FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » I Need to search over 100 largeish text documents efficiently. What's the best approach?
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184735] Sun, 26 January 2014 13:34 Go to next message
Rob Bradford is currently offline  Rob Bradford
Messages: 5
Registered: February 2011
Karma: 0
Junior Member
As part of my hosting providers re-platforming cycle my site has moved server, on the new server and all new servers php exec() and equivalents are blocked, this has taken out my fast document search that used exec() to call grep then awk. I now need to do the grep part as effectively as possible in PHP as I can no longer access the shell from the scripts. The awk part is easily sorted.

What is the best/fastest approach to scan 100+ largish text files for word strings, I really don't wish to index each file into a database as the documents change quite frequently. my grep-awk scan was around one second to begin rendering the results page, I know I can't match that but I can't afford too much of a delay.

Any ideas appreciated whilst I look for a new hosting provider, I feel that any hosting set up that makes such a change without notification really has no respect for it's clients.

Rob
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184736 is a reply to message #184735] Sun, 26 January 2014 14:09 Go to previous messageGo to next message
Richard Damon is currently offline  Richard Damon
Messages: 58
Registered: August 2011
Karma: 0
Member
On 1/26/14, 8:34 AM, rob(dot)bradford2805(at)gmail(dot)com wrote:
> As part of my hosting providers re-platforming cycle my site has moved server, on the new server and all new servers php exec() and equivalents are blocked, this has taken out my fast document search that used exec() to call grep then awk. I now need to do the grep part as effectively as possible in PHP as I can no longer access the shell from the scripts. The awk part is easily sorted.
>
> What is the best/fastest approach to scan 100+ largish text files for word strings, I really don't wish to index each file into a database as the documents change quite frequently. my grep-awk scan was around one second to begin rendering the results page, I know I can't match that but I can't afford too much of a delay.
>
> Any ideas appreciated whilst I look for a new hosting provider, I feel that any hosting set up that makes such a change without notification really has no respect for it's clients.
>
> Rob
>

If you can't call grep from the command line via exec, the best solution
may be to write a version of grep in your program. Read the files (or
chunks of them in sequence) and use the PHP string search functions on
the data block. If reading chunks, make sure to do any needed overlap
between chunks so you don't miss matches across chunk breaks.
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184737 is a reply to message #184735] Sun, 26 January 2014 15:29 Go to previous messageGo to next message
Jerry Stuckle is currently offline  Jerry Stuckle
Messages: 2598
Registered: September 2010
Karma: 0
Senior Member
On 1/26/2014 8:34 AM, rob(dot)bradford2805(at)gmail(dot)com wrote:
> As part of my hosting providers re-platforming cycle my site has moved server, on the new server and all new servers php exec() and equivalents are blocked, this has taken out my fast document search that used exec() to call grep then awk. I now need to do the grep part as effectively as possible in PHP as I can no longer access the shell from the scripts. The awk part is easily sorted.
>
> What is the best/fastest approach to scan 100+ largish text files for word strings, I really don't wish to index each file into a database as the documents change quite frequently. my grep-awk scan was around one second to begin rendering the results page, I know I can't match that but I can't afford too much of a delay.
>
> Any ideas appreciated whilst I look for a new hosting provider, I feel that any hosting set up that makes such a change without notification really has no respect for it's clients.
>
> Rob
>

Whether the files change frequently or not, your best bet is going to be
putting the documents in a database. You won't be able to do anything
nearly as fast in PHP as the database does. And it isn't that hard to
insert the documents into the database when they are uploaded.

And kudos to your hosting provider for closing a huge security exposure.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex(at)attglobal(dot)net
==================
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184738 is a reply to message #184735] Sun, 26 January 2014 19:56 Go to previous messageGo to next message
The Natural Philosoph is currently offline  The Natural Philosoph
Messages: 993
Registered: September 2010
Karma: 0
Senior Member
On 26/01/14 13:34, rob(dot)bradford2805(at)gmail(dot)com wrote:
> As part of my hosting providers re-platforming cycle my site has moved server, on the new server and all new servers php exec() and equivalents are blocked, this has taken out my fast document search that used exec() to call grep then awk. I now need to do the grep part as effectively as possible in PHP as I can no longer access the shell from the scripts. The awk part is easily sorted.
>
> What is the best/fastest approach to scan 100+ largish text files for word strings, I really don't wish to index each file into a database as the documents change quite frequently. my grep-awk scan was around one second to begin rendering the results page, I know I can't match that but I can't afford too much of a delay.
>
> Any ideas appreciated whilst I look for a new hosting provider, I feel that any hosting set up that makes such a change without notification really has no respect for it's clients.
>
> Rob
>
change service providers and get your own virtual machine.


--
Ineptocracy

(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184739 is a reply to message #184737] Sun, 26 January 2014 20:14 Go to previous messageGo to next message
Denis McMahon is currently offline  Denis McMahon
Messages: 634
Registered: September 2010
Karma: 0
Senior Member
On Sun, 26 Jan 2014 10:29:58 -0500, Jerry Stuckle wrote:

> On 1/26/2014 8:34 AM, rob(dot)bradford2805(at)gmail(dot)com wrote:

>> Any ideas appreciated whilst I look for a new hosting provider, I feel
>> that any hosting set up that makes such a change without notification
>> really has no respect for it's clients.

> And kudos to your hosting provider for closing a huge security exposure.

+1 to both these comments.

Your hosting provider should have told you about this change as soon as
they became aware it would happen (and if they weren't aware they'd be
forcing this change on their customers, then they need to be shot because
that would mean that the people managing the server switch had no idea of
the effect of the differences in configurations).

However, you'd have been even more pissed at your hosting provider if
your website had suddenly started serving up porn or russian mafia
viruses or viagra-clone adverts to all visitors because the now closed
security hole had been exploited.

--
Denis McMahon, denismfmcmahon(at)gmail(dot)com
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184740 is a reply to message #184735] Sun, 26 January 2014 20:55 Go to previous messageGo to next message
Ben Bacarisse is currently offline  Ben Bacarisse
Messages: 82
Registered: November 2013
Karma: 0
Member
rob(dot)bradford2805(at)gmail(dot)com writes:

> As part of my hosting providers re-platforming cycle my site has moved
> server, on the new server and all new servers php exec() and
> equivalents are blocked, this has taken out my fast document search
> that used exec() to call grep then awk. I now need to do the grep
> part as effectively as possible in PHP as I can no longer access the
> shell from the scripts. The awk part is easily sorted.
>
> What is the best/fastest approach to scan 100+ largish text files for
> word strings, I really don't wish to index each file into a database
> as the documents change quite frequently. my grep-awk scan was around
> one second to begin rendering the results page, I know I can't match
> that but I can't afford too much of a delay.

I second Richard Damon's suggestion. If the awk bit is easily sorted,
and all you are missing is grep, it must be a matter of minutes to make
something like the functionality you had before from a handful of lines
of PHP[1]. Of course, it won't be an external command, so the way it
integrates with the rest of the code might make this not quite the
trivial task that it first appears to be.

[1] fgets, preg_match and glob (if you need it).

--
Ben.
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184741 is a reply to message #184736] Sun, 26 January 2014 21:34 Go to previous messageGo to next message
Michael Vilain is currently offline  Michael Vilain
Messages: 88
Registered: September 2010
Karma: 0
Member
In article <SG8Fu.46700$vG7(dot)15374(at)en-nntp-03(dot)dc1(dot)easynews(dot)com>,
Richard Damon <Richard(at)Damon-Family(dot)org> wrote:

> On 1/26/14, 8:34 AM, rob(dot)bradford2805(at)gmail(dot)com wrote:
>> As part of my hosting providers re-platforming cycle my site has moved
>> server, on the new server and all new servers php exec() and equivalents
>> are blocked, this has taken out my fast document search that used exec() to
>> call grep then awk. I now need to do the grep part as effectively as
>> possible in PHP as I can no longer access the shell from the scripts. The
>> awk part is easily sorted.
>>
>> What is the best/fastest approach to scan 100+ largish text files for word
>> strings, I really don't wish to index each file into a database as the
>> documents change quite frequently. my grep-awk scan was around one second
>> to begin rendering the results page, I know I can't match that but I can't
>> afford too much of a delay.
>>
>> Any ideas appreciated whilst I look for a new hosting provider, I feel that
>> any hosting set up that makes such a change without notification really has
>> no respect for it's clients.
>>
>> Rob
>>
>
> If you can't call grep from the command line via exec, the best solution
> may be to write a version of grep in your program. Read the files (or
> chunks of them in sequence) and use the PHP string search functions on
> the data block. If reading chunks, make sure to do any needed overlap
> between chunks so you don't miss matches across chunk breaks.

If you're greping multiple large files from an exec, grep will produced
the results on each file as it's processed. You can easily replicate
this behavior from within php.

Loop through an array containing the filenames you want to grep.
In each file, open it and read it into memory as an array.
use preg_grep or preg_match_all to scan the entire array for results.
do whatever you want with the resultant array of matching results.
process the next file. That seems fairly straightforward.

If these files are HUGE (e.g. GB), you may have to do your own I/O with
fopen/fread, convert the string buffer into an array with split, grep
it, and get more data. The problem there is you may pull in a partial
line.

--
DeeDee, don't press that button! DeeDee! NO! Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184742 is a reply to message #184735] Mon, 27 January 2014 01:43 Go to previous messageGo to next message
Denis McMahon is currently offline  Denis McMahon
Messages: 634
Registered: September 2010
Karma: 0
Senior Member
On Sun, 26 Jan 2014 05:34:21 -0800, rob.bradford2805 wrote:

> What is the best/fastest approach to scan 100+ largish text files for
> word strings

A quick googling finds:

http://sourceforge.net/projects/php-grep/
http://net-wrench.com/download-tools/php-grep.php

Claims to be able to search 1000 files in under 10 secs

--
Denis McMahon, denismfmcmahon(at)gmail(dot)com
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184743 is a reply to message #184742] Mon, 27 January 2014 09:58 Go to previous messageGo to next message
Arno Welzel is currently offline  Arno Welzel
Messages: 317
Registered: October 2011
Karma: 0
Senior Member
Am 27.01.2014 02:43, schrieb Denis McMahon:

> On Sun, 26 Jan 2014 05:34:21 -0800, rob.bradford2805 wrote:
>
>> What is the best/fastest approach to scan 100+ largish text files for
>> word strings
>
> A quick googling finds:
>
> http://sourceforge.net/projects/php-grep/
> http://net-wrench.com/download-tools/php-grep.php
>
> Claims to be able to search 1000 files in under 10 secs

Under ideal conditions - maybe. But if each file is more than 1 MB, it
is barely possible to even read this amount of data in just 10 seconds
(assuming around 80 MB/s and 1000 MB of data to be searched).

Even using a simple word index (word plus the name of the file(s) and
the position(s) where the word is located) would be the better solution.


--
Arno Welzel
http://arnowelzel.de
http://de-rec-fahrrad.de
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184744 is a reply to message #184743] Mon, 27 January 2014 12:23 Go to previous messageGo to next message
Denis McMahon is currently offline  Denis McMahon
Messages: 634
Registered: September 2010
Karma: 0
Senior Member
On Mon, 27 Jan 2014 10:58:42 +0100, Arno Welzel wrote:

> Am 27.01.2014 02:43, schrieb Denis McMahon:
>
>> On Sun, 26 Jan 2014 05:34:21 -0800, rob.bradford2805 wrote:
>>
>>> What is the best/fastest approach to scan 100+ largish text files for
>>> word strings
>>
>> A quick googling finds:
>>
>> http://sourceforge.net/projects/php-grep/
>> http://net-wrench.com/download-tools/php-grep.php
>>
>> Claims to be able to search 1000 files in under 10 secs
>
> Under ideal conditions - maybe. But if each file is more than 1 MB, it
> is barely possible to even read this amount of data in just 10 seconds
> (assuming around 80 MB/s and 1000 MB of data to be searched).
>
> Even using a simple word index (word plus the name of the file(s) and
> the position(s) where the word is located) would be the better solution.

Indeed, the fastest solution would be to index each file when it changes,
and keep the indexes in a db.

Perhaps there are common words you wouldn't index, in english these might
include:

a the in on an this that then ....

Then if you have a search phrase, remove the common words, look for the
uncommon words in close proximity to each other

It might help to know more about the grep too, is this using complex
regexp, or is it a simple string search done externally using grep.

--
Denis McMahon, denismfmcmahon(at)gmail(dot)com
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184745 is a reply to message #184743] Mon, 27 January 2014 17:05 Go to previous messageGo to next message
Ben Bacarisse is currently offline  Ben Bacarisse
Messages: 82
Registered: November 2013
Karma: 0
Member
Arno Welzel <usenet(at)arnowelzel(dot)de> writes:

> Am 27.01.2014 02:43, schrieb Denis McMahon:
>
>> On Sun, 26 Jan 2014 05:34:21 -0800, rob.bradford2805 wrote:
>>
>>> What is the best/fastest approach to scan 100+ largish text files for
>>> word strings
>>
>> A quick googling finds:
>>
>> http://sourceforge.net/projects/php-grep/
>> http://net-wrench.com/download-tools/php-grep.php
>>
>> Claims to be able to search 1000 files in under 10 secs
>
> Under ideal conditions - maybe. But if each file is more than 1 MB, it
> is barely possible to even read this amount of data in just 10 seconds
> (assuming around 80 MB/s and 1000 MB of data to be searched).

There are so many variable here; we don't know what "largeish" means and
we don't know what sort of grep is being done (to count lines, display
lines, or just to find matching files?).

Anyway, I tried a couple of there out. With the most naive PHP grep
imaginable, I can do the equivalent of grep -l (to find matching files)
in reasonable time. With 987M of data in 201 5M files, it takes about
0.5 second in PHP compared to about 0.15 seconds for grep. I recon if
that is what is being done, then it's well worth it as a temporary
solution.

Counting matching lines is much slower: about 17s for PHP vs 1s for
grep. Although the server is likely to be faster than my laptop, the
ratios might be useful data.

--
Ben.
Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184746 is a reply to message #184735] Mon, 27 January 2014 21:14 Go to previous message
Rob Bradford is currently offline  Rob Bradford
Messages: 5
Registered: February 2011
Karma: 0
Junior Member
On Sunday, 26 January 2014 13:34:21 UTC, rob.brad...@gmail.com wrote:
> As part of my hosting providers re-platforming cycle my site has moved server, on the new server and all new servers php exec() and equivalents are blocked, this has taken out my fast document search that used exec() to call grep then awk. I now need to do the grep part as effectively as possible in PHP as I can no longer access the shell from the scripts. The awk part is easily sorted.
>
>
>
> What is the best/fastest approach to scan 100+ largish text files for word strings, I really don't wish to index each file into a database as the documents change quite frequently. my grep-awk scan was around one second to begin rendering the results page, I know I can't match that but I can't afford too much of a delay.
>
>
>
> Any ideas appreciated whilst I look for a new hosting provider, I feel that any hosting set up that makes such a change without notification really has no respect for it's clients.
>
>
>
> Rob

Thanks for all the suggestions, and my gripe with the hosting provider is not that they closed the security risk, just the lack of notice that it was going to happen, then leaving me to wonder why stuff had stopped working.

Rob
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: include capturing wrong value
Next Topic: help with preg_match pattern
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Tue Dec 03 19:19:24 GMT 2024

Total time taken to generate the page: 0.02989 seconds