Tokenize an HTML page. [message #179710] |
Mon, 26 November 2012 07:56 |
Simon
Messages: 29 Registered: February 2011
Karma:
|
Junior Member |
|
|
Hi,
I would like index a whole bunch of html documents on my site to speed
up my internal searches, (I currently use 'LIKE "%...%"' and that's not
very efficient).
My understanding would be to:
1) Remove some html (with strip_tags( ... ))
2) Walk the string and, every time I come across a stop character,
(<space>,',",?,! etc...), then count that as a word.
The above solution is over simplistic as it does not work for many
languages, (Hebrew for example uses the single quote as part of the word).
Also stripping HTML assumes that it is properly formated, something I
cannot really guaranty, (and in any case, I might want to keep certain
items such as websites inside the href='' tags).
So, before I re-invent the wheel, can someone suggest a
script/class/code that is able to tokenize html content?
Any suggestions?
Many thanks
Simon
|
|
|