FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » Tokenize an HTML page.
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
Tokenize an HTML page. [message #179710] Mon, 26 November 2012 07:56
Simon is currently offline  Simon
Messages: 29
Registered: February 2011
Karma: 0
Junior Member
Hi,

I would like index a whole bunch of html documents on my site to speed
up my internal searches, (I currently use 'LIKE "%...%"' and that's not
very efficient).

My understanding would be to:
1) Remove some html (with strip_tags( ... ))
2) Walk the string and, every time I come across a stop character,
(<space>,',",?,! etc...), then count that as a word.

The above solution is over simplistic as it does not work for many
languages, (Hebrew for example uses the single quote as part of the word).

Also stripping HTML assumes that it is properly formated, something I
cannot really guaranty, (and in any case, I might want to keep certain
items such as websites inside the href='' tags).

So, before I re-invent the wheel, can someone suggest a
script/class/code that is able to tokenize html content?

Any suggestions?

Many thanks

Simon
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: setcookie() returns FALSE
Next Topic: error reporting
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Tue Dec 03 17:41:38 GMT 2024

Total time taken to generate the page: 0.02973 seconds