FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » Tokenize an HTML page.
Show: Today's Messages :: Unread Messages :: Show Polls :: Message Navigator
| Subscribe to topic | Bookmark topic 
Switch to threaded view of this topic Create a new topic Submit Reply
Tokenize an HTML page. [message #179710] Mon, 26 November 2012 02:56
Simon is currently offline  Simon
Messages: 29
Registered: February 2011
Karma: 0
Junior Member
add to buddy list
ignore all messages by this user
Hi,

I would like index a whole bunch of html documents on my site to speed
up my internal searches, (I currently use 'LIKE "%...%"' and that's not
very efficient).

My understanding would be to:
1) Remove some html (with strip_tags( ... ))
2) Walk the string and, every time I come across a stop character,
(<space>,',",?,! etc...), then count that as a word.

The above solution is over simplistic as it does not work for many
languages, (Hebrew for example uses the single quote as part of the word).

Also stripping HTML assumes that it is properly formated, something I
cannot really guaranty, (and in any case, I might want to keep certain
items such as websites inside the href='' tags).

So, before I re-invent the wheel, can someone suggest a
script/class/code that is able to tokenize html content?

Any suggestions?

Many thanks

Simon
Quick Reply
Formatting Tools:   
  Switch to threaded view of this topic Create a new topic
Previous Topic: setcookie() returns FALSE
Next Topic: error reporting
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Wed Oct 18 16:33:26 EDT 2017

Total time taken to generate the page: 0.00594 seconds