|
|
|
|
|
|
|
|
|
|
|
|
|
Re: no results when a search includes a numerical character [message #40854 is a reply to message #40031] |
Sat, 19 April 2008 19:12 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Sun, 06 January 2008 17:14 | You will need PHP 5.1.0 or greater since the feature was introduced (by me BTW ) in that release.
|
OK, I'm ready to make use of your feature now
Server is scheduled for upgrade to php5 next week, then I want to enable searching on numbers as described above.
The part that concerns me is the rebuilding of the search index, which is said to take a 'long time'. So how long is a 'long time'? I know you can't answer that, but...
My msg_1 file is 120Mb. Current fud_index table has 4,500,000 rows. Server is lightly loaded, loadav usually below 1, have just upgraded the ram to 1.5Gb so it is not needing the pagefile.
Are we talking 30 minutes, or 10 hours, or what? It's a managed server and has a php timeout which I can't increase (I think it is 100minutes). Can I run this reindex from the command line so it runs faster?
What happens if the index is only partially rebuilt? Then I have no useable index at all... Would it be a case then of reinstalling the old fud_index table from backup?
I assume the forum is unavailable during the rebuild?
Thanks
Simon Child
|
|
|
|
|
|
Re: no results when a search includes a numerical character [message #40897 is a reply to message #40876] |
Sat, 26 April 2008 06:30 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Tue, 22 April 2008 18:54 | Well, the speed will be pretty minimal, so you may as well consider it disabled...
|
Well...
It only took 40minutes. I ran it from commandline, niced, and it didn't bring down the server (though forum wouldn't load during this time).
However, it hasn't indexed numbers.
Before doing the reindexing, having updated the template and rebuilt the theme, I waited a couple of days to check that new posts were being indexed including numbers, and they were. Some new posts, with strings including numbers, could be found by searching those strings, e.g. GP2GP
After rebuilding the index, the search no longer finds those posts.
The rebuild appears to be successful. It cleared the original index, and after the reindex fud_index contains the same number of records as before, and searches for standard strings (text-only e.g. 'test' still works). It just didn't index numbers.
So, does the index rebuild script not make use of word_to_texta? Do I have to make changes somewhere else as well?
using php 5.2.5, mysql 5
Thanks
Simon Child
|
|
|
|
Re: no results when a search includes a numerical character [message #40925 is a reply to message #40919] |
Mon, 28 April 2008 19:17 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Mon, 28 April 2008 22:49 | It does use the word_to_texta() function, but perhaps the numbers were shorter then the minimum word length?
|
That specific 'word' (GP2GP) was not being indexed before I rebuilt the theme.
After I made the changes to word_to_texta() some posts containing that were indexed and could be found by searching on GP2GP (GP2GP may not mean anything to you, but it is of interest to my forum visitors!)
After I rebuilt the index, no posts including numbers were findable in a search, including those containing GP2GP which were indexed before I rebuilt.
Since I rebuilt, two new posts containing GP2GP have been indexed and can be found in a search for GP2GP, but that is all.
Does the index rebuild use a different minimum word length? Where is the word length set? Incidentally, if you mean the mysql fulltext search word length, I have that set to three characters.
Thanks
Simon Child
|
|
|
|
Re: no results when a search includes a numerical character [message #40947 is a reply to message #40938] |
Wed, 30 April 2008 16:22 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Wed, 30 April 2008 00:41 | The minimum word length is defined inside search.inc
|
I guess you mean isearch.inc, I can't find a search.inc
I can't see where word length is defined in there, but in any case the words I'm interested in (e.g. GP2GP) are getting indexed for new posts. What is not working is for these same posts to be indexed when I run indexdb.php in commandline mode to rebuild the search index - even though indexdb.php does indeed rebuild the index the new index does not include these terms.
So somewhere the rebuild of the index is using different parameters to the routine indexing?
Simon Child
|
|
|
|
Re: no results when a search includes a numerical character [message #40973 is a reply to message #40951] |
Sat, 03 May 2008 07:02 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Thu, 01 May 2008 00:38 | In your case you need to change the call to function str_word_count()
making it into
array_unique(str_word_count(strtolower($text), 1, '0123456789'));
|
Hmm, I'd missed off the quotes:
array_unique(str_word_count(strtolower($text),1,1234567890));
But putting them back, rebuilding the theme, checking that include/theme/default/isearch.inc has been updated - it has
array_unique(str_word_count(strtolower($text),1,'1234567890'));
then rebuilding the search index... still not indexing 'GP2GP' in the rebuilt index, even though new posts with that are being indexed as they arrive, and the index rebuild does work otherwise.
I see that indexdb.php does include isearch.inc, but it must be doing something different with it?
Simon Child
|
|
|
|
|
|
Re: no results when a search includes a numerical character [message #41011 is a reply to message #41009] |
Tue, 06 May 2008 20:21 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Wed, 07 May 2008 00:15 | It definitely does, since it used index_text() defined inside isearch.inc to index the message text.
|
Looking at the code for indexdb.php I can see that as you say it does call index_text() which in turn calls my modified text_to_worda()
However the behaviour of text_to_worda() is influenced by some environment variables:
/* if no good locale, default to splitting by spaces */
if (!$GLOBALS['good_locale']) {
$GLOBALS['usr']->lang = 'latvian';
}
Might it be that calling it from command line it is not setting a locale?
What would be a good way to fake an appropriate locale?
Simon Child
|
|
|
|
Re: no results when a search includes a numerical character - FIXED [message #41036 is a reply to message #41018] |
Sun, 11 May 2008 06:32 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
add to buddy list ignore all messages by this user
|
|
Ilia wrote on Thu, 08 May 2008 00:47 | Set locale to C.
|
I found it was already finding it as C.
However stepping through the code I have found the cause, a bug in function text_to_worda
function text_to_worda($text)
{
$a = array();
/* if no good locale, default to splitting by spaces */
if (!$GLOBALS['good_locale']) {
$GLOBALS['usr']->lang = 'latvian';
}
$text = strip_tags(reverse_fmt($text));
while (1) {
switch ($GLOBALS['usr']->lang) {
case 'chinese_big5':
case 'chinese':
case 'japanese':
case 'korean':
return mb_word_split($text, $GLOBALS['usr']->lang);
break;
case 'latvian':
case 'russian-1251':
$t1 = array_unique(preg_split('![\x00-\x40]+!', $text, -1, PREG_SPLIT_NO_EMPTY));
break;
default:
$t1 = array_unique(str_word_count(strtolower($text),1,'1234567890'));
if ($text && !$t1) { /* fall through to split by special chars */
$GLOBALS['usr']->lang = 'latvian';
continue;
}
break;
}
The first time through if finds locale as C and language as English, and so as desired goes to 'default':
array_unique(str_word_count(strtolower($text),1,'1234567890'));
However, if any message makes it fall through this:
if ($text && !$t1) { /* fall through to split by special chars */
$GLOBALS['usr']->lang = 'latvian';
continue;
}
then $GLOBALS['usr']->lang is set to Latvian and this persists for the rest of the reindex, affecting parsing of every subsequent message.
When indexing a single message it wouldn't matter that $GLOBALS['usr']->lang gets set to Latvian, since the next message would be a fresh start with it set to English once more. But with the reindex running through all messages in once script, then every subsequent message is processed as though language is Latvian.
So I just changed three lines like this:
function text_to_worda($text)
{
$a = array();
/* if no good locale, default to splitting by spaces */
if (!$GLOBALS['good_locale']) {
$GLOBALS['usr']->lang = 'latvian';
}
// use local variable for message language
$thismessagelang = $GLOBALS['usr']->lang;
$text = strip_tags(reverse_fmt($text));
while (1) {
// switch ($GLOBALS['usr']->lang) {
// switch on message language
switch ($thismessagelang) {
case 'chinese_big5':
case 'chinese':
case 'japanese':
case 'korean':
return mb_word_split($text, $GLOBALS['usr']->lang);
break;
case 'latvian':
case 'russian-1251':
$t1 = array_unique(preg_split('![\x00-\x40]+!', $text, -1, PREG_SPLIT_NO_EMPTY));
break;
default:
$t1 = array_unique(str_word_count(strtolower($text),1,'1234567890'));
if ($text && !$t1) { /* fall through to split by special chars */
// if resetting language, do it locally not globally
// $GLOBALS['usr']->lang = 'latvian';
$thismessagelang = 'latvian';
continue;
}
break;
}
This seems to have fixed it for me, my index now includes numbers as required.
Thanks
Simon Child
|
|
|
|
Re: no results when a search includes a numerical character - FIXED [message #158940 is a reply to message #158789] |
Fri, 17 April 2009 10:45 |
Peter Vendike
Messages: 65 Registered: February 2009 Location: Denmark
Karma: 0
|
Member Translator |
add to buddy list ignore all messages by this user
|
|
Den tis, 24 mars 2009 03:51 skrev kerryg: | Hi Frank - the ability to search for numerical strings (including the slash character "/" as in "error 1/1" or "7/7" or "1324123g/12341234f") would be be an *extremely* useful function for folks like myself whose forums often discuss software error messages - it's one of the most important things to be able to search for. Do you have any plans to commit this patch? I'd understand if it was best to have it default to "off" for most folks, but it would be a killer feature, well worth some slowdown in searching.
|
As I read the doc's, the search would not get slower, only the save message (edit) process, as the indexing is made there.
Isn't php >= 5.1 standard these days?
I'm for committing that hack soon if it's working not only for latvian
|
|
|