Ilia wrote on Thu, 08 May 2008 00:47 |
Set locale to C.
|
I found it was already finding it as C.
However stepping through the code I have found the cause, a bug in function text_to_worda
function text_to_worda($text)
{
$a = array();
/* if no good locale, default to splitting by spaces */
if (!$GLOBALS['good_locale']) {
$GLOBALS['usr']->lang = 'latvian';
}
$text = strip_tags(reverse_fmt($text));
while (1) {
switch ($GLOBALS['usr']->lang) {
case 'chinese_big5':
case 'chinese':
case 'japanese':
case 'korean':
return mb_word_split($text, $GLOBALS['usr']->lang);
break;
case 'latvian':
case 'russian-1251':
$t1 = array_unique(preg_split('![\x00-\x40]+!', $text, -1, PREG_SPLIT_NO_EMPTY));
break;
default:
$t1 = array_unique(str_word_count(strtolower($text),1,'1234567890'));
if ($text && !$t1) { /* fall through to split by special chars */
$GLOBALS['usr']->lang = 'latvian';
continue;
}
break;
}
The first time through if finds locale as C and language as English, and so as desired goes to 'default':
array_unique(str_word_count(strtolower($text),1,'1234567890'));
However, if any message makes it fall through this:
if ($text && !$t1) { /* fall through to split by special chars */
$GLOBALS['usr']->lang = 'latvian';
continue;
}
then $GLOBALS['usr']->lang is set to Latvian and this persists for the rest of the reindex, affecting parsing of every subsequent message.
When indexing a single message it wouldn't matter that $GLOBALS['usr']->lang gets set to Latvian, since the next message would be a fresh start with it set to English once more. But with the reindex running through all messages in once script, then every subsequent message is processed as though language is Latvian.
So I just changed three lines like this:
function text_to_worda($text)
{
$a = array();
/* if no good locale, default to splitting by spaces */
if (!$GLOBALS['good_locale']) {
$GLOBALS['usr']->lang = 'latvian';
}
// use local variable for message language
$thismessagelang = $GLOBALS['usr']->lang;
$text = strip_tags(reverse_fmt($text));
while (1) {
// switch ($GLOBALS['usr']->lang) {
// switch on message language
switch ($thismessagelang) {
case 'chinese_big5':
case 'chinese':
case 'japanese':
case 'korean':
return mb_word_split($text, $GLOBALS['usr']->lang);
break;
case 'latvian':
case 'russian-1251':
$t1 = array_unique(preg_split('![\x00-\x40]+!', $text, -1, PREG_SPLIT_NO_EMPTY));
break;
default:
$t1 = array_unique(str_word_count(strtolower($text),1,'1234567890'));
if ($text && !$t1) { /* fall through to split by special chars */
// if resetting language, do it locally not globally
// $GLOBALS['usr']->lang = 'latvian';
$thismessagelang = 'latvian';
continue;
}
break;
}
This seems to have fixed it for me, my index now includes numbers as required.
Thanks