anonymous user

Forums   Register   Login   Forgot your login/password?   Search

$cl->BuildExcerpts doesn't work for russian text.

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 524 | 525 | 526 | 527 | next »» | Create new thread

ethaniel

Name: Ethaniel
Posts: 30

2006-08-26 17:52:29 | reply!


$cl->BuildExcerpts doesn't work when we input russian CP-1251 text.
It just removes all the russian characters.

Converting to UTF or KOI8 doesn't help either.

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ethaniel, 2006-08-27 01:36:30 | reply!


> Converting to UTF or KOI8 doesn't help either.

It doesn't support encoding other than UTF-8, but UTF-8 really should work. Are you
positive you convert both $docs and query $words to UTF-8?

ethaniel

Name: Ethaniel
Posts: 30

to: shodan, 2006-08-27 07:02:54 | reply!


> > Converting to UTF or KOI8 doesn't help either.
>
> It doesn't support encoding other than UTF-8, but UTF-8 really should work. Are you
> positive you convert both $docs and query $words to UTF-8?

first of all I would like to thank you for this wonderful program. It is just what I
wanted to create for a long long time. Now it will really help me out.

Now regarding UTF. I did convert to docs and query to UTF.
my opts are
$opts = array
(
        "before_match" => "<b>",
        "after_match" => "</b>",
        "chunk_separator" => " ... ",
        "limit" => 400,
        "around" => 3
);

it returns the same text I enter, it doesn't enclose the query with <b></b>.

http://search.nightparty.ru/np.php

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ethaniel, 2006-08-27 09:27:19 | reply!


> Now regarding UTF. I did convert to docs and query to UTF.

Managed to reproduced that on one of my servers. Will check and fix, thanks for the
report!

ethaniel

Name: Ethaniel
Posts: 30

to: shodan, 2006-08-28 16:55:01 | reply!


> > Now regarding UTF. I did convert to docs and query to UTF.
>
> Managed to reproduced that on one of my servers. Will check and fix, thanks for the
> report!

can't wait for the new version.

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: shodan, 2006-08-29 11:41:01 | reply!


> Managed to reproduced that on one of my servers.

It turns out that charset_table for the index was configured to use SBCS encoding - so
excerpts code picked it and, obviously, failed - as it only supports UTF-8 at the moment.

To workaround with 0.9.6, you would either use UTF-8 everywhere - or setup a fake index
with UTF-8 encoding and proper table, and use this fake index for excerpts generation
only.

I scheduled to add SBCS support to exceprts generator, will be fixed in some next release.

ethaniel

Name: Ethaniel
Posts: 30

to: shodan, 2006-09-02 16:45:29 | reply!


> > Managed to reproduced that on one of my servers.
>
> It turns out that charset_table for the index was configured to use SBCS encoding - so
> excerpts code picked it and, obviously, failed - as it only supports UTF-8 at the moment.
>
> To workaround with 0.9.6, you would either use UTF-8 everywhere - or setup a fake index
> with UTF-8 encoding and proper table, and use this fake index for excerpts generation
> only.
>
> I scheduled to add SBCS support to exceprts generator, will be fixed in some next release.

my dbs are cp1251. mysql 4.0.24 (no collation or stuff like that).

I set utf-8 in the config file, reindexed and now the search is returning zero results.

any ideas? this fix is rather important.

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ethaniel, 2006-09-03 19:02:10 | reply!


> my dbs are cp1251. mysql 4.0.24 (no collation or stuff like that).
>
> I set utf-8 in the config file, reindexed and now the search is returning zero results.

If Sphinx expects UTF-8, you need to make MySQL provide UTF-8 encoded data to Sphinx when
indexing as well.

Something like sql_query_pre = SET CHARACTER_SET_RESULTS UTF-8 should help.

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: shodan, 2006-09-04 06:34:54 | reply!


> Something like sql_query_pre = SET CHARACTER_SET_RESULTS UTF-8 should help.

I've been just told that 4.0.24 does not support UTF-8.

In this case, you'll have to setup main Sphinx index to use cp-1251 (and query it in
cp-1251) and a fake index to generate excerpts in UTF-8 (and pass document data and query
in UTF-8).

ethaniel

Name: Ethaniel
Posts: 30

to: shodan, 2006-09-04 14:36:10 | reply!


> > Something like sql_query_pre = SET CHARACTER_SET_RESULTS UTF-8 should help.
>
> I've been just told that 4.0.24 does not support UTF-8.
>
> In this case, you'll have to setup main Sphinx index to use cp-1251 (and query it in
> cp-1251) and a fake index to generate excerpts in UTF-8 (and pass document data and query
> in UTF-8).

thanks alot, I guess that should work.
When will you release the main fix for this problem? I'd love to use your system in
production.

ethaniel

Name: Ethaniel
Posts: 30

to: ethaniel, 2006-09-05 10:09:57 | reply!


> thanks alot, I guess that should work.
> When will you release the main fix for this problem? I'd love to use your system in
> production.

it didn't work.

$text=array(win2utf($text));
$res = $cl->BuildExcerpts ( $text, "utf8", win2utf($q), $opts );

$res is empty.

I use the following function:
              function win2utf($s){
                      $c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
                      for($i=0; $i<strlen($s); $i++) {
                              $c=ord($s[$i]);
                              if ($c>=192 and $c<=239) $t.=$c208.chr($c-48);
                              elseif ($c>239) $t.=$c209.chr($c-112);
                              elseif ($c==184) $t.=$c209.$c209;
                              elseif ($c==168) $t.=$c208.$c129;
                              else $t.=$s[$i];
                      }
                      return $t;
              }

ethaniel

Name: Ethaniel
Posts: 30

to: ethaniel, 2006-09-05 10:25:12 | reply!


> > thanks alot, I guess that should work.
> > When will you release the main fix for this problem? I'd love to use your system in
> production.
>
> it didn't work.
>
> $text=array(win2utf($text));
> $res = $cl->BuildExcerpts ( $text, "utf8", win2utf($q), $opts );
>
> $res is empty.
>
> I use the following function:
> function win2utf($s){
> $c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
> for($i=0; $i<strlen($s); $i++) {
> $c=ord($s[$i]);
> if ($c>=192 and $c<=239) $t.=$c208.chr($c-48);
> elseif ($c>239) $t.=$c209.chr($c-112);
> elseif ($c==184) $t.=$c209.$c209;
> elseif ($c==168) $t.=$c208.$c129;
> else $t.=$s[$i];
> }
> return $t;
> }
>
>

PLEASE DISREGARD THIS COMMENT. I FORGOT TO RESTART searchd.

Now there is an additional problem. For example i query "blagodaru" in russian.
the search returns all results including "blagodara" (which is correct too).

but the BuildExcerpts doesn't select the "blagodara" with <b></b>.

(I'm in UTF mode).

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ethaniel, 2006-09-05 12:06:29 | reply!


> the search returns all results including "blagodara" (which is correct too).
> but the BuildExcerpts doesn't select the "blagodara" with <b></b>.

This is another feature missing from excerpts generator: as of 0.9.6, it doesn't support
stemming.

Will hopefully be fixed in next release as well.

ethaniel

Name: Ethaniel
Posts: 30

to: shodan, 2006-09-05 13:44:45 | reply!


> Will hopefully be fixed in next release as well.

thanks alot :) when are you planning to make the next release?
I'd even love to donate someday - your search is perfect for the time being.

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ethaniel, 2006-09-06 07:06:09 | reply!


> > Will hopefully be fixed in next release as well.
>
> thanks alot :) when are you planning to make the next release?

Somewhere this month.

> your search is perfect for the time being.

Thanks :)

dweis

Name: Tristan
Posts: 31

to: shodan, 2006-11-05 11:51:43 | reply!


> > > Will hopefully be fixed in next release as well.
> >
> > thanks alot :) when are you planning to make the next release?
>
> Somewhere this month.
>
> > your search is perfect for the time being.
>
> Thanks :)

I couldn't yet tried : is there some improvement about excerpt with 0.9.7 RC1 ?

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: dweis, 2006-11-06 06:58:05 | reply!


> I couldn't yet tried : is there some improvement about excerpt with 0.9.7

I fixed SBCS excerpts after 0.9.7-rc1. The patch is available upon request. :)

dweis

Name: Tristan
Posts: 31

to: shodan, 2006-11-06 14:38:12 | reply!


> > I couldn't yet tried : is there some improvement about excerpt with 0.9.7
>
> I fixed SBCS excerpts after 0.9.7-rc1.
Thanks, that's a good news ;)

> The patch is available upon request. :)
I'll wait the 0.9.7 final :)

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 524 | 525 | 526 | 527 | next »» | Create new thread