anonymous user

Forums   Register   Login   Forgot your login/password?   Search

How did the sphinx calculate the weight?

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 498 | 499 | 500 | 501 | next »» | Create new thread

m4eclipse

Name: eclipse
Posts: 15

2010-09-13 05:45:20 | reply!


First take a look at a example:

The following is my table(just for test used):

+----+--------------------------+----------------------+
| Id | title | body |
+----+--------------------------+----------------------+
| 1 | National first hospital | NASA |
| 2 | National second hospital | Space Administration |
| 3 | National govenment | Support the hospital |
+----+--------------------------+----------------------+

I want to search the contents from the title and body field, so I config the sphinx.conf
as shown followed:

--------The sphinx config file----------
source mysql
{
        type = mysql
        sql_host = localhost
        sql_user = root
        sql_pass =0000
        sql_db = testfull
        sql_port = 3306 # optional, default is 3306
        sql_query_pre = SET NAMES utf8
        sql_query = SELECT * FROM test
}

index mysql
{
        source = mysql
        path = var/data/mysql_old_test
        docinfo = extern
        mlock = 0
        morphology = stem_en, stem_ru, soundex
        min_stemming_len = 1
        min_word_len = 1
        charset_type = utf-8
        html_strip = 0
}

indexer
{
        mem_limit = 128M
}

searchd
{
    listen = 9312
        read_timeout = 5
        max_children = 30
        max_matches = 1000
        seamless_rotate = 0
        preopen_indexes = 0
        unlink_old = 1
        pid_file = var/log/searchd_mysql.pid
        log = var/log/searchd_mysql.log
        query_log = var/log/query_mysql.log
}
------------------

Then I reindex the db and start the searchd daemon.

In my client side I set the attribute as:

----------Client side config-------------------
sc = new SphinxClient();
///other thing
HashMap<String, Integer> weiMap=new HashMap<String, Integer>();
weiMap.put("title", 100);
weiMap.put("body", 0);
sc.SetFieldWeights(weiMap);

sc.SetMatchMode(SphinxClient.SPH_MATCH_ALL);

sc.SetSortMode(SphinxClient.SPH_SORT_EXTENDED,"@weight DESC");
-----------------------------

When I try to search "National hospital", I got the following output:

Query 'National hospital' retrieved 3 of 3 matches in 0.0 sec.
Query stats:
        'nation' found 3 times in 3 documents
        'hospit' found 3 times in 3 documents

Matches:
1. id=3, weight=101
2. id=1, weight=100
3. id=2, weight=100

The match number (three matched) is right,however the order of the result is not what I
wanted.

Obviously the document of id 1 and 2 should be the most closed items to the required
string( "National hospital" ), so in my opinion they should be given the largest
weights,but they are orderd at the last position.

I wonder if there is anyway to meet my requirement?

PS:
1)please do not suggestion me set the sortModel to :
sc.SetSortMode(SphinxClient.SPH_SORT_EXTENDED,"@weight ASC");

This may work for just this example, it will caused some other potinal problems.
2)Actuall the contents in my table is Chinese, I just use the "National Hosp..l" to make
a example.

barryhunter

Name: Barry Hunter
Posts: 6896

to: m4eclipse, 2010-09-14 01:42:55 | reply!


I've pretty much answered this in the second part of this thread:
http://sphinxsearch.com/forum/view.html?id=6228


(I didnt reply before as thought I had replied there and so this was just a duplicate)

m4eclipse

Name: eclipse
Posts: 15

to: barryhunter, 2010-09-15 12:11:12 | reply!


In this thread:
http://sphinxsearch.com/forum/view.html?id=6228

You give me a link about the ranking computing.

And it is very useful,and I found that the "SPH_RANK_WORDCOUNTER" can meet my requrement.

I use the match model - SPH_MATCH_EXTENDED2,and raning mode -- SPH_RANK_WORDCOUNTER

for a table:
---------+----------------+------------------
+ id + title + body +
-----------------------------------------------
+ 1 + hello world + text +
+ 2 + hello + world +
+-------+-----------------+----------------+

ALso I set the field weight: title:20, body 10
If I search "hello world", both document 1 and 2 are returned.
And document 1 is sorted at first,since the document 1 get the result weights of
40[=20(hello in title field)+20(world in title field) ]
document 2 get the result weights of 30[=20(hello in title field)+100(world in body
field) ]
This is correct.
However I can not use the Fuzzy search,that's to say, when I search "hello any world", no
document is matched.

It seems that the SetRankingMode just work when the MatchMode is "SPH_MATCH_EXTENDED2",
so I want to know if I can implement this search manner:

The search result should be the same with what returned from the match model
"SPH_MATCH_ANY", and the weights of each document are computed by the
"SPH_RANK_WORDCOUNTER".

Can this be implemented?

BTW, the reason I post my reply here is that what we are discussing is about the weights
computing, the old thread whose title is "Chinese lexer" does not fit for it.

barryhunter

Name: Barry Hunter
Posts: 6896

to: m4eclipse, 2010-09-15 14:55:08 | reply!


SPH_RANK_WORDCOUNT only be used with SPH_MATCH_EXTENDED.

But can emulate SPH_MATCH_ANY with the extended syntax.

So....

$s->setMatchMode(SPH_MATCH_EXTENDED2);
$s->setRankingMode(SPH_RANK_WORDCOUNT);
$s->Query('"hello world"/1',$index);

m4eclipse

Name: eclipse
Posts: 15

to: barryhunter, 2010-09-16 13:14:15 | reply!


> $s->setMatchMode(SPH_MATCH_EXTENDED2);
> $s->setRankingMode(SPH_RANK_WORDCOUNT);
> $s->Query('"hello world"/1',$index);
>
Thanks,it works now.

In your article you mentioned:

> Relevance is ultimately subjective, so thereЎЇs no single one-size-fits-all ranker,
> >and there will never be.

This is true but rather confuing me .

It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and "
SPH_RANK_WORDCOUNT ".

document 1:hello world
document 2:world hello

Search " hello world":
Use the "SPH_RANK_WORDCOUNT", they own the same weights.
In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former
case, obviously the "SPH_RANK_WORDCOUNT" is better.
So I am rather entangled.... :)

m4eclipse

Name: eclipse
Posts: 15

to: barryhunter, 2010-09-16 13:14:47 | reply!


> $s->setMatchMode(SPH_MATCH_EXTENDED2);
> $s->setRankingMode(SPH_RANK_WORDCOUNT);
> $s->Query('"hello world"/1',$index);
>
Thanks,it works now.

In your article you mentioned:

> Relevance is ultimately subjective, so thereЎЇs no single one-size-fits-all ranker,
> >and there will never be.

This is true but rather confuing me .

It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and "
SPH_RANK_WORDCOUNT ".

document 1:hello world
document 2:world hello

Search " hello world":
Use the "SPH_RANK_WORDCOUNT", they own the same weights.
In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former
case, obviously the "SPH_RANK_WORDCOUNT" is better.
So I am rather entangled.... :)

m4eclipse

Name: eclipse
Posts: 15

to: barryhunter, 2010-09-16 13:14:52 | reply!


> $s->setMatchMode(SPH_MATCH_EXTENDED2);
> $s->setRankingMode(SPH_RANK_WORDCOUNT);
> $s->Query('"hello world"/1',$index);
>
Thanks,it works now.

In your article you mentioned:

> Relevance is ultimately subjective, so thereЎЇs no single one-size-fits-all ranker,
> >and there will never be.

This is true but rather confuing me .

It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and "
SPH_RANK_WORDCOUNT ".

document 1:hello world
document 2:world hello

Search " hello world":
Use the "SPH_RANK_WORDCOUNT", they own the same weights.
In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former
case, obviously the "SPH_RANK_WORDCOUNT" is better.
So I am rather entangled.... :)

barryhunter

Name: Barry Hunter
Posts: 6896

to: m4eclipse, 2010-09-16 13:30:49 | reply!


>
> In your article you mentioned:

(not my article! I just provided a link)

>
> >Relevance is ultimately subjective, so thereЎЇs no single one-size-fits-all ranker,
> >and there will never be.
>
>
> It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and "
> SPH_RANK_WORDCOUNT ".
>
> document 1:hello world
> document 2:world hello
>
> Search " hello world":
> Use the "SPH_RANK_WORDCOUNT", they own the same weights.

One way to address that is

$c->Query('"hello world" | (hello world)',$index);

Which if they are in the correct order matches both terms.


...

sounds like you really need to break it down into what you believe are your important
ranking factors (for your data and use case) and then try and address each in term.

Accept it might not be able to all, or that might have to make compromise.

Almost by definition relevance is an inexact science, so compromises are a fact of live.

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 498 | 499 | 500 | 501 | next »» | Create new thread