Forums Register Login Forgot your login/password? Search
How did the sphinx calculate the weight?
Common forum | 1 | 2 | 3 | 4 | 5 | ... | 450 | 451 | 452 | 453 | next »» | Create new thread
|
m4eclipse
Name: eclipse |
2010-09-13 05:45:20
| reply! First take a look at a example: The following is my table(just for test used): +----+--------------------------+----------------------+ | Id | title | body | +----+--------------------------+----------------------+ | 1 | National first hospital | NASA | | 2 | National second hospital | Space Administration | | 3 | National govenment | Support the hospital | +----+--------------------------+----------------------+ I want to search the contents from the title and body field, so I config the sphinx.conf as shown followed: --------The sphinx config file---------- source mysql { type = mysql sql_host = localhost sql_user = root sql_pass =0000 sql_db = testfull sql_port = 3306 # optional, default is 3306 sql_query_pre = SET NAMES utf8 sql_query = SELECT * FROM test } index mysql { source = mysql path = var/data/mysql_old_test docinfo = extern mlock = 0 morphology = stem_en, stem_ru, soundex min_stemming_len = 1 min_word_len = 1 charset_type = utf-8 html_strip = 0 } indexer { mem_limit = 128M } searchd { listen = 9312 read_timeout = 5 max_children = 30 max_matches = 1000 seamless_rotate = 0 preopen_indexes = 0 unlink_old = 1 pid_file = var/log/searchd_mysql.pid log = var/log/searchd_mysql.log query_log = var/log/query_mysql.log } ------------------ Then I reindex the db and start the searchd daemon. In my client side I set the attribute as: ----------Client side config------------------- sc = new SphinxClient(); ///other thing HashMap<String, Integer> weiMap=new HashMap<String, Integer>(); weiMap.put("title", 100); weiMap.put("body", 0); sc.SetFieldWeights(weiMap); sc.SetMatchMode(SphinxClient.SPH_MATCH_ALL); sc.SetSortMode(SphinxClient.SPH_SORT_EXTENDED,"@weight DESC"); ----------------------------- When I try to search "National hospital", I got the following output: Query 'National hospital' retrieved 3 of 3 matches in 0.0 sec. Query stats: 'nation' found 3 times in 3 documents 'hospit' found 3 times in 3 documents Matches: 1. id=3, weight=101 2. id=1, weight=100 3. id=2, weight=100 The match number (three matched) is right,however the order of the result is not what I wanted. Obviously the document of id 1 and 2 should be the most closed items to the required string( "National hospital" ), so in my opinion they should be given the largest weights,but they are orderd at the last position. I wonder if there is anyway to meet my requirement? PS: 1)please do not suggestion me set the sortModel to : sc.SetSortMode(SphinxClient.SPH_SORT_EXTENDED,"@weight ASC"); This may work for just this example, it will caused some other potinal problems. 2)Actuall the contents in my table is Chinese, I just use the "National Hosp..l" to make a example. |
|
barryhunter
Name: Barry Hunter |
to: m4eclipse, 2010-09-14 01:42:55
| reply! I've pretty much answered this in the second part of this thread: http://sphinxsearch.com/forum/view.html?id=6228 (I didnt reply before as thought I had replied there and so this was just a duplicate) |
|
m4eclipse
Name: eclipse |
to: barryhunter, 2010-09-15 12:11:12
| reply! In this thread: http://sphinxsearch.com/forum/view.html?id=6228 You give me a link about the ranking computing. And it is very useful,and I found that the "SPH_RANK_WORDCOUNTER" can meet my requrement. I use the match model - SPH_MATCH_EXTENDED2,and raning mode -- SPH_RANK_WORDCOUNTER for a table: ---------+----------------+------------------ + id + title + body + ----------------------------------------------- + 1 + hello world + text + + 2 + hello + world + +-------+-----------------+----------------+ ALso I set the field weight: title:20, body 10 If I search "hello world", both document 1 and 2 are returned. And document 1 is sorted at first,since the document 1 get the result weights of 40[=20(hello in title field)+20(world in title field) ] document 2 get the result weights of 30[=20(hello in title field)+100(world in body field) ] This is correct. However I can not use the Fuzzy search,that's to say, when I search "hello any world", no document is matched. It seems that the SetRankingMode just work when the MatchMode is "SPH_MATCH_EXTENDED2", so I want to know if I can implement this search manner: The search result should be the same with what returned from the match model "SPH_MATCH_ANY", and the weights of each document are computed by the "SPH_RANK_WORDCOUNTER". Can this be implemented? BTW, the reason I post my reply here is that what we are discussing is about the weights computing, the old thread whose title is "Chinese lexer" does not fit for it. |
|
barryhunter
Name: Barry Hunter |
to: m4eclipse, 2010-09-15 14:55:08
| reply! SPH_RANK_WORDCOUNT only be used with SPH_MATCH_EXTENDED. But can emulate SPH_MATCH_ANY with the extended syntax. So.... $s->setMatchMode(SPH_MATCH_EXTENDED2); $s->setRankingMode(SPH_RANK_WORDCOUNT); $s->Query('"hello world"/1',$index); |
|
m4eclipse
Name: eclipse |
to: barryhunter, 2010-09-16 13:14:15
| reply! > $s->setMatchMode(SPH_MATCH_EXTENDED2); > $s->setRankingMode(SPH_RANK_WORDCOUNT); > $s->Query('"hello world"/1',$index); > Thanks,it works now. In your article you mentioned: > Relevance is ultimately subjective, so there¡¯s no single one-size-fits-all ranker, >and > there will never be. This is true but rather confuing me . It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and " SPH_RANK_WORDCOUNT ". document 1:hello world document 2:world hello Search " hello world": Use the "SPH_RANK_WORDCOUNT", they own the same weights. In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former case, obviously the "SPH_RANK_WORDCOUNT" is better. So I am rather entangled.... :) |
|
m4eclipse
Name: eclipse |
to: barryhunter, 2010-09-16 13:14:47
| reply! > $s->setMatchMode(SPH_MATCH_EXTENDED2); > $s->setRankingMode(SPH_RANK_WORDCOUNT); > $s->Query('"hello world"/1',$index); > Thanks,it works now. In your article you mentioned: > Relevance is ultimately subjective, so there¡¯s no single one-size-fits-all ranker, >and > there will never be. This is true but rather confuing me . It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and " SPH_RANK_WORDCOUNT ". document 1:hello world document 2:world hello Search " hello world": Use the "SPH_RANK_WORDCOUNT", they own the same weights. In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former case, obviously the "SPH_RANK_WORDCOUNT" is better. So I am rather entangled.... :) |
|
m4eclipse
Name: eclipse |
to: barryhunter, 2010-09-16 13:14:52
| reply! > $s->setMatchMode(SPH_MATCH_EXTENDED2); > $s->setRankingMode(SPH_RANK_WORDCOUNT); > $s->Query('"hello world"/1',$index); > Thanks,it works now. In your article you mentioned: > Relevance is ultimately subjective, so there¡¯s no single one-size-fits-all ranker, >and > there will never be. This is true but rather confuing me . It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and " SPH_RANK_WORDCOUNT ". document 1:hello world document 2:world hello Search " hello world": Use the "SPH_RANK_WORDCOUNT", they own the same weights. In this case, use the "SPH_RANK_PROXIMITY_BM25" would be better. However in my former case, obviously the "SPH_RANK_WORDCOUNT" is better. So I am rather entangled.... :) |
|
barryhunter
Name: Barry Hunter |
to: m4eclipse, 2010-09-16 13:30:49
| reply! > > In your article you mentioned: (not my article! I just provided a link) > > >Relevance is ultimately subjective, so there¡¯s no single one-size-fits-all ranker, >and > there will never be. > > > It seems that I have to choose one between the "SPH_RANK_PROXIMITY_BM25" and " > SPH_RANK_WORDCOUNT ". > > document 1:hello world > document 2:world hello > > Search " hello world": > Use the "SPH_RANK_WORDCOUNT", they own the same weights. One way to address that is $c->Query('"hello world" | (hello world)',$index); Which if they are in the correct order matches both terms. ... sounds like you really need to break it down into what you believe are your important ranking factors (for your data and use case) and then try and address each in term. Accept it might not be able to all, or that might have to make compromise. Almost by definition relevance is an inexact science, so compromises are a fact of live. |
Common forum | 1 | 2 | 3 | 4 | 5 | ... | 450 | 451 | 452 | 453 | next »» | Create new thread