| Anonymous | Login | Signup for a new account | 2013-05-23 12:38 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||
| 0000540 | Sphinx | documentation | public | 2010-06-23 20:24 | 2012-11-03 19:27 | ||
| Reporter | roby1kenobi | ||||||
| Assigned To | kevg | ||||||
| Priority | normal | Severity | minor | Reproducibility | always | ||
| Status | closed | Resolution | fixed | ||||
| Platform | OS | OS Version | |||||
| Product Version | 0.9.9-rc2 | ||||||
| Target Version | Fixed in Version | ||||||
| Summary | 0000540: Documentation states space (U+20) is valid, but adding it to charset_table breaks searches | ||||||
| Description | I am attempting to index multiple word tags as single phrases. However, adding the space character (U+20) to the charset_table setting breaks all searches, no matter the match mode. Expected: --------- According to the documentation (http://www.sphinxsearch.com/docs/manual-0.9.9.html#charsets [^]) and numerous posts (i.e. http://www.sphinxsearch.com/forum/view.html?id=60 [^]), the space character (U+20) should be a valid charset_table setting. If the issue is not going to be addressed, the documentation should be updated. Steps to Reproduce: ------------------- 1. Created xmlpipe2 XML file data source that includes tags with spaces within them and ensure it is saved with UTF-8 character encoding. 2. Create sphinx.conf file with charset_table setting that includes space (U+20). charset_table = 0..9, A..Z->a..z, _, a..z, U+20 3. Run indexer to generate the index files based on the xmlpipe2 data source file. sudo indexer --config /local1/roby/sphinx/sphinx.conf --all 4. Run indexer with --buildstops to generate list of terms added to index. sudo indexer --config /local1/roby/sphinx/sphinx.conf test --buildstops words.txt 1000 5. Verify words.txt includes the expected terms. 6. Run search and attempt to find hits for the terms listed in words.txt and note that nothing works, even terms without spaces. For example: search tag2_underscore 7. Edit sphinx.conf, replacing space char (U+20) with hyphen (U+2D) in charset_table and index again. charset_table = 0..9, A..Z->a..z, _, a..z, U+2D 8. Run same search queries and note that they now work. 9. Compile debug build of sphinx: make distclean ./configure --with-debug --with-pgsql --without-mysql --prefix /local1/roby/sphinx-debug make sudo make install 10. Repeat steps 3 through 8 to confirm the results are consistent with the debug build. note: Also downloaded Win32 binaries and repro'd with them at home | ||||||
| Additional Information | Sphinx.conf: ------------ source test_src { type = xmlpipe2 xmlpipe_command = cat /local1/roby/sphinx/sphinx-0.9.9/test.xml xmlpipe_fixup_utf8 = 1 } index test { source = test_src path = /local1/roby/sphinx/sphinx-0.9.9/var/test/data/test_src docinfo = extern mlock = 1 morphology = none min_stemming_len = none min_prefix_len = 0 enable_star = false charset_type = utf-8 charset_table = 0..9, A..Z->a..z, _, a..z, U+20 # use ^ (U+5E) as word delimiter #charset_table = 0..9, A..Z->a..z, _, a..z, U+20..U+29, U+2A..U+2F, U+3A..U+3F, U+40, U+5B..U+5D, U+5F, U+60, U+7B..U+7E } index dist_index { type = distributed local = test agent_connect_timeout = 1000 } indexer { mem_limit = 256M } searchd { listen = 127.0.0.1:9312 log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/ query_log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/query.log pid_file = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/searchd.pid preopen_indexes = 1 unlink_old = 1 max_matches = 1000 } XMLPipe2 Data File: ------------------- <?xml version="1.0" encoding="utf-8"?> <sphinx:docset> <sphinx:schema> <sphinx:attr name="id_attr" type="int" bits="16" default="1"/> <sphinx:attr name="organism_attr" type="int" bits="16" default="1"/> <sphinx:attr name="project_attr" type="int" bits="16" default="0"/> <sphinx:attr name="class_attr" type="int" bits="16" default="0"/> <sphinx:field name="entrez_id"/> <sphinx:field name="homologene_id"/> <sphinx:field name="subject"/> <sphinx:field name="tags"/> <sphinx:field name="content"/> </sphinx:schema> <sphinx:document id="2000001"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>310</project_attr> <class_attr>5</class_attr> <entrez_id>1</entrez_id> <homologene_id>1</homologene_id> <subject>subject1</subject> <tags>tag1 with spaces</tags> <content>a</content> </sphinx:document> <sphinx:document id="2000002"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>350</project_attr> <class_attr>5</class_attr> <entrez_id>10</entrez_id> <homologene_id>10</homologene_id> <subject>subject2</subject> <tags>tag1 with spaces</tags> <content>b</content> </sphinx:document> <sphinx:document id="2000003"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>343</project_attr> <class_attr>5</class_attr> <entrez_id>0</entrez_id> <homologene_id>0</homologene_id> <subject>subject3</subject> <tags>tag2_underscore</tags> <content>c</content> </sphinx:document> </sphinx:docset> | ||||||
| Tags | No tags attached. | ||||||
| Attached Files | |||||||
Notes |
|
|
(0000972) MoneySack (reporter) 2010-11-25 12:28 |
Same Bug on my installation. When the Index is configured with U+20 in charset_table and quering the Index with the Shell-Utility search, it looks like sphinx-search is prepending/appending (depending on Query-Mode) a Whitespace which results in no matches. For Example: # search -i players_test -e \@\(name\)*somename* Output: using config file '/usr/local/etc/sphinx.conf'... index 'players_test': query '@(name)*somename* ': returned 0 matches of 0 total in 0.000 sec words: 1. '*somename* ': 0 documents, 0 hits Look at the Whitespace after the Asterisk... On an Index without U+20 in the charset_table it looks like this: using config file '/usr/local/etc/sphinx.conf'... index 'players': query '@(name)*somename*': returned 0 matches of 0 total in 0.000 sec words: 1. '*somename*': 0 documents, 0 hits No Whitespaces in Querystring after the Asterisk. Hope this bug will be fixed soon |
|
(0001209) akremer (reporter) 2011-02-21 11:32 |
Confirm this is still an issue in 1.10b. Would love a fix for this as well. |
|
(0001255) Tomat (manager) 2011-03-09 12:05 |
I've just tried this issue config and data vs either 1.10b 1.11(trunk) and it indexes and returns correct matches via SphinxQL \ API ( I've queried for 'tag1 with spaces' and 'tag2_underscore'). Is not clear to me this issue related to charset_table or search utility? |
|
(0003452) kevg (manager) 2012-11-03 19:27 |
U+20 is no valid anymore. WARNING: wrong character mapping start specified: U+20, should be between U+21 and U+2ffff (inclusive); CLAMPED |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2010-06-23 20:24 | roby1kenobi | New Issue | |
| 2010-11-25 12:28 | MoneySack | Note Added: 0000972 | |
| 2010-11-25 12:28 | MoneySack | Issue Monitored: MoneySack | |
| 2011-02-21 11:32 | akremer | Note Added: 0001209 | |
| 2011-03-09 11:27 | kevg | Status | new => confirmed |
| 2011-03-09 12:05 | Tomat | Note Added: 0001255 | |
| 2012-11-03 19:27 | kevg | Note Added: 0003452 | |
| 2012-11-03 19:27 | kevg | Status | confirmed => closed |
| 2012-11-03 19:27 | kevg | Assigned To | => kevg |
| 2012-11-03 19:27 | kevg | Resolution | open => fixed |
| Copyright © 2000 - 2010 MantisBT Group |




