| Anonymous | Login | Signup for a new account | 2010-09-08 18:58 CEST |
| Main | My View | View Issues | Change Log | Roadmap | Docs |
| Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | |||||||||||
| ID | Category | Severity | Reproducibility | Date Submitted | Last Update | |||||||
| 0000540 | [Sphinx] documentation | minor | always | 2010-06-23 20:24 | 2010-06-23 20:24 | |||||||
| Reporter | roby1kenobi | View Status | public | |||||||||
| Assigned To | ||||||||||||
| Priority | normal | Resolution | open | |||||||||
| Status | new | Product Version | 0.9.9-rc2 | |||||||||
| Summary | 0000540: Documentation states space (U+20) is valid, but adding it to charset_table breaks searches | |||||||||||
| Description |
I am attempting to index multiple word tags as single phrases. However, adding the space character (U+20) to the charset_table setting breaks all searches, no matter the match mode. Expected: --------- According to the documentation (http://www.sphinxsearch.com/docs/manual-0.9.9.html#charsets) [^] and numerous posts (i.e. http://www.sphinxsearch.com/forum/view.html?id=60), [^] the space character (U+20) should be a valid charset_table setting. If the issue is not going to be addressed, the documentation should be updated. Steps to Reproduce: ------------------- 1. Created xmlpipe2 XML file data source that includes tags with spaces within them and ensure it is saved with UTF-8 character encoding. 2. Create sphinx.conf file with charset_table setting that includes space (U+20). charset_table = 0..9, A..Z->a..z, _, a..z, U+20 3. Run indexer to generate the index files based on the xmlpipe2 data source file. sudo indexer --config /local1/roby/sphinx/sphinx.conf --all 4. Run indexer with --buildstops to generate list of terms added to index. sudo indexer --config /local1/roby/sphinx/sphinx.conf test --buildstops words.txt 1000 5. Verify words.txt includes the expected terms. 6. Run search and attempt to find hits for the terms listed in words.txt and note that nothing works, even terms without spaces. For example: search tag2_underscore 7. Edit sphinx.conf, replacing space char (U+20) with hyphen (U+2D) in charset_table and index again. charset_table = 0..9, A..Z->a..z, _, a..z, U+2D 8. Run same search queries and note that they now work. 9. Compile debug build of sphinx: make distclean ./configure --with-debug --with-pgsql --without-mysql --prefix /local1/roby/sphinx-debug make sudo make install 10. Repeat steps 3 through 8 to confirm the results are consistent with the debug build. note: Also downloaded Win32 binaries and repro'd with them at home |
|||||||||||
| Additional Information |
Sphinx.conf: ------------ source test_src { type = xmlpipe2 xmlpipe_command = cat /local1/roby/sphinx/sphinx-0.9.9/test.xml xmlpipe_fixup_utf8 = 1 } index test { source = test_src path = /local1/roby/sphinx/sphinx-0.9.9/var/test/data/test_src docinfo = extern mlock = 1 morphology = none min_stemming_len = none min_prefix_len = 0 enable_star = false charset_type = utf-8 charset_table = 0..9, A..Z->a..z, _, a..z, U+20 # use ^ (U+5E) as word delimiter #charset_table = 0..9, A..Z->a..z, _, a..z, U+20..U+29, U+2A..U+2F, U+3A..U+3F, U+40, U+5B..U+5D, U+5F, U+60, U+7B..U+7E } index dist_index { type = distributed local = test agent_connect_timeout = 1000 } indexer { mem_limit = 256M } searchd { listen = 127.0.0.1:9312 log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/ query_log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/query.log pid_file = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/searchd.pid preopen_indexes = 1 unlink_old = 1 max_matches = 1000 } XMLPipe2 Data File: ------------------- <?xml version="1.0" encoding="utf-8"?> <sphinx:docset> <sphinx:schema> <sphinx:attr name="id_attr" type="int" bits="16" default="1"/> <sphinx:attr name="organism_attr" type="int" bits="16" default="1"/> <sphinx:attr name="project_attr" type="int" bits="16" default="0"/> <sphinx:attr name="class_attr" type="int" bits="16" default="0"/> <sphinx:field name="entrez_id"/> <sphinx:field name="homologene_id"/> <sphinx:field name="subject"/> <sphinx:field name="tags"/> <sphinx:field name="content"/> </sphinx:schema> <sphinx:document id="2000001"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>310</project_attr> <class_attr>5</class_attr> <entrez_id>1</entrez_id> <homologene_id>1</homologene_id> <subject>subject1</subject> <tags>tag1 with spaces</tags> <content>a</content> </sphinx:document> <sphinx:document id="2000002"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>350</project_attr> <class_attr>5</class_attr> <entrez_id>10</entrez_id> <homologene_id>10</homologene_id> <subject>subject2</subject> <tags>tag1 with spaces</tags> <content>b</content> </sphinx:document> <sphinx:document id="2000003"> <id_attr>1</id_attr> <organism_attr>1</organism_attr> <project_attr>343</project_attr> <class_attr>5</class_attr> <entrez_id>0</entrez_id> <homologene_id>0</homologene_id> <subject>subject3</subject> <tags>tag2_underscore</tags> <content>c</content> </sphinx:document> </sphinx:docset> |
|||||||||||
| Tags | No tags attached. | |||||||||||
| Attached Files | ||||||||||||
|
|
||||||||||||
| There are no notes attached to this issue. |
| Copyright © 2000 - 2008 Mantis Group |