Mantis Bugtracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000540 [Sphinx] documentation minor always 2010-06-23 20:24 2010-06-23 20:24
Reporter roby1kenobi View Status public  
Assigned To
Priority normal Resolution open  
Status new   Product Version 0.9.9-rc2
Summary 0000540: Documentation states space (U+20) is valid, but adding it to charset_table breaks searches
Description I am attempting to index multiple word tags as single phrases. However, adding the space character (U+20) to the charset_table setting breaks all searches, no matter the match mode.

Expected:
---------
According to the documentation (http://www.sphinxsearch.com/docs/manual-0.9.9.html#charsets) [^] and numerous posts (i.e. http://www.sphinxsearch.com/forum/view.html?id=60), [^] the space character (U+20) should be a valid charset_table setting. If the issue is not going to be addressed, the documentation should be updated.


Steps to Reproduce:
-------------------

1. Created xmlpipe2 XML file data source that includes tags with spaces within them and ensure it is saved with UTF-8 character encoding.

2. Create sphinx.conf file with charset_table setting that includes space (U+20).

   charset_table = 0..9, A..Z->a..z, _, a..z, U+20

3. Run indexer to generate the index files based on the xmlpipe2 data source file.

   sudo indexer --config /local1/roby/sphinx/sphinx.conf --all

4. Run indexer with --buildstops to generate list of terms added to index.

   sudo indexer --config /local1/roby/sphinx/sphinx.conf test --buildstops words.txt 1000

5. Verify words.txt includes the expected terms.

6. Run search and attempt to find hits for the terms listed in words.txt and note that nothing works, even terms without spaces. For example:

        search tag2_underscore

7. Edit sphinx.conf, replacing space char (U+20) with hyphen (U+2D) in charset_table and index again.

        charset_table = 0..9, A..Z->a..z, _, a..z, U+2D

8. Run same search queries and note that they now work.

9. Compile debug build of sphinx:

   make distclean ./configure --with-debug --with-pgsql --without-mysql --prefix /local1/roby/sphinx-debug
   make
   sudo make install

10. Repeat steps 3 through 8 to confirm the results are consistent with the debug build.

note: Also downloaded Win32 binaries and repro'd with them at home
Additional Information Sphinx.conf:
------------
source test_src

{

    type = xmlpipe2

    xmlpipe_command = cat /local1/roby/sphinx/sphinx-0.9.9/test.xml

    xmlpipe_fixup_utf8 = 1

}


index test

{

        source = test_src

        path = /local1/roby/sphinx/sphinx-0.9.9/var/test/data/test_src

        docinfo = extern

        mlock = 1

        morphology = none

        min_stemming_len = none

        min_prefix_len = 0
    enable_star = false

        charset_type = utf-8
    charset_table = 0..9, A..Z->a..z, _, a..z, U+20

    # use ^ (U+5E) as word delimiter

        #charset_table = 0..9, A..Z->a..z, _, a..z, U+20..U+29, U+2A..U+2F, U+3A..U+3F, U+40, U+5B..U+5D, U+5F, U+60, U+7B..U+7E

}



index dist_index

{

        type = distributed

        local = test

        agent_connect_timeout = 1000

}



indexer

{

        mem_limit = 256M

}





searchd

{

        listen = 127.0.0.1:9312

        log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/

        query_log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/query.log

        pid_file = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/searchd.pid

        preopen_indexes = 1

        unlink_old = 1

        max_matches = 1000

}


XMLPipe2 Data File:
-------------------
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

  <sphinx:schema>
    <sphinx:attr name="id_attr" type="int" bits="16" default="1"/>
    <sphinx:attr name="organism_attr" type="int" bits="16" default="1"/>
    <sphinx:attr name="project_attr" type="int" bits="16" default="0"/>
    <sphinx:attr name="class_attr" type="int" bits="16" default="0"/>
    <sphinx:field name="entrez_id"/>
    <sphinx:field name="homologene_id"/>
    <sphinx:field name="subject"/>
    <sphinx:field name="tags"/>
    <sphinx:field name="content"/>
  </sphinx:schema>

  <sphinx:document id="2000001">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>310</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>1</entrez_id>
    <homologene_id>1</homologene_id>
    <subject>subject1</subject>
    <tags>tag1 with spaces</tags>
    <content>a</content>
  </sphinx:document>

  <sphinx:document id="2000002">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>350</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>10</entrez_id>
    <homologene_id>10</homologene_id>
    <subject>subject2</subject>
    <tags>tag1 with spaces</tags>
    <content>b</content>
  </sphinx:document>

  <sphinx:document id="2000003">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>343</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>0</entrez_id>
    <homologene_id>0</homologene_id>
    <subject>subject3</subject>
    <tags>tag2_underscore</tags>
    <content>c</content>
  </sphinx:document>

</sphinx:docset>
Tags No tags attached.
Attached Files

- Relationships

There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
2010-06-23 20:24 roby1kenobi New Issue


Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker