View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000540Sphinxdocumentationpublic2010-06-23 20:242012-11-03 19:27
Reporterroby1kenobi 
Assigned Tokevg 
PrioritynormalSeverityminorReproducibilityalways
StatusclosedResolutionfixed 
PlatformOSOS Version
Product Version0.9.9-rc2 
Target VersionFixed in Version 
Summary0000540: Documentation states space (U+20) is valid, but adding it to charset_table breaks searches
DescriptionI am attempting to index multiple word tags as single phrases. However, adding the space character (U+20) to the charset_table setting breaks all searches, no matter the match mode.

Expected:
---------
According to the documentation (http://www.sphinxsearch.com/docs/manual-0.9.9.html#charsets [^]) and numerous posts (i.e. http://www.sphinxsearch.com/forum/view.html?id=60 [^]), the space character (U+20) should be a valid charset_table setting. If the issue is not going to be addressed, the documentation should be updated.


Steps to Reproduce:
-------------------

1. Created xmlpipe2 XML file data source that includes tags with spaces within them and ensure it is saved with UTF-8 character encoding.

2. Create sphinx.conf file with charset_table setting that includes space (U+20).

   charset_table = 0..9, A..Z->a..z, _, a..z, U+20

3. Run indexer to generate the index files based on the xmlpipe2 data source file.

   sudo indexer --config /local1/roby/sphinx/sphinx.conf --all

4. Run indexer with --buildstops to generate list of terms added to index.

   sudo indexer --config /local1/roby/sphinx/sphinx.conf test --buildstops words.txt 1000

5. Verify words.txt includes the expected terms.

6. Run search and attempt to find hits for the terms listed in words.txt and note that nothing works, even terms without spaces. For example:

        search tag2_underscore

7. Edit sphinx.conf, replacing space char (U+20) with hyphen (U+2D) in charset_table and index again.

        charset_table = 0..9, A..Z->a..z, _, a..z, U+2D

8. Run same search queries and note that they now work.

9. Compile debug build of sphinx:

   make distclean ./configure --with-debug --with-pgsql --without-mysql --prefix /local1/roby/sphinx-debug
   make
   sudo make install

10. Repeat steps 3 through 8 to confirm the results are consistent with the debug build.

note: Also downloaded Win32 binaries and repro'd with them at home
Additional InformationSphinx.conf:
------------
source test_src

{

    type = xmlpipe2

    xmlpipe_command = cat /local1/roby/sphinx/sphinx-0.9.9/test.xml

    xmlpipe_fixup_utf8 = 1

}


index test

{

        source = test_src

        path = /local1/roby/sphinx/sphinx-0.9.9/var/test/data/test_src

        docinfo = extern

        mlock = 1

        morphology = none

        min_stemming_len = none

        min_prefix_len = 0
    enable_star = false

        charset_type = utf-8
    charset_table = 0..9, A..Z->a..z, _, a..z, U+20

    # use ^ (U+5E) as word delimiter

        #charset_table = 0..9, A..Z->a..z, _, a..z, U+20..U+29, U+2A..U+2F, U+3A..U+3F, U+40, U+5B..U+5D, U+5F, U+60, U+7B..U+7E

}



index dist_index

{

        type = distributed

        local = test

        agent_connect_timeout = 1000

}



indexer

{

        mem_limit = 256M

}





searchd

{

        listen = 127.0.0.1:9312

        log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/

        query_log = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/query.log

        pid_file = /local1/roby/sphinx/sphinx-0.9.9/var/test/log/searchd.pid

        preopen_indexes = 1

        unlink_old = 1

        max_matches = 1000

}


XMLPipe2 Data File:
-------------------
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

  <sphinx:schema>
    <sphinx:attr name="id_attr" type="int" bits="16" default="1"/>
    <sphinx:attr name="organism_attr" type="int" bits="16" default="1"/>
    <sphinx:attr name="project_attr" type="int" bits="16" default="0"/>
    <sphinx:attr name="class_attr" type="int" bits="16" default="0"/>
    <sphinx:field name="entrez_id"/>
    <sphinx:field name="homologene_id"/>
    <sphinx:field name="subject"/>
    <sphinx:field name="tags"/>
    <sphinx:field name="content"/>
  </sphinx:schema>

  <sphinx:document id="2000001">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>310</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>1</entrez_id>
    <homologene_id>1</homologene_id>
    <subject>subject1</subject>
    <tags>tag1 with spaces</tags>
    <content>a</content>
  </sphinx:document>

  <sphinx:document id="2000002">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>350</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>10</entrez_id>
    <homologene_id>10</homologene_id>
    <subject>subject2</subject>
    <tags>tag1 with spaces</tags>
    <content>b</content>
  </sphinx:document>

  <sphinx:document id="2000003">
    <id_attr>1</id_attr>
    <organism_attr>1</organism_attr>
    <project_attr>343</project_attr>
    <class_attr>5</class_attr>
    <entrez_id>0</entrez_id>
    <homologene_id>0</homologene_id>
    <subject>subject3</subject>
    <tags>tag2_underscore</tags>
    <content>c</content>
  </sphinx:document>

</sphinx:docset>
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0000972)
MoneySack (reporter)
2010-11-25 12:28

Same Bug on my installation.

When the Index is configured with U+20 in charset_table and quering the Index with the Shell-Utility search, it looks like sphinx-search is prepending/appending (depending on Query-Mode) a Whitespace which results in no matches.

For Example:
# search -i players_test -e \@\(name\)*somename*
Output:
using config file '/usr/local/etc/sphinx.conf'...
index 'players_test': query '@(name)*somename* ': returned 0 matches of 0 total in 0.000 sec

words:
1. '*somename* ': 0 documents, 0 hits

Look at the Whitespace after the Asterisk...

On an Index without U+20 in the charset_table it looks like this:

using config file '/usr/local/etc/sphinx.conf'...
index 'players': query '@(name)*somename*': returned 0 matches of 0 total in 0.000 sec

words:
1. '*somename*': 0 documents, 0 hits

No Whitespaces in Querystring after the Asterisk.

Hope this bug will be fixed soon
(0001209)
akremer (reporter)
2011-02-21 11:32

Confirm this is still an issue in 1.10b. Would love a fix for this as well.
(0001255)
Tomat (manager)
2011-03-09 12:05

I've just tried this issue config and data vs either 1.10b 1.11(trunk) and it indexes and returns correct matches via SphinxQL \ API ( I've queried for 'tag1 with spaces' and 'tag2_underscore').
Is not clear to me this issue related to charset_table or search utility?
(0003452)
kevg (manager)
2012-11-03 19:27

U+20 is no valid anymore.
WARNING: wrong character mapping start specified: U+20, should be between U+21 and U+2ffff (inclusive); CLAMPED

- Issue History
Date Modified Username Field Change
2010-06-23 20:24 roby1kenobi New Issue
2010-11-25 12:28 MoneySack Note Added: 0000972
2010-11-25 12:28 MoneySack Issue Monitored: MoneySack
2011-02-21 11:32 akremer Note Added: 0001209
2011-03-09 11:27 kevg Status new => confirmed
2011-03-09 12:05 Tomat Note Added: 0001255
2012-11-03 19:27 kevg Note Added: 0003452
2012-11-03 19:27 kevg Status confirmed => closed
2012-11-03 19:27 kevg Assigned To => kevg
2012-11-03 19:27 kevg Resolution open => fixed


Copyright © 2000 - 2010 MantisBT Group
Powered by Mantis Bugtracker