View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000757Sphinxgeneralpublic2011-04-07 13:332011-05-06 11:11
Reporterzenuch 
Assigned ToTomat 
PrioritynormalSeveritymajorReproducibilitysometimes
StatusclosedResolutionfixed 
PlatformOSOS Version
Product Version1.11-dev 
Target VersionFixed in Version2.0.1-beta 
Summary0000757: Incorrect normalization of words in certain combinations
DescriptionWordforms:
???????????? > ????????????
????????????? > ????????????
????????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
????????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????
???????????? > ????????????

charset_table = 0..9, A..Z->a..z, _, ., -, a..z, \
       U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435

attention to the last two issues: «?» & «?» is converted to «?»

Query:
«????????????»

«BuildKeywords» returns: http://gyazo.com/00424291209df945e4734487c5c01d0c.png [^]
Additional Informationindex normalized_test
{
    source = search_chunk0
    path = /data/sphinx/index/search/search_chunk0
    docinfo = extern
    mlock = 1
    morphology = stem_ru
    charset_type = utf-8
    html_strip = 1
    html_remove_elements= table, img
    min_word_len = 1
    stopwords = /data/sphinx/stopwords.txt
    wordforms = /data/sphinx/wordforms.txt
    #exceptions = /data/sphinx/exceptions.txt
    #blend_chars = @, /
    index_sp = 1
    charset_table = 0..9, A..Z->a..z, _, ., -, a..z, \
        U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435
}
TagsNo tags attached.
Attached Files? file icon i757.conf [^] (1,094 bytes) 2011-04-07 14:30
? file icon i757.sql [^] (124 bytes) 2011-04-07 14:31
? file icon i757create.sql [^] (260 bytes) 2011-04-07 14:31
txt file icon i757res.txt [^] (222 bytes) 2011-04-07 14:31 [Show Content]
txt file icon rt_wf.txt [^] (802 bytes) 2011-04-07 14:35 [Show Content]

- Relationships

-  Notes
(0001455)
zenuch (reporter)
2011-04-07 13:35

http://pastebin.com/j5sdzr4Z [^]
(0001460)
Tomat (manager)
2011-04-07 14:34

I can't reproduce this issue. I've uploaded all files that I've tried to reproduce this and result set looks correct.
Could you dump header of you index via
./indextoole -c YOUR_CONFIG.conf --dumpheader normalized_test
(0001461)
Tomat (manager)
2011-04-07 14:38

Could you also provide you Sphinx version and do you use id64 option and expat library?
(0001463)
zenuch (reporter)
2011-04-07 15:09

Sphinx 1.11-id64-dev (r2642)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com [^])

dumping header file 'search_chunk11.sph'...
version: 24
idbits: 64
docinfo: extern
fields: 18
  field 0: document_title
  field 1: document_title_clean
  field 2: document_title_codex
  field 3: document_title_topfz
  field 4: document_title_topfz_appendix
  field 5: document_title_topfz_deactivate
  field 6: document_title_topfz_appendix_deactivate
  field 7: document_title_other_fz
  field 8: body
  field 9: body_codex
  field 10: title
  field 11: title_second_fuzzy
  field 12: title_second
  field 13: title_second_fuzzy_deactivate
  field 14: title_second_deactivate
  field 15: title_third
  field 16: title_last
  field 17: string_number
attrs: 27
  attr 0: is_deleted, boolean, bitoff 0
  attr 1: is_active, boolean, bitoff 1
  attr 2: type_id, uint, bits 32, bitoff 32
  attr 3: line_id, uint, bits 32, bitoff 64
  attr 4: last_revision_id, uint, bits 32, bitoff 96
  attr 5: revision_id, uint, bits 32, bitoff 128
  attr 6: entity, uint, bits 32, bitoff 160
  attr 7: document_id, uint, bits 32, bitoff 192
  attr 8: date_activate, timestamp, bitoff 224
  attr 9: category_dispute_id, uint, bits 32, bitoff 256
  attr 10: entity_count, uint, bits 32, bitoff 288
  attr 11: court_id, uint, bits 32, bitoff 320
  attr 12: is_jurisprudence, uint, bits 32, bitoff 352
  attr 13: is_reduced, boolean, bitoff 2
  attr 14: is_lower, boolean, bitoff 3
  attr 15: document_type_id, uint, bits 32, bitoff 384
  attr 16: date_moj, timestamp, bitoff 416
  attr 17: date_accept, timestamp, bitoff 448
  attr 18: date_title, timestamp, bitoff 480
  attr 19: number_moj, uint, bits 32, bitoff 512
  attr 20: document_weight, uint, bits 32, bitoff 544
  attr 21: type_weight, uint, bits 32, bitoff 576
  attr 22: upper_level_id, uint, bits 32, bitoff 608
  attr 23: vbucket_id, uint, bits 32, bitoff 640
  attr 24: department_id, mva, bitoff 672
  attr 25: subject_id, mva, bitoff 704
  attr 26: number, mva, bitoff 736
total-documents: 18850239
total-bytes: 14584053312
min-prefix-len: 0
min-infix-len: 0
exact-words: 0
html-strip: 1
html-index-attrs: (null)
html-remove-elements: table, img
zone prefix: (null)
tokenizer-type: 2
tokenizer-case-folding: 0..9, A..Z->a..z, _, ., -, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435
tokenizer-min-word-len: 1
tokenizer-ngram-chars: (null)
tokenizer-ngram-len: 0
tokenizer-exceptions: (null)
tokenizer-phrase-boundary: (null)
tokenizer-ignore-chars: (null)
tokenizer-blend-chars: (null)
tokenizer-blend-mode: (null)
dictionary-morphology: stem_ru
dictionary-stopwords: /data/sphinx/stopwords.txt
dictionary-wordforms: /data/sphinx/wordforms.txt
min-stemming-len: 1
killlist-size: 0
(0001464)
zenuch (reporter)
2011-04-07 15:45

Checked on three computers, everywhere there is the same set of wordforms, but results differ
http://pastebin.com/Pr7YMW2p [^]
dumpheader is made from a not correctly working server
(0001468)
Tomat (manager)
2011-04-07 19:47

Could you provide your OS version and gcc version?
(0001470)
zenuch (reporter)
2011-04-08 07:44

http://pastebin.com/PKxkeS8a [^]
(0001471)
Tomat (manager)
2011-04-08 09:03

Its unclear from your description are these servers 32 or 64 bits?
(0001472)
zenuch (reporter)
2011-04-08 09:19

All servers are 64-bit
(0001474)
Tomat (manager)
2011-04-08 18:34

Could I get to your box to reproduce this issue \ try different revisions?
As I've failed to reproduce this issue on my VM.
(0001481)
zenuch (reporter)
2011-04-11 09:03

I will send access details to Shodan
(0001493)
Tomat (manager)
2011-04-14 10:21

fixed 2776

Added warning that all indexes which shared same wordforms file should have same tokenizer settings

- Issue History
Date Modified Username Field Change
2011-04-07 13:33 zenuch New Issue
2011-04-07 13:34 zenuch Issue Monitored: zenuch
2011-04-07 13:35 zenuch Note Added: 0001455
2011-04-07 14:30 Tomat Assigned To => Tomat
2011-04-07 14:30 Tomat Reproducibility always => unable to reproduce
2011-04-07 14:30 Tomat Status new => assigned
2011-04-07 14:30 Tomat File Added: i757.conf
2011-04-07 14:31 Tomat File Added: i757.sql
2011-04-07 14:31 Tomat File Added: i757create.sql
2011-04-07 14:31 Tomat File Added: i757res.txt
2011-04-07 14:34 Tomat Note Added: 0001460
2011-04-07 14:35 Tomat File Added: rt_wf.txt
2011-04-07 14:38 Tomat Note Added: 0001461
2011-04-07 15:09 zenuch Note Added: 0001463
2011-04-07 15:45 zenuch Note Added: 0001464
2011-04-07 19:47 Tomat Note Added: 0001468
2011-04-08 07:44 zenuch Note Added: 0001470
2011-04-08 09:03 Tomat Note Added: 0001471
2011-04-08 09:19 zenuch Note Added: 0001472
2011-04-08 18:34 Tomat Note Added: 0001474
2011-04-11 09:03 zenuch Note Added: 0001481
2011-04-14 10:21 Tomat Note Added: 0001493
2011-04-14 10:22 Tomat Reproducibility unable to reproduce => sometimes
2011-04-14 10:22 Tomat Status assigned => resolved
2011-04-14 10:22 Tomat Resolution open => fixed
2011-05-06 11:11 Tomat Status resolved => closed
2011-05-06 11:11 Tomat Fixed in Version => 2.0.1-beta


Copyright © 2000 - 2010 MantisBT Group
Powered by Mantis Bugtracker