| Anonymous | Login | Signup for a new account | 2013-05-20 10:30 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||
| 0000757 | Sphinx | general | public | 2011-04-07 13:33 | 2011-05-06 11:11 | ||
| Reporter | zenuch | ||||||
| Assigned To | Tomat | ||||||
| Priority | normal | Severity | major | Reproducibility | sometimes | ||
| Status | closed | Resolution | fixed | ||||
| Platform | OS | OS Version | |||||
| Product Version | 1.11-dev | ||||||
| Target Version | Fixed in Version | 2.0.1-beta | |||||
| Summary | 0000757: Incorrect normalization of words in certain combinations | ||||||
| Description | Wordforms: ???????????? > ???????????? ????????????? > ???????????? ????????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ????????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? ???????????? > ???????????? charset_table = 0..9, A..Z->a..z, _, ., -, a..z, \ U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435 attention to the last two issues: «?» & «?» is converted to «?» Query: «????????????» «BuildKeywords» returns: http://gyazo.com/00424291209df945e4734487c5c01d0c.png [^] | ||||||
| Additional Information | index normalized_test { source = search_chunk0 path = /data/sphinx/index/search/search_chunk0 docinfo = extern mlock = 1 morphology = stem_ru charset_type = utf-8 html_strip = 1 html_remove_elements= table, img min_word_len = 1 stopwords = /data/sphinx/stopwords.txt wordforms = /data/sphinx/wordforms.txt #exceptions = /data/sphinx/exceptions.txt #blend_chars = @, / index_sp = 1 charset_table = 0..9, A..Z->a..z, _, ., -, a..z, \ U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435 } | ||||||
| Tags | No tags attached. | ||||||
| Attached Files | |||||||
Notes |
|
|
(0001455) zenuch (reporter) 2011-04-07 13:35 |
http://pastebin.com/j5sdzr4Z [^] |
|
(0001460) Tomat (manager) 2011-04-07 14:34 |
I can't reproduce this issue. I've uploaded all files that I've tried to reproduce this and result set looks correct. Could you dump header of you index via ./indextoole -c YOUR_CONFIG.conf --dumpheader normalized_test |
|
(0001461) Tomat (manager) 2011-04-07 14:38 |
Could you also provide you Sphinx version and do you use id64 option and expat library? |
|
(0001463) zenuch (reporter) 2011-04-07 15:09 |
Sphinx 1.11-id64-dev (r2642) Copyright (c) 2001-2011, Andrew Aksyonoff Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com [^]) dumping header file 'search_chunk11.sph'... version: 24 idbits: 64 docinfo: extern fields: 18 field 0: document_title field 1: document_title_clean field 2: document_title_codex field 3: document_title_topfz field 4: document_title_topfz_appendix field 5: document_title_topfz_deactivate field 6: document_title_topfz_appendix_deactivate field 7: document_title_other_fz field 8: body field 9: body_codex field 10: title field 11: title_second_fuzzy field 12: title_second field 13: title_second_fuzzy_deactivate field 14: title_second_deactivate field 15: title_third field 16: title_last field 17: string_number attrs: 27 attr 0: is_deleted, boolean, bitoff 0 attr 1: is_active, boolean, bitoff 1 attr 2: type_id, uint, bits 32, bitoff 32 attr 3: line_id, uint, bits 32, bitoff 64 attr 4: last_revision_id, uint, bits 32, bitoff 96 attr 5: revision_id, uint, bits 32, bitoff 128 attr 6: entity, uint, bits 32, bitoff 160 attr 7: document_id, uint, bits 32, bitoff 192 attr 8: date_activate, timestamp, bitoff 224 attr 9: category_dispute_id, uint, bits 32, bitoff 256 attr 10: entity_count, uint, bits 32, bitoff 288 attr 11: court_id, uint, bits 32, bitoff 320 attr 12: is_jurisprudence, uint, bits 32, bitoff 352 attr 13: is_reduced, boolean, bitoff 2 attr 14: is_lower, boolean, bitoff 3 attr 15: document_type_id, uint, bits 32, bitoff 384 attr 16: date_moj, timestamp, bitoff 416 attr 17: date_accept, timestamp, bitoff 448 attr 18: date_title, timestamp, bitoff 480 attr 19: number_moj, uint, bits 32, bitoff 512 attr 20: document_weight, uint, bits 32, bitoff 544 attr 21: type_weight, uint, bits 32, bitoff 576 attr 22: upper_level_id, uint, bits 32, bitoff 608 attr 23: vbucket_id, uint, bits 32, bitoff 640 attr 24: department_id, mva, bitoff 672 attr 25: subject_id, mva, bitoff 704 attr 26: number, mva, bitoff 736 total-documents: 18850239 total-bytes: 14584053312 min-prefix-len: 0 min-infix-len: 0 exact-words: 0 html-strip: 1 html-index-attrs: (null) html-remove-elements: table, img zone prefix: (null) tokenizer-type: 2 tokenizer-case-folding: 0..9, A..Z->a..z, _, ., -, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+0401->U+0435, U+0451->U+0435 tokenizer-min-word-len: 1 tokenizer-ngram-chars: (null) tokenizer-ngram-len: 0 tokenizer-exceptions: (null) tokenizer-phrase-boundary: (null) tokenizer-ignore-chars: (null) tokenizer-blend-chars: (null) tokenizer-blend-mode: (null) dictionary-morphology: stem_ru dictionary-stopwords: /data/sphinx/stopwords.txt dictionary-wordforms: /data/sphinx/wordforms.txt min-stemming-len: 1 killlist-size: 0 |
|
(0001464) zenuch (reporter) 2011-04-07 15:45 |
Checked on three computers, everywhere there is the same set of wordforms, but results differ http://pastebin.com/Pr7YMW2p [^] dumpheader is made from a not correctly working server |
|
(0001468) Tomat (manager) 2011-04-07 19:47 |
Could you provide your OS version and gcc version? |
|
(0001470) zenuch (reporter) 2011-04-08 07:44 |
http://pastebin.com/PKxkeS8a [^] |
|
(0001471) Tomat (manager) 2011-04-08 09:03 |
Its unclear from your description are these servers 32 or 64 bits? |
|
(0001472) zenuch (reporter) 2011-04-08 09:19 |
All servers are 64-bit |
|
(0001474) Tomat (manager) 2011-04-08 18:34 |
Could I get to your box to reproduce this issue \ try different revisions? As I've failed to reproduce this issue on my VM. |
|
(0001481) zenuch (reporter) 2011-04-11 09:03 |
I will send access details to Shodan |
|
(0001493) Tomat (manager) 2011-04-14 10:21 |
fixed 2776 Added warning that all indexes which shared same wordforms file should have same tokenizer settings |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-04-07 13:33 | zenuch | New Issue | |
| 2011-04-07 13:34 | zenuch | Issue Monitored: zenuch | |
| 2011-04-07 13:35 | zenuch | Note Added: 0001455 | |
| 2011-04-07 14:30 | Tomat | Assigned To | => Tomat |
| 2011-04-07 14:30 | Tomat | Reproducibility | always => unable to reproduce |
| 2011-04-07 14:30 | Tomat | Status | new => assigned |
| 2011-04-07 14:30 | Tomat | File Added: i757.conf | |
| 2011-04-07 14:31 | Tomat | File Added: i757.sql | |
| 2011-04-07 14:31 | Tomat | File Added: i757create.sql | |
| 2011-04-07 14:31 | Tomat | File Added: i757res.txt | |
| 2011-04-07 14:34 | Tomat | Note Added: 0001460 | |
| 2011-04-07 14:35 | Tomat | File Added: rt_wf.txt | |
| 2011-04-07 14:38 | Tomat | Note Added: 0001461 | |
| 2011-04-07 15:09 | zenuch | Note Added: 0001463 | |
| 2011-04-07 15:45 | zenuch | Note Added: 0001464 | |
| 2011-04-07 19:47 | Tomat | Note Added: 0001468 | |
| 2011-04-08 07:44 | zenuch | Note Added: 0001470 | |
| 2011-04-08 09:03 | Tomat | Note Added: 0001471 | |
| 2011-04-08 09:19 | zenuch | Note Added: 0001472 | |
| 2011-04-08 18:34 | Tomat | Note Added: 0001474 | |
| 2011-04-11 09:03 | zenuch | Note Added: 0001481 | |
| 2011-04-14 10:21 | Tomat | Note Added: 0001493 | |
| 2011-04-14 10:22 | Tomat | Reproducibility | unable to reproduce => sometimes |
| 2011-04-14 10:22 | Tomat | Status | assigned => resolved |
| 2011-04-14 10:22 | Tomat | Resolution | open => fixed |
| 2011-05-06 11:11 | Tomat | Status | resolved => closed |
| 2011-05-06 11:11 | Tomat | Fixed in Version | => 2.0.1-beta |
| Copyright © 2000 - 2010 MantisBT Group |




