都市の住所による検索の国際化。SphinxSearchにロシア語のSoundexを実装します

あなたの街には何人の外国人観光客がいますか?私の場合、数は少ないですが、原則として、通りの真ん中で失われ、1つの単語(名前)を繰り返します。そして通行人は、どこに行くべきかを指で説明しようとします。「私のものはあなたが理解していない」場合、彼らは手を取り、目的地に導きます。驚いたことに、通常、ターゲットは徒歩5分以内です。これらの観光客はまだ都市のいくつかの大まかな考えを持っていました。多分彼らは紙の地図によって導かれました。





他の国のなじみのない都市で、このような状況に自分自身をどのくらいの頻度で見つけましたか?





スマートフォンやナビゲーションアプリの登場により、多くの問題が解決されました。やあ、あなたはあなたのジオロケーションを見ることができ、どこに行くべきかを見つけ、どの方向に推定し、そしてルートをプロットすることさえできます。





残っている問題は1つだけです。アプリケーションのすべての道路は、地域の方言の地域の象形文字で署名されています。ホスト国でラテンアルファベットが採用されている場合、すべてのスマートフォンにラテンキーボードがあり、世界はそれから、チェコ語のアルファベットで採用されている発音区別符号のために不快感を覚えました。そして、私はキリル文字を見ている外国人の痛みと苦しみを想像することしかできません。疑似キリル文字を見ると、あなたは理解するでしょう。もし私が彼らの代わりにいたら、私はラテン語で名前と住所を書き、音を再現しようとしました-音声検索。





この出版物では、音声検索アルゴリズムSoudexをSphinx検索エンジンに実装する方法について説明します音訳だけでは、どこにもありませんが、ここでは機能しません。結果の構成ファイルは、GitHubGistで入手できます





前書き

, , -, , , Sphinx Search.





, , , .. , - Sphinx.





, , , , , . , , .





, . Soundex Metaphone, . Soundex , Metaphone .





, Sphinx Soundex, , . , , . .. . .





. , : « » – , , « », , . , , , , , .





, Soundex, , , NYSIIS, Daitch-Mokotoff.





SphinxQL, :





mysql -h 127.0.0.1 -P 9306 --default-character-set=utf8







Sphinx, , Sphinx Search, , , . .





Soundex

. , Sphinx Search, , , .. .





, : , – . .





– , Sphinx .





, , , , , : . – , - , , – . " ", . , , , .





Sphinx :





regexp_filter = (|) => a











regexp_filter = (|) =>







, – , GitHub Gist.





soundex :





morphology = soundex







, , Sphinx Soundex.





, , Sphinx. -. - , , . . «», «», - , «Lenina», «ulitsa Lenina».





CALL KEYWORDS:





mysql> call keywords('  Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | lenin     | l500       |
| 2    | lenina    | l500       |
| 3    | lenina    | l500       |
| 4    | lennina   | l500       |
| 5    | lenin     | l500       |
+------+-----------+------------+
      
      



, tokenized , . normalized, Sphinx , , morphology. 'Lenina' l500, '' l500, , - , . Lennina, Lenena, Lennona. , , .





, :





mysql> select * from STREETS where match('Lenena'); 
+------+--------------------------------------+-----------+--------------+
| id   | aoguid                               | shortname | offname      |
+------+--------------------------------------+-----------+--------------+
|  387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 |         |        |
+------+--------------------------------------+-----------+--------------+
      
      



Sphinx , . . , :





mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+----------------+------------+
| qpos | tokenized      | normalized |
+------+----------------+------------+
| 1    | plekhanovskaja | p42512     |
| 2    | plechanovskaya | p42512     |
| 3    | plehanovskaja  | p4512      |
| 4    | plekhanovska   | p42512     |
+------+----------------+------------+
      
      



plehanovskaja -



. Sphinx . , CALL QSUGGEST:





mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
+----------------+----------+------+
| suggest        | distance | docs |
+----------------+----------+------+
| plekhanovskaja | 1        | 1    |
| petrovskaja    | 4        | 1    |
+----------------+----------+------+
      
      



, , . .. .





, :





min_infix_len = 2







suggest tokenized, .. , . , Soudex , QSUGGEST . 





- :





mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id   | aoguid                               | shortname | offname                |
+------+--------------------------------------+-----------+------------------------+
|  677 | 87234d80-4098-40c0-adb2-fc83ef237a5f |         | 30            |
+------+--------------------------------------+-----------+------------------------+

mysql> select * from STREETS where match('30  ');
+------+--------------------------------------+-----------+------------------------+
| id   | aoguid                               | shortname | offname                |
+------+--------------------------------------+-----------+------------------------+
|  677 | 87234d80-4098-40c0-adb2-fc83ef237a5f |         | 30            |
+------+--------------------------------------+-----------+------------------------+
      
      



, .





: . , , Soundex.





Soundex

. , , , .





.





Sphinx index



, , , . , Sphinx , . .. , regexp_filter



, regexp_filter



.





morphology = soundex



– , . , .





Sphinx , , ! . RE2.





, : regexp_filter = \A(A|a) => a







, 0.





regexp_filter = \B(A|a) => 0
regexp_filter = \B(Y|y) => 0
...
      
      



, regexp_filter = \B(Y|y) =>







, - . , «» «Veelkaseem» .





mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | v738      | v738       |
| 2    | v738      | v738       |
+------+-----------+------------+
      
      



- :





mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | v738      | v738       |
| 2    | v0730308  | v0730308   |
+------+-----------+------------+
      
      



, H W .





, , /, H W, . .





regexp_filter = 0+ => 0
regexp_filter = 1+ => 1
...
      
      



:





mysql> call keywords('  Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | l8        | l8         |
| 2    | l8        | l8         |
| 3    | l8        | l8         |
| 4    | l8        | l8         |
| 5    | l8        | l8         |
+------+-----------+------------+

mysql> select * from STREETS where match('Lenina');
+------+--------------------------------------+-----------+--------------+
| id   | aoguid                               | shortname | offname      |
+------+--------------------------------------+-----------+--------------+
|  387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 |         |        |
+------+--------------------------------------+-----------+--------------+
      
      



, . , tokenized , soundex-. QSUGGEST . - , – . ngram_chars. .





:





mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | p738234   | p738234    |
| 2    | p73823    | p73823     |
| 3    | p78234    | p78234     |
| 4    | p73823    | p73823     |
+------+-----------+------------+
      
      



, , QSUGGEST :





mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
Empty set (0.00 sec)

mysql> CALL QSUGGEST('p73823', 'STREETS');
Empty set (0.00 sec)

mysql> CALL QSUGGEST('p78234', 'STREETS');
Empty set (0.00 sec)
      
      



, , , . , , . . , «30 »:





mysql> call keywords('30 let Podedy', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | 30        | 30         |
| 2    | l6        | l6         |
| 3    | p6        | p6         |
+------+-----------+------------+

mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id   | aoguid                               | shortname | offname                |
+------+--------------------------------------+-----------+------------------------+
|  677 | 87234d80-4098-40c0-adb2-fc83ef237a5f |         | 30            |
+------+--------------------------------------+-----------+------------------------+
      
      



:





mysql> select * from STREETS where match('');
+------+--------------------------------------+--------------+----------------------+
| id   | aoguid                               | shortname    | offname              |
+------+--------------------------------------+--------------+----------------------+
|  873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 |        | 30           |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd |            | 50           |
+------+--------------------------------------+--------------+----------------------+
      
      



, , , .





NYSIIS

. «» - . «» , , - , .





(?i) .





, . :









  1. regexp_filter = (?i)\b(mac) => mcc











  2. regexp_filter = (?i)(ee)\b => y







  3. : H, W





    regexp_filter = (?i)(a|e|i|o|u|y)h => \1







    regexp_filter = (?i)(a|e|i|o|u|y)w => \1a











  4. regexp_filter = (?i)\B(e|i|o|u) => a







    regexp_filter = (?i)\B(q) => g







  5. S





    regexp_filter = (?i)s\b =>







  6. AY Y





  7. A





, , !!!





, - , , , CALL QSUGGEST.





:





mysql> call keywords('  Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | lanan     | lanan      |
| 2    | lanan     | lanan      |
| 3    | lanan     | lanan      |
| 4    | lannan    | lannan     |
| 5    | lanan     | lanan      |
+------+-----------+------------+

mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+---------------+---------------+
| qpos | tokenized     | normalized    |
+------+---------------+---------------+
| 1    | plachanavscaj | plachanavscaj |
| 2    | plachanavscay | plachanavscay |
| 3    | plaanavscaj   | plaanavscaj   |
| 4    | plachanavsc   | plachanavsc   |
+------+---------------+---------------+
      
      



, CALL QSUGGEST Plehanovskaja, plaanavscaj:





mysql> CALL QSUGGEST('plaanavscaj', 'STREETS');
+---------------+----------+------+
| suggest       | distance | docs |
+---------------+----------+------+
| paanarscaj    | 2        | 1    |
| plachanavscaj | 2        | 1    |
| latavscaj     | 3        | 1    |
| sladcavscaj   | 3        | 1    |
| pacravscaj    | 3        | 1    |
+---------------+----------+------+
      
      



. - .





paanarscaj





plachanavscaj





latavscaj





sladcavscaj





pacravscaj





- , . - . , . , , .





Daitch-Mokotoff Soundex

, , Soundex.





. , « », , , - , , - .





, .





.





, .. :









  • regexp_filter = (?i)\b(au) => 0











  • regexp_filter = (?i)(a|e|i|o|u|y)(au) => \17







  • , \B ,





    regexp_filter = (?i)au =>







– - :





regexp_filter = (?i)j => 1







:





mysql> call keywords('  Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | 866       | 866        |
| 2    | 866       | 866        |
| 3    | 866       | 866        |
| 4    | 8666      | 8666       |
| 5    | 866       | 866        |
+------+-----------+------------+

mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | 7856745   | 7856745    |
| 2    | 7856745   | 7856745    |
| 3    | 786745    | 786745     |
| 4    | 7856745   | 7856745    |
+------+-----------+------------+
      
      



, QSUGGEST . .





mysql> select * from STREETS where match('Veelkaseem'); show meta;
+------+--------------------------------------+--------------+----------------------+
| id   | aoguid                               | shortname    | offname              |
+------+--------------------------------------+--------------+----------------------+
|  873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 |        | 30           |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd |            | 50           |
+------+--------------------------------------+--------------+----------------------+
2 rows in set (0.00 sec)
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 2     |
| total_found   | 2     |
| time          | 0.000 |
| keyword[0]    | 78546 |
| docs[0]       | 2     |
| hits[0]       | 2     |
+---------------+-------+
      
      



, , - .





Soundex, , Soundex NYSIIS, CALL QSUGGEST, Sphinx , NYSIIS -. Soundex Daitch-Mokotoff Soundex, , , , 1286 , , - . :





mysql> call keywords(' ', 'STREETS', 0);
+------+------------+------------+
| qpos | tokenized  | normalized |
+------+------------+------------+
| 1    | vorovskogo | v612       |
| 2    | verbovaja  | v612       |
+------+------------+------------+
      
      



Soundex, :





mysql> call keywords(' ', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | v9234     | v9234      |
| 2    | v9124     | v9124      |
+------+-----------+------------+
      
      



, . , Soundex:





mysql> select * from STREETS where match('');
+------+--------------------------------------+-----------+--------------------------+
| id   | aoguid                               | shortname | offname                  |
+------+--------------------------------------+-----------+--------------------------+
|   12 | 0278d3ee-4e17-4347-b128-33f8f62c59e0 |         |              |
+------+--------------------------------------+-----------+--------------------------+
      
      



.





QSUGGEST, . , . , – .





, , : Soundex . - , , - , , Sphinx.





, , , Soundex Daitch-Mokotof - , . NYSIIS , , , .





sphinx-3.3.1, 2.1.1-beta, . Manticore. Manticore Search, . , , .





, . , .





P.S.

, . Metaphone . , , . :





  1. -









  2. ????





  3. PROFIT








All Articles