👩🏾‍💻 🏆 💆 ANYKSスペルチェッカー 🌭 🍘 👇🏼

こんにちは、これはHabréに関する私の3番目の記事です。以前、ALM言語モデルに関する記事を書きました。ここで、ASCタイプミス修正システム（ALMに基づいて実装）を紹介します。

はい、タイプミスを修正するためのシステムは多数あります。それらにはすべて独自の長所と短所があります。オープンシステムから、最も有望なJamSpellの1つを選び出すことができ、それと比較します。DeepPavlovからも同様のシステムがあり、多くの人が考えているかもしれませんが、私はそれと友達になったことがありません。

機能リスト：

最大4レベンシュテイン距離の違いがある単語の間違いの修正。
文字の単語（挿入、削除、置換、再配置）のタイプミスの修正。
の品名は、コンテキストを与えられました。
文脈を考慮して、（適切な名前とタイトル）の単語の最初の文字の大文字と小文字を区別します。
コンテキストを考慮して、結合された単語を別々の単語に分割します。
元のテキストを修正せずにテキスト分析を実行します。
テキストでプレゼンス（エラー、タイプミス、誤ったコンテキスト）を検索します。

サポートされているオペレーティングシステム：

MacOS X
FreeBSD
Linux

システムはC ++ 11で記述されており、Python3用のポートがあります。

準備ができた辞書

名前	サイズ（GB）	RAM（GB）	サイズNグラム	言語
wittenbell-3-big.asc	1.97	15.6	3	RU
wittenbell-3-middle.asc	1.24	9.7	3	RU
mkneserney-3-middle.asc	1.33	9.7	3	RU
wittenbell-3-single.asc	0.772	5.14	3	RU
wittenbell-5-single.asc	1.37	10.7	五	RU

テスト

2016 Dialog21の「タイプミス修正」 コンペティションのデータを使用して、システムをテストしました。訓練されたバイナリ辞書がテストに使用されました： wittenbell-3-middle.asc

テスト実施	精度	想起	FMeasure
タイプミス修正モード	76.97	62.71	69.11
エラー修正モード	73.72	60.53	66.48

他のデータを追加する必要はないと思います。必要に応じて、誰でもテストを繰り返すことができます。以下のテストで使用したすべての資料を添付します。

テストで使用される材料

test.txt- テストするテキスト
correct.txt- 正しいバリアントのテキスト
Evaluation.py- 修正結果を計算するためのPython3スクリプト

ここで、同じ条件でタイプミス自体を修正するためのシステムの動作を比較するのは興味深いことです。同じテキストデータで2つの異なるタイプミスをトレーニングし、テストを実行します。

比較のために、前述のタイプミス修正システムであるJamSpellを見てみましょう。

ASCとJamSpell

インストール

ASC

$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

JamSpell

$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

トレーニング

ASC

train.json

{
  "ext": "txt",
  "size": 3,
  "alter": {"":""},
  "debug": 1,
  "threads": 0,
  "method": "train",
  "allow-unk": true,
  "reset-unk": true,
  "confidence": true,
  "interpolate": true,
  "mixed-dicts": true,
  "only-token-words": true,
  "locale": "en_US.UTF-8",
  "smoothing": "wittenbell",
  "pilots": ["","","","","","","","","","","a","i","o","e","g"],
  "corpus": "./texts/correct.txt",
  "w-bin": "./dictionary/3-middle.asc",
  "w-vocab": "./train/lm.vocab",
  "w-arpa": "./train/lm.arpa",
  "mix-restwords": "./similars/letters.txt",
  "alphabet": "abcdefghijklmnopqrstuvwxyz",
  "bin-code": "ru",
  "bin-name": "Russian",
  "bin-author": "You name",
  "bin-copyright": "You company LLC",
  "bin-contacts": "site: https://example.com, e-mail: info@example.com",
  "bin-lictype": "MIT",
  "bin-lictext": "... License text ...",
  "embedding-size": 28,
  "embedding": {
      "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
      "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
      "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
      "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
      "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
      "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
      "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
      "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
      "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
      "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
      "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
      "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
      "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
      "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
  }
}

$ ./asc -r-json ./train.json

Python3

import asc

asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)

asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")

asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

def statusArpa1(status):
    print("Build arpa", status)

def statusArpa2(status):
    print("Write arpa", status)

def statusVocab(status):
    print("Write vocab", status)

def statusIndex(text, status):
    print(text, status)

def status(text, status):
    print(text, status)

asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)

asc.buildArpa(statusArpa1)

asc.writeArpa("./train/lm.arpa", statusArpa2)

asc.writeVocab("./train/lm.vocab", statusVocab)

asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")

asc.setEmbedding({
     "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
     "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
     "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
     "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
     "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
     "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
     "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
     "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
     "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
     "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
     "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
     "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
     "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
     "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

JamSpell

$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin

テスト

ASC

spell.json

{
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "spell-verbose": true,
    "confidence": true,
    "mixed-dicts": true,
    "asc-split": true,
    "asc-alter": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "asc-wordrep": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "r-bin": "./dictionary/3-middle.asc"
}

$ ./asc -r-json ./spell.json

Python3

import asc

asc.setAlmV2()
asc.setThreads(0)

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)

def status(text, status):
    print(text, status)

asc.loadIndex("./dictionary/3-middle.asc", "", status)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

JamSpell

- Python , C++

#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>

//   BOOST
#ifdef USE_BOOST_CONVERT
	#include <boost/locale/encoding_utf.hpp>
//     
#else
	#include <codecvt>
#endif

using namespace std;

/**
 * convert    utf-8  
 * @param  str  utf-8  
 * @return      
 */
const string convert(const wstring & str){
	//   
	string result = "";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//     
#else
		//     UTF-8
		using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
		//  
		wstring_convert <convert_type, wchar_t> conv;
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		//    utf-8 
		result = conv.to_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * convert      utf-8
 * @param  str   
 * @return       utf-8
 */
const wstring convert(const string & str){
	//   
	wstring result = L"";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//     
#else
		//  
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
		//    utf-8 
		result = conv.from_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * safeGetline     
 * @param  is  
 * @param  t     
 * @return     
 */
istream & safeGetline(istream & is, string & t){
	//  
	t.clear();

	istream::sentry se(is, true);
	streambuf * sb = is.rdbuf();

	for(;;){
		int c = sb->sbumpc();
		switch(c){
 			case '\n': return is;
			case '\r':
				if(sb->sgetc() == '\n') sb->sbumpc();
				return is;
			case streambuf::traits_type::eof():
				if(t.empty()) is.setstate(ios::eofbit);
				return is;
			default: t += (char) c;
		}
	}
}

/**
* main   
*/
int main(){
	//  
	NJamSpell::TSpellCorrector corrector;
	//   
	corrector.LoadLangModel("model.bin");
	//    
	ifstream file1("./test_data/test.txt", ios::in);
	//   
	if(file1.is_open()){
		//    
		string line = "", res = "";
		//    
		ofstream file2("./test_data/output.txt", ios::out);
		//   
		if(file2.is_open()){
			//       
			while(file1.good()){
				//    
				safeGetline(file1, line);
				//   ,  
				if(!line.empty()){
					//   
					res = convert(corrector.FixFragment(convert(line)));
					//   ,    
					if(!res.empty()){
						//   
						res.append("\n");
						//    
						file2.write(res.c_str(), res.size());
					}
				}
			}
			//  
			file2.close();
		}
		//  
		file1.close();
	}
    return 0;
}

$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test

$ ./bin/test

結果

結果を得る

$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt

ASC

精度	想起	FMeasure
92.13	82.51	87.05

JamSpell

精度	想起	FMeasure
77.87	63.36	69.87

ASC の主な機能の1つは、ダーティデータからの学習です。オープンアクセスでエラーやタイプミスのないテキストコーパスを見つけることは事実上不可能です。テラバイトのデータを手作業で修正するだけでは十分ではありませんが、なんとかして作業する必要があります。

私が提供する教育原則

ダーティデータを使用して言語モデルをまとめる
アセンブルされた言語モデルのすべてのまれな単語とNグラムを削除します
タイプミス修正システムをより正確に操作するために、1つの単語を追加します。
バイナリ辞書をまとめる

始めましょう

異なる主題のコーパスがいくつかあるとすると、それらを別々にトレーニングしてから組み合わせる方が論理的です。

ALMを使用したシャーシの組み立て

collect.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"ext": "txt",
	"method": "train",
	"allow-unk": true,
	"mixed-dicts": true,
	"only-token-words": true,
	"smoothing": "wittenbell",
	"locale": "en_US.UTF-8",
	"w-abbr": "./output/alm.abbr",
	"w-map": "./output/alm.map",
	"w-vocab": "./output/alm.vocab",
	"w-words": "./output/words.txt",
	"corpus": "./texts/corpus",
	"abbrs": "./abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./collect.json

size — N- 3
debug —
threads —
ext —
allow-unk — 〈unk〉
mixed-dicts —
only-token-words — N- —
smoothing — wittenbell ( , - )
locale — ( )
w-abbr —
w-map —
w-vocab —
w-words — ( )
corpus —
abbrs — , , (, , ...)
goodwords —
badwords —
mix-restwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#    N- —       
alm.setOption(alm.options_t.tokenWords)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#   ,  ,   (, ,  ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    alm.addAbbr(abbr)
f.close()

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def status(text, status):
    print(text, status)

def statusWords(status):
    print("Write words", status)

def statusVocab(status):
    print("Write vocab", status)

def statusMap(status):
    print("Write map", status)

def statusSuffix(status):
    print("Write suffix", status)

#    
alm.collectCorpus("./texts/corpus", status)
#      
alm.writeWords("./output/words.txt", statusWords)
#   
alm.writeVocab("./output/alm.vocab", statusVocab)
#    
alm.writeMap("./output/alm.map", statusMap)
#      
alm.writeSuffix("./output/alm.abbr", statusSuffix)

ALMを使用した組み立てられた船体の剪定

prune.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "vprune",
    "vprune-wltf": -15.0,
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./corpus1/alm.map",
    "r-vocab": "./corpus1/alm.vocab",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./prune.json

size — N- 3
debug —
allow-unk — 〈unk〉
vprune-wltf — - (, — )
locale — ( )
smoothing — wittenbell ( , - )
r-map —
r-vocab —
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")

#    <unk>   
alm.setOption(alm.options_t.allowUnk)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def statusPrune(status):
    print("Prune data", status)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#  
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#    
alm.readMap("./corpus1/alm.map", statusReadMap)
#   
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

収集したデータをALMと組み合わせる

merge.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "merge",
    "mixed-dicts": "true",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-words": "./texts/words",
    "r-map": "./corpus1",
    "r-vocab": "./corpus1",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "mix-restwords": "./texts/similars/letters.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./merge.json

size — N- 3
debug —
allow-unk — 〈unk〉
mixed-dicts —
locale — ( )
smoothing — wittenbell ( , - )
r-words —
r-map — ,
r-vocab — ,
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

#         
f = open('./texts/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addWord(word)
f.close()

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#   
alm.readVocab("./corpus1", statusReadVocab)
#    
alm.readMap("./corpus1", statusReadMap)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

ALMで言語モデルを学ぶ

train.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "reset-unk": true,
    "interpolate": true,
    "method": "train",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./output/alm.map",
    "r-vocab": "./output/alm.vocab",
    "w-arpa": "./output/alm.arpa",
    "w-words": "./output/words.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./train.json

size — N- 3
debug —
allow-unk — 〈unk〉
reset-unk — , 〈unk〉
interpolate —
locale — ( )
smoothing — wittenbell
r-map — ,
r-vocab — ,
w-arpa — ARPA,
w-words — , ( )
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#      <unk>   
alm.setOption(alm.options_t.resetUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#     
alm.setOption(alm.options_t.interpolate)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusBuildArpa(status):
    print("Build ARPA", status)

def statusWriteMap(status):
    print("Write map", status)

def statusWriteArpa(status):
    print("Write ARPA", status)

def statusWords(status):
    print("Write words", status)

#   
alm.readVocab("./output/alm.vocab", statusReadVocab)
#    
alm.readMap("./output/alm.map", statusReadMap)

#     
alm.buildArpa(statusBuildArpa)

#       ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)

#   
alm.writeWords("./output/words.txt", statusWords)

スペルチェッカーASCトレーニング

train.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"confidence": true,
	"mixed-dicts": true,
	"method": "train",
	"alter": {"":""},
	"locale": "en_US.UTF-8",
	"smoothing": "wittenbell",
	"pilots": ["","","","","","","","","","","a","i","o","e","g"],
	"w-bin": "./dictionary/3-single.asc",
	"r-abbr": "./output/alm.abbr",
	"r-vocab": "./output/alm.vocab",
	"r-arpa": "./output/alm.arpa",
	"abbrs": "./texts/abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"alters": "./texts/alters/yoficator.txt",
	"upwords": "./texts/words/upp",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz",
	"bin-code": "ru",
	"bin-name": "Russian",
	"bin-author": "You name",
	"bin-copyright": "You company LLC",
	"bin-contacts": "site: https://example.com, e-mail: info@example.com",
	"bin-lictype": "MIT",
	"bin-lictext": "... License text ...",
	"embedding-size": 28,
	"embedding": {
	    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
	    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
	    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
	    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
	    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
	    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
	    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
	    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
	    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
	    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
	    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
	    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
	    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
	    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
	}
}

$ ./asc -r-json ./train.json

size — N- 3
debug —
threads —
confidence — ARPA - ,
mixed-dicts —
alter — ( , , — «»)
locale — ( )
smoothing — wittenbell ( , - )
pilots — ( )
w-bin —
r-abbr — ,
r-vocab — ,
r-arpa — ARPA,
abbrs — , , (, , ...)
goodwords —
badwords —
alters — , ( )
upwords — , (, , ...)
mix-restwords —
alphabet — ( )
bin-code —
bin-name —
bin-author —
bin-copyright —
bin-contacts —
bin-lictype —
bin-lictext —
embedding-size —
embedding — ( , )

Python

import asc

#   N-  3
asc.setSize(3)
#      
asc.setThreads(0)
#    (  )
asc.setLocale("en_US.UTF-8")

#        
asc.setOption(asc.options_t.uppers)
#    <unk>   
asc.setOption(asc.options_t.allowUnk)
#      <unk>   
asc.setOption(asc.options_t.resetUnk)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusIndex(text, status):
    print(text, status)

def statusBuildIndex(status):
    print("Build index", status)

def statusArpa(status):
    print("Read arpa", status)

def statusVocab(status):
    print("Read vocab", status)

#        ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#   
asc.readVocab("./output/alm.vocab", statusVocab)

#     
asc.setCode("RU")
#    
asc.setLictype("MIT")
#   
asc.setName("Russian")
#    
asc.setAuthor("You name")
#   
asc.setCopyright("You company LLC")
#    
asc.setLictext("... License text ...")
#     
asc.setContacts("site: https://example.com, e-mail: info@example.com")

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusBuildIndex)

#     
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

すべての人が独自のバイナリボキャブラリーをトレーニングできるわけではないことを理解しています。これには、テキストコーパスと大量のコンピューティングリソースが必要です。したがって、ASCは、メインディクショナリとして1つのARPAファイルのみを処理できます。

仕事の例

spell.json

{
    "ad": 13,
    "cw": 38120,
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "alter": {"":""},
    "asc-split": true,
    "asc-alter": true,
    "confidence": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "mixed-dicts": true,
    "asc-wordrep": true,
    "spell-verbose": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "upwords": "./texts/words/upp",
    "r-arpa": "./dictionary/alm.arpa",
    "r-abbr": "./dictionary/alm.abbr",
    "abbrs": "./texts/abbrs/abbrs.txt",
    "alters": "./texts/alters/yoficator.txt",
    "mix-restwords": "./similars/letters.txt",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "pilots": ["","","","","","","","","","","a","i","o","e","g"],
    "alphabet": "abcdefghijklmnopqrstuvwxyz",
    "embedding-size": 28,
    "embedding": {
        "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
        "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
        "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
        "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
        "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
        "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
        "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
        "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
        "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
        "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
        "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
        "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
        "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
        "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
    }
}

$ ./asc -r-json ./spell.json

Python

import asc

#      
asc.setThreads(0)
#        
asc.setOption(asc.options_t.uppers)
#   
asc.setOption(asc.options_t.ascSplit)
#   
asc.setOption(asc.options_t.ascAlter)
#      
asc.setOption(asc.options_t.ascESplit)
#      
asc.setOption(asc.options_t.ascRSplit)
#     
asc.setOption(asc.options_t.ascUppers)
#     
asc.setOption(asc.options_t.ascHyphen)
#    
asc.setOption(asc.options_t.ascWordRep)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusArpa(status):
    print("Read arpa", status)

def statusIndex(status):
    print("Build index", status)

#        ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)

#    (38120      13    )
asc.setAdCw(38120, 13)

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusIndex)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

PS何も集めて訓練したくない人のために、私はASCのWebバージョンを立ち上げました。タイプミスを修正するためのシステムは万能のシステムではなく、そこでロシア語全体を提供することは不可能であることにも留意する必要があります。ASCはテキストを修正しません。トピックごとに個別にトレーニングする必要があります。

ANYKSスペルチェッカー

機能リスト：

サポートされているオペレーティングシステム：

準備ができた辞書

テスト

テストで使用される材料

ASCとJamSpell

結果

私が提供する教育原則

始めましょう

More articles: