埋め込みから意味を抽出する方法に関する記事の分析

tl; dr:記事の簡略化された分析。著者は2つの興味深い定理を提供し、それに基づいて、埋め込み行列から意味の隠されたベクトルを抽出する方法を見つけました。結果を再現する方法に関するガイドが提供されます。ラップトップはgithubから入手できます



前書き



この記事では、研究者Sanjev Aroraが記事「多義性への応用を伴う単語感覚の線形代数構造」で発見した1つの驚くべきことについてお話します。これは一連の記事の1つで、彼は単語の埋め込みの特性について理論的な正当化を試みています。同じ作業で、Aroraはword2vecやGloveなどの単純な埋め込みには実際には1つの単語の複数の意味が含まれていると想定し、それらを回復する方法を提案します。私は記事全体を通して元の例に固執するように努めます。



より正式には、 υtietieという単語の特定の埋め込みベクトルを示します。これは、結び目やネクタイの意味を持っている場合もあれば、動詞「タイアップ」である場合もあります。Aroraは、このベクトルは次の線形結合として記述できることを示唆しています。



υtieα1υtie1+α2υtie2+α3υtie3+...



どこ υtienこれはtieの可能な意味の1つでありα-係数。それがどうなるかを理解してみましょう。



理論



免責事項

, , . .



アローラの理論についての小さなメモ



Aroraの開始作業はこれよりはるかに複雑であるため、完全なレビューはまだ準備していません。ただし、それについて簡単に説明します。



したがって、Aroraは、テキストは生成モデルによって生成されるという考えを提案します。あらゆる時間ステップで彼女の仕事の過程でt 単語が生成されます wモデルはコンテキストベクトルで構成されています および埋め込みのベクトル uw. (dimensions), , . , , - (, ), — (, ), , , — .



, .. - , . . , . : " " , " ". , "": , .



, . , , , .

: , . , t w



P(w|ct)=1Zcexp<ct,υw>



ctt, υww, Zc=wexp<c,υw> — partition function. , , .



. , , : , , , . Y, X .



. - , - .



, , . , , "". :



, ", , , ". , , , ", , , " , " " .





, . , , . , ( , ). , , . .



1



, s n . A ,



υwAE[1nwisυwi|ws]



, . . w . S. , υs sS, u. , , u υw A ( ). , , out-of-vocabulary , , .



, . , SIF . , , , . , SIF υSIF k, , w, TF-IDF.



υSIF=1kn=1kυntf_idf(wn)



, , 1, c. , - , , .



. , - w, υw , . :



  1. . V.
  2. wV, , SIF 20 w, . wV (νw1,νw2,,...νwn,), n — w .
  3. uw SIF wV uw=1nt=1nνwt.
  4. argminAA||Auwυw||22
  5. SIF υw=Auw


, .. . 1/3 , A 2\3 . . .



#paragraphs 250k 500k 750k 1 million
cos similarity 0.94 0.95 0.96 0.96


2



, w s1 s2. υw - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .

, , $<!-- math>$inline$ \upsilon
{w{s1} } </math>$$<!math>\upsilon{w_{s2} } $inline$</math -->$. , , , υwυ0, υ



υw=f1f1+f2υs1+f2f1+f2υs2=αυs1+βυs2



f1 f2 s1 and s2 . , , .



, , , , ? , alpha. . , c . , , . , , , , , . , , , (inner product) . , , - (, , , ), υtie1 , ! .



. ? d k,n. k<n, A1,A2,...,Am, ,



υw=j=1mαw,jAj+μw



k α μw — .



wυwj=1mαw,jAj22



, k (sparsity parameter), m — .. , . k-SVD. , . , A , ( , A ). , , - Ai , , , m . .





, , .



import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')


1. Gensim

GloVe.

, 300- .



tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)


embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


print(embedds.shape)


(400000, 300)


400000 .



2. k-svd

. ksvd.



!pip install ksvd
from ksvd import ApproximateKSVD


Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)


, 2000 5.

: 10000 . , , , , .



%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs


#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]


#print(gamma.shape)
print(dictionary.shape)


(2000, 300)


#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)


3.



, . .



embeddings.similar_by_vector(dictionary[1354,:])


[('slave', 0.8417330980300903),
 ('slaves', 0.7482961416244507),
 ('plantation', 0.6208109259605408),
 ('slavery', 0.5356900095939636),
 ('enslaved', 0.4814416170120239),
 ('indentured', 0.46423888206481934),
 ('fugitive', 0.4226764440536499),
 ('laborers', 0.41914862394332886),
 ('servitude', 0.41276970505714417),
 ('plantations', 0.4113745093345642)]


embeddings.similar_by_vector(dictionary[1350,:])


[('transplant', 0.7767853736877441),
 ('marrow', 0.699995219707489),
 ('transplants', 0.6998592615127563),
 ('kidney', 0.6526087522506714),
 ('transplantation', 0.6381147503852844),
 ('tissue', 0.6344675421714783),
 ('liver', 0.6085026860237122),
 ('blood', 0.5676015615463257),
 ('heart', 0.5653558969497681),
 ('cells', 0.5476219058036804)]


embeddings.similar_by_vector(dictionary[1546,:])


[('commons', 0.7160810828208923),
 ('house', 0.6588335037231445),
 ('parliament', 0.5054076910018921),
 ('capitol', 0.5014163851737976),
 ('senate', 0.4895153343677521),
 ('hill', 0.48859673738479614),
 ('inn', 0.4566132128238678),
 ('congressional', 0.4341348707675934),
 ('congress', 0.42997264862060547),
 ('parliamentary', 0.4264637529850006)]


embeddings.similar_by_vector(dictionary[1850,:])


[('okano', 0.2669774889945984),
 ('erythrocytes', 0.25755012035369873),
 ('windir', 0.25621023774147034),
 ('reapportionment', 0.2507009208202362),
 ('qurayza', 0.2459488958120346),
 ('taschen', 0.24417680501937866),
 ('pfaffenbach', 0.2437630295753479),
 ('boldt', 0.2394050508737564),
 ('frucht', 0.23922981321811676),
 ('rulebook', 0.23821482062339783)]


! , . . , , . "tie" "spring" .



itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic


! , , , .

, , . , , .



. fastText, RusVectores. 300.



fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')


embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


embedds.shape


(164996, 300)


%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs


dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]


embeddings.similar_by_vector(dictionary[1024,:], 20)


[('', 0.6854609251022339),
 ('', 0.6593252420425415),
 ('', 0.6360634565353394),
 ('', 0.5998549461364746),
 ('', 0.5971367955207825),
 ('', 0.5862340927124023),
 ('', 0.5788886547088623),
 ('', 0.5788123607635498),
 ('', 0.5623885989189148),
 ('', 0.5610565543174744),
 ('', 0.5551878809928894),
 ('', 0.551397442817688),
 ('', 0.5356274247169495),
 ('', 0.531707227230072),
 ('', 0.5174376368522644),
 ('', 0.5131562948226929),
 ('', 0.5120065212249756),
 ('', 0.5077806115150452),
 ('', 0.5074601173400879),
 ('', 0.5068254470825195)]


embeddings.similar_by_vector(dictionary[1582,:], 20)


[('', 0.45191124081611633),
 ('', 0.4515378475189209),
 ('', 0.4478364586830139),
 ('', 0.4280813932418823),
 ('', 0.41220104694366455),
 ('', 0.40772825479507446),
 ('', 0.4047147035598755),
 ('', 0.4030646085739136),
 ('', 0.39368513226509094),
 ('', 0.39012178778648376),
 ('', 0.3866344690322876),
 ('', 0.37968817353248596),
 ('', 0.3728911876678467),
 ('', 0.3663109242916107),
 ('', 0.3640827238559723),
 ('', 0.3474290072917938),
 ('', 0.3473641574382782),
 ('', 0.3468908369541168),
 ('', 0.34586742520332336),
 ('', 0.34555742144584656)]


embeddings.similar_by_vector(dictionary[500,:], 20)


[('', 0.6874514222145081),
 ('-', 0.5172050595283508),
 ('', 0.46720415353775024),
 ('', 0.44713956117630005),
 ('', 0.4144558310508728),
 ('', 0.40545403957366943),
 ('', 0.4030636250972748),
 ('-', 0.4016447067260742),
 ('', 0.38331469893455505),
 ('', 0.37292781472206116),
 ('', 0.3625457286834717),
 ('', 0.35121074318885803),
 ('', 0.3504621088504791),
 ('', 0.34097471833229065),
 ('', 0.33320850133895874),
 ('', 0.3277249336242676),
 ('', 0.3266661763191223),
 ('', 0.31865227222442627),
 ('::', 0.30150306224823),
 ('', 0.2975207567214966)]


itie = index2word.index('')
ispring = index2word.index('')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #185:          
Atom #1217:         - 
Atom #1213:          
Atom #1978:          
Atom #1796:          
Atom #839:          
Atom #989:          
Atom #414:          
Atom #1140:       -   
Atom #878:          


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #883:          -
Atom #40:          
Atom #215:          
Atom #688:          
Atom #386:          
Atom #676:          
Atom #414:          
Atom #127:          
Atom #592:          
Atom #703:    - -     


#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)


.





, (Word sense indection), , 1. — , . , , . , , , . , .



UPD: knagaev .




All Articles