ãããžã§ã¯ãã¯æé·ããã©ã€ãã©ãªã¯çŸåšããã·ã¢ã®èªç¶èšèªãåŠçãããã¹ãŠã®åºæ¬çãªã¿ã¹ã¯ã解決ããŠããŸããããŒã¯ã³ãšæãžã®ã»ã°ã¡ã³ããŒã·ã§ã³ã圢æ åŠçããã³æ§æçåæãã¬ã³ãåãååä»ããšã³ãã£ãã£ã®æœåºã§ãã
ãã¥ãŒã¹èšäºã®å Žåããã¹ãŠã®ã¿ã¹ã¯ã®åè³ªã¯æ¢åã®ãœãªã¥ãŒã·ã§ã³ãšåçãŸãã¯ãã以äžã§ãã..ãããšãã°ãNatashaã¯Deeppavlov BERT NERïŒF1 PER 0.97ãLOC 0.91ãORG 0.85ïŒããã1ããŒã»ã³ããã€ã³ãæªãNERã¿ã¹ã¯ã«å¯ŸåŠããã¢ãã«ã®ééã¯75åã®1ïŒ27MBïŒã§ãCPUäžã§2åé«éïŒ25èšäº/ç§ïŒã§å®è¡ãããŸãã ïŒGPUã®BERTNERããã
ãããžã§ã¯ãã«ã¯9ã€ã®ãªããžããªããããNatashaã©ã€ãã©ãªã¯ãããã1ã€ã®ã€ã³ã¿ãŒãã§ã€ã¹ã«çµåããŸãããã®èšäºã§ã¯ãæ°ããããŒã«ã«ã€ããŠèª¬æããããããæ¢åã®ãœãªã¥ãŒã·ã§ã³ïŒDeeppavlovãSpaCyãUDPipeïŒãšæ¯èŒããŸãã
ãã®ãã³ã°ãªãŒãã®åã«ãnatasha.github.ioã«äžé£ã®æçš¿ããããŸããã以äžã®ããã¹ãã®ãµã€ãºã«äžå®ãããå Žåã¯ããã¿ãŒã·ã£ãããžã§ã¯ãã®æŽå²ã«ã€ããŠã®ãã¥ãŒãã¹ããªãŒã ã®æåã®20åéãã芧ãã ãããçã説æããããŸãïŒ
- ãã¿ãŒã·ã£-ãã·ã¢èªçšã®é«å質ã³ã³ãã¯ãNER
- Navec-ãã·ã¢èªçšã®ã³ã³ãã¯ããªåã蟌ã¿
- ã³ãŒã©ã¹-ãã·ã¢èªã®NLPããŒã¿ã»ããã®ã³ã¬ã¯ã·ã§ã³
- Razdel-ãã·ã¢èªã®ããã¹ããããŒã¯ã³ãšãªãã¡ãŒã«åå²
- Naeval-ãã·ã¢èªã話ãNLPã®ã·ã¹ãã ã®å®éçæ¯èŒ
- Nerusã¯ã圢æ ãæ§æãååä»ããšã³ãã£ãã£ã®ããŒã¯ã¢ãããåããå€§èŠæš¡ãªåæãã·ã¢èªããŒã¿ã»ããã§ãã
ããã¹ãã¯t.me/natural_language_processingãã£ããããã®ã¡ã¢ãšãã£ã¹ã«ãã·ã§ã³ã䜿çšããæ°ããè³æãžã®ãªã³ã¯ã¯åãå Žæã«è¡šç€ºãããŸãã
- ãã¿ãŒã·ã£ããã©ã³ã¹ãã©ãŒããŒã䜿çšããŠããªãçç±ã100è¡ã®BERT
- SlovnetBERTã¢ãã«
- ãã¿ãŒã·ã£ãããžã§ã¯ãã®æŽå²ã«ã€ããŠã®ãã¥ãŒãã¹ããªãŒã
- Yargyããã¥ã¡ã³ããæŽæ°
- YargyããŒãµãŒã«é¢ãã远å ãªãœãŒã¹
ãã£ãšèããã人ã¯ãDatafest 2020ã§ã®æ¯æã®ããŒã¯ããã§ãã¯ããŠãã ãããããã¯ã»ãšãã©ãã®æçš¿ãã«ããŒããŠããŸãïŒ
ã³ã³ãã³ãïŒ
- Natasha â .
- Razdel â
- Slovnet â deep learning
- Navec â
- Nerus â ,
- Corus â +
- Naeval â NLP
- Yargy- â
- Ipymarkup â
Natasha â .
以åãNatashaã©ã€ãã©ãªã¯ãã·ã¢èªã®NERåé¡ã解決ããã«ãŒã«ã«åºã¥ããŠæ§ç¯ãããå¹³åçãªå質ãšããã©ãŒãã³ã¹ã瀺ããŸãããçŸåšããã¿ãŒã·ã£ã¯å šäœãšããŠå€§ããªãããžã§ã¯ãã§ããã9ã€ã®ãªããžããªã§æ§æãããŠããŸããNatashaã©ã€ãã©ãªã¯ããããã1ã€ã®ã€ã³ã¿ãŒãã§ã€ã¹ã«çµ±åããèªç¶ãªãã·ã¢èªãåŠçããåºæ¬çãªã¿ã¹ã¯ã解決ããŸããããŒã¯ã³ãšæãžã®ã»ã°ã¡ã³ããŒã·ã§ã³ãäºåã«ãã¬ãŒãã³ã°ãããåã蟌ã¿ã圢æ ãšæ§æã®åæãã¬ã³ãåãNERã§ãããã¹ãŠã®ãœãªã¥ãŒã·ã§ã³ã¯ãã¥ãŒã¹ãããã¯ã§æé«ã®çµæã瀺ããCPUã§é«éã«å®è¡ãããŸãã
ãã¿ãŒã·ã£ã¯ä»ã®ã³ã³ãã€ã³ã©ã€ãã©ãªã«äŒŒãŠããŸãïŒSpaCyãUDPipeãStanza..ãSpaCyã¯ã¢ãã«ãåæåããŠæé»çã«åŒã³åºãããŠãŒã¶ãŒã¯ããã¹ããããžãã¯é¢æ°
nlpã«æž¡ããå®å
šã«è§£æãããããã¥ã¡ã³ããååŸããŸãã
import spacy
# load ,
# , NER
nlp = spacy.load('...')
# ,
text = '...'
doc = nlp(text)
ãã¿ãŒã·ã£ã®ã€ã³ã¿ãŒãã§ãŒã¹ã¯ããåé·ã§ãããŠãŒã¶ãŒã¯ã³ã³ããŒãã³ããæç€ºçã«åæåããŸããäºåã«ãã¬ãŒãã³ã°ãããåã蟌ã¿ãããŒãããã¢ãã«ã³ã³ã¹ãã©ã¯ã¿ãŒã«æž¡ããŸãããµã ã¯ãã¡ãœãããåŒã³åºã
segmentãtag_morphãparse_syntax圢æ
ãæ§æã®è§£æãããŒã¯ã³ãšéèŠãžã®ã»ã°ã¡ã³ããŒã·ã§ã³ã
>>> from natasha import (
Segmenter,
NewsEmbedding,
NewsMorphTagger,
NewsSyntaxParser,
Doc
)
>>> segmenter = Segmenter()
>>> emb = NewsEmbedding()
>>> morph_tagger = NewsMorphTagger(emb)
>>> syntax_parser = NewsSyntaxParser(emb)
>>> text = ' , , 2019 () ...'
>>> doc = Doc(text)
>>> doc.segment(segmenter)
>>> doc.tag_morph(morph_tagger)
>>> doc.parse_syntax(syntax_parser)
>>> sent = doc.sents[0]
>>> sent.morph.print()
NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing
ADP
PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing
PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
...
>>> sent.syntax.print()
âââ⺠nsubj
â
â â⺠case
â ââ
â ââ
â â⺠flat:name
ââââââââââââ
â â âââ⺠, punct
â â â â⺠mark
â ââºââââ ccomp
â â â⺠case
â ââââºââ obl
...
ååä»ããšã³ãã£ãã£ãšã¯ã¹ãã©ã¯ã¿ã¯ã圢æ åŠçããã³è§£æã®çµæã«äŸåãããåå¥ã«äœ¿çšã§ããŸãã
>>> from natasha import NewsNERTagger
>>> ner_tagger = NewsNERTagger(emb)
>>> doc.tag_ner(ner_tagger)
>>> doc.ner.print()
, ,
LOCââââ LOCââââ PERâââââââ
2019
LOCââââââââââââââ
()
LOCâââ ORGâââââââââââââââââââââââââââââââââââââââ
...
PERââââââââââââ
ãã¿ãŒã·ã£ã¯ãèŠåºãèªåŠçã®åé¡ã解決䜿çšãPymorphy2ãšåœ¢æ çŽ è§£æã®çµæãã
>>> from natasha import MorphVocab
>>> morph_vocab = MorphVocab()
>>> for token in doc.tokens:
>>> token.lemmatize(morph_vocab)
>>> {_.text: _.lemma for _ in doc.tokens}
{'': '',
'': '',
'': '',
'': '',
'': '',
'': '',
'': '',
',': ',',
'': '',
'': ''
...
ãã¬ãŒãºãéåžžã®åœ¢ã«ããããã«ã¯ãåã ã®åèªã®è£é¡ãèŠã€ããã ãã§ã¯ååã§ã¯ãããŸããããã·ã¢å€åçã«ãšã£ãŠã¯ããã·ã¢å€åçããŠã¯ã©ã€ãåœæ°äž»çŸ©è çµç¹ãã€ãŸããŠã¯ã©ã€ãåœæ°äž»çŸ©çµç¹ã§ããããšã倿ããŸãããã¿ãŒã·ã£ã¯ãè§£æã®çµæã䜿çšããåèªéã®é¢ä¿ãèæ ®ããååä»ããšã³ãã£ãã£ãæ£èŠåããŸãã
>>> for span in doc.spans:
>>> span.normalize(morph_vocab)
>>> {_.text: _.normal for _ in doc.spans}
{'': '',
'': '',
' ': ' ',
' ': ' ',
'': '',
' ()': ' ()',
' ': ' ',
...
ãã¿ãŒã·ã£ã¯ãããã¹ãå ã®ååãçµç¹ãå Žæã®ååãèŠã€ããŸããã©ã€ãã©ãªå ã®ååã«ã€ããŠã¯ãYargyããŒãµãŒã®æ¢æã®ã«ãŒã«ã®ã»ããããããã¢ãžã¥ãŒã«ã¯æ£èŠåãããååãéšåã«åå²ãããViktorFedorovichYushchenkoãããååŸã
{first: , last: , middle: }ãŸãã
>>> from natasha import (
PER,
NamesExtractor,
)
>>> names_extractor = NamesExtractor(morph_vocab)
>>> for span in doc.spans:
>>> if span.type == PER:
>>> span.extract_fact(names_extractor)
>>> {_.normal: _.fact.as_dict for _ in doc.spans if _.type == PER}
{' ': {'first': '', 'last': ''},
' ': {'first': '', 'last': ''},
' ': {'first': '', 'last': ''},
'': {'last': ''},
' ': {'first': '', 'last': ''}}
ã©ã€ãã©ãªã«ã¯ãæ¥ä»ãéé¡ãäœæãè§£æããããã®ã«ãŒã«ãå«ãŸããŠããŸãããããã¯ãããã¥ã¡ã³ããšãªãã¡ã¬ã³ã¹ããã¯ã«èšèŒãããŠããŸãã
ãã¿ãŒã·ã£ã©ã€ãã©ãªã¯ãæè²ã§äœ¿çšããããããžã§ã¯ããã¯ãããžãŒã®ãã¢ã³ã¹ãã¬ãŒã·ã§ã³ã«æé©ã§ããã¢ãã«ã®éã¿ãæã€ã¢ãŒã«ã€ãã¯ããã±ãŒãžã«çµã¿èŸŒãŸããŠããŸããã€ã³ã¹ããŒã«åŸãäœãããŠã³ããŒãããŠæ§æããå¿ èŠã¯ãããŸããã
Natashaã¯ãä»ã®ãããžã§ã¯ãã©ã€ãã©ãªã1ã€ã®ã€ã³ã¿ãŒãã§ã€ã¹ã«çµ±åããŸããå®éã®åé¡ã解決ããã«ã¯ãããããçŽæ¥äœ¿çšããå¿ èŠããããŸãã
- Razdel-ããã¹ããæãšããŒã¯ã³ã«åå²ããŸãã
- Navec-é«å質ã®ã³ã³ãã¯ããªåã蟌ã¿ã
- Slovnet-圢æ
ãæ§æãNERã®ææ°ã®ã³ã³ãã¯ãã¢ãã«ã
- Yargy-æ§é åãããæ
å ±ãæœåºããããã®ã«ãŒã«ãšèªåœã
- Ipymarkup -NERãšæ§æããŒã¯ã¢ããã®èŠèŠåã
- ã³ãŒã©ã¹-ãã·ã¢èªã®å
¬éããŒã¿ã»ãããžã®ãªã³ã¯ã®ã³ã¬ã¯ã·ã§ã³ã
- Nerusã¯ãååä»ããšã³ãã£ãã£ã圢æ
ãããã³æ§æã®èªåããŒã¯ã¢ãããåãã倧ããªã³ãŒãã¹ã§ãã
Razdel-ãã·ã¢èªã®ããã¹ããããŒã¯ã³ãšãªãã¡ãŒã«åå²
Razdelã©ã€ãã©ãªã¯Natashaãããžã§ã¯ãã®äžéšã§ããããã·ã¢èªã®ããã¹ããããŒã¯ã³ãšæã«åå²ããŸããRazdelãªããžããªã§ã®ã€ã³ã¹ããŒã«æé ã䜿çšäŸãããã³ããã©ãŒãã³ã¹æž¬å®ã
>>> from razdel import tokenize, sentenize
>>> text = '- 0.5 (50/64 ³, 516;...)'
>>> list(tokenize(text))
[Substring(start=0, stop=13, text='-'),
Substring(start=14, stop=16, text=''),
Substring(start=17, stop=20, text='0.5'),
Substring(start=20, stop=21, text=''),
Substring(start=22, stop=23, text='(')
...]
>>> text = '''
... - " ?" - " --".
... . . . . ,
... '''
>>> list(sentenize(text))
[Substring(start=1, stop=23, text='- " ?"'),
Substring(start=24, stop=40, text='- " --".'),
Substring(start=41, stop=56, text=' . . . .'),
Substring(start=57, stop=76, text=' , ')]
æè¿ã®ã¢ãã«ã¯ãã»ã°ã¡ã³ããŒã·ã§ã³ãæ°ã«ãããBPEã䜿çšããé©ãã¹ãçµæã瀺ããGPTãšBERTåç©åã®ãã¹ãŠã®ããŒãžã§ã³ãèŠããŠããããšããããããŸãããã¿ãŒã·ã£ã¯ã圢æ ãšæ§æã®è§£æã®åé¡ã解決ããŸãããããã¯ã1ã€ã®æå ã®å¥ã ã®åèªã«å¯ŸããŠã®ã¿æå³ããããŸãããããã£ãŠãç§ãã¡ã¯è²¬ä»»ãæã£ãŠã»ã°ã¡ã³ããŒã·ã§ã³ã®æ®µéã«è¿ã¥ãã人æ°ã®ãããªãŒãã³ããŒã¿ã»ããïŒSynTagRusãOpenCorporaãGICRYAïŒããããŒã¯ã¢ãããç¹°ãè¿ãããšããŸãã
Razdelã®é床ãšå質ã¯ããã·ã¢èªã®ä»ã®ãªãŒãã³ãœãŒã¹ãœãªã¥ãŒã·ã§ã³ãšåçããã以äžã§ãã
| ããŒã¯ã³ã»ã°ã¡ã³ããŒã·ã§ã³ãœãªã¥ãŒã·ã§ã³ | 1000ããŒã¯ã³ãããã®ãšã©ãŒ | åŠçæéãç§ |
| Regexp-ããŒã¹ã©ã€ã³ | 19 | 0.5 |
| SpaCy
|
17 | 5.4 |
| NLTK
|
130 | 3.1 |
| MyStem
|
19 | 4.5 |
| ã¢ãŒã»
|
åäž | 1.9 |
| SegTok
|
12 | 2.1 |
| SpaCy Russian Tokenizer
|
8 | 46.4 |
| RuTokenizer
|
15 | 1.0 |
| Razdel
|
7 | 2.6 |
| 1000 | , | |
| Regexp-baseline | 76 | 0.7 |
| SegTok
|
381 | 10.8 |
| Moses
|
166 | 7.0 |
| NLTK
|
57 | 7.1 |
| DeepPavlov
|
41 | 8.5 |
| Razdel | 43 | 4.8 |
SynTagRusãOpenCorporaãGICRYAãRNCã®4ã€ã®ããŒã¿ã»ããã®å¹³åãšã©ãŒæ°ã詳现ã«ã€ããŠã¯ãRazdelãªããžããªãã芧ãã ããã
éåžžã®ç·ã®ããŒã¹ã©ã€ã³ãåæ§ã®å質ãæäŸãããã·ã¢èªçšã®æ¢è£œã®ãœãªã¥ãŒã·ã§ã³ãããããããã®ã«ããªãRazdelãå¿ èŠãªã®ã§ããïŒå®éãRazdelã¯åãªãããŒã¯ãã€ã¶ãŒã§ã¯ãªããå°ããªã«ãŒã«ããŒã¹ã®ã»ã°ã¡ã³ããŒã·ã§ã³ãšã³ãžã³ã§ããã»ã°ã¡ã³ããŒã·ã§ã³ã¯åºæ¬çãªã¿ã¹ã¯ã§ãããå®éã«ããçºçããŸããããšãã°ãåžæ³è¡çºãããããã®äžã®éçšéšåã匷調衚瀺ãããããæ®µèœã«åå²ããå¿ èŠããããŸããåœç¶ã®ããšãªãããæ¢è£œã®ãœãªã¥ãŒã·ã§ã³ã§ã¯ãããã§ããŸããããœãŒã¹ã³ãŒãã§ç¬èªã®ã«ãŒã«ãäœæããæ¹æ³ããèªã¿ãã ãããããã«ãèªåèªèº«ãããã·ã¥ãããšã³ãžã³ã§ããŒã¯ã³ãšãªãã¡ãŒã®ããããœãªã¥ãŒã·ã§ã³ãäœæããæ¹æ³ã«ã€ããŠèª¬æããŸãã
é£ããã¯äœã§ããïŒ
ãã·ã¢èªã§ã¯ãæã¯éåžžãããªãªããçå笊ããŸãã¯æå笊ã§çµãããŸããããã¹ããéåžžã®åŒã§åå²ããŠã¿ãŸããã
[.?!]\s+ããã®ãœãªã¥ãŒã·ã§ã³ã§ã¯ã1000æããã76ãšã©ãŒãçºçããŸããééãã®çš®é¡ãšäŸïŒ
ç¥èª
... 3,000人以äžã®èŠèŽè ããããã©ãããã©ãŒã ã¯ããã¬ãŒã§ãã
... 17äžçŽã®çµããããããŒãã圌ãã®äžã«ç«ã£ãŠããŸããã
âŠâB.Aãã«ã¡ãªãã§åä»ããããChamberMusicalTheaterã§ãã¯ããã¹ããŒã
ã€ãã·ã£ã«
V.A.âMozart-R.âStraussã«ãããªãã©ãIdomeneoãã«ç¶ã...
ãªã¹ã
2.ãã£ã³ã©ã³ãé äºé€šã«ã¯çŸããé·ãåããããšæããŸãã...
g ããã·ã¢ã®ééã®åè»ã®ãã±ãã...
æã®çµããã«ãç¬é¡ãŸãã¯æŽ»åã®çç¥èšå·
ãã€ãã¹ãåãé€ãæ¹æ³ãæäŸãã人ã¯èª°ã§ã-ãã®ãããã§ïŒïŒâç§ã¯èŠããææ ®æ·±ã...âã³ã³ãã³ããå£ããŠããŸãã®ã§ãããã¯ãã£ãšäžå¿«ã§ãã
åŒçšãçŽæ¥ã¹ããŒããææ«ã«åŒçšããŒã¯
-ããªãã¯çºã«è±å«ãããŸããïŒãâã誰ã®ããã«è±å«ãããŸããïŒã
ãããã¯ç§ããã®ããã§ã¯ãªãã»ã©è¯ãã§ãïŒãâä»ã翻蚳äžã«ãç§ã¯ããã€ãã®ééããç¯ããŸããïŒãidologyãã
Razdelã¯ãããã®ãã¥ã¢ã³ã¹ãèæ ®ã«å ¥ãããšã©ãŒã®æ°ã1000æããã76ãã43ã«æžãããŸãã
ç¶æ³ã¯ããŒã¯ã³ã§ãåæ§ã§ããåªããåºæ¬çãªè§£æ±ºçã¯regexã§
[--]+|[0-9]+|[^-0-9 ]ã1000ããŒã¯ã³ããã19ãšã©ãŒã«ãªããŸããäŸïŒ
åæ°ãè€éãªå¥èªç¹
... 1980幎代åŸåãã1990幎代åé
...BS-â3ã®è³ªéã¯ãããã«å°ãªãïŒ3âãâ6tïŒ
-ãããŠåœŒå¥³ã¯â.âæ»ãã ãé·¹ã女ã®åãåãããŸããïŒâïŒ
Razdelã¯ããšã©ãŒçã1000ããŒã¯ã³ããã7ã«æžãããŠããŸãã
åäœåç
ã·ã¹ãã ã¯ã«ãŒã«ã«åºã¥ããŠæ§ç¯ãããŠããŸããããŒã¯ã³ãšãªãã¡ãŒãžã®ã»ã°ã¡ã³ããŒã·ã§ã³ã®ååã¯åãã§ãã
åè£è ã®ã³ã¬ã¯ã·ã§ã³
æ¬æã«ã¯ãããªãªããæ¥åãæ¬åŒ§ãåŒçšç¬Šãªã©ãææ«ã®ãã¹ãŠã®åè£ãå«ãŸããŠããŸãã
6.âãç§ã¯ããããããšããåçã®æãé »ç¹ã§åæã«é«ãè©äŸ¡ã®ãªãã·ã§ã³âïŒ13ã®ã¹ããŒãã¡ã³ãã25ãã€ã³ãïŒââæ¿èªãšå±ãŸããåããç¶æ³â7.âãç§ã¯ç¥ã£ãŠããããšããåçã§ã¯ãæãå žåçãªãã®ãšããŠæšå®ãããŠããããšã¯æ³šç®ã«å€ããŸãããããããç§ã¯å¥³æ§ã§ãããšããçãã«åºãããããšãã ãâ;ããã®äººçã§ç§ãåŸ ã£ãŠããã®ã¯1ã€ã®çµå©ã ãã§ããâãšãé ããæ©ããç§ã¯åºç£ããªããã°ãªããŸãããâ.âã³ã³ãã€ã©ïŒV.âP.âãŽããã³ãF.âV.âZanichevãA.âL.âRastorguevãR.âV.âSavkoãI.âI.âTuchkovã
ããŒã¯ã³ã®å Žåãããã¹ããã¢ãã ã«åå²ããŸããããŒã¯ã³ã®å¢çç·ã¯ãã¢ãã å ãæ£ç¢ºã«ééããŸããã
1980幎ã®çµããã«â-ââ-beginning1990â-
ââBSâ-â3âãããã«â
å°ãã質éâïŒâ3âãâ6ââïŒâââãããŒã¯ããããšãå¯èœã§ãã Daâandâumerlaâ.â.â.âGotâligirlãâthefalconâïŒâïŒ
é£å
ç§ãã¡ã¯äžè²«ããŠåé¢ã®åè£ããã€ãã¹ããäžèŠãªãã®ãåé€ããŸãããã¥ãŒãªã¹ãã£ãã¯ã®ãªã¹ãã䜿çšããŸãã
ãªã¹ãã¢ã€ãã ãåºåãæåã¯ããªãªããŸãã¯æ¬åŒ§ã§ãå·ŠåŽã¯æ°åãŸãã¯æå
6ã§ããâæãé »ç¹ã§ãããšåæã«é«ãè©äŸ¡ãããŠããåçããããããïŒ13ã¹ããŒãã¡ã³ãã25ãã€ã³ãïŒã¯ãæ¿èªãšå±ãŸããåããŠããç¶æ³ã§ãã 7.âãç§ã¯ç¥ã£ãŠããããšããçãã®äžã§...
ã€ãã·ã£ã«ã§ããããšã¯æ³šç®ã«å€ããŸããã»ãã¬ãŒã¿-ããããå·ŠåŽã«1ã€ã®å€§æå
...V.âP.âGolovinãF.âV.âZanichevãA.âL.âRastorguevãR.âV.âSavkoãI.âI.âTuchkovã«ãã£ãŠç·šéãããŸããã ..ã
ã»ãã¬ãŒã¿ãŒã®å³åŽã«ã¹ããŒã¹ã¯ãããŸãã
...ãããããç§ã¯å¥³æ§ã§ãããšããçãã¯äžåºŠã ãã§ãâ; ããã®äººçã§ç§ãåŸ ã£ãŠããã®ã¯1ã€ã®çµå©ã ãã§ãããšãé ããæ©ããç§ã¯åºç£ããªããã°ãªããªãããšãã声æããããŸãâã
çµäºåŒçšç¬ŠãŸãã¯æ¬åŒ§ã®åã«ææ«èšå·ã¯ãããŸãããããã¯åŒçšç¬ŠãŸãã¯çŽæ¥ã®ã¹ããŒãã§ã¯ãããŸããã6
ãæãé »ç¹ã§é«ãè©äŸ¡ãããŠããåçã¯ãããããããã§ã«ïŒ13ã¹ããŒãã¡ã³ãã25ãã€ã³ãïŒâ-æ¿èªãšå±ãŸããåŸãç¶æ³ã ...ããã®äººçã§ç§ãåŸ ã£ãŠããã®ã¯1ã€ã®çµå©ã ãã§ãããããŠãé ããæ©ããç§ã¯åºç£ããªããã°ãªããŸãããã
ãã®çµæã2ã€ã®åºåãæåãæ®ã£ãŠããã®ã§ãããããæã®çµãããšèŠãªããŸãã
6.ãç§ã¯ãããããïŒ13ã®ã¹ããŒãã¡ã³ãã25ã®ãã€ã³ãïŒãšããåçã®æãé »ç¹ã§åæã«é«ãè©äŸ¡ãããŠããå€åœ¢ã¯ãæ¿èªãšå±ãŸããåããŠããç¶æ³ã§ããâ7ã ãç§ã¯ç¥ã£ãŠããããšããçãã§æãå žåçãªãã®ãšããŠè©äŸ¡ãããŠããããšã¯æ³šç®ã«å€ããŸããããç§ã¯å¥³æ§ã§ãããšããçãã«åºäŒã£ãã®ã¯äžåºŠã ãã§ãã ããã®äººçã§ç§ãåŸ ã£ãŠããã®ã¯1ã€ã®çµå©ã ãã§ããããé ããæ©ããç§ã¯åºç£ããªããã°ãªããªãããšãã声æããããŸããV.PãGolovinãF.VãZanichevãA.LãRastorguevãR.Vã SavkoãIãIãTuchkov
æé ã¯ããŒã¯ã³ã®å ŽåãšäŒŒãŠããŸãããã«ãŒã«ãç°ãªããŸãã
åæ°ãŸãã¯åççãªæ°
...ïŒ3âãâ6tïŒ...
è€éãªå¥èªç¹
-ã¯ããæ»äº¡ããŸãããâ.âãé·¹ã女ã®åãåãããŸããïŒâïŒ
ãã€ãã³ã®åšãã«ã¹ããŒã¹ã¯ãããŸãããããã¯çŽæ¥ã®ã¹ããŒãã®å§ãŸãã§ã¯ãããŸããã
1980幎ã®çµããã«â-â-1990幎ã®åã
â-âBSâ-â3ã«æ³šæããŠãã ãã...
æ®ã£ãŠãããã®ã¯ãã¹ãŠããŒã¯ã³ã®å¢çãšèŠãªãããŸãã
1980幎代ã®çµããã«-xâ-å§ãŸã-1990-xâBS
-3âããã¯ââéç¥âãããã«äœã質éâïŒâ3.6âtâïŒâââ
ã¯ãããããŠæ»ãã ã ..âäºè§£âliâgirlãâsokolâïŒïŒ
å¶éäºé
Razdelã«ãŒã«ã¯ãæ£ããå¥èªç¹ã§ãããã«æžãããããã¹ãçšã«æé©åãããŠããŸãããã®ãœãªã¥ãŒã·ã§ã³ã¯ããã¥ãŒã¹èšäºãæåŠããã¹ãã§ããŸãæ©èœããŸãããœãŒã·ã£ã«ãããã¯ãŒã¯ããã®æçš¿ãé»è©±ã§ã®äŒè©±ã®èšé²ã§ã¯ãå質ãäœããªããŸããæã®éã«ã¹ããŒã¹ããªãããæ«å°Ÿã«ããªãªãããªãå ŽåããŸãã¯æãå°æåã§å§ãŸãå ŽåãRazdelã¯ééããç¯ããŸãããœãŒã¹ã³ãŒãã§
ã¿ã¹ã¯ã®ã«ãŒã«ãäœæããæ¹æ³ãèªãã§ãã ããããã®ãããã¯ã¯ãŸã ããã¥ã¡ã³ãã§é瀺ãããŠããŸããã
Slovnet-èªç¶ãªãã·ã¢èªåŠçã®ããã®æ·±å±€åŠç¿ã¢ããªã³ã°
ãã®ãããžã§ã¯ãã§ã¯ããã¿ãŒã·ã£ã¹ãããããã¯ããã·ã¢èªã話ãNLPã®ææ°ã¢ãã«ã®æè²ãšæšè«ã«åãçµãã§ããŸããã©ã€ãã©ãªã«ã¯ãååä»ããšã³ãã£ãã£ãæœåºãã圢æ ãšæ§æãè§£æããããã®é«å質ã®ã³ã³ãã¯ãã¢ãã«ãå«ãŸããŠããŸãããã¹ãŠã®ã¿ã¹ã¯ã®å質ã¯ããã¥ãŒã¹ããã¹ãã®ãã·ã¢èªã®ä»ã®ãªãŒãã³ãœãªã¥ãŒã·ã§ã³ãšåçãŸãã¯ãã以äžã§ããã€ã³ã¹ããŒã«ã®æé ã䜿çšäŸ-äžSlovnetãªããžããªã NERåé¡ã®è§£æ±ºçãã©ã®ããã«é 眮ãããŠãããã詳ããèŠãŠã¿ãŸãããã圢æ ãšæ§æã«ã€ããŠã¯ããã¹ãŠã顿šã«ãããã®ã§ãã
2018幎ã®çµããã«ãBERTã«é¢ããGoogleããã®èšäºã®åŸãè±èªã®NLPã§å€ãã®é²æ©ããããŸããã 2019幎ãDeepPavlovãããžã§ã¯ãã®ã¡ã³ããŒãã·ã¢èªã«é©å¿ããå€èšèªBERTãRuBERTãç»å ŽããŸãããCRFããããäžéšã§ãã¬ãŒãã³ã°ãããDeepPavlov BERTNER-ãã·ã¢èªã®SOTAã§ããããšã倿ããŸãããã¢ãã«ã®å質ã¯åªããŠãããæãè¿ã远跡è ã§ããDeepPavlov NERã®2åã®1ã®ãšã©ãŒã§ããããµã€ãºãšããã©ãŒãã³ã¹ã¯æããããã®ã§ãã6GB-GPURAMã®æ¶è²»ã2GB-ã¢ãã«ã®ãµã€ãºãæ¯ç§13èšäº-åªããGPUã§ã®ããã©ãŒãã³ã¹ã
2020幎ãNatashaãããžã§ã¯ãã§ã¯ãDeepPavlov BERT NERã«å質ãè¿ã¥ããããšãã§ããŸãããã¢ãã«ã®ãµã€ãºã¯ã75åã®1ïŒ27MBïŒãã¡ã¢ãªæ¶è²»éã¯30åã®1ïŒ205MBïŒãCPUã®é床ã¯2åã®1ïŒ25èšäº/ç§ïŒã§ããã ïŒã
| ãã¿ãŒã·ã£ãã¹ãããããNER | DeepPavlov BERT NER | |
| ããŒã¯ã³ããšã®PER / LOC / ORG F1ãCollection5ããšã®å¹³åãfactRuEval-2016ãBSNLP-2019ãGareev | 0.97 / 0.91 / 0.85 | 0.98 / 0.92 / 0.86 |
| ã¢ãã«ãµã€ãº | 27MB | 2GB |
| ã¡ã¢ãªæ¶è²» | 205MB | 6GBïŒGPUïŒ |
| ããã©ãŒãã³ã¹ã1ç§ãããã®ãã¥ãŒã¹èšäºïŒ1èšäºâ1KBïŒ | CPUããã25ïŒCore i5ïŒ | 13 GPUïŒRTX 2080 TiïŒã1 CPU |
| åæåæéãç§ | 1 | 35 |
| ã©ã€ãã©ãªã¯ãµããŒãããŸã | Python 3.5以éãPyPy3 | Python 3.6+ |
| äŸåé¢ä¿ | NumPy | TensorFlow |
Slovnet NERã®å質ã¯ãSOTA DeepPavlov BERT NERã®å質ããã1ããŒã»ã³ããã€ã³ãäœããã¢ãã«ã®ãµã€ãºã¯75åã®1ã§ãã¡ã¢ãªæ¶è²»éã¯30åã®1ã§ãCPUã®é床ã¯2åã®1ã§ããSlovnetãªããžããªå ã®ãã·ã¢èªã話ãNERçšã®SpaCyãPullEntiããã³ãã®ä»ã®ãœãªã¥ãŒã·ã§ã³ãšã®æ¯èŒã
ãã®çµæãã©ã®ããã«ååŸããŸããïŒçãã¬ã·ãïŒ
Slovnet NER = Slovnet BERT NER - DeepPavlov BERT NERã®ã¢ããã°+ WordCNN-CRFã§ã®åæããŒã¯ã¢ããïŒNerusïŒã«ããèžçãšéååãããåã蟌ã¿ïŒNavecïŒ+ NumPyã§ã®æšè«çšãšã³ãžã³ã
ä»é çªã«ãèšç»ã¯æ¬¡ã®ãšããã§ããæåã§æ³šéãä»ããå°ããªããŒã¿ã»ããã§ãBERTã¢ãŒããã¯ãã£ã䜿çšããŠéãã¢ãã«ããã¬ãŒãã³ã°ããŸãããã¥ãŒã¹ã³ãŒãã¹ã§ããŒã¯ãä»ãããšã倧ãããŠæ±ãåæãã¬ãŒãã³ã°ããŒã¿ã»ãããåŸãããŸãããã®äžã§ã³ã³ãã¯ããªããªããã£ãã¢ãã«ããã¬ãŒãã³ã°ããŸãããããã®ããã»ã¹ã¯èžçãšåŒã°ããŸããéãã¢ãã«ã¯æåž«ã§ãããã³ã³ãã¯ããªã¢ãã«ã¯åŠçã§ããBERTã¢ãŒããã¯ãã£ã¯NERã®åé¡ã«å¯ŸããŠåé·ã§ãããã³ã³ãã¯ãã¢ãã«ã¯éãã¢ãã«ã«æ¯ã¹ãŠå質ãããã»ã©äœäžããªããšèããŠããŸãã
ã¢ãã«æåž«
DeepPavlov BERT NERã¯ãRuBERTãšã³ã³ãŒããŒãšCRFãããã§æ§æãããŠããŸããç§ãã¡ã®éãæåž«ã¢ãã«ã¯ããã€ããŒãªæ¹åãå ããŠãã®ã¢ãŒããã¯ãã£ãç¹°ãè¿ããŸãã
ãã¹ãŠã®ãã³ãããŒã¯ã¯ããã¥ãŒã¹ããã¹ãã®NERåè³ªãæž¬å®ããŸãããã¥ãŒã¹ã§RuBERTãèšç·ŽããŸããããCorusãªããžããªã«ã¯ããã·ã¢èªã®å ¬éãã¥ãŒã¹ã³ãŒãã¹ãžã®ãªã³ã¯ãåèš12GBã®ããã¹ããå«ãŸããŠããŸããRoBERTaã«é¢ããFacebookã®èšäºã®ææ³ã䜿çšããŸãïŒå€§èŠæš¡ãªéçŽããããåçãã¹ã¯ãæ¬¡ã®æïŒNSPïŒã®äºæž¬ã®æåŠã RuBERTã¯ã120,000åã®ãµãããŒã¯ã³ã®å·šå€§ãªèŸæžã䜿çšããŸããããã¯Googleã®å€èšèªBERTã®éºç£ã§ããæãé »åºŠã®é«ããã¥ãŒã¹ã¢ã€ãã ã®ãµã€ãºã50,000ã«æžãããšãã«ãã¬ããžã¯5ïŒ æžå°ããŸããNewsRuBERTãå ¥æãããã¢ãã«ã¯ããã¥ãŒã¹ã®åœè£ ãµãããŒã¯ã³ãRuBERTããã5ããŒã»ã³ããã€ã³ãè¯ãäºæž¬ããŸãïŒããã1ã®63ïŒ ïŒãCollection5ã®
1000ä»¶ã®èšäºçšã«NewsRuBERTãšã³ã³ãŒããŒãšCRFãããããã¬ãŒãã³ã°ããŸããããSlovnet BERT NERãå ¥æããå質ã¯DeepPavlov BERT NERãã0.5ããŒã»ã³ããã€ã³ãåªããŠãããã¢ãã«ãµã€ãºã¯4åã®1ïŒ473MBïŒã3åé«éïŒæ¯ç§40èšäºïŒã§ãã
NewsRuBERT = RuBERT + 12GBã®ãã¥ãŒã¹+ RoBERTaã®ãã¯ãããžãŒ+ 50K-èŸæžã
Slovnet BERT NERïŒDeepPavlov BERT NERã®ã¢ããã°ïŒ= NewsRuBERT + CRFããã+ã³ã¬ã¯ã·ã§ã³5ã
çŸåšãBERTã®ãããªã¢ãŒããã¯ãã£ã§ã¢ãã«ããã¬ãŒãã³ã°ããã«ã¯ãHuggingFaceã®Transformersã䜿çšããã®ãéäŸã§ãããã©ã³ã¹ãã©ãŒããŒã¯100,000è¡ã®Pythonã³ãŒãã§ããæšè«ã§æå€±ããŽããççºããå Žåãäœãæªãã£ãã®ããçè§£ããã®ã¯å°é£ã§ããããŠãããã«ã¯ããããã®ã³ãŒããè€è£œãããŠããŸããRoBERTaããã¬ãŒãã³ã°ãããšããŠããåé¡ãçŽ3000è¡ã®ã³ãŒãã«ãã°ããããŒã«ã©ã€ãºã§ããŸããããããå€ãã®ããšã§ããææ°ã®PyTorchã§ã¯ãTransformersã©ã€ãã©ãªã¯ããã»ã©é¢é£æ§ããããŸããã
torch.nn.TransformerEncoderLayerããã«ã¿ã®ãããªã¢ãã«ã³ãŒã100è¡ãåããŸãã
class BERTEmbedding(nn.Module):
def __init__(self, vocab_size, seq_len, emb_dim, dropout=0.1, norm_eps=1e-12):
super(BERTEmbedding, self).__init__()
self.word = nn.Embedding(vocab_size, emb_dim)
self.position = nn.Embedding(seq_len, emb_dim)
self.norm = nn.LayerNorm(emb_dim, eps=norm_eps)
self.drop = nn.Dropout(dropout)
def forward(self, input):
batch_size, seq_len = input.shape
position = torch.arange(seq_len).expand_as(input).to(input.device)
emb = self.word(input) + self.position(position)
emb = self.norm(emb)
return self.drop(emb)
def BERTLayer(emb_dim, heads_num, hidden_dim, dropout=0.1, norm_eps=1e-12):
layer = nn.TransformerEncoderLayer(
d_model=emb_dim,
nhead=heads_num,
dim_feedforward=hidden_dim,
dropout=dropout,
activation='gelu'
)
layer.norm1.eps = norm_eps
layer.norm2.eps = norm_eps
return layer
class BERTEncoder(nn.Module):
def __init__(self, layers_num, emb_dim, heads_num, hidden_dim,
dropout=0.1, norm_eps=1e-12):
super(BERTEncoder, self).__init__()
self.layers = nn.ModuleList([
BERTLayer(
emb_dim, heads_num, hidden_dim,
dropout, norm_eps
)
for _ in range(layers_num)
])
def forward(self, input, pad_mask=None):
input = input.transpose(0, 1) # torch expects seq x batch x emb
for layer in self.layers:
input = layer(input, src_key_padding_mask=pad_mask)
return input.transpose(0, 1) # restore
class BERTMLMHead(nn.Module):
def __init__(self, emb_dim, vocab_size, norm_eps=1e-12):
super(BERTMLMHead, self).__init__()
self.linear1 = nn.Linear(emb_dim, emb_dim)
self.norm = nn.LayerNorm(emb_dim, eps=norm_eps)
self.linear2 = nn.Linear(emb_dim, vocab_size)
def forward(self, input):
x = self.linear1(input)
x = F.gelu(x)
x = self.norm(x)
return self.linear2(x)
class BERTMLM(nn.Module):
def __init__(self, emb, encoder, head):
super(BERTMLM, self).__init__()
self.emb = emb
self.encoder = encoder
self.head = head
def forward(self, input):
x = self.emb(input)
x = self.encoder(x)
return self.head(x)
ããã¯ãããã¿ã€ãã§ã¯ãªããã³ãŒãã¯Slovnetãªããžããªããã³ããŒãããŸãããã©ã³ã¹ãã©ãŒããŒã¯èªã¿ããããå€ãã®äœæ¥ãè¡ããArxivã䜿çšããŠèšäºã®ã³ãŒããè©°ã蟌ã¿ãŸããå€ãã®å ŽåãPythonãœãŒã¹ã¯ç§åŠèšäºã®èª¬æãããæç¢ºã§ãã
åæããŒã¿ã»ãã
Lenta.ruã³ãŒãã¹ ããã®700,000ä»¶ã®èšäºã«éãã¢ãã«ã§ããŒã¯ãä»ããŸãããã巚倧ãªåæãã¬ãŒãã³ã°ããŒã¿ã»ãããååŸããŸããã¢ãŒã«ã€ãã¯ãNatashaãããžã§ã¯ãã®Nerusãªããžããªã§å ¥æã§ããŸããããŒã¯ã¢ããã¯éåžžã«é«å質ã§ãF1ã¯ããŒã¯ã³ã§æšå®ããŸãïŒPER-99.7ïŒ ãLOC-98.6ïŒ ãORG-97.2ïŒ ããšã©ãŒã®ãŸããªäŸïŒ
ORGââââââââââââââ LOCââââââââââââââââââââââââââââ
241- 4- 10-
<
LOCâââ LOCââââââ
>.
âââââââââââ~~~~~~~~~~~
ORGââââââââââââââââââââ~~~~~~~~~~~~~~~~
.
LOCâââ
<>
~~~~~~~~ LOCââââââââââââââââââ
.
~~~~ ~~~~~~ LOCâââ
.
LOCââââ
-
PERâââââââââââââââââââââ
M&A.
~~~
:
~~~~~~~~~~~~ORGâââ LOCââ
,
PERâââââââ LOCâââ
,
ORGâ LOCâââââââââââââ
.
LOC
ã¢ãã«åŠç¿è
ãããŒãã£ãŒãã£ãŒã¢ãã«ã®ã¢ãŒããã¯ãã£ã®éžæã«åé¡ã¯ãããŸããã§ãããå¯äžã®ãªãã·ã§ã³ã¯ãã©ã³ã¹ãã©ãŒããŒã§ãããã³ã³ãã¯ããªåŠçã¢ãã«ã¯ããé£ãããå€ãã®ãªãã·ã§ã³ããããŸãã 2013幎ãã2018幎ã«ãããŠãword2vecã®ç»å ŽããBERTã«é¢ããèšäºãŸã§ã人é¡ã¯NERåé¡ã解決ããããã®äžé£ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãŒããã¯ãã£ãèæ¡ããŸããããã¹ãŠã«å ±éã®ã¹ããŒã ããã
ãŸããNERã¿ã¹ã¯ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãŒããã¯ãã£ã®ã¹ããŒã ïŒããŒã¯ã³ãšã³ã³ãŒããŒãã³ã³ããã¹ããšã³ã³ãŒããŒãã¿ã°ãã³ãŒããŒãã€ã³ïŒ2018ïŒã«ããã¬ãã¥ãŒèšäºã®ç¥èªã®èª¬æã
ã¢ãŒããã¯ãã£ã«ã¯å€ãã®çµã¿åããããããŸããã©ã¡ããéžæããŸããïŒããšãã°ãïŒCharCNN + EmbeddingïŒ-WordBiLSTM-CRFã¯ã2019幎ãŸã§ã®ãã·ã¢èªã®SOTAã§ããDeepPavlovNERã«é¢ããèšäºã®ã¢ãã«å³ã§ãã
CharCNNãCharRNNã®ãªãã·ã§ã³ãã¹ãããããŸããåããŒã¯ã³ã®ã·ã³ãã«ã«ãã£ãŠå°ããªãã¥ãŒã©ã«ãããã¯ãŒã¯ãèµ·åããã®ã¯ãç§ãã¡ã®ããæ¹ã§ã¯ãªããé ãããŸãããŸããWordRNNãé¿ãããã®ã§ããããœãªã¥ãŒã·ã§ã³ã¯CPUã§æ©èœããåããŒã¯ã³ã®ãããªãã¯ã¹ããã£ãããšä¹ç®ããå¿ èŠããããŸããNERã®å Žåãç·åœ¢ãšCRFã®ã©ã¡ããéžæãããã¯æ¡ä»¶ä»ãã§ããBIOãšã³ã³ãŒãã£ã³ã°ã䜿çšããŸããã¿ã°ã®é åºã¯éèŠã§ããç§ãã¡ã¯ã²ã©ããã¬ãŒãã«èããªããã°ãªããŸãããCRFã䜿çšããŠãã ããã1ã€ã®ãªãã·ã§ã³ãæ®ã£ãŠããŸã-åã蟌ã¿-WordCNN-CRFããã®ã¢ãã«ã¯å€§æåãšå°æåãåºå¥ããŸãããNERã«ãšã£ãŠéèŠãªã®ã¯ããåžæãã¯åãªãåèªã§ããããåžæãã¯ããããååã§ããShapeEmbeddingã远å ããŸã-ããŒã¯ã³ã®ã¢ãŠãã©ã€ã³ãåã蟌ã¿ãŸããäŸïŒ "NER" -EN_XXã "Vainovich" -RU_Xxã "ïŒ" --PUNCT_ ïŒã "ããã³" --RU_xã "5.1" -NUMã "ãã¥ãŒãšãŒã¯" --RU_Xx-XxãSlovnet NERã¹ããŒã -ïŒWordEmbedding + ShapeEmbeddingïŒ-WordCNN-CRFã
èžç
巚倧ãªåæããŒã¿ã»ããã§SlovnetNERããã¬ãŒãã³ã°ããŸããããçµæãéãã¢ãã«æåž«ã®SlovnetBERTNERãšæ¯èŒããŠã¿ãŸããããå質ã¯ãæåã§ããŒã¯ãããCollection5ãGareevãfactRuEval-2016ãBSNLP-2019ã§èšç®ãããå¹³ååãããŸãããã¬ãŒãã³ã°ãµã³ãã«ã®ãµã€ãºã¯éåžžã«éèŠã§ãã250ã®ãã¥ãŒã¹èšäºïŒãµã€ãºfactRuEval-2016ïŒã®å ŽåãPERãLOCãLOG F1ã®å¹³åã¯0.64ã1000ïŒã³ã¬ã¯ã·ã§ã³5ã®ã¢ããã°ïŒã®å¹³åã¯0.81ãããŒã¿ã»ããå šäœã®å Žåã¯0.91ãSlovnet BERTNERã®å質ã¯0.92ã§ãã
Slovnet NERã®å質ãåæãã¬ãŒãã³ã°ã®äŸã®æ°ãžã®äŸåãç°è²ã®ç·-SlovnetBERTNERã®å質ã Slovnet NERã¯ãææžãã®äŸãèªèãããåæããŒã¿ã®ã¿ããã¬ãŒãã³ã°ããŸãã
åå§çãªåŠçã¢ãã«ã¯ãããŒãæåž«ã¢ãã«ããã1ããŒã»ã³ããã€ã³ãæªãã§ããããã¯çŽ æŽãããçµæã§ããæ®éçãªã¬ã·ãã¯ããèªäœã瀺åããŠããŸãïŒ
äžéšã®ããŒã¿ãæåã§ããŒã¯ã¢ããããŸããéãå€å§åšãèšç·ŽããŸããå€ãã®åæããŒã¿ãçæããŸãã倧ããªãµã³ãã«ã§åçŽãªã¢ãã«ããã¬ãŒãã³ã°ããŸãããã©ã³ã¹ã®å質ãã·ã³ãã«ãªã¢ãã«ã®ãµã€ãºãšããã©ãŒãã³ã¹ãåŸãããŸãã
Slovnetã©ã€ãã©ãªã«ã¯ããã®ã¬ã·ãã«åŸã£ãŠãã¬ãŒãã³ã°ããã2ã€ã®ã¢ãã«ããããŸããSlovnetMorph-圢æ åŠçã¿ã¬ãŒãSlovnetæ§æ-æ§æããŒãµãŒãSlovnet Morphã¯ããããŒãã£ãŒãã£ãŒã¢ãã«ãã2ããŒã»ã³ããã€ã³ãé ããŠããŸããSlovnetSyntax- 5ã§ããã©ã¡ãã®ã¢ãã«ãããã¥ãŒã¹èšäºçšã®æ¢åã®ãã·ã¢ã®ãœãªã¥ãŒã·ã§ã³ãããåªããå質ãšããã©ãŒãã³ã¹ãåããŠããŸãã
å®éå
SlovnetNERã®ãµã€ãºã¯289MBã§ãã287MBã¯ãåã蟌ã¿ã®ããããŒãã«ã§å ããããŠããŸãããã®ã¢ãã«ã¯250,000è¡ã®å€§ããªèªåœã䜿çšãããã¥ãŒã¹ããã¹ãã®åèªã®98ïŒ ãã«ããŒããŠããŸããéååã䜿çšããŠã300次å ã®ãããŒããã¯ãã«ã100次å ã®8ããããã¯ãã«ã«çœ®ãæããŸããã¢ãã«ã®ãµã€ãºã¯10åã®1ïŒ27MBïŒã«ãªããå質ã¯å€ãããŸãããNavecã©ã€ãã©ãªã¯ãéååãããäºåãã¬ãŒãã³ã°æžã¿ã®åã蟌ã¿ã®ã³ã¬ã¯ã·ã§ã³ã§ããNatashaãããžã§ã¯ãã®äžéšã§ãããã£ã¯ã·ã§ã³ã§ãã¬ãŒãã³ã°ããããŠã§ã€ãã¯50MBããããåææšå®ã«åŸã£ãŠãã¹ãŠã®éçRusVectoresã¢ãã«ããã€ãã¹ããŸãã
æšè«
Slovnet NERã¯ããã¬ãŒãã³ã°ã«PyTorchã䜿çšããŠããŸãã PyTorchããã±ãŒãžã®ééã¯700MBã§ããæšæž¬ã®ããã«æ¬çªç°å¢ã«ãã©ãã°ããããããŸããã PyTorchã¯ãPyPyã€ã³ã¿ãŒããªã¿ãŒã§ãæ©èœããŸããã Slovnetã¯ãšå ±ã«äœ¿çšãããYargyããŒãµãŒãã®ã¢ããã°Yandexã®å¯ç°ããŒãµã PyPyã䜿çšãããšãææ³ã®è€éãã«å¿ããŠãYargyã¯2ã10åéãåäœããŸãã PyTorchã«äŸåããŠããããã«é床ãèœãšããããããŸããã
æšæºçãªè§£æ±ºçã¯ãTorchScriptã䜿çšããããã¢ãã«ãONNXã«å€æããONNXRuntimeã§æšè«ãè¡ãããšã§ãã Slovnet NERã¯ãéæšæºã®ãããã¯ïŒéååãããåã蟌ã¿ãCRFãã³ãŒããŒïŒã䜿çšããŸãã TorchScriptãšONNXRuntimeã¯PyPyããµããŒãããŠããŸããã
Slovnet NERã¯ã·ã³ãã«ãªã¢ãã«ã§ãNumPyã®ãã¹ãŠã®ãããã¯ãæåã§å®è£ ããPyTorchã«ãã£ãŠèšç®ãããéã¿ã䜿çšããŸããå°ãNumPyã®éæ³ãé©çšããŠãCNNãããã¯ãCRFãã³ãŒããŒãæ³šææ·±ãå®è£ ããéååãããåã蟌ã¿ãè§£åããã«ã¯5è¡ããããŸããCPUã§ã®æšè«é床ã¯ãONNXRuntimeããã³PyTorchã®å Žåãšåãã§ãCorei5ã§ã¯æ¯ç§25ã®ãã¥ãŒã¹èšäºããããŸãã
ãã®ææ³ã¯ãããè€éãªã¢ãã«ã§æ©èœããŸããSlovnetMorphãšSlovnetSyntaxãNumPyã«å®è£ ãããŠããŸããSlovnet NERãMorphãSyntaxã¯ãå ±éã®åã蟌ã¿ããŒãã«ãå ±æããŠããŸããå¥ã®ãã¡ã€ã«ã§éã¿ãåãåºããŸããããããŒãã«ã¯ã¡ã¢ãªãšãã£ã¹ã¯ã«è€è£œãããŸããã
>>> navec = Navec.load('navec_news_v1_1B.tar') # 25MB
>>> morph = Morph.load('slovnet_morph_news_v1.tar') # 2MB
>>> syntax = Syntax.load('slovnet_syntax_news_v1.tar') # 3MB
>>> ner = NER.load('slovnet_ner_news_v1.tar') # 2MB
# 25 + 2 + 3 + 2 25+2 + 25+3 + 25+2
>>> morph.navec(navec)
>>> syntax.navec(navec)
>>> ner.navec(navec)
å¶éäºé
ãã¿ãŒã·ã£ã¯ãååãããããŒã ã®ååãçµç¹ãªã©ã®æšæºãšã³ãã£ãã£ãæœåºããŸãããã®ãœãªã¥ãŒã·ã§ã³ã¯ããã¥ãŒã¹ã§åªããå質ã瀺ããŠããŸããä»ã®ãšã³ãã£ãã£ãããã¹ãã®çš®é¡ãæäœããæ¹æ³ã¯ïŒæ°ããã¢ãã«ããã¬ãŒãã³ã°ããå¿ èŠããããŸããããã¯ç°¡åã§ã¯ãããŸãããã¢ãã«æºåã®è€éãã«ãããã³ã³ãã¯ããªãµã€ãºãšäœæ¥é床ãå®çŸããŸããéãæåž«ã®ã¢ãã«ã補é ããããã®ã¹ã¯ãªããã®ã©ããããããåŠçã¢ãã«ã®ããã®ã¹ã¯ãªããã®ã©ããããããéååãããåã蟌ã¿ã調補ããããã®èª¬ææžã
Navec-ãã·ã¢èªçšã®ã³ã³ãã¯ããªåã蟌ã¿
ã³ã³ãã¯ããªã¢ãã«ã¯äœæ¥ã«äŸ¿å©ã§ãããããã¯è¿ éã«éå§ããã¡ã¢ãªãã»ãšãã©äœ¿çšãããããå€ãã®äžŠåããã»ã¹ã1ã€ã®ã€ã³ã¹ã¿ã³ã¹ã«é©åããŸãã
NLPã§ã¯ãã¢ãã«ã®éã¿ã®80ã90ïŒ ãåã蟌ã¿ããŒãã«ã«ãããŸããNavecã©ã€ãã©ãªã¯ããã·ã¢èªçšã«äºåã«ãã¬ãŒãã³ã°ãããåã蟌ã¿ã®ã³ã¬ã¯ã·ã§ã³ã§ããNatashaãããžã§ã¯ãã®äžéšã§ããåºæã®å質ã¡ããªãã¯ã«é¢ããŠã¯ãRusVectoresã®ããããœãªã¥ãŒã·ã§ã³ããããã«äžåã£ãŠããŸãããéã¿ä»ãã®ã¢ãŒã«ã€ãã®ãµã€ãºã¯5ã6åå°ããïŒ51MBïŒãèŸæžã¯2ã3å倧ãããªã£ãŠããŸãïŒ500Kã¯ãŒãïŒã
| å質* | ã¢ãã«ãµã€ãºãMB | èŸæžã®ãµã€ãºãÃ10 3 | |
| ããã㯠| 0.719 | 50.6 | 500 |
| RusVectores | 0.638-0.726 | 220.6ã290.7 | 189-249 |
ç§ãã¡ã¯ãã«ã€ããŠã話ããŸãå€ãè¯ãã¯ãŒãããšã®åã蟌ã¿2013幎ã«NLPã«é©åœããããããŸããããã®æè¡ã¯ä»æ¥ã§ãéèŠã§ããNatashaãããžã§ã¯ãã§ã¯ã圢æ ãæ§æã®è§£æãããã³ååä»ããšã³ãã£ãã£ã®æœåºã®ã¢ãã«ããåèªããšã®Navecåã蟌ã¿ã§æ©èœããä»ã®ãªãŒãã³ãœãªã¥ãŒã·ã§ã³ãããåªããå質ã瀺ããŸãã
RusVectores
ãã·ã¢èªã®å ŽåãRusVectoresããäºåã«ãã¬ãŒãã³ã°ãããåã蟌ã¿ã䜿çšããã®ãéäŸã§ãããããã«ã¯äžå¿«ãªæ©èœããããŸããããŒãã«ã«ã¯åèªã§ã¯ãªãããword_POS-tagãã®ãã¢ãå«ãŸããŸãããã¢ãoven_VERBãã®å Žåããcook_VERBãããcook_VERBããããã³ãoven_NOUNãã®å Žåããhut_NOUNãããfurnace_NOUNãã«é¡äŒŒãããã¯ãã«ãå¿ èŠã§ãã
å®éã«ã¯ããã®ãããªåã蟌ã¿ã䜿çšããããšã¯äžäŸ¿ã§ããããã¹ããããŒã¯ã³ã«åå²ããã ãã§ã¯äžååã§ããããŒã¯ã³ããšã«ãäœããã®æ¹æ³ã§POSã¿ã°ãå®çŸ©ããå¿ èŠããããŸããåã蟌ã¿ããŒãã«ãèšããã§ããŸãããbecomeããšãã1ã€ã®åèªã®ä»£ããã«ã6ïŒ2ã€ã®åŠ¥åœãªãbecome_VERBãããbecome_NOUNããããã³4ã€ã®å¥åŠãªãbecome_ADVãããbecome_PROPNãããbecome_NUMãããbecome_ADJããæ ŒçŽããŸãã250,000ãšã³ããªã®ããŒãã«ã«ã¯195,000ã®äžæã®åèªããããŸãã
å質
ã»ãã³ãã£ãã¯è¿æ¥åé¡ãžã®åã蟌ã¿ã®å質ãèŠç©ãããŸããããããã€ãã®åèªãèŠãŠã¿ãŸããããåã蟌ã¿ãã¯ãã«ãèŠã€ãããã³ã«ãäœåŒŠã®é¡äŒŒæ§ãèšç®ããŸããåæ§ã®åèªãcupããšãjugãã®Navecã¯ããfruitããšãovenãã®å Žåã0.49ãè¿ããŸã--- 0.0047ãé¡äŒŒæ§ã®åç §ããŒã¯ãæã€å€ãã®ãã¢ãåéããã¹ãã¢ãã³ãšç§ãã¡ã®çããšã®çžé¢é¢ä¿ãèšç®ããŸãããã
RusVectoresã®äœæè ã¯ãSimLex965ãã¢ã®å°ãããæ³šææ·±ããã§ãã¯ãããæ¹èšããããã¹ããªã¹ãã䜿çšããŸããRUSSEãããžã§ã¯ãããæ°ããYandexLRWCãšããŒã¿ã»ããã远å ããŸãããïŒHJãRTãAEãAE2ïŒ
| 6ã€ã®ããŒã¿ã»ããã®å¹³åå質 | èªã¿èŸŒã¿æéãç§ | ã¢ãã«ãµã€ãºãMB | èŸæžã®ãµã€ãºãÃ10 3 | ||
| ããã㯠| hudlit_12B_500K_300d_100q |
0.719 | 1.0 | 50.6 | 500 |
news_1B_250K_300d_100q |
0.653 | 0.5 | 25.4 | 250 | |
| RusVectores | ruscorpora_upos_cbow_300_20_2019 |
0.692 | 3.3 | 220.6 | 189 |
ruwikiruscorpora_upos_skipgram_300_2_2019 |
0.691 | 5.0 | 290.0 | 248 | |
tayga_upos_skipgram_300_2_2019 |
0.726 | 5.2 | 290.7 | 249 | |
tayga_none_fasttextcbow_300_10_2019 |
0.638 | 8.0 | 2741.9 | 192 | |
araneum_none_fasttextcbow_300_5_2018 |
0.664 | 16.4 | 2752.1 | 195 |
å質ã¯
hudlit_12B_500K_300d_100qRusVectoresãœãªã¥ãŒã·ã§ã³ãšåçããã以äžã§ãèŸæžã¯2ã3å倧ãããã¢ãã«ãµã€ãºã¯5ã6åå°ãããªã£ãŠããŸããã©ã®ããã«ããŠãã®å質ãšãµã€ãºãæã«å
¥ããŸãããïŒ
åäœåç
hudlit_12B_500K_300d_100q-æè¢-åã蟌ã¿ã®ããã«èšç·Žãããå°èª¬ã®145ã®ã¬ãã€ããRUSSEãããžã§ã¯ãããã®ããã¹ããå«ãã¢ãŒã«ã€ããèŠãŠã¿ãŸããããCã§å
ã®GloVeå®è£
ã䜿çšãã䟿å©ãªPythonã€ã³ã¿ãŒãã§ã€ã¹ã§ã©ããããŠã¿ãŸãããã
ãªãword2vecã§ã¯ãªãã®ã§ããïŒå€§èŠæš¡ãªããŒã¿ã»ããã§ã®å®éšã¯ãGloVeã䜿çšãããšé«éã«ãªããŸããã³ãã±ãŒã·ã§ã³ãããªãã¯ã¹ãèšç®ããããããã䜿çšããŠããŸããŸãªæ¬¡å ã®åã蟌ã¿ãæºåããæé©ãªãªãã·ã§ã³ãéžæããŸãã
ãªãfastTextã§ã¯ãªãã®ã§ããïŒãã¿ãŒã·ã£ãããžã§ã¯ãã§ã¯ããã¥ãŒã¹ããã¹ããæ±ããŸãããããã«ã¯ã¿ã€ããã¹ãã»ãšãã©ãªããOOVããŒã¯ã³ã®åé¡ã¯å€§ããªèŸæžã«ãã£ãŠè§£æ±ºãããŸãã衚ã®250,000è¡ã¯ã
news_1B_250K_300d_100qãã¥ãŒã¹èšäºã®åèªã®98ïŒ
ãã«ããŒããŠããŸãã
èŸæžã®ãµã€ãº
hudlit_12B_500K_300d_100q-500,000ãšã³ããªããã£ã¯ã·ã§ã³ããã¹ãã®åèªã®98ïŒ
ãã«ããŒããŸãããã¯ãã«ã®æé©ãªæ¬¡å
ã¯300ã§ãããããŒãæ°ã®500,000Ã300ã®ããŒãã«ã¯578MBãåããéã¿ä»ãã®ã¢ãŒã«ã€ãã®ãµã€ãºhudlit_12B_500K_300d_100qã¯12åã®1ïŒ48MBïŒã§ããããã¯éååã«ã€ããŠã§ãã
å®éå
32ãããã®æµ®åå°æ°ç¹æ°ã8ãããã®ã³ãŒãã«çœ®ãæããŸãããïŒ[-âã-0.86ïŒ-ã³ãŒã0ã[-0.86ã-0.79ïŒ-ã³ãŒã1ã[-0.79ã-0.74ïŒ-2ãâŠã[0.86ã âïŒ-255ãããŒãã«ã®ãµã€ãºã¯4åã®1ïŒ143MBïŒã«æžå°ããŸãã
:
-0.220 -0.071 0.320 -0.279 0.376 0.409 0.340 -0.329 0.400
0.046 0.870 -0.163 0.075 0.198 -0.357 -0.279 0.267 0.239
0.111 0.057 0.746 -0.240 -0.254 0.504 0.202 0.212 0.570
0.529 0.088 0.444 -0.005 -0.003 -0.350 -0.001 0.472 0.635
ââââââ ââââââ
-0.170 0.677 0.212 0.202 -0.030 0.279 0.229 -0.475 -0.031
ââââââ ââââââ
:
63 105 215 49 225 230 219 39 228
143 255 78 152 187 34 49 204 198
163 146 253 58 55 240 188 191 246
243 155 234 127 127 35 128 237 249
âââ âââ
76 251 191 188 118 207 195 18 118
âââ âââ
ããŒã¿ã¯ç²ããªããç°ãªãå€-0.005ãš-0.003ã1ã€ã®ã³ãŒã127ã-0.030ãš-0.031ã眮ãæããŸã-118
ã³ãŒãã1ã€ã§ã¯ãªãã3ã€ã®æ°åã«çœ®ãæããŸããããk-meansã¢ã«ãŽãªãºã ã䜿çšããŠãåã蟌ã¿ããŒãã«ã®æ°å€ã®ãã¹ãŠã®ããªãã¬ããã256ã®ã¯ã©ã¹ã¿ãŒã«ã¯ã©ã¹ã¿ãŒåããŸããåããªãã¬ããã®ä»£ããã«ã0ãã255ãŸã§ã®ã³ãŒããæ ŒçŽããŸããããŒãã«ã¯3åïŒ48MBïŒæžå°ããŸãã Navecã¯PQk-meansã©ã€ãã©ãªã䜿çšãããããªãã¯ã¹ã100åã«åå²ãããããããåå¥ã«ã¯ã©ã¹ã¿ãŒåããŸããåæãã¹ãã®å質ã¯1ããŒã»ã³ããã€ã³ãäœäžããŸããk-NNã®ProductQuantizersã®èšäºã§éååã«ã€ããŠæç¢ºã«ãªã£ãŠããŸãã
éååãããåã蟌ã¿ã¯ãéåžžã®ãã®ãããé ããªããŸããå§çž®ããããã¯ãã«ã¯ã䜿çšããåã«è§£åããå¿ èŠããããŸããæé ãæ éã«å®è¡ããNumpyããžãã¯ãé©çšããŸããPyTorchã§ã¯torch.gatherã䜿çšããŸããSlovnet NERã§ã¯ãåã蟌ã¿ããŒãã«ãžã®ã¢ã¯ã»ã¹ã«åèšèšç®æéã®0.1ïŒ ãããããŸããSlovnetã©ã€ãã©ãªã®
ã¢ãžã¥ãŒã«
NavecEmbeddingã¯ãNavecãPyTorchã¢ãã«ã«çµ±åããŸãã
>>> import torch
>>> from navec import Navec
>>> from slovnet.model.emb import NavecEmbedding
>>> path = 'hudlit_12B_500K_300d_100q.tar' # 51MB
>>> navec = Navec.load(path) # ~1 sec, ~100MB RAM
>>> words = ['', '<unk>', '<pad>']
>>> ids = [navec.vocab[_] for _ in words]
>>> emb = NavecEmbedding(navec)
>>> input = torch.tensor(ids)
>>> emb(input) # 3 x 300
tensor([[ 4.2000e-01, 3.6666e-01, 1.7728e-01,
[ 1.6954e-01, -4.6063e-01, 5.4519e-01,
[ 0.0000e+00, 0.0000e+00, 0.0000e+00,
...
Nerusã¯ã圢æ ãæ§æãååä»ããšã³ãã£ãã£ã®ããŒã¯ã¢ãããåããå€§èŠæš¡ãªåæããŒã¿ã»ããã§ã
Natashaãããžã§ã¯ãã§ã¯ã圢æ ãæ§æåæãããã³ååä»ããšã³ãã£ãã£ã®æœåºã¯ãSlovnet NERãSlovnet Morphãããã³SlovnetSyntaxã®3ã€ã®ã³ã³ãã¯ãã¢ãã«ã«ãã£ãŠè¡ãããŸãããœãªã¥ãŒã·ã§ã³ã®å質ã¯ãBERTã¢ãŒããã¯ãã£ãåããéããœãªã¥ãŒã·ã§ã³ããã1ã5ããŒã»ã³ããã€ã³ãæªãããµã€ãºã¯50ã75åå°ãããCPUã®é床ã¯2åé«éã§ããã¢ãã«ã¯ã圢æ ãæ§æãããã³ååä»ããšã³ãã£ãã£ã®CoNLL-UããŒã¯ã¢ãããå«ã700,000ã®ãã¥ãŒã¹èšäºã®ã¢ãŒã«ã€ãã§ã巚倧ãªåæNerusããŒã¿ã»ããã§ãã¬ãŒãã³ã°ãããŸãã
# newdoc id = 0
# sent_id = 0_0
# text = - , ...
1 - _ NOUN _ Animacy=Anim|C... 7 nsubj _ Tag=O
2 _ ADP _ _ 4 case _ Tag=O
3 _ ADJ _ Case=Dat|Degre... 4 amod _ Tag=O
4 _ NOUN _ Animacy=Inan|C... 1 nmod _ Tag=O
5 _ PROPN _ Animacy=Anim|C... 1 appos _ Tag=B-PER
6 _ PROPN _ Animacy=Anim|C... 5 flat:name _ Tag=I-PER
7 _ VERB _ Aspect=Perf|Ge... 0 root _ Tag=O
8 , _ PUNCT _ _ 13 punct _ Tag=O
9 _ ADP _ _ 11 case _ Tag=O
10 _ DET _ Case=Loc|Numbe... 11 det _ Tag=O
11 _ NOUN _ Animacy=Inan|C... 13 obl _ Tag=O
12 _ PROPN _ Animacy=Inan|C... 11 nmod _ Tag=B-LOC
13 _ VERB _ Aspect=Perf|Ge... 7 ccomp _ Tag=O
14 _ ADV _ Degree=Pos 15 advmod _ Tag=O
15 _ ADJ _ Case=Nom|Degre... 16 amod _ Tag=O
16 _ NOUN _ Animacy=Inan|C... 13 nsubj _ Tag=O
17 _ ADP _ _ 18 case _ Tag=O
18 _ NOUN _ Animacy=Inan|C... 16 nmod _ Tag=O
19 , _ PUNCT _ _ 20 punct _ Tag=O
20 _ VERB _ Aspect=Imp|Moo... 0 root _ Tag=O
21 _ PROPN _ Animacy=Inan|C... 20 nsubj _ Tag=B-ORG
22 _ PROPN _ Animacy=Inan|C... 21 appos _ Tag=I-ORG
23 . _ PUNCT _ _ 20 punct _ Tag=O
# sent_id = 0_1
# text = , , , ...
1 _ ADP _ _ 2 case _ Tag=O
2 _ NOUN _ Animacy=Inan|C... 9 parataxis _ Tag=O
...
Slovnet NERãMorphãSyntax-ããªããã£ãã¢ãã«ããã¬ãŒãã³ã°ã»ããã«1000ã®äŸãããå ŽåãSlovnet NERã¯éãBERTã¢ããã°ãã11ããŒã»ã³ããã€ã³ãé ãã10,000ã®äŸïŒ3ãã€ã³ãã®å Žåã500,000ã®å ŽåïŒ
ã¯1ã§ããNerusã¯äœæ¥ã®çµæã§ãããBERTã¢ãŒããã¯ãã£ãåããéãã¢ãã«ïŒSlovnet BERT NERãSlovnet BERTã¢ãŒããSlovnetBERTæ§æã Tesla V100ã§ã¯ã700,000ä»¶ã®ãã¥ãŒã¹èšäºã®åŠçã«20æéããããŸããä»ã®ç ç©¶è ã®æéãç¯çŽãã宿ããã¢ãŒã«ã€ãããªãŒãã³ã¢ã¯ã»ã¹ã«ããŸããã§ã¹ãã€ã·ãŒ-ã«ãããŠã ãã·ã¢èªåã®ã¹ãã€ã·ãŒããNerus宿§çãªã¢ãã«ã§æãããå ¬åŒãªããžããªã«ããããæºåããŸãã
åæããŒã¯ã¢ããã«ã¯é«å質ããããŸãã圢æ åŠçã¿ã°ã®æ±ºå®ã®ç²ŸåºŠã¯98ïŒ ãæ§æãªã³ã¯ã¯96ïŒ ã§ãã NERã®å ŽåãF1ã¯ããŒã¯ã³ã§æšå®ããŸãïŒPER-99ïŒ ãLOC-98ïŒ ãORG-97ïŒ ãå質ãè©äŸ¡ããããã«ãSynTagRusãCollection5ãããã³ãã¥ãŒã¹ã¹ã©ã€ã¹GramEval2020ãããŒã¯ã¢ããããåç §ããŒã¯ã¢ãããç§ãã¡ã®ãã®ãšæ¯èŒããŠãNerusãªããžããªã®è©³çްã確èªããŸããæ§æã®ããŒã¯ã¢ããã«ãšã©ãŒããããããã«ãŒããšè€æ°ã®ã«ãŒãããããPOSã¿ã°ãæ§æãšããžã«å¯Ÿå¿ããªãå ŽåããããŸããUniversalDependenciesã®ããªããŒã¿ãŒã䜿çšãããšäŸ¿å©ã§ãããã®ãããªäŸã¯ã¹ãããããŠãã ããã
Pythonããã±ãŒãžNerusã¯ãããŒã¯ã¢ãããããŒãããã³ã¬ã³ããªã³ã°ããããã®äŸ¿å©ãªã€ã³ã¿ãŒãã§ã€ã¹ãç·šæããŸãã
>>> from nerus import load_nerus
>>> docs = load_nerus('nerus_lenta.conllu.gz')
>>> doc = next(docs)
>>> doc
NerusDoc(
id='0',
sents=[NerusSent(
id='0_0',
text='- , ...',
tokens=[NerusToken(
id='1',
text='-',
pos='NOUN',
feats={'Animacy': 'Anim',
'Case': 'Nom',
'Gender': 'Masc',
'Number': 'Sing'},
head_id='7',
rel='nsubj',
tag='O'
),
NerusToken(
id='2',
text='',
pos='ADP',
...
>>> doc.ner.print()
- ,
PERâââââââââââââ LOCâââ
, . ,
ORGââââââââ PERââââââ
...
â
>>> sent = doc.sents[0]
>>> sent.morph.print()
- NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
ADP
ADJ|Case=Dat|Degree=Pos|Number=Plur
NOUN|Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur
PROPN|Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing
PROPN|Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing
VERB|Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing
...
>>> sent.syntax.print()
ââºââââââââ - nsubj
â â â âââ⺠case
â â â â â⺠amod
â â ââºââââ nmod
â ââââââºââ appos
â â⺠flat:name
ââââââââââââ
â âââââââ⺠, punct
â â âââ⺠case
â â â â⺠det
â â ââºââââ obl
â â â âââ⺠nmod
ââââºââââââââ ccomp
â â⺠advmod
â ââºââ amod
ââºââââââ nsubj:pass
â â⺠case
ââââºââ nmod
â⺠, punct
ââââââ
â ââºââ nsubj
â â⺠appos
âââââ⺠. punct
ã€ã³ã¹ããŒã«ã®æé ã䜿çšäŸãå質è©äŸ¡Nerusãªããžããªã€ã³ã
ã³ãŒã©ã¹-ãã·ã¢èªã®å ¬éããŒã¿ã»ãããžã®ãªã³ã¯ã®ã³ã¬ã¯ã·ã§ã³+ããŠã³ããŒãçšã®é¢æ°
Corusã©ã€ãã©ãªã¯Natashaãããžã§ã¯ãã®äžéšã§ãããå ¬éãã·ã¢èªNLPããŒã¿ã»ãããžã®ãªã³ã¯ã®ã³ã¬ã¯ã·ã§ã³+ããŒããŒé¢æ°ãåããPythonããã±ãŒãžã§ãããœãŒã¹ãžã®ãªã³ã¯ã®ãªã¹ããã€ã³ã¹ããŒã«æé ãããã³Corusãªããžããªã§ã®äœ¿çšäŸã
>>> from corus import load_lenta
# Corus Lenta.ru, :
# wget https://github.com/yutkin/Lenta.Ru-News-Dataset/...
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path) # 2, 750 000
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title=' \xa0 ...',
text='- ...',
topic='',
tags=''
)
ãã·ã¢èªã®äŸ¿å©ãªãªãŒãã³ããŒã¿ã»ããã¯éåžžã«ããé ãããŠãããããã»ãšãã©ã®äººã¯ãããã«ã€ããŠç¥ããŸããã
ã®äŸ
ãã¥ãŒã¹èšäºã®ã³ãŒãã¹
ãã¥ãŒã¹èšäºã§èšèªã¢ãã«ããã¬ãŒãã³ã°ãããã®ã§ãããããã®ããã¹ããå¿ èŠã§ããæåã«é ã«æµ®ãã¶ã®ã¯ãTaigaããŒã¿ã»ããã®ãã¥ãŒã¹ã¹ã©ã€ã¹ïŒã1GBïŒã§ããå€ãã®äººãLenta.ruãã³ãïŒ2GBïŒã«ã€ããŠç¥ã£ãŠããŸããä»ã®ãœãŒã¹ãèŠã€ããã®ã¯ããå°é£ã§ãã 2019幎ãDialogueã¯ãããã©ã€ã³ãçæããããã®ã³ã³ãã¹ããäž»å¬ããŸãããäž»å¬è ã¯ãRIA Novostiã®ãã³ãã4幎éïŒ3.7GBïŒæºåããŸããã 2018幎ãYuri Baburovã¯ã40ã®ãã·ã¢èªãã¥ãŒã¹ãªãœãŒã¹ïŒ7.5GBïŒããã®ã¢ããããŒããå ¬éããŸããããã¥ãŒã¹ã¢ãžã§ã³ãã®åæã«ã€ããŠãããžã§ã¯ãã®ããã«åéãããODS å ±æã¢ãŒã«ã€ãïŒ7GBïŒããã®ãã©ã³ãã£ã¢ã ã§ã¯ã³ãŒã©ã¹ã¬ãžã¹ããª
ãã¹ãŠã®ãœãŒã¹ã¯ãæ©èœã»ããŒããŒãæã£ãŠããããããããã®ããŒã¿ã»ãããžã®ãªã³ã¯ã¯ã«ãã¥ãŒã¹Â»ã¿ã°ä»ãïŒ
load_taiga_*ãload_lentaãload_riaãload_buriy_*ãload_ods_*ã
NER
ãã·ã¢èªã§NERãæãããã®ã§ã泚éä»ãã®ããã¹ããå¿ èŠã§ãããŸããfactRuEval-2016ã³ã³ããã£ã·ã§ã³ã®ããŒã¿ãæãåºããŸããããŒã¯ã¢ããã«ã¯æ¬ ç¹ããããŸããè€éãªåœ¢åŒããšã³ãã£ãã£ã¹ãã³ã®éè€ããããŸããªãLocOrgãã«ããŽãªããããŸããPersons-1000ã®åŸç¶ã§ããNamedEntities5ã³ã¬ã¯ã·ã§ã³ã«ã€ããŠèª°ããç¥ã£ãŠããããã§ã¯ãããŸãããæšæºãã©ãŒãããã®ã¬ã€ã¢ãŠããã¹ãã³ã亀差ããªããçŸããïŒä»ã®3ã€ã®æ å ±æºã¯ããã·ã¢èªã話ãNERã®æãç±å¿ãªãã¡ã³ã ããç¥ã£ãŠããŸãã Rinat Gareevã«ã¡ãŒã«ã§æçŽãæžãã2013幎ã®åœŒã®èšäºãžã®ãªã³ã¯ãæ·»ä»ããŸããããã«å¿ããŠãååãšçµç¹ãã¿ã°ä»ãããã250ã®ãã¥ãŒã¹èšäºãåãåããŸãã2019幎ã«BSNLP-2019ã³ã³ããã£ã·ã§ã³ãéå¬ãããŸããã¹ã©ãèšèªã®NERã«ã€ããŠã¯ãäž»å¬è ã«æçŽãæžããããã«450ã®ããŒã¯ãããããã¹ããååŸããŸããWineræ°ã®ãããžã§ã¯ãã¯ãWikipediaã®ãã³ãããåèªåNERããŒã¯ã¢ãããäœããšããã¢ã€ãã¢ãæãä»ããããã·ã¢ã®ããã®å€§èŠæš¡ãªããŠã³ããŒããGithubã®äžã§äœ¿çšå¯èœã§ãã
ã¬ãžã¹ã¿ã³ãŒã©ã¹ãããŒãããããã®ãªã³ã¯ãšæ©èœïŒ
load_factruãload_ne5ãload_gareevãload_bsnlpãload_wikinerã
ãªã³ã¯ã®ã³ã¬ã¯ã·ã§ã³
ããŒãããŒããŒãååŸããŠã¬ãžã¹ããªã«å ¥ãåã«ããœãŒã¹ãžã®ãªã³ã¯ããã±ããã®ã»ã¯ã·ã§ã³ã«èç©ãããŸãã30åã®ããŒã¿ã»ããã®ã³ã¬ã¯ã·ã§ã³ïŒã¿ã€ã¬ã®æ°ããŒãžã§ã³ãã¯ããŒã«ã³ã¢ã³ãã568ã®ã¬ãã€ããã·ã¢èªã®ããã¹ããBanki.ru Cã¬ãã¥ãŒãšAuto.ruã調æ»çµæãå ±æãããªã³ã¯ä»ãã®ãã±ãããäœæããããšããå§ãããŸãã
ããŒããŒæ©èœ
åçŽãªããŒã¿ã»ããã®ã³ãŒãã¯ãèªåã§ç°¡åã«äœæã§ããŸããLenta.ruãã³ãã¯æŽåœ¢åŒã§ãå®è£ ã¯ç°¡åã§ããTaigaã¯ãçŽ1,500äžã®CoNLL-Uzipãã¡ã€ã«ã§æ§æãããŠããŸããããŠã³ããŒããè¿ éã«æ©èœãã倧éã®ã¡ã¢ãªã䜿çšããããã¡ã€ã«ã·ã¹ãã ãå°ç¡ãã«ããªãããã«ã¯ãæ··ä¹±ããäœã¬ãã«ã®zipãã¡ã€ã«ã§ã®äœæ¥ãæ éã«å®è£ ããå¿ èŠããããŸãã
35ã®ãœãŒã¹ã®å ŽåãCorusPythonããã±ãŒãžã«ã¯ããŒããŒé¢æ°ããããŸããTaigaã«ã¢ã¯ã»ã¹ããããã®ã€ã³ã¿ãŒãã§ãŒã¹ã¯ãLenta.ruãã³ããããè€éã§ã¯ãããŸããã
>>> from corus import load_taiga_proza_metas, load_taiga_proza
>>> path = 'taiga/proza_ru.zip'
>>> metas = load_taiga_proza_metas(path)
>>> records = load_taiga_proza(path, metas)
>>> next(records)
TaigaRecord(
id='20151231005',
meta=Meta(
id='20151231005',
timestamp=datetime.datetime(2015, 12, 31, 23, 40),
genre=' ',
topic='',
author=Author(
name='',
readers=7973,
texts=92681,
url='http://www.proza.ru/avtor/sadshoot'
),
title=' !',
url='http://www.proza.ru/2015/12/31/1875'
),
text='... ...\n... ..\n...
)
ãã«ãªã¯ãšã¹ããäœæããããŒããŒé¢æ°ãéä¿¡ããããã«ãŠãŒã¶ãŒãæåŸ ããŸããããã¯ãCorusãªããžããªã§ã®ç°¡åãªèª¬æã§ãã
Naeval-ãã·ã¢èªã話ãNLPã®ã·ã¹ãã ã®å®éçæ¯èŒ
ãã¿ãŒã·ã£ã¯ç§åŠçãªãããžã§ã¯ãã§ã¯ãªããSOTAã«åã€ãšããç®æšã¯ãããŸããããããã©ãŒãã³ã¹ãããŸãæãªãããšãªãé«ãäœçœ®ãå ããããã«ãå ¬éãã³ãããŒã¯ã§å質ã確èªããããšãéèŠã§ããã¢ã«ãããŒãšåãããã«ãåè³ªãæž¬å®ããæ°å€ãååŸããä»ã®èšäºããã¿ãã¬ãããåãåºãããããã®æ°å€ãèªåã®æ°å€ãšæ¯èŒããŸãããã®ã¹ããŒã ã«ã¯2ã€ã®åé¡ããããŸãã
- ããã©ãŒãã³ã¹ãå¿ããŠãã ãããã¢ãã«ã®ãµã€ãºãäœæ¥éåºŠã¯æ¯èŒãããŸãããå質ã®ã¿ã«éç¹ã眮ãããŠããŸãã
- ã³ãŒããå ¬éããªãã§ãã ãããåè³ªææšã®èšç®ã«ã¯éåžžã100äžã®ãã¥ã¢ã³ã¹ããããŸããä»ã®èšäºã§ã¯ã©ã®çšåºŠæ£ç¢ºã«ã«ãŠã³ããããŸãããïŒããããªãã
Naevalã¯ããã·ã¢ã®èªç¶èšèªãåŠçããããã®ãªãŒãã³ãœãŒã¹ããŒã«ã®å質ãšé床ãè©äŸ¡ããããã®äžé£ã®ã¹ã¯ãªããã§ããNatashaãããžã§ã¯ãã®äžéšã§ãã
| ä»äº | ããŒã¿ã»ãã | ãœãªã¥ãŒã·ã§ã³ |
| ããŒã¯ã³å | SynTagRus, OpenCorpora, GICRYA, RNC
|
SpaCy, NLTK, MyStem, Moses, SegTok, SpaCy Russian Tokenizer, RuTokenizer, Razdel
|
| SynTagRus, OpenCorpora, GICRYA, RNC
|
SegTok, Moses, NLTK, RuSentTokenizer, Razdel
|
|
| SimLex965, HJ, LRWC, RT, AE, AE2
|
RusVectores, Navec
|
|
| GramRuEval2020 (SynTagRus, GSD, Lenta.ru, Taiga)
|
DeepPavlov Morph, DeepPavlov BERT Morph, RuPosTagger, RNNMorph, Maru, UDPipe, SpaCy, Stanza, Slovnet Morph, Slovnet BERT Morph
|
|
| GramRuEval2020 (SynTagRus, GSD, Lenta.ru, Taiga)
|
DeepPavlov BERT Syntax, UDPipe, SpaCy, Stanza, Slovnet Syntax, Slovnet BERT Syntax
|
|
| NER | factRuEval-2016, Collection5, Gareev, BSNLP-2019, WiNER
|
DeepPavlov NERãDeepPavlov BERT NERãDeepPavlov Slavic BERT NERãPullEntiãSpaCyãStanzaãTexterraãTomitaãMITIEãSlovnet NERãSlovnet BERT NER
|
以äžã®NERåé¡ã詳ããèŠãŠã¿ãŸãããã
ããŒã¿ã»ãã
ãã·ã¢èªã話ãNERã«ã¯ãfactRuEval-2016ãCollection5ãGareevãBSNLP-2019ãWiNERã®5ã€ã®å ¬éãã³ãããŒã¯ããããŸãããœãŒã¹ãªã³ã¯ã¯Corusã¬ãžã¹ããªã«åéãããŸãããã¹ãŠã®ããŒã¿ã»ããã¯ãã¥ãŒã¹èšäºã§æ§æãããååã®ä»ãããµãã¹ããªã³ã°ãçµç¹ã®ååãããã³ããããŒã ãããã¹ãã§ããŒã¯ãããŠããŸããäœãç°¡åã§ããããïŒ
ãã¹ãŠã®ãœãŒã¹ã«ã¯ãç°ãªãããŒã¯ã¢ãã圢åŒããããŸãã Collection5ã¯ãBratãGareevãããã³WiNERãŠãŒãã£ãªãã£ã®ã¹ã¿ã³ããªã圢åŒã䜿çšããŸã-BIOããŒã¯ã¢ããã®ç°ãªãæ¹èšãBSNLP-2019ã«ã¯ç¬èªã®åœ¢åŒããããfactRuEval-2016ã«ãç¬èªã®éèŠãªä»æ§ããããŸã..ãNaevalã¯ããã¹ãŠã®ãœãŒã¹ãå ±éã®åœ¢åŒã«å€æããŸããããŒã¯ã¢ããã¯ã¹ãã³ã§æ§æãããŸããã¹ãã³-3ïŒãšã³ãã£ãã£ã¿ã€ãããµãã¹ããªã³ã°ã®éå§ãšçµäºã
ãšã³ãã£ãã£ã¿ã€ããfactRuEval-2016ãšCollection5ã¯ããKremlinãããEUãããUSSRãã®ããŒãããŒã ã®åçµç¹ãåå¥ã«ããŒã¯ããŸããBSNLP-2019ãšWiNERã¯ãã€ãã³ãã®ååã匷調ããŠããŸãïŒããã·ã¢ã®ãã£ã³ããªã³ã·ããããããã¬ãã·ããããNaevalã¯ããã€ãã®ã¿ã°ãé©å¿ãããŠåé€ããåç §ã¿ã°PERãLOCãORGãæ®ããŸãïŒäººã®ååãããããŒã ãšçµç¹ã®ååã
ãã¹ããããã¹ãã³ãå®éãRuEval-2016ã§ã¯ãã¹ãã³ãéè€ããŠããŸããNaevalã¯ããŒã¯ã¢ãããç°¡çŽ åããŸãã
:
, 5 Retail Group,
org_nameâââââââ
Orgââââââââââââ
"", "" "",
org_descrâââââ org_nameâ org_nameâââ org_name
Orgââââââââââââââââââââââ
org_descrâââââ
Orgâââââââââââââââââââââââââââââââââââââ
org_descrâââââ
Orgââââââââââââââââââââââââââââââââââââââââââââââââââ
, .
:
, 5 Retail Group,
ORGââââââââââââ
"", "" "",
ORGââââââ ORGââââââââ ORGâââââ
, .
ã¢ãã«
Naevalã¯ã12ã®ãªãŒãã³ãœãŒã¹ãœãªã¥ãŒã·ã§ã³ããã·ã¢ã®NERåé¡ãšæ¯èŒããŠããŸãããã¹ãŠã®ããŒã«ã¯ãWebã€ã³ã¿ãŒãã§ã€ã¹ãåããDockerã³ã³ããã«ã©ãããããŠããŸãã
$ docker run -p 8080:8080 natasha/tomita-algfio
2020-07-02 11:09:19 BIN: 'tomita-linux64', CONFIG: 'algfio'
2020-07-02 11:09:19 Listening http://0.0.0.0:8080
$ curl -X POST http://localhost:8080 --data \
' \
\
'
<document url="" di="5" bi="-1" date="2020-07-02">
<facts>
<Person pos="18" len="16" sn="0" fw="2" lw="3">
<Name_Surname val="" />
<Name_FirstName val="" />
<Name_SurnameIsDictionary val="1" />
</Person>
<Person pos="67" len="14" sn="0" fw="8" lw="9">
<Name_Surname val="" />
<Name_FirstName val="" />
<Name_SurnameIsDictionary val="1" />
</Person>
</facts>
</document>
äžéšã®ãœãªã¥ãŒã·ã§ã³ã¯ãèµ·åãšæ§æãéåžžã«é£ããããã䜿çšãã人ã¯ã»ãšãã©ããŸãããPullEntiãæŽç·Žãããã«ãŒã«ããŒã¹ã®ã·ã¹ãã ã¯ã2016幎ã«factRuEvalç«¶äºã®äžã§ç¬¬äžäœãåããŸããããã®ããŒã«ã¯ãCïŒçšã®SDKãšããŠé åžãããŠããŸããNaevalã§ã®äœæ¥ã«ãããPullEntiã®ã©ãããŒã®ã»ãããå«ãå¥ã®ãããžã§ã¯ããäœæãããŸãããPullentiServerã¯CïŒWebãµãŒããŒã§ãããpullenti-clientã¯PullentiServerã®Pythonã¯ã©ã€ã¢ã³ãã§ãã
$ docker run -p 8080:8080 pullenti/pullenti-server
2020-07-02 11:42:02 [INFO] Init Pullenti v3.21 ...
2020-07-02 11:42:02 [INFO] Load lang: ru, en
2020-07-02 11:42:03 [INFO] Load analyzer: geo, org, person
2020-07-02 11:42:05 [INFO] Listen prefix: http://*:8080/
>>> from pullenti_client import Client
>>> client = Client('localhost', 8080)
>>> text = ' ' \
... ' ' \
... ' '
>>> result = client(text)
>>> result.graph
ãã¹ãŠã®ããŒã«ã®ããŒã¯ã¢ãã圢åŒã¯å°ãç°ãªããŸããNaevalã¯çµæãããŒããããšã³ãã£ãã£ã¿ã€ããé©å¿ãããã¹ãã³ã®æ§é ãç°¡çŽ åããŸãã
(PullEnti):
, 19
ORGANIZATIONââââââââââ
GEOâââââââââ
PERSONââââââââââââââââ
PERSONPROPERTYâââââââ
ââââââââââââââââ PERSONâââââââââââââââââââââââ
PERSONPROPERTYââââââââââââââ
ORGANIZATIONâââ
.
ââââââââââââââââ
:
, 19
ORGââââââ LOCâââââââââ
PERâââââââââââââ ORGââââââââââââ
.
PERâââââââââââââ
PullEntiã®äœæ¥ã®çµæã¯ãfactRuEval-2016ããŒã¯ã¢ãããããé©å¿ãå°é£ã§ããã¢ã«ãŽãªãºã ã¯PERSONPROPERTYã¿ã°ãåé€ãããã¹ããããPERSONãORGANIZATIONãããã³GEOãéè€ããªãPERãLOCãORGã«åå²ããŸãã
æ¯èŒ
ãã¢ãã«ãããŒã¿ã»ãããã®åãã¢ã«ã€ããŠãNaevalã¯ããŒã¯ã³ã«ããF1ã¡ãžã£ãŒãèšç®ããå質ã¹ã³ã¢ãå«ãããŒãã«ãå ¬éããŸãã
ãã¿ãŒã·ã£ã¯ç§åŠçãªãããžã§ã¯ãã§ã¯ãããŸããããœãªã¥ãŒã·ã§ã³ã®å®çšæ§ã¯ç§ãã¡ã«ãšã£ãŠéèŠã§ãã Naevalã¯ãéå§æéãå®è¡é床ãã¢ãã«ãµã€ãºãããã³RAMæ¶è²»éãæž¬å®ããŸãããªããžããªå ã®çµæãå«ãããŒãã«ã
ããŒã¿ã»ãããæºåããDockerã³ã³ããã§20ã®ã·ã¹ãã ãã©ãããããã·ã¢èªNLPã®ä»ã®5ã€ã®ã¿ã¹ã¯ã®ã¡ããªãã¯ãèšç®ããŸãããçµæã¯ãããŒã¯ã³åãæãžã®ã»ã°ã¡ã³ããŒã·ã§ã³ãåã蟌ã¿ã圢æ ããã³æ§æåæã§ãã
Yargy- â
YargyããŒãµãŒã¯Yandexã®ã®é¡äŒŒäœã§ããå¯ç°ããŒãµPythonçšãã€ã³ã¹ããŒã«ã®æé ã䜿çšäŸãææžYargyãªããžããªã€ã³ããšã³ãã£ãã£ãæœåºããããã®ã«ãŒã«ã¯ãã³ã³ããã¹ãããªãŒã®ææ³ãšèŸæžã䜿çšããŠèª¬æãããŠããŸãã 2幎åãç§ã¯Habrã«ãYargyãšNatashaã©ã€ãã©ãªã«ã€ããŠã®èšäºãæžãããã·ã¢èªã®NERåé¡ã®è§£æ±ºã«ã€ããŠè©±ããŸããããããžã§ã¯ãã¯å¥œè©ã§ããã SberbankãInterfaxãRIA Novostiå ã®å€§èŠæš¡ãããžã§ã¯ãã§ãYargy-parserãTomitaã«åã£ãŠä»£ãããŸãããããããã®ææãç»å ŽããŠããŸãã Yandexã®ã¯ãŒã¯ã·ã§ããããã®å€§ããªãããªãäŸã䜿ã£ãŠææ³ãéçºããããã»ã¹ã«ã€ããŠ1æéåã
ããã¥ã¡ã³ããæŽæ°ããã玹ä»ã»ã¯ã·ã§ã³ãšãªãã¡ã¬ã³ã¹ããã¯ãçµã¿åãããŸãããæãéèŠãªã®ã¯ãã¯ãã¯ããã¯ãç»å Žããããšã§ããããã¯ã圹ç«ã€ãã©ã¯ãã£ã¹ã®ã»ã¯ã·ã§ã³ã§ããããã«ã¯ãt.me / natural_language_processingããã®æããããã質åãžã®åçãå«ãŸããŠããŸãã
- ããã¹ãã®äžéšãã¹ãããããæ¹æ³;
- ããã¹ãã§ã¯ãªããããŒã¯ã³ãéä¿¡ããæ¹æ³ã
- ããŒãµãŒã®é床ãäœäžããå Žåã®å¯ŸåŠæ¹æ³ã
YargyããŒãµãŒã¯è€éãªããŒã«ã§ããã¯ãã¯ããã¯ã§ã¯ã倧éã®ã«ãŒã«ã»ãããæäœãããšãã«çºçããéèªæãªãã€ã³ãã«ã€ããŠèª¬æããŠããŸãã
Yargyã©ãã§ã¯ããã€ãã®å€§èŠæš¡ãªãµãŒãã¹ãå®è¡ããŠããŸããç§ã¯ã³ãŒããèªã¿çŽããå ¬éãããŠããªãã¯ãã¯ããã¯ã«åéããããã¿ãŒã³ãèªã¿ãŸããã
- ã«ãŒã«ã®çæ;
- ç¶æ¿ã®äºå®ïŒç¹ã«æçšã§ãããå®éã«ã¯ãã®ææ³ãªãã§ã¯è§£æ±ºçã¯ãããŸããïŒã
ããã¥ã¡ã³ããèªãã åŸãäŸã䜿çšããŠãªããžããªã確èªãããšäŸ¿å©ã§ãã
Natashaãããžã§ã¯ãã«ã¯ãnatasha-usageãªããžããªããããŸããããã¯ãGithubã§å ¬éãããŠããYargyããŒãµãŒãŠãŒã¶ãŒã®ã³ãŒããè¡ããšããã§ãããªã³ã¯ã®80ïŒ ã¯æè²ãããžã§ã¯ãã§ãããæçãªäŸããããŸãã
- ãµã³ã¯ãããã«ãã«ã¯ã®å°äžéã®ä»äºã«é¢ãããã£ãŒãã®åæ;
- ãœãŒã·ã£ã«ãããã¯ãŒã¯ã§äœå® ãé éããããã®åºåã®è§£æã
- èªåè»ã¿ã€ã€ã®ååããã®å±æ§ã®æœåº;
- ODSãã£ããã®ãžã§ããã£ãã«ããã®æ¬ å¡ã®è§£æã
ãã¡ãããYargyããŒãµãŒã䜿çšããæãè峿·±ãã±ãŒã¹ã¯ãGithubã§å ¬éãããŠããŸãããäŒç€ŸãYargyã䜿çšããŠããå Žåã¯ãPMã«é£çµ¡ããæ°ã«ããªãå Žåã¯ãnatasha.github.ioã«ããŽã远å ããŠãã ããã
Ipymarkup-ååä»ããšã³ãã£ãã£ã®ããŒã¯ã¢ãããšæ§æäžã®é¢ä¿ã®èŠèŠå
Ipymarkupã¯ãããã¹ãå ã®ãµãã¹ããªã³ã°ã匷調衚瀺ããããã«å¿ èŠãªããªããã£ãã©ã€ãã©ãªã§ãããNERã®èŠèŠåã§ããã€ã³ã¹ããŒã«æé ãIpymarkupãªããžããªã§ã®äœ¿çšäŸããã®ã©ã€ãã©ãªã¯ãdisplaCyããã³displaCy ENTã«äŒŒãŠãããYargyããŒãµãŒã®ææ³ããããã°ããã®ã«éåžžã«åœ¹ç«ã¡ãŸãã
>>> from yargy import Parser
>>> from ipymarkup import show_span_box_markup as show_markup
>>> parser = Parser(...)
>>> text = '...'
>>> matches = parser.findall(text)
>>> spans = [_.span for _ in matches]
>>> show_markup(text, spans)
Natashaãããžã§ã¯ãã«ã¯ãè§£æã®åé¡ã«å¯Ÿãã解決çããããŸããããã¹ãå ã®åèªã匷調衚瀺ããã ãã§ãªãããããã®éã«ç¢å°ãæãå¿ èŠããããŸãããæ¢æã®è§£æ±ºçã¯ããããããããã®ãããã¯ã«é¢ããç§åŠçãªèšäºããããŸãã
ãã¡ãããæ¢åã®ãã®ã¯ã©ããç»å ŽããŸããã§ãããããæ¥ãç§ã¯æ¬åœã«æ··ä¹±ããCSSãšHTMLã®æåãªéæ³ããã¹ãŠé©çšããIpymarkupã«æ°ããèŠèŠåã远å ããŸãããããã¯ã§ã®äœ¿ç𿹿³ã
>>> from ipymarkup import show_dep_markup
>>> words = ['', '', '', '', '', '', '-', '', '', '', '', '', '', '', '.']
>>> deps = [(2, 0, 'case'), (2, 1, 'amod'), (10, 2, 'obl'), (2, 3, 'nmod'), (10, 4, 'obj'), (7, 5, 'compound'), (5, 6, 'punct'), (4, 7, 'nmod'), (9, 8, 'case'), (4, 9, 'nmod'), (13, 11, 'case'), (13, 12, 'nummod'), (10, 13, 'nsubj'), (10, 14, 'punct')]
>>> show_dep_markup(words, deps)
çŸåšããã¿ãŒã·ã£ãšãã«ã¹ã§ã¯ãè§£æã®çµæã確èªããã®ã«äŸ¿å©ã§ãã
