ä»æ¥ããœãŒã·ã£ã«ã¡ãã£ã¢ã¯ããªã³ã©ã€ã³ãšå®ç掻ã®äž¡æ¹ã§äž»èŠãªã³ãã¥ãã±ãŒã·ã§ã³ãã©ãããã©ãŒã ã®1ã€ã«ãªã£ãŠããŸããææ¯ã§æ»æçã§äžå¿«ãªã³ã¡ã³ããå«ãããŸããŸãªèŠç¹ãè¡šçŸããèªç±ã¯ã人ã ã®æèŠã瀟äŒççµæã«é·æçãªæªåœ±é¿ãäžããå¯èœæ§ããããŸãããããã£ãŠãçŸä»£ç€ŸäŒã®æãéèŠãªã¿ã¹ã¯ã®1ã€ã¯ãã€ã³ã¿ãŒãããäžã®ææ¯æ å ±ãèªåæ€åºããŠæªåœ±é¿ã軜æžããæ段ã®éçºã§ãã
ãã®èšäºã§ã¯ããã·ã¢èªã§ãã®åé¡ã解決ããæ¹æ³ã«ã€ããŠèª¬æããŸããããŒã¿ãœãŒã¹ãšããŠãKaggleã§å¿åã§å ¬éãããããŒã¿ã»ããã䜿çšããããã«æ³šéã®å質ããã§ãã¯ããŸãããåé¡ã¢ãã«ãäœæããããã«ãå€èšèªãŠãããŒãµã«ã»ã³ãã³ã¹ãšã³ã³ãŒããŒã®2ã€ã®ããŒãžã§ã³ããã©ã³ã¹ãã©ãŒããŒããã®åæ¹åãšã³ã³ãŒããŒè¡šçŸãšruBERTã埮調æŽããŸãããã«ã¹ã¿ãã€ãºãããã¢ãã«ruBERTã瀺ãF 1ãæè¯ã®åé¡çµæã§ãã£ãã= 92.20ããŒã»ã³ããããã¬ãŒãã³ã°æžã¿ã®ã¢ãã«ãšã³ãŒãäŸãå ¬éããŸããã
1.ã¯ããã«
ä»æ¥ãææ¯ãªã³ã¡ã³ããç¹å®ããåé¡ã¯ãé«åºŠãªæ·±å±€åŠç¿æè¡ã䜿çšããŠååã«è§£æ±ºãããŠããŸã[1]ã[35]ããã·ã¢èªã§ã®äŸ®èŸ±ãææ¯ããã³ææªã®ã¹ããŒãã®æ€åºã®ãããã¯ãçŽæ¥èª¿æ»ããäœåããããŸãã[2]ã[8]ã[17]ãææ¯ãªãã·ã¢èªã®ã³ã¡ã³ããå«ãå ¬éãããŠããããŒã¿ã»ããã¯1ã€ã ãã§ã[5]ã泚éããã»ã¹ã®èª¬æãªãã«Kaggleã§å ¬éããããããåŠè¡çããã³å®çšçãªç®çã§ã¯ãè¿œå ã®è©³çŽ°ãªèª¿æ»ãªãã§ã¯ä¿¡é Œã§ããªãå¯èœæ§ããããŸãã
ãã®èšäºã¯ããã·ã¢èªã§ã®ææ¯ãªã³ã¡ã³ãã®èªåæ€åºã«å°å¿µããŠããŸãããã®ã¿ã¹ã¯ã§ã¯ããã·ã¢èªã®ææ¯ã³ã¡ã³ãããŒã¿ã»ãã[5]ã®æ³šéã確èªããŸããã次ã«ãäºåã«ãã¬ãŒãã³ã°ãããå€èšèªããŒãžã§ã³ã®å€èšèªãŠãããŒãµã«ã»ã³ãã³ã¹ãšã³ã³ãŒããŒïŒM-USEïŒ[48]ããã©ã³ã¹ãã©ãŒããŒããã®åæ¹åãšã³ã³ãŒããŒè¡šçŸïŒM-BERTïŒ[13]ãããã³ruBERT [22]ã®åŸ®èª¿æŽã«åºã¥ããŠåé¡ã¢ãã«ãäœæãããŸãããæãæ£ç¢ºãªã¢ãã«ruBERT-Toxicã¯ãææ¯ãªã³ã¡ã³ãã®ãã€ããªåé¡åé¡ã§F 1 = 92,20ïŒ ã瀺ããŸãããçµæã®M-BERTããã³M-USEã¢ãã«ã¯ãgithubããããŠã³ããŒãã§ããŸãã
èšäºã®æ§æã¯ä»¥äžã®ãšããã§ããã§ã»ã¯ã·ã§ã³2ãã®ãããã¯ã«é¢ããä»ã®äœæ¥ãšãå©çšå¯èœãªãã·ã¢èªã®ããŒã¿ã»ããã«ã€ããŠç°¡åã«èª¬æããŸããã§ã¯ç¬¬3ç¯ãæã ã¯ãã·ã¢èªææ¯ã³ã¡ã³ãããŒã¿ã»ããã®äžè¬çãªæŠèŠãæäŸãããã®æ³šéããã§ãã¯ããããã®ããã»ã¹ã説æããŸããã§ã¯ç¬¬4ç¯ãæã ã¯ããã¹ãåé¡ã®ã¿ã¹ã¯ã®ããã®èšèªã¢ãã«ã®æŽç·Žãèšè¿°ãããã§ã»ã¯ã·ã§ã³5ãæã ã¯ãåé¡å®éšãèšè¿°ãããæåŸã«ãã·ã¹ãã ã®ããã©ãŒãã³ã¹ãšä»åŸã®ç 究ã®æ¹åæ§ã«ã€ããŠã話ãããŸãããã
2.ãããã¯ã«é¢ããä»ã®äœå
ããŸããŸãªããŒã¿ãœãŒã¹ã«é¢ããææ¯ãªã³ã¡ã³ããæ€åºããããã«ãåºç¯ãªäœæ¥ãè¡ãããŠããŸããããšãã°ãPrabowoãã¯ããã€ãŒããã€ãžã¢ã³åé¡ïŒNBïŒããµããŒããã¯ã¿ãŒãã·ã³ïŒSVMïŒãã¢ã³ãµã³ãã«ãã·ãžã§ã³ããªãŒïŒRFDTïŒåé¡åã䜿çšããŠãã€ã³ããã·ã¢ã®Twitterã§ææªãšäžå¿«ãªèšèãæ€åºããŸãã[34]ãå®éšçµæã¯ãèŸæžãŠãã°ã©ã ã®ç¬Šå·ã䜿çšããéå±€çã¢ãããŒããšSVMã¢ãã«ã§68.43ïŒ ã®ç²ŸåºŠã瀺ããŸããã Founta [15]ãçããããŒã ã®ç 究ã§ã¯ãææ¯ãªããã¹ãã®åé¡ã®ããã«ãäºåã«ãã¬ãŒãã³ã°ãããGloVeåã蟌ã¿ãåããGRUã«åºã¥ã深局åŠç¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ãææ¡ãããŸãããã¢ãã«ã¯5ã€ã®ããŒã¿ã»ããã§é«ã粟床ã瀺ããAUCã¯92ïŒ ãã98ïŒ ã®ç¯å²ã§ããã
ãŸããŸãå€ãã®ã¯ãŒã¯ã·ã§ãããã³ã³ãã¹ãããææ¯ã§ãææªçã§ãäžå¿«ãªã³ã¡ã³ããæ€åºããããšã«å°å¿µããŠããŸããããšãã°ãSemEval-2019ã§ã®HatEvalãšOffensEvalã HASOC at FIRE-2019; GermEval-2019ããã³GermEval-2018ã§ã®äžå¿«ãªèšèªã®èå¥ã«é¢ããå ±æã¿ã¹ã¯ã COLING-2018ã§ã®TRACãåé¡ã§äœ¿çšãããã¢ãã«ã¯ãåŸæ¥ã®æ©æ¢°åŠç¿ïŒSVMãããžã¹ãã£ãã¯ååž°ãªã©ïŒãã深局åŠç¿ïŒRNNãLSTMãGRUãCNNã泚æã¡ã«ããºã ãå«ãCapsNet [45]ã[49]ãELMoãªã©ã®é«åºŠãªã¢ãã«ïŒãŸã§å€å²ã«ããããŸãã [31]ãBERT [13]ããã³USE [9]ã[48]ïŒãè¯å¥œãªçµæãéæããããªãã®æ°ã®ããŒã [18]ã[24]ã[27]ã[28]ã[30]ã[36]ã[38]ã¯ããªã¹ããããäºåãã¬ãŒãã³ã°æžã¿èšèªã¢ãã«ããã®åã蟌ã¿ã䜿çšããŸãããäºåã«ãã¬ãŒãã³ã°ãããã¢ãã«ããã®ãã¥ãŒã¯åé¡ã§ããŸãæ©èœããããããã®åŸã®ç 究ã§åºã䜿çšãããŸãããããšãã°ããã¬ãŒã倧åŠã®ç 究è ã¯ã2ã€ã®ã¢ãããŒãã䜿çšããŠTwitterã¡ãã»ãŒãžã®ãã«ãã¯ã©ã¹ãã€ããªåé¡ãå®æœããŸãããäºåã«ãã¬ãŒãã³ã°ãããèªåœåã蟌ã¿ã䜿çšããDNNåé¡åã®ãã¬ãŒãã³ã°ãšãæ éã«èª¿æŽãããäºåãã¬ãŒãã³ã°ãããBERTã¢ãã«ã§ã[14]ã 2çªç®ã®ã¢ãããŒãã¯ãFastTextåã蟌ã¿ã«åºã¥ãCNNããã³åæ¹åLSTMãã¥ãŒã©ã«ãããã¯ãŒã¯ãšæ¯èŒããŠãå€§å¹ ã«åªããçµæã瀺ããŸãããäºåã«ãã¬ãŒãã³ã°ãããèªåœã®åã蟌ã¿ãšæ³šææ·±ã調æŽãããäºåã«ãã¬ãŒãã³ã°ãããBERTã¢ãã«ã䜿çšããŠDNNåé¡åããã¬ãŒãã³ã°ããããšã«ãã£ãŠ[14]ã 2çªç®ã®ã¢ãããŒãã¯ãFastTextåã蟌ã¿ã«åºã¥ãCNNããã³åæ¹åLSTMãã¥ãŒã©ã«ãããã¯ãŒã¯ãšæ¯èŒããŠãå€§å¹ ã«åªããçµæã瀺ããŸãããäºåã«ãã¬ãŒãã³ã°ãããèªåœã®åã蟌ã¿ãšæ³šææ·±ã調æŽãããäºåã«ãã¬ãŒãã³ã°ãããBERTã¢ãã«ã䜿çšããŠDNNåé¡åããã¬ãŒãã³ã°ããããšã«ãã£ãŠ[14]ã 2çªç®ã®ã¢ãããŒãã¯ãFastTextåã蟌ã¿ã«åºã¥ãCNNããã³åæ¹åLSTMãã¥ãŒã©ã«ãããã¯ãŒã¯ãšæ¯èŒããŠãå€§å¹ ã«åªããçµæã瀺ããŸããã
ããªãã®æ°ã®ç 究[7]ã[33]ã[41]ããã·ã¢èªã®ãœãŒã·ã£ã«ãããã¯ãŒã¯ã«ãããææ¯ã§æ»æçãªè¡åã®ç 究ã«å°å¿µããŠããŸãããããããã®èªååé¡ã«ã¯ããŸã泚æãæãããŠããŸãããè±èªãšãã·ã¢èªã®ããã¹ãã®æ»ææ§ãå€æããããã«ãGordeevã¯ç³ã¿èŸŒã¿ç¥çµãããã¯ãŒã¯ãšã©ã³ãã ãã©ã¬ã¹ãåé¡åïŒRFCïŒã䜿çšããŸãã[17]ãã¢ã°ã¬ãã·ããšããŠæ³šéãä»ããããã¡ãã»ãŒãžã®ã»ããã«ã¯ããã·ã¢èªã§çŽ1000ã®ã¡ãã»ãŒãžãå«ãŸããè±èªã§ãã»ãŒåãã§ããããå ¬éãããŠããŸãããèšç·ŽãããCNNã¢ãã«ã¯ããã·ã¢èªã®ããã¹ãã®ãã€ããªåé¡ã®ç²ŸåºŠã66.68ïŒ ã§ããããšã瀺ããŸããããããã®çµæã«åºã¥ããŠãèè ãã¯ãç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ãšæ·±å±€åŠç¿ããŒã¹ã®ã¢ãããŒããæ»æçãªããã¹ããèå¥ããããã«ããææã§ãããšçµè«ä»ããŸãããAndruziak et alãã¯ããŠã¯ã©ã€ãèªãšãã·ã¢èªã§æžãããäžå¿«ãªYouTubeã³ã¡ã³ããåé¡ããããã«ããœãŒã¹ããã£ãã©ãªãŒã䜿çšããç£èŠãããŠããªã確çè«çã¢ãããŒããææ¡ããŸãã[2]ãèè ã¯ãæåã§ã©ãã«ä»ãããã2,000件ã®ã³ã¡ã³ãã®ããŒã¿ã»ãããå ¬éããŠããŸããããã·ã¢èªãšãŠã¯ã©ã€ãèªã®äž¡æ¹ã®ããã¹ããå«ãŸããŠããããããã·ã¢èªã®ããã¹ãã®èª¿æ»ã«çŽæ¥äœ¿çšããããšã¯ã§ããŸããã
æè¿ã®ããã€ãã®ç 究ã¯ãã¢ã€ãã³ãã£ãã£ã«åºã¥ãæ»æã®èå¥ãå«ãããã·ã¢èªã話ããœãŒã·ã£ã«ãããã¯ãŒã¯ã«ããã移æ°ãšæ°æã°ã«ãŒãã«å¯Ÿããæ 床ã®èªåèå¥ã«çŠç¹ãåãããŠããŸãã Bodrunovaã¯å ±èè ãšãšãã«ãä»ã®åœãšæ¯èŒããŠããã¹ããœããšãå ±ååœããã®ç§»æ°ã«å¯Ÿããæ 床ã®ãããã¯ã«ã€ããŠãLiveJournalã§363,000ã®ãã·ã¢èªã®åºçç©ãç 究ããŸãã[8]ããã·ã¢èªã®ããã°ã§ã¯ã移æ°ã¯éèŠãªè°è«ã®åå ã«ã¯ãªãããææªã®æ±ããåããŠããªãã£ãããšãå€æããŸãããåæã«ãåã³ãŒã«ãµã¹è«žåœãšäžå€®ã¢ãžã¢è«žåœã®ä»£è¡šè ã¯ããŸã£ããç°ãªãæ¹æ³ã§æ±ãããŸãã Bessudnovãçããç 究è ã°ã«ãŒãã¯ããã·ã¢äººã¯äŒçµ±çã«ã³ãŒã«ãµã¹ãšäžå€®ã¢ãžã¢ã®äººã ã«å¯ŸããŠããæµå¯Ÿçã§ããããšãçºèŠããŸãããåæã«ããŠã¯ã©ã€ã人ãšã¢ã«ããã³äººã¯äžè¬çã«æœåšçãªé£äººãšããŠåãå ¥ããããŠããŸã[6]ããããŠãã³ã«ãã©ã¯ãçããéå£ã®èª¿æ»çµæã«ãããšãäžå€®ã¢ãžã¢ã®åœç±ãšãŠã¯ã©ã€ã人ã®ä»£è¡šã«å¯Ÿããæ 床ã¯æãåŠå®çã§ã[19]ãäžéšã®åŠè¡ç 究ã¯ãææ¯ã§æ»æçã§ææªçãªã¹ããŒããç¹å®ããããšã«çŠç¹ãåœãŠãŠããŸããããã·ã¢èªã®ããŒã¿ã»ãããå ¬éããŠããèè ã¯ããŸãããç§ãã¡ãç¥ãéãããã·ã¢èªã®ææ¯ãªã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããããªãã¯ãã¡ã€ã³ã§å¯äžã®ãã·ã¢èªã®ææ¯ãªã³ã¡ã³ãã®ã»ããã§ãããã ããäœæãšæ³šéã®ããã»ã¹ã説æããã«Kaggleã§å ¬éãããããã詳现ãªèª¿æ»ããªããã°ãåŠè¡çããã³å®çšçãªãããžã§ã¯ãã§ã®äœ¿çšã¯æšå¥šãããŸãããäžéšã®åŠè¡ç 究ã¯ãææ¯ã§æ»æçã§ææªçãªã¹ããŒããç¹å®ããããšã«çŠç¹ãåœãŠãŠããŸããããã·ã¢èªã®ããŒã¿ã»ãããå ¬éããŠããèè ã¯ããŸãããç§ãã¡ãç¥ãéãããã·ã¢èªã®ææ¯ãªã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããããªãã¯ãã¡ã€ã³ã§å¯äžã®ãã·ã¢èªã®ææ¯ãªã³ã¡ã³ãã®ã»ããã§ãããã ããäœæãšæ³šéã®ããã»ã¹ã説æããã«Kaggleã§å ¬éãããããã詳现ãªèª¿æ»ããªããã°ãåŠè¡çããã³å®çšçãªãããžã§ã¯ãã§ã®äœ¿çšã¯æšå¥šãããŸãããäžéšã®åŠè¡ç 究ã¯ãææ¯ã§æ»æçã§ææªçãªã¹ããŒããç¹å®ããããšã«çŠç¹ãåœãŠãŠããŸããããã·ã¢èªã®ããŒã¿ã»ãããå ¬éããŠããèè ã¯ããŸãããç§ãã¡ãç¥ãéãããã·ã¢èªã®ææ¯ã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããããªãã¯ãã¡ã€ã³ã§å¯äžã®ãã·ã¢èªã®ææ¯ã³ã¡ã³ãã®ã»ããã§ãããã ããäœæãšæ³šéã®ããã»ã¹ã説æããã«Kaggleã§å ¬éãããããã詳现ãªèª¿æ»ããªããã°ãåŠè¡çããã³å®çšçãªãããžã§ã¯ãã§ã®äœ¿çšã¯æšå¥šãããŸããããã·ã¢èªã®ææ¯ãªã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããããªãã¯ãã¡ã€ã³ã§å¯äžã®ãã·ã¢èªã®ææ¯ãªã³ã¡ã³ãã®ã»ããã§ãããã ããäœæãšæ³šéã®ããã»ã¹ã説æããã«Kaggleã§å ¬éãããããã詳现ãªèª¿æ»ããªããã°ãåŠè¡çããã³å®çšçãªãããžã§ã¯ãã§ã®äœ¿çšã¯æšå¥šãããŸããããã·ã¢èªã®ææ¯ãªã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããããªãã¯ãã¡ã€ã³ã§å¯äžã®ãã·ã¢èªã®ææ¯ãªã³ã¡ã³ãã®ã»ããã§ãããã ããäœæãšæ³šéã®ããã»ã¹ã説æããã«Kaggleã§å ¬éãããããã詳现ãªèª¿æ»ããªããã°ãåŠè¡çããã³å®çšçãªãããžã§ã¯ãã§ã®äœ¿çšã¯æšå¥šãããŸããã
ææ¯ãªãã·ã¢èªã®ã³ã¡ã³ãã®å®çŸ©ã«é¢ããç 究ã¯ã»ãšãã©ãªãããããã·ã¢èªã®ææ¯ãªã³ã¡ã³ãããŒã¿ã»ãã[5]ã§æ·±å±€åŠç¿ã¢ãã«ã®äœæ¥ãè©äŸ¡ããããšã«ããŸããããã®ããŒã¿ãœãŒã¹ã«åºã¥ãåé¡ç 究ã¯ãããŸãããå€èšèªBERTããã³å€èšèªUSEã¢ãã«ã¯ãæè¿ã®ç 究ãããžã§ã¯ãã§æãæ®åããæåããŠããã¢ãã«ã®1ã€ã§ãããããŠã圌ãã ããå ¬åŒã«ãã·ã¢èªããµããŒãããŠããŸããæè¿ã®ç 究ã§ã¯æè¯ã®åé¡çµæãåŸããããããåŠç¿äŒéã¢ãããŒããšããŠåŸ®èª¿æŽã䜿çšããããšãéžæããŸãã[13]ã[22]ã[43]ã[48]ã
3.ææ¯ãªã³ã¡ã³ãã®ããããŒã¿ã»ãã
ãã·ã¢èªã èšå®ããèšèªæ¯æ§ã³ã¡ã³ãããŒã¿ã»ãã[5]ã¯ããµã€ãDvachããã³Peekabooããã®æ³šéä»ãã³ã¡ã³ãã®ã³ã¬ã¯ã·ã§ã³ã§ãã 2019幎ã«Kaggleã«æçš¿ããã14,412件ã®ã³ã¡ã³ããå«ãŸããŠããŸãããã®ãã¡ã4,826件ã¯ææ¯ã9,586件ã¯ç¡æ¯ã§ããã³ã¡ã³ãã®å¹³åã®é·ãã¯175æåãæå°ã¯21æåãæ倧ã¯7 403ã§ãã
泚éã®å質ã確èªããããã«ãã³ã¡ã³ãã®äžéšã«æåã§æ³šéãä»ããã¢ãããŒã¿ãŒéåæã䜿çšããŠå ã®ã¿ã°ãšæ¯èŒããŸãããã¢ãããŒã¿ãŒéã®åæã®éèŠãªã¬ãã«ãŸãã¯é«ã¬ãã«ã«éãããšãã«ãæ¢åã®ã¢ãããŒã·ã§ã³ãæ£ãããšèŠãªãããšã«ããŸããã
ãŸãã3000åã®ã³ã¡ã³ãã«æåã§ã¿ã°ãä»ããçµæã®ã¯ã©ã¹ã©ãã«ãå ã®ã©ãã«ãšæ¯èŒããŸããã泚éã¯ãYandex.Tolokaã¯ã©ãŠããœãŒã·ã³ã°ãã©ãããã©ãŒã ã®ãã·ã¢èªã話ãåå è ã«ãã£ãŠäœæãããŸãããããã¯ããã·ã¢èªã®ããã¹ãã®ããã€ãã®åŠè¡ç 究ã§ãã§ã«äœ¿çšãããŠããŸã[10]ã[29]ã[32]ã[44]ãããŒã¯ã¢ããã®ã¬ã€ããšããŠããžã°ãœãŒæ¯æ§ã³ã¡ã³ãåé¡ãã£ã¬ã³ãžã§äœ¿çšãããè¿œå ã®å±æ§ãæã€æ¯æ§èªèæé ã䜿çšããŸãããã¢ãããŒã¿ãŒã¯ããã¹ãã®æ¯æ§ã決å®ããããã«æ±ãããããã®ã¬ãã«ã¯ã³ã¡ã³ãããšã«ç€ºãããªããã°ãªããŸããã§ãããããŒã¯ã¢ããã®ç²ŸåºŠãåäžããã欺çã®å¯èœæ§ãå¶éããããã«ã次ã®ææ³ã䜿çšããŸããã
- ã¢ãããŒã¿ãŒã«ã¯ãã¿ã¹ã¯ãå¶åŸ¡ããããã®åçã«åºã¥ããŠã¬ãã«ãå²ãåœãŠã誀ã£ãåçããã人ãçŠæ¢ããŸããã
- å¿çãéããã人ã®ããã®ã¿ã¹ã¯ãžã®ã¢ã¯ã»ã¹ã®å¶éã
- ãããã¯ã®ã¿ã¹ã¯ãžã®ã¢ã¯ã»ã¹ãå¶éãããŠãããããæ£ãããã£ããã£ãé£ç¶ããŠæ°åå ¥åãããŸããã
åã³ã¡ã³ãã«ã¯ãåçãªãŒããŒã©ããææ³ã䜿çšããŠ3ã8人ã®ã¢ãããŒã¿ãŒã泚éãä»ããŸãããçµæã¯ã Yandex.Tolokaã®æšå¥šã«åºã¥ããŠDawid-Skeneæ³[12]ã䜿çšããŠéèšãããŸãããã¢ãããŒã¿ãŒã¯é«ã¬ãã«ã®ã¢ãããŒã¿ãŒéåæã瀺ããã¯ãªããã³ãã«ãã®ã¢ã«ãã¡ä¿æ°ã¯0.81ã§ããããŸããå ã®ã©ãã«ãšéèšãããã©ãã«ã®éã®ã³ãŒãšã³ã®ã«ããä¿æ°ã¯0.68ã§ãããããã¯ãã¢ãããŒã¿ãŒéã®åæã®ããªãã®ã¬ãã«ã«å¯Ÿå¿ããŸã[11]ããããã£ãŠãç¹ã«æ³šéã®æ瀺ã§èµ·ããããéããèæ ®ããŠãããŒã¿ã»ããã®ããŒã¯ã¢ãããæ£ãããšèŠãªãããšã«ããŸããã
4.æ©æ¢°åŠç¿ã¢ãã«
4.1ãããŒã¹ã©ã€ã³ã¢ãããŒã
ããŒã¹ã©ã€ã³ã¢ãããŒãã§ã¯ã1ã€ã®åºæ¬çãªæ©æ¢°åŠç¿ã¢ãããŒããš1ã€ã®ææ°ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãããŒããæ¡çšããŸãããã©ã¡ãã®å Žåããäºåã®æºåãè¡ããŸãããURLãšããã¯ããŒã ãããŒã¯ãŒãã«çœ®ãæããå¥èªç¹ãåé€ãã倧æåãå°æåã«çœ®ãæããŸããã
ãŸããããã¹ãåé¡ã®åé¡ã§ããŸãæ©èœããMultinomial Naive BayesïŒMNBïŒã¢ãã«ãé©çšããŸãã[16]ã[40]ãã¢ãã«ãäœæããããã«ãBag-of-WordsãšTF-IDFãã¯ãã«åãæ¡çšããŸããã 2çªç®ã®ã¢ãã«ã¯ãåæ¹åã®é·æçæã¡ã¢ãªïŒBiLSTMïŒãã¥ãŒã©ã«ãããã¯ãŒã¯ã§ãããåã蟌ã¿ã¬ã€ã€ãŒã«ã€ããŠã¯ãWord2Vecåã蟌ã¿ãäºåã«ãã¬ãŒãã³ã°ããŸããïŒèæã= 300ïŒ[25] RuTweetCorp [37]ããã®ãã·ã¢èªã®Twitterã¡ãã»ãŒãžã®ã³ã¬ã¯ã·ã§ã³ã«åºã¥ããŠããŸãããŸããWord2Vecã®åã蟌ã¿ã«å ããŠã2ã€ã®åæ¹åLSTMã¬ã€ã€ãŒãè¿œå ããŸããã次ã«ãå®å šã«æ¥ç¶ãããé衚瀺ã®ã¬ã€ã€ãŒãšã·ã°ã¢ã€ãåºåã¬ã€ã€ãŒãè¿œå ããŸããããªãŒããŒãã£ãããæžããããã«ãã¬ãŠã¹ãã€ãºãšé€å€ã¬ã€ã€ãŒïŒããããã¢ãŠãïŒãå«ãæ£èŠåã¬ã€ã€ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ã«è¿œå ããŸãããæ倱é¢æ°ãšããŠãåæåŠç¿çã0.001ã§ãã«ããŽãªã®ãã€ããªã¯ãã¹ãšã³ããããŒãæã€Adamã®ãªããã£ãã€ã¶ã䜿çšããŸãããã¢ãã«ã¯ã10ãšããã¯ã®åºå®åã蟌ã¿ã§ãã¬ãŒãã³ã°ãããŸãããåŠç¿çãäžããªãããããŸããŸãªæ代ã®åã蟌ã¿ã®ãããã¯ã解é€ããããšããŸããããçµæã¯ããã«æªããªããŸããããã®çç±ã¯ãããããã¬ãŒãã³ã°ã»ããã®ãµã€ãºã§ãã[4]ã
4.2ãBERTã¢ãã«
å€èšèªBERTã®2ã€ã®ããŒãžã§ã³ã®BASEã®ã¢ãã«ããã ãä»æ£åŒã«å©çšã§ããããå¯äžã®ã±ãŒã¹å ¥ãããŒãžã§ã³ãå ¬åŒã«æšå¥šãããŸãã BERT BASEã¯ã512ããŒã¯ã³ä»¥äžã®ã·ãŒã±ã³ã¹ãåãåãããã®è¡šçŸãè¿ããŸããããŒã¯ã³åã¯ãWordPiece [46]ã䜿çšããŠãäºåçãªããã¹ãã®æ£èŠåãšå¥èªç¹ã®åé¢ã䜿çšããŠå®è¡ãããŸãã MIPTããã®ç 究è ã¯ãBERTèšç·ŽãåãBASEãã·ã¢èªã®ã¢ãã«[22] -ã±ãŒã¹å ¥ããšå ¬è¡šruBERTããæã ã¯äž¡æ¹ã®ã¢ãã«ã䜿çš-å€èšèªã®BERT BASEã12åã®é 次å€æãããã¯ãå«ãcasedããã³ruBERTã¯ã768ã®é衚瀺ãµã€ãºãæã¡ã12åã®èªå·±æ³šæããããš1å1000äžåã®ãã©ã¡ãŒã¿ãŒãå«ã¿ãŸãã埮調æŽæ®µéã¯ã[43]ããã³å ¬åŒãªããžããªããã®æšå¥šãã©ã¡ãŒã¿ãŒïŒ3ã€ã®åŠç¿ãšããã¯ã10ïŒ ã®ãŠã©ãŒã ã¢ãã段éãæ倧ã·ãŒã±ã³ã¹é·128ããã±ãããµã€ãº32ãåŠç¿ç5e-5ïŒã䜿çšããŠå®è¡ãããŸããã
4.3ãMUSEã¢ãã«
å€USEãã©ã³ã¹ã¯ãããã100åã®ããŒã¯ã³ã®ã·ãŒã±ã³ã¹ãåããå ¥åãšããŠãå€USE CNNã¯é åããããªã256åã®ä»¥äžã®ããŒã¯ã³ãã SentencePiece [20]ããŒã¯ã³åã¯ããµããŒããããŠãããã¹ãŠã®èšèªã§äœ¿çšãããŸããæã ã¯ãäºåã«èšç·Žå€èšèªUSEçšãããã©ã³ã¹ããã·ã¢èªãå«ã16åã®èšèªããµããŒããã6ã€ã®å€æå±€ã8ã€ã®æ³šç®ããããããã¯2048ã®ãã£ã«ã¿ãµã€ãºãæããŠãããæã ã¯ãŸããäºåã«èšç·Žå€èšèªUSE䜿çš512ã®é ããããµã€ãºãæãããšã³ã³ãŒãã³ã³ããŒã¿å«ãŸCNNããµããŒãããŠããããšããã·ã¢èªãå«ã16ã®èšèªã«ã¯ã2ã€ã®CNNã¬ã€ã€ãŒãæã€CNNãšã³ã³ãŒããŒãå«ãŸãããã£ã«ã¿ãŒå¹ ïŒ1ã2ã3ã5ïŒããã£ã«ã¿ãŒãµã€ãºããããŸããäž¡æ¹ã®ã¢ãã«ã§ãæšå¥šããããã©ã¡ãŒã¿ãŒãTensorFlowããããŒãžïŒ100åŠç¿ãšããã¯ãããããµã€ãº32ãåŠç¿ç3e-4ã
5.å®éš
ããŒã¹ã©ã€ã³ãšåŠç¿è»¢éã®ã¢ãããŒããæ¯èŒããŸããã
- å€é ãã€ãŒããã€ãºåé¡åš;
- ãã¥ãŒã©ã«ãããã¯ãŒã¯åæ¹åé·æçæã¡ã¢ãªïŒBiLSTMïŒ;
- ãã©ã³ã¹ãã©ãŒããŒããã®åæ¹åãšã³ã³ãŒããŒè¡šçŸã®å€èšèªããŒãžã§ã³ïŒM-BERTïŒ;
- ruBERT;
- å€èšèªãŠãããŒãµã«ã»ã³ãã³ã¹ãšã³ã³ãŒããŒïŒM-USEïŒã®2ã€ã®ããŒãžã§ã³ã
ãã¹ãã»ããã§ã®ãã¬ãŒãã³ã°æžã¿ã¢ãã«ã®åé¡ã®å質ïŒ20ïŒ ïŒãè¡šã«ç€ºããŸãã調æŽããããã¹ãŠã®èšèªã¢ãã«ã¯ãF 1ã®ç²ŸåºŠãæ³èµ·ã枬å®ã«ãããŠããŒã¹ã©ã€ã³ã¬ãã«ãè¶ ããŸãããruBERTã瀺ãF 1ãããã¯æè¯ã®çµæã§ããã= 92.20ããŒã»ã³ããã
ææ¯ãªãã·ã¢èªã®ã³ã¡ã³ãã®ãã€ããªåé¡ïŒ
ã·ã¹ãã | P | R | F 1 |
MNB | 87,01 % | 81,22 % | 83,21 % |
BiLSTM | 86,56 % | 86,65 % | 86,59 % |
M â BERTBASE â Toxic | 91,19 % | 91,10 % | 91,15 % |
ruBert â Toxic | 91,91 % | 92,51 % | 92,20 % |
M â USECNN â Toxic | 89,69 % | 90,14% | 89,91 % |
M â USETrans â Toxic | 90,85 % | 91,92 % | 91,35 % |
6.
ãã®èšäºã§ã¯ãå€èšèªãŠãããŒãµã«ã»ã³ãã³ã¹ãšã³ã³ãŒããŒ[48]ã®2ã€ã®åŸ®èª¿æŽããŒãžã§ã³ããã©ã³ã¹ãã©ãŒããŒããã®å€èšèªåæ¹åãšã³ã³ãŒããŒè¡šçŸ[13]ãšruBERT [22]ã䜿çšããŠãææ¯ãªãã·ã¢èªã®ã³ã¡ã³ããç¹å®ããŸãããrubertå調æ¯æ§ã瀺ããF 1 = 92.20ããŒã»ã³ããæé«ã®åé¡çµæã§ãã
çµæãšããŠåŸãããM-BERTããã³M-USEã¢ãã«ã¯ãgithubã§å ¥æã§ããŸãã
æåŠçãªæ å ±æº
ãªã¹ã
- Aken, B. van et al.: Challenges for toxic comment classification: An in-depth error analysis. In: Proceedings of the 2nd workshop on abusive language online (ALW2). pp. 33â42. Association for Computational Linguistics, Brussels, Belgium (2018).
- Andrusyak, B. et al.: Detection of abusive speech for mixed sociolects of russian and ukrainian languages. In: The 12th workshop on recent advances in slavonic natural languages processing, RASLAN 2018, karlova studanka, czech republic, december 7â9, 2018. pp. 77â84 (2018).
- Basile, V. et al.: SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th international workshop on semantic evaluation. pp. 54â63. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
- Baziotis, C. et al.: DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). pp. 747â754. Association for Computational Linguistics, Vancouver, Canada (2017).
- Belchikov, A.: Russian language toxic comments, https://www.kaggle.com/ blackmoon/russian-language-toxic-comments.
- Bessudnov, A., Shcherbak, A.: Ethnic discrimination in multi-ethnic societies: Evidence from russia. European Sociological Review. (2019).
- Biryukova, E. V. et al.: READERâS comment in on-line magazine as a genre of internet discourse (by the material of the german and russian languages). Philological Sciences. Issues of Theory and Practice. 12, 1, 79â82 (2018).
- Bodrunova, S. S. et al.: Whoâs bad? Attitudes toward resettlers from the post-soviet south versus other nations in the russian blogosphere. International Journal of Communication. 11, 23 (2017).
- Cer, D. M. et al.: Universal sentence encoder. ArXiv. abs/1803.11175, (2018).
- Chernyak, E. et al.: Char-rnn for word stress detection in east slavic languages. CoRR. abs/1906.04082, (2019).
- Cohen, J.: A coefficient of agreement for nominal scales. Educational and psychological measurement. 20, 1, 37â46 (1960).
- Dawid, A. P., Skene, A. M.: Maximum likelihood estimation of observer errorrates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics). 28, 1, 20â28 (1979).
- Devlin, J. et al.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). pp. 4171â4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019).
- dâSa, A. G. et al.: BERT and fastText embeddings for automatic detection of toxic speech. In: SIIE 2020-information systems and economic intelligence. (2020).
- Founta, A. M. et al.: A unified deep learning architecture for abuse detection. In: Proceedings of the 10th acm conference on web science. pp. 105â114. Association for Computing Machinery, New York, NY, USA (2019).
- Frank, E., Bouckaert, R.: Naive bayes for text classification with unbalanced classes. In: FÃŒrnkranz, J. et al. (eds.) Knowledge discovery in databases: PKDD 2006. pp. 503â510. Springer Berlin Heidelberg, Berlin, Heidelberg (2006).
- Gordeev, D.: Detecting state of aggression in sentences using cnn. In: International conference on speech and computer. pp. 240â245. Springer (2016).
- Indurthi, V. et al.: FERMI at SemEval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in twitter. In: Proceedings of the 13th international workshop on semantic evaluation. pp. 70â74. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
- Koltsova, O. et al.: FINDING and analyzing judgements on ethnicity in the russian-language social media. AoIR Selected Papers of Internet Research. (2017).
- Kudo, T., Richardson, J.: SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations. pp. 66â71. Association for Computational Linguistics, Brussels, Belgium (2018).
- Kumar, R. et al. eds: Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018).
- Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual transformers for Russian language. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference «Dialogue». pp. 333â340. RSUH, Moscow, Russia (2019).
- Lenhart, A. et al.: Online harassment, digital abuse, and cyberstalking in america. Data; Society Research Institute (2016).
- Liu, P. et al.: NULI at SemEval-2019 task 6: Transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of the 13th international workshop on semantic evaluation. pp. 87â91. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
- Mikolov, T. et al.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systemsâvolume 2. pp. 3111â3119. Curran Associates Inc., Red Hook, NY, USA (2013).
- Mishra, P. et al.: Abusive language detection with graph convolutional networks. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). pp. 2145â2150 (2019).
- Mishra, S., Mishra, S.: 3Idiots at HASOC 2019: Fine-tuning transformer neural networks for hate speech identification in indo-european languages. In: Working notes of FIRE 2019âforum for information retrieval evaluation, kolkata, india, december 12â15, 2019. pp. 208â213 (2019).
- Nikolov, A., Radivchev, V.: Nikolov-radivchev at SemEval-2019 task 6: Offensive tweet classification with BERT and ensembles. In: Proceedings of the 13th international workshop on semantic evaluation. pp. 691â695. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
- Panchenko, A. et al.: RUSSEâ2018: A Shared Task on Word Sense Induction for the Russian Language. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference «Dialogue». pp. 547â564. RSUH, Moscow, Russia (2018).
- Paraschiv, A., Cercel, D.-C.: UPB at germeval-2019 task 2: BERT-based offensive language classification of german tweets. In: Preliminary proceedings of the 15th conference on natural language processing (konvens 2019). Erlangen, germany: German society for computational linguistics & language technology. pp. 396â402 (2019).
- Peters, M. et al.: Deep contextualized word representations. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers). pp. 2227â2237. Association for Computational Linguistics, New Orleans, Louisiana (2018).
- Ponomareva, M. et al.: Automated word stress detection in Russian. In: Proceedings of the first workshop on subword and character level models in NLP. pp. 31â35. Association for Computational Linguistics, Copenhagen, Denmark (2017).
- Potapova, R., Komalova, L.: Lexico-semantical indices of «deprivationâaggression» modality correlation in social network discourse. In: International conference on speech and computer. pp. 493â502. Springer (2017).
- Prabowo, F. A. et al.: Hierarchical multi-label classification to identify hate speech and abusive language on indonesian twitter. In: 2019 6th international conference on information technology, computer and electrical engineering (icitacee). pp. 1â5 (2019).
- Risch, J., Krestel, R.: Toxic comment detection in online discussions. In: Deep learning-based approaches for sentiment analysis. pp. 85â109. Springer (2020).
- Risch, J. et al.: HpiDEDIS at germeval 2019: Offensive language identification using a german bert model. In: Preliminary proceedings of the 15th conference on natural language processing (konvens 2019). Erlangen, germany: German society for computational linguistics & language technology. pp. 403â408 (2019).
- Rubtsova, Y.: A method for development and analysis of short text corpus for the review classification task. Proceedings of conferences Digital Libraries: Advanced Methods and Technologies, Digital Collections (RCDLâ2013). Pp. 269â275 (2013).
- Ruiter, D. et al.: LSV-uds at HASOC 2019: The problem of defining hate. In: Working notes of FIRE 2019âforum for information retrieval evaluation, kolkata, india, december 12â15, 2019. pp. 263â270 (2019).
- Sambasivan, N. et al.: «They donât leave us alone anywhere we go»: Gender and digital abuse in south asia. In: Proceedings of the 2019 chi conference on human factors in computing systems. Association for Computing Machinery, New York, NY, USA (2019).
- Sang-Bum Kim et al.: Some effective techniques for naive bayes text classification. IEEE Transactions on Knowledge and Data Engineering. 18, 11, 1457â1466 (2006).
- Shkapenko, T., Vertelova, I.: Hate speech markers in internet comments to translated articles from polish media. Political Linguistics. 70, 4, Pages 104â111 (2018).
- Strus, J. M. et al.: Overview of germeval task 2, 2019 shared task on the identification of offensive language. Presented at the (2019).
- Sun, C. et al.: How to fine-tune bert for text classification? In: Sun, M. et al. (eds.) Chinese computational linguistics. pp. 194â206. Springer International Publishing, Cham (2019).
- Ustalov, D., Igushkin, S.: Sense inventory alignment using lexical substitutions and crowdsourcing. In: 2016 international fruct conference on intelligence, social media and web (ismw fruct). (2016).
- Vaswani, A. et al.: Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. pp. 6000â6010. Curran Associates Inc., Red Hook, NY, USA (2017).
- Wu, Y. et al.: Googleâs neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. (2016).
- Yang, F. et al.: Exploring deep multimodal fusion of text and photo for hate speech classification. In: Proceedings of the third workshop on abusive language online. pp. 11â18. Association for Computational Linguistics, Florence, Italy (2019).
- Yang, Y. et al.: Multilingual universal sentence encoder for semantic retrieval. CoRR. abs/1907.04307, (2019).
- Yang, Z. et al.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies. pp. 1480â1489. pp. Association for Computational Linguistics, San Diego, California (2016).