é·ãéèšäºãæžããŠããŸããã§ããããYandexãšMIPTã®æåãªå°éåéã§ãããæ©æ¢°åŠç¿ãšããŒã¿åæãã®ãã¬ãŒãã³ã°äžã«åŸãããããŒã¿ãµã€ãšã³ã¹ã®ç¥èãã©ã®ããã«åœ¹ç«ã£ããã«ã€ããŠæžãæãæ¥ããšæããŸãã確ãã«ãå ¬å¹³ãæãããã«ãç¥èãå®å šã«åŸãããŠããªãããšã«æ³šæããå¿ èŠããããŸã-å°éåéã¯å®äºããŠããŸãã:)ããããåçŽãªå®éã®ããžãã¹äžã®åé¡ã解決ããããšã¯ãã§ã«å¯èœã§ãããããšãå¿ èŠã§ããïŒãã®è³ªåã¯ãã»ãã®æ°æ®µèœã§çããããŸãã
ããã§ã仿¥ãã®èšäºã§ç§ã¯ãªãŒãã³ã³ã³ããã£ã·ã§ã³ã«åå ããç§ã®æåã®çµéšã«ã€ããŠèŠªæãªãèªè ã«è©±ããŸããã³ã³ãã¹ãã®ç§ã®ç®æšã¯è³åãç²åŸããããšã§ã¯ãªãã£ãããšãããã«ææããããšæããŸããå¯äžã®æã¿ã¯ãçŸå®ã®äžçã§ç§ã®æã詊ãããšã§ãã:)ã¯ããããã«å ããŠãç«¶äºã®ãããã¯ããééããã³ãŒã¹ã®è³æãšå®è³ªçã«äº€å·®ããªãã£ãããšãèµ·ãããŸãããããã¯ããã€ãã®è€éãã远å ããŸããããããã«ãã£ãŠç«¶äºã¯ããã«é¢çœããªããããããåŸãããçµéšã¯è²Žéã«ãªããŸããã
äŒçµ±çã«ãç§ã¯èª°ããã®èšäºã«èå³ãæã£ãŠããã®ããæå®ããŸãããŸããäžèšã®å°éåéã®æåã®2ã€ã®ã³ãŒã¹ããã§ã«å®äºããŠããŠãå®éã®åé¡ã«ææŠãããããããŸããããªãå¯èœæ§ããããç¬ããããªã©ãæ¥ãããããå±ã§å¿é ããŠããå Žåãèšäºãèªãã åŸããã®ãããªæãã¯ææãããããšãé¡ã£ãŠããŸãã第äºã«ãããããããªãã¯åæ§ã®åé¡ã解決ããŠããŠãã©ãã«å ¥ãã®ãå šãããããªãã§ãããããããŠãå®éã®ããŒã¿ã€ã³ã¿ãŒã³ãèšãããã«ãããã¯æ¢è£œã®æ°åããªãããŒã¹ã©ã€ã³ã§ã:)
ããã§ç ç©¶èšç»ã®æŠèŠã説æããããšã¯äŸ¡å€ããããŸãããå°ãéžè±ããŠãæåã®æ®µèœããã®è³ªåã«çããããšããŸã-ããŒã¿ã·ã³ãã£ã³ã°ã®åå¿è ããã®ãããªå€§äŒã§åœŒã®æã詊ãå¿ èŠããããã©ããããã®ã¹ã³ã¢ã«ã€ããŠã¯æèŠãç°ãªããŸããå人çã«ã¯ç§ã®æèŠãå¿ èŠã§ãïŒãã®çç±ã説æãããŠãã ãããå€ãã®çç±ããããŸããç§ã¯ãã¹ãŠããªã¹ãããããã§ã¯ãããŸãããç§ã¯æãéèŠãªãã®ã瀺ããŸãã第äžã«ããã®ãããªç«¶äºã¯å®éã®çè«çç¥èãçµ±åããã®ã«åœ¹ç«ã¡ãŸãã第äºã«ãç§ã®å®è·µã§ã¯ãã»ãšãã©ã®å ŽåãæŠéã«è¿ãç¶æ³ã§åŸãããçµéšã¯ããããªã忥ãžã®éåžžã«åŒ·ãåæ©ä»ããšãªããŸãã第äžã«ãããã¯æãéèŠãªããšã§ã-ç«¶äºäžã«ããªãã¯ç¹å¥ãªãã£ããã§ä»ã®åå è ãšã³ãã¥ãã±ãŒã·ã§ã³ããæ©äŒããããŸããããªãã¯ã³ãã¥ãã±ãŒã·ã§ã³ããå¿ èŠãããããŸãããããªãã¯äººã ãæžãããã®ãèªãããšãã§ããŸããããŠããã¯ãã°ãã°è峿·±ãèãã«ã€ãªãããŸã調æ»ã§ä»ã«ã©ã®ãããªå€æŽãå ããããbïŒç¹ã«ãã£ããã§è¡šçŸãããŠããå Žåã¯ãèªåã®ã¢ã€ãã¢ãæ€èšŒããèªä¿¡ããããŸãããããã®å©ç¹ã¯ãå šèœæ§ã®æèŠããªãããã«ãäžå®ã®æ éãããã£ãŠåãçµãå¿ èŠããããŸã...
ããŠãç§ãã©ã®ããã«åå ããããšã«æ±ºãããã«ã€ããŠå°ã説æããŸãã倧äŒãå§ãŸãã»ãã®æ°æ¥åã«ãã®ããšãç¥ããŸãããæåã®èãã¯ãããŸãã1ãæåã«ç«¶äºã«ã€ããŠç¥ã£ãŠããã°ãèªåã§æºåããã ããããç ç©¶ãè¡ãã®ã«åœ¹ç«ã€å¯èœæ§ã®ããããã€ãã®è¿œå è³æãç ç©¶ããã ãããããã§ãªããã°ãæºåãªãã§ã¯æéã«éã«åããªã...ãã2çªç®ã®èãã¯ãå®éã«ã¯ãç®æšãè³åã§ãªããã°ããŸããããªããããããŸããããåå ã¯ãç¹ã«95ïŒ ã®ã±ãŒã¹ã®åå è ããã·ã¢èªã話ããããããã«ãã£ã¹ã«ãã·ã§ã³ã®ããã®ç¹å¥ãªãã£ããããããããäž»å¬è ããã®äœããã®ãŠã§ãããŒããããŸããæçµçã«ã¯ããã¹ãŠã®ã¹ãã©ã€ããšãµã€ãºã®ã©ã€ãããŒã¿ãµã€ãšã³ãã£ã¹ããèŠãããšãã§ããããã«ãªããŸã... "ããæ³åã®ãšããã2çªç®ã®èããåã¡ãŸããããç¡é§ã§ã¯ãããŸããã§ãããã»ãã®æ°æ¥éã®ããŒãã¯ãŒã¯ã§ãåçŽãªãã®ã§ã¯ãããŸããã貎éãªçµéšãåŸãããšãã§ããŸãããããããããªãã®ããžãã¹ã¿ã¹ã¯ã§ãããããã£ãŠãããŒã¿ãµã€ãšã³ã¹ã®é«ã¿ãåŸæããæ¬¡ã®ç«¶äºãèŠãå Žåã¯ãæ¯åœèªã§ããã£ããã®ãµããŒãããããèªç±ãªæéããããŸã-é·ãéèºèºããªãã§ãã ãã-詊ããŠã¿ãŠãã ãããåãããªããšäžç·ã«æ¥ããããããŸããïŒããžãã£ããªããšã«ãç§ãã¡ã¯ã¿ã¹ã¯ãšç ç©¶èšç»ã«ç§»ããŸãã
äžèŽããåå
ç§ãã¡ã¯èªåèªèº«ãèŠãããããåé¡ã®èª¬æãæãã€ãããã¯ããŸããããã³ã³ãã¹ãäž»å¬è ã®ãŠã§ããµã€ãããåæãæäŸããŸãã
ä»äº
æ°ããã¯ã©ã€ã¢ã³ããæ¢ããšããSIBURã¯ããŸããŸãªãœãŒã¹ããã®äœçŸäžãã®æ°ããäŒç€Ÿã«é¢ããæ å ±ãåŠçããå¿ èŠããããŸããåæã«ãäŒç€Ÿã®ååã¯ãã¹ãã«ãç°ãªã£ãŠããããç¥èªã誀ããå«ãŸããŠããããSIBURã«ãã§ã«ç¥ãããŠããäŒç€ŸãšææºããŠããå ŽåããããŸãã
æœåšçãªé¡§å®¢ã«é¢ããæ å ±ãããå¹ççã«åŠçããããã«ãSIBURã¯ã2ã€ã®ååãé¢é£ããŠãããã©ããïŒã€ãŸããåãäŒç€ŸãŸãã¯é¢é£äŒç€Ÿã«å±ããŠãããã©ããïŒãç¥ãå¿ èŠããããŸãã
ãã®å ŽåãSIBURã¯ãäŒç€ŸèªäœãŸãã¯é¢é£äŒç€Ÿã«é¢ããæ¢ç¥ã®æ å ±ã䜿çšã§ããäŒç€Ÿãžã®éè€ããåŒã³åºãããç¡é¢ä¿ãªäŒç€ŸãŸãã¯ç«¶åä»ç€Ÿã®åäŒç€Ÿã«æéãæµªè²»ããããšã¯ãããŸããã
ãã¬ãŒãã³ã°ãµã³ãã«ã«ã¯ãããŸããŸãªãœãŒã¹ïŒã«ã¹ã¿ã ã®ãã®ãå«ãïŒããã®ååã®ãã¢ãšããŒã¯ã¢ãããå«ãŸããŠããŸãã
ããŒã¯ã¢ããã¯ãäžéšã¯æäœæ¥ã§ãäžéšã¯ã¢ã«ãŽãªãºã ã«ãã£ãŠååŸãããŸãããããã«ãããŒã¯ã¢ããã«ãšã©ãŒãå«ãŸããŠããå¯èœæ§ããããŸãã2ã€ã®ååãé¢é£ããŠãããã©ãããäºæž¬ãããã€ããªã¢ãã«ãæ§ç¯ããŸãããã®ã¿ã¹ã¯ã§äœ¿çšãããã¡ããªãã¯ã¯F1ã§ãã
ãã®ã¿ã¹ã¯ã§ã¯ããªãŒãã³ããŒã¿ãœãŒã¹ã䜿çšããŠããŒã¿ã»ãããå å®ãããããé¢é£äŒç€Ÿãç¹å®ããããã«éèŠãªè¿œå æ å ±ãèŠã€ãããããããšãå¯èœã§ãããå¿ èŠã§ãããããŸãã
ã¿ã¹ã¯ã«é¢ããè¿œå æ å ±
詳现ã«ã€ããŠã¯ç§ãæããã«ããŠãã ãã
, . , : , , Sibur Digital, , Sibur international GMBH , â International GMBHâ .
: , .
, , , .
(50%) (50%) .
. , , .
1 1 000 000 .
( ). , .
24:00 6 2020 # .
, , , .
, , - , .
.
, , .
10 .
API , , 2.
ââ crowdsource . , :)
, .
legel entities, , .. , Industries .
. .
, . , ââ .
, , , . , , .
, , .
open source , . â . .
, - â . , , - , , , .
: , .
, , , .
(50%) (50%) .
. , , .
1 1 000 000 .
( ). , .
24:00 6 2020 # .
, , , .
, , - , .
.
, , .
10 .
API , , 2.
, , , .. crowdsource
ââ crowdsource . , :)
, .
legel entities, , .. , Industries .
. .
, . , ââ .
, , , . , , .
, , .
open source
open source , . â . .
, - â . , , - , , , .
ããŒã¿
train.csv-ãã¬ãŒãã³ã°ã»ãã
test.csv-ãã¹ãã»ãã
sample_submission.csv-æ£ãã圢åŒã®ãœãªã¥ãŒã·ã§ã³ã®äŸ
ããŒãã³ã°baseline.ipynb-ã³ãŒã
baseline_submission.csv-åºæ¬çãªãœãªã¥ãŒã·ã§ã³
ã³ã³ãã¹ãã®äž»å¬è ãè¥ãäžä»£ã®é¢åãèŠãŠãåé¡ã®åºæ¬çãªè§£æ±ºçãæçš¿ããããšã«æ³šæããŠãã ãããããã«ãããf1ã®å質ã¯çŽ0.1ã«ãªããŸããã³ã³ãã¹ãã«åå ããã®ã¯ãããåããŠã§ããããèŠãã®ã¯åããŠã§ã:)
ããã§ãåé¡èªäœãšãã®è§£æ±ºçã®èŠä»¶ã«ç²Ÿéããã®ã§ã解決çã®èšç»ã«ç§»ããŸãããã
åé¡è§£æ±ºèšç»
æè¡æ©åšã®ã»ããã¢ãã
ã©ã€ãã©ãªãããŒãã
ãŸãããè£å©é¢æ°ãæžããŠã¿ãŸããã
ããŒã¿ã®ååŠç
⊠-. !
50 & Drop it smart.
ã¬ãã³ã·ã¥ãã€ã³è·é¢ã
èšç®ããŸãããæ£èŠåãããã¬ãã³ã·ã¥ãã€ã³è·é¢ãèšç®ã
ãŸãç¹åŸŽãèŠèŠåããŸã
åãã¢ã®ããã¹ãå ã®åèªãæ¯èŒãã倿°ã®ç¹åŸŽãçæ
ããŸãããã¹ãããã®åèªãç³æ²¹ååŠããã³å»ºèšæ¥çã®äžäœ50ã®ä¿æãã©ã³ãã®ååããã®åèªãšæ¯èŒããŸããæ©èœã®2çªç®ã®å€§ããªå±±ãååŸããŸãããã2çªç®ã®CHIT
ã¢ãã«ã«ãã£ãŒãããããã®ããŒã¿ã®æºå
ã¢ãã«ã®èšå®ãšãã¬ãŒãã³ã°
倧äŒçµæ
æ å ±æº
ç ç©¶èšç»ã«æ £ããŠããã®ã§ããã®å®æœã«ç§»ããŸãããã
æè¡æ©åšã®ã»ããã¢ãã
ã©ã€ãã©ãªã®èªã¿èŸŒã¿
å®éãããã§ã¯ãã¹ãŠãåçŽã§ãããŸããäžè¶³ããŠããã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŸã
ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŠåœã®ãªã¹ãã確èªããããã¹ãããåé€ããŸã
pip install pycountry
ããã¹ãããã®åèªéã®ã¬ãã³ã·ã¥ãã€ã³è·é¢ã決å®ããããã®ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŸãã
pip install strsimpy
ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ãããã®å©ããåããŠãã·ã¢èªã®ããã¹ããã©ãã³èªã«ç¿»èš³ããŸã
pip install cyrtranslit
ã©ã€ãã©ãªããã«ã¢ãã
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import pycountry
import re
from tqdm import tqdm
tqdm.pandas()
from strsimpy.levenshtein import Levenshtein
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()
sns.set_style("whitegrid")
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from scipy.sparse import csr_matrix
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report, f1_score
# import googletrans
# from googletrans import Translator
import cyrtranslit
è£å©é¢æ°ãæžããŠã¿ãŸããã
倧ããªã³ãŒããã³ããŒããã®ã§ã¯ãªãã1è¡ã§é¢æ°ãæå®ããããšããå§ãããŸããã»ãšãã©ã®å ŽåãããããŸãã
颿°å ã®ã³ãŒãã®å質ãåªããŠãããšã¯èšããŸããã確å®ã«æé©åããå¿ èŠãããå ŽæããããŸãããè¿ éãªèª¿æ»ãè¡ãã«ã¯ãèšç®ã®ç²ŸåºŠã ãã§ååã§ãã
ãããã£ãŠãæåã®é¢æ°ã¯ããã¹ããå°æåã«å€æããŸã
ã³ãŒã
# convert text to lowercase
def lower_str(data,column):
data[column] = data[column].str.lower()
次ã®4ã€ã®é¢æ°ã¯ã調æ»äžã®ãã£ãŒãã£ã®ã¹ããŒã¹ãšãã¿ãŒã²ããã©ãã«ïŒ0ãŸãã¯1ïŒã§ãªããžã§ã¯ããåé¢ããæ©èœãèŠèŠåããã®ã«åœ¹ç«ã¡ãŸãã
ã³ãŒã
# statistic table for analyse float values (it needs to make histogramms and boxplots)
def data_statistics(data,analyse,title_print):
data0 = data[data['target']==0][analyse]
data1 = data[data['target']==1][analyse]
data_describe = pd.DataFrame()
data_describe['target_0'] = data0.describe()
data_describe['target_1'] = data1.describe()
data_describe = data_describe.T
if title_print == 'yes':
print ('\033[1m' + ' ',analyse,'\033[m')
elif title_print == 'no':
None
return data_describe
# histogramms for float values
def hist_fz(data,data_describe,analyse,size):
print ()
print ('\033[1m' + 'Information about',analyse,'\033[m')
print ()
data_0 = data[data['target'] == 0][analyse]
data_1 = data[data['target'] == 1][analyse]
min_data = data_describe['min'].min()
max_data = data_describe['max'].max()
data0_mean = data_describe.loc['target_0']['mean']
data0_median = data_describe.loc['target_0']['50%']
data0_min = data_describe.loc['target_0']['min']
data0_max = data_describe.loc['target_0']['max']
data0_count = data_describe.loc['target_0']['count']
data1_mean = data_describe.loc['target_1']['mean']
data1_median = data_describe.loc['target_1']['50%']
data1_min = data_describe.loc['target_1']['min']
data1_max = data_describe.loc['target_1']['max']
data1_count = data_describe.loc['target_1']['count']
print ('\033[4m' + 'Analyse'+ '\033[m','No duplicates')
figure(figsize=size)
sns.distplot(data_0,color='darkgreen',kde = False)
plt.scatter(data0_mean,0,s=200,marker='o',c='dimgray',label='Mean')
plt.scatter(data0_median,0,s=250,marker='|',c='black',label='Median')
plt.legend(scatterpoints=1,
loc='upper right',
ncol=3,
fontsize=16)
plt.xlim(min_data, max_data)
plt.show()
print ('Quantity:', data0_count,
' Min:', round(data0_min,2),
' Max:', round(data0_max,2),
' Mean:', round(data0_mean,2),
' Median:', round(data0_median,2))
print ()
print ('\033[4m' + 'Analyse'+ '\033[m','Duplicates')
figure(figsize=size)
sns.distplot(data_1,color='darkred',kde = False)
plt.scatter(data1_mean,0,s=200,marker='o',c='dimgray',label='Mean')
plt.scatter(data1_median,0,s=250,marker='|',c='black',label='Median')
plt.legend(scatterpoints=1,
loc='upper right',
ncol=3,
fontsize=16)
plt.xlim(min_data, max_data)
plt.show()
print ('Quantity:', data_1.count(),
' Min:', round(data1_min,2),
' Max:', round(data1_max,2),
' Mean:', round(data1_mean,2),
' Median:', round(data1_median,2))
# draw boxplot
def boxplot(data,analyse,size):
print ('\033[4m' + 'Analyse'+ '\033[m','All pairs')
data_0 = data[data['target'] == 0][analyse]
data_1 = data[data['target'] == 1][analyse]
figure(figsize=size)
sns.boxplot(x=analyse,y='target',data=data,orient='h',
showmeans=True,
meanprops={"marker":"o",
"markerfacecolor":"dimgray",
"markeredgecolor":"black",
"markersize":"14"},
palette=['palegreen', 'salmon'])
plt.ylabel('target', size=14)
plt.xlabel(analyse, size=14)
plt.show()
# draw graph for analyse two choosing features for predict traget label
def two_features(data,analyse1,analyse2,size):
fig = plt.subplots(figsize=size)
x0 = data[data['target']==0][analyse1]
y0 = data[data['target']==0][analyse2]
x1 = data[data['target']==1][analyse1]
y1 = data[data['target']==1][analyse2]
plt.scatter(x0,y0,c='green',marker='.')
plt.scatter(x1,y1,c='black',marker='+')
plt.xlabel(analyse1)
plt.ylabel(analyse2)
title = [analyse1,analyse2]
plt.title(title)
plt.show()
5çªç®ã®é¢æ°ã¯ãå ±åœ¹ããŒãã«ãšããŠããç¥ãããŠãããã¢ã«ãŽãªãºã ã®æšæž¬ãšãšã©ãŒã®ããŒãã«ãçæããããã«èšèšãããŠããŸãã
èšãæããã°ãäºæž¬ã®ãã¯ãã«ã®åœ¢æåŸãäºæž¬ãã¿ãŒã²ããã©ãã«ãšæ¯èŒããå¿ èŠããããŸãããã®ãããªæ¯èŒã®çµæã¯ããã¬ãŒãã³ã°ãµã³ãã«ããã®äŒæ¥ã®åãã¢ã®å ±åœ¹è¡šã«ãªãã¯ãã§ããåãã¢ã®å ±åœ¹è¡šã§ããã¬ãŒãã³ã°ãµã³ãã«ããã®ã¯ã©ã¹ã«äºæž¬ãäžèŽãããçµæãæ±ºå®ãããŸããäžèŽããåé¡ã¯ããçéœæ§ãããåœéœæ§ãããçé°æ§ãããŸãã¯ãåœé°æ§ããšããŠåãå ¥ããããŸãããããã®ããŒã¿ã¯ãã¢ã«ãŽãªãºã ã®åäœãåæããã¢ãã«ãšæ©èœç©ºéã®æ¹åãæ±ºå®ããããã«éåžžã«éèŠã§ãã
ã³ãŒã
def contingency_table(X,features,probability_level,tridx,cvidx,model):
tr_predict_proba = model.predict_proba(X.iloc[tridx][features].values)
cv_predict_proba = model.predict_proba(X.iloc[cvidx][features].values)
tr_predict_target = (tr_predict_proba[:, 1] > probability_level).astype(np.int)
cv_predict_target = (cv_predict_proba[:, 1] > probability_level).astype(np.int)
X_tr = X.iloc[tridx]
X_cv = X.iloc[cvidx]
X_tr['predict_proba'] = tr_predict_proba[:,1]
X_cv['predict_proba'] = cv_predict_proba[:,1]
X_tr['predict_target'] = tr_predict_target
X_cv['predict_target'] = cv_predict_target
# make true positive column
data = pd.DataFrame(X_tr[X_tr['target']==1][X_tr['predict_target']==1]['pair_id'])
data['True_Positive'] = 1
X_tr = X_tr.merge(data,on='pair_id',how='left')
data = pd.DataFrame(X_cv[X_cv['target']==1][X_cv['predict_target']==1]['pair_id'])
data['True_Positive'] = 1
X_cv = X_cv.merge(data,on='pair_id',how='left')
# make false positive column
data = pd.DataFrame(X_tr[X_tr['target']==0][X_tr['predict_target']==1]['pair_id'])
data['False_Positive'] = 1
X_tr = X_tr.merge(data,on='pair_id',how='left')
data = pd.DataFrame(X_cv[X_cv['target']==0][X_cv['predict_target']==1]['pair_id'])
data['False_Positive'] = 1
X_cv = X_cv.merge(data,on='pair_id',how='left')
# make true negative column
data = pd.DataFrame(X_tr[X_tr['target']==0][X_tr['predict_target']==0]['pair_id'])
data['True_Negative'] = 1
X_tr = X_tr.merge(data,on='pair_id',how='left')
data = pd.DataFrame(X_cv[X_cv['target']==0][X_cv['predict_target']==0]['pair_id'])
data['True_Negative'] = 1
X_cv = X_cv.merge(data,on='pair_id',how='left')
# make false negative column
data = pd.DataFrame(X_tr[X_tr['target']==1][X_tr['predict_target']==0]['pair_id'])
data['False_Negative'] = 1
X_tr = X_tr.merge(data,on='pair_id',how='left')
data = pd.DataFrame(X_cv[X_cv['target']==1][X_cv['predict_target']==0]['pair_id'])
data['False_Negative'] = 1
X_cv = X_cv.merge(data,on='pair_id',how='left')
return X_tr,X_cv
6çªç®ã®é¢æ°ã¯ãå ±åœ¹è¡åã圢æããããã«äœ¿çšãããŸããã«ãããªã³ã°ããŒãã«ãšæ··åããªãã§ãã ãããäžæ¹ã仿¹ããç¶ãããããªãèªèº«ãããã«ãã¹ãŠãèŠãã§ããã
ã³ãŒã
def matrix_confusion(X):
list_matrix = ['True_Positive','False_Positive','True_Negative','False_Negative']
tr_pos = X[list_matrix].sum().loc['True_Positive']
f_pos = X[list_matrix].sum().loc['False_Positive']
tr_neg = X[list_matrix].sum().loc['True_Negative']
f_neg = X[list_matrix].sum().loc['False_Negative']
matrix_confusion = pd.DataFrame()
matrix_confusion['0_algorythm'] = np.array([tr_neg,f_neg]).T
matrix_confusion['1_algorythm'] = np.array([f_pos,tr_pos]).T
matrix_confusion = matrix_confusion.rename(index={0: '0_target', 1: '1_target'})
return matrix_confusion
7çªç®ã®é¢æ°ã¯ãã¢ã«ãŽãªãºã ã®æäœã«é¢ããã¬ããŒããèŠèŠåããããã«èšèšãããŠããŸããããã«ã¯ãå ±åœ¹è¡åãã¡ããªãã¯ã®ç²ŸåºŠã®å€ããªã³ãŒã«ãf1ãå«ãŸããŸãã
ã³ãŒã
def report_score(tr_matrix_confusion,
cv_matrix_confusion,
data,tridx,cvidx,
X_tr,X_cv):
# print some imporatant information
print ('\033[1m'+'Matrix confusion on train data'+'\033[m')
display(tr_matrix_confusion)
print ()
print(classification_report(data.iloc[tridx]["target"].values, X_tr['predict_target']))
print ('******************************************************')
print ()
print ()
print ('\033[1m'+'Matrix confusion on test(cv) data'+'\033[m')
display(cv_matrix_confusion)
print ()
print(classification_report(data.iloc[cvidx]["target"].values, X_cv['predict_target']))
print ('******************************************************')
8çªç®ãš9çªç®ã®é¢æ°ã䜿çšããŠã調æ»å¯Ÿè±¡ã®åæ©èœã®ä¿æ°ãæ å ±ã²ã€ã³ãã®å€ã®èгç¹ãããLightGBMã®äœ¿çšã¢ãã«ã®æ©èœã®æçšæ§ãåæããŸãã
ã³ãŒã
def table_gain_coef(model,features,start,stop):
data_gain = pd.DataFrame()
data_gain['Features'] = features
data_gain['Gain'] = model.booster_.feature_importance(importance_type='gain')
return data_gain.sort_values('Gain', ascending=False)[start:stop]
def gain_hist(df,size,start,stop):
fig, ax = plt.subplots(figsize=(size))
x = (df.sort_values('Gain', ascending=False)['Features'][start:stop])
y = (df.sort_values('Gain', ascending=False)['Gain'][start:stop])
plt.bar(x,y)
plt.xlabel('Features')
plt.ylabel('Gain')
plt.xticks(rotation=90)
plt.show()
10çªç®ã®é¢æ°ã¯ãäŒæ¥ã®åãã¢ã®äžèŽããåèªã®æ°ã®é åã圢æããããã«å¿ èŠã§ãã
ãã®é¢æ°ã䜿çšããŠãäžèŽããªãåèªã®é åã圢æããããšãã§ããŸãã
ã³ãŒã
def compair_metrics(data):
duplicate_count = []
duplicate_sum = []
for i in range(len(data)):
count=len(data[i])
duplicate_count.append(count)
if count <= 0:
duplicate_sum.append(0)
elif count > 0:
temp_sum = 0
for j in range(len(data[i])):
temp_sum +=len(data[i][j])
duplicate_sum.append(temp_sum)
return duplicate_count,duplicate_sum
11çªç®ã®é¢æ°ã¯ããã·ã¢èªã®ããã¹ããã©ãã³èªã®ã¢ã«ãã¡ãããã«å€æããŸã
ã³ãŒã
def transliterate(data):
text_transliterate = []
for i in range(data.shape[0]):
temp_list = list(data[i:i+1])
temp_str = ''.join(temp_list)
result = cyrtranslit.to_latin(temp_str,'ru')
text_transliterate.append(result)
.
, , , . , ,
<spoiler title="">
<source lang="python">def rename_agg_columns(id_client,data,rename):
columns = [id_client]
for lev_0 in data.columns.levels[0]:
if lev_0 != id_client:
for lev_1 in data.columns.levels[1][:-1]:
columns.append(rename % (lev_0, lev_1))
data.columns = columns
return data
return text_transliterate
13çªç®ãš14çªç®ã®é¢æ°ã¯ãLevenshteinè·é¢ããŒãã«ãšãã®ä»ã®éèŠãªã€ã³ãžã±ãŒã¿ãŒã衚瀺ããã³çæããããã«å¿ èŠã§ãã
ããã¯äžè¬çã«ã©ã®ãããªçš®é¡ã®ããŒãã«ã§ããããã®äžã®ã¡ããªãã¯ã¯äœã§ãããã©ã®ããã«åœ¢æãããŸããïŒããŒãã«ãã©ã®ããã«åœ¢æãããããæ®µéçã«èŠãŠã¿ãŸãããã
- ã¹ããã1.å¿ èŠãªããŒã¿ãå®çŸ©ããŸãããããã¢IDãããã¹ãä»äžã-äž¡æ¹ã®åãä¿æåãªã¹ãïŒç³æ²¹ååŠããã³å»ºèšäŒç€Ÿã®ããã50ïŒã
- ã¹ããã2.å1ã§ãååèªã®åãã¢ã§ãä¿æåã®ãªã¹ãããååèªãŸã§ã®ã¬ãã³ã·ã¥ãã€ã³è·é¢ãååèªã®é·ããããã³é·ãã«å¯Ÿããè·é¢ã®æ¯çãæž¬å®ããŸãã
- 3. , 0.4, id , .
- 4. , 0.4, .
- 5. , ID , â . id ( id ). .
- ã¹ããã6.çµæã®ããŒãã«ããªãµãŒãããŒãã«ã«æ¥çããŸãã
éèŠãªæ©èœïŒ
æ¥ãã§æžãããã³ãŒãã®ãããèšç®ã«æéãããã
ã³ãŒã
def dist_name_to_top_list_view(data,column1,column2,list_top_companies):
id_pair = []
r1 = []
r2 = []
words1 = []
words2 = []
top_words = []
for n in range(0, data.shape[0], 1):
for line1 in data[column1][n:n+1]:
line1 = line1.split()
for word1 in line1:
if len(word1) >=3:
for top_word in list_top_companies:
dist1 = levenshtein.distance(word1, top_word)
ratio = max(dist1/float(len(top_word)),dist1/float(len(word1)))
if ratio <= 0.4:
ratio1 = ratio
break
if ratio <= 0.4:
for line2 in data[column2][n:n+1]:
line2 = line2.split()
for word2 in line2:
dist2 = levenshtein.distance(word2, top_word)
ratio = max(dist2/float(len(top_word)),dist2/float(len(word2)))
if ratio <= 0.4:
ratio2 = ratio
id_pair.append(int(data['pair_id'][n:n+1].values))
r1.append(ratio1)
r2.append(ratio2)
break
df = pd.DataFrame()
df['pair_id'] = id_pair
df['levenstein_dist_w1_top_w'] = dist1
df['levenstein_dist_w2_top_w'] = dist2
df['length_w1_top_w'] = len(word1)
df['length_w2_top_w'] = len(word2)
df['length_top_w'] = len(top_word)
df['ratio_dist_w1_to_top_w'] = r1
df['ratio_dist_w2_to_top_w'] = r2
feature = df.groupby(['pair_id']).agg([min]).reset_index()
feature = rename_agg_columns(id_client='pair_id',data=feature,rename='%s_%s')
data = data.merge(feature,on='pair_id',how='left')
display(data)
print ('Words:', word1,word2,top_word)
print ('Levenstein distance:',dist1,dist2)
print ('Length of word:',len(word1),len(word2),len(top_word))
print ('Ratio (distance/length word):',ratio1,ratio2)
def dist_name_to_top_list_make(data,column1,column2,list_top_companies):
id_pair = []
r1 = []
r2 = []
dist_w1 = []
dist_w2 = []
length_w1 = []
length_w2 = []
length_top_w = []
for n in range(0, data.shape[0], 1):
for line1 in data[column1][n:n+1]:
line1 = line1.split()
for word1 in line1:
if len(word1) >=3:
for top_word in list_top_companies:
dist1 = levenshtein.distance(word1, top_word)
ratio = max(dist1/float(len(top_word)),dist1/float(len(word1)))
if ratio <= 0.4:
ratio1 = ratio
break
if ratio <= 0.4:
for line2 in data[column2][n:n+1]:
line2 = line2.split()
for word2 in line2:
dist2 = levenshtein.distance(word2, top_word)
ratio = max(dist2/float(len(top_word)),dist2/float(len(word2)))
if ratio <= 0.4:
ratio2 = ratio
id_pair.append(int(data['pair_id'][n:n+1].values))
r1.append(ratio1)
r2.append(ratio2)
dist_w1.append(dist1)
dist_w2.append(dist2)
length_w1.append(float(len(word1)))
length_w2.append(float(len(word2)))
length_top_w.append(float(len(top_word)))
break
df = pd.DataFrame()
df['pair_id'] = id_pair
df['levenstein_dist_w1_top_w'] = dist_w1
df['levenstein_dist_w2_top_w'] = dist_w2
df['length_w1_top_w'] = length_w1
df['length_w2_top_w'] = length_w2
df['length_top_w'] = length_top_w
df['ratio_dist_w1_to_top_w'] = r1
df['ratio_dist_w2_to_top_w'] = r2
feature = df.groupby(['pair_id']).agg([min]).reset_index()
feature = rename_agg_columns(id_client='pair_id',data=feature,rename='%s_%s')
data = data.merge(feature,on='pair_id',how='left')
return data
ããŒã¿ã®ååŠç
ç§ã®å°ããªçµéšããããã®è¡šçŸã®åºãæå³ã§ã®ããŒã¿ã®ååŠçãããæéãããããŸããé çªã«è¡ããŸãããã
ããŒã¿ãèªã¿èŸŒã
ããã§ã¯ãã¹ãŠãéåžžã«ç°¡åã§ããããŒã¿ãããŒãããåã®ååãã¿ãŒã²ããã©ãã«ãis_duplicateãã«çœ®ãæããŠãtargetãã«çœ®ãæããŸããããããã¯ã颿°ã䜿ããããããããã§ããäžéšã®é¢æ°ã¯ä»¥åã®èª¿æ»ã®äžéšãšããŠäœæãããŠãããã¿ãŒã²ããã©ãã«ããã¿ãŒã²ãããã§ããåã®ååã䜿çšããŠããŸãã
ã³ãŒã
# DOWNLOAD DATA
text_train = pd.read_csv('train.csv')
text_test = pd.read_csv('test.csv')
# RENAME DATA
text_train = text_train.rename(columns={"is_duplicate": "target"})
ããŒã¿ãèŠãŠã¿ãŸããã
ããŒã¿ãããŒããããŸãããåèšã§ããã€ã®ãªããžã§ã¯ãããããããããã©ã®çšåºŠãã©ã³ã¹ãåããŠããããèŠãŠã¿ãŸãããã
ã³ãŒã
# ANALYSE BALANCE OF DATA
target_1 = text_train[text_train['target']==1]['target'].count()
target_0 = text_train[text_train['target']==0]['target'].count()
print ('There are', text_train.shape[0], 'objects')
print ('There are', target_1, 'objects with target 1')
print ('There are', target_0, 'objects with target 0')
print ('Balance is', round(100*target_1/target_0,2),'%')
衚â1ãããŒã¯ã®ãã©ã³ã¹ã
ãªããžã§ã¯ãã¯ããããããã50äžè¿ãããããã©ã³ã¹ããŸã£ããåããŠããŸãããã€ãŸããçŽ50äžåã®ãªããžã§ã¯ãã®ãã¡ãåèšã§4,000åæªæºã®ã¿ãŒã²ããã©ãã«ã1ïŒ1ïŒ æªæºïŒã«
ãªã£ãŠããŸããããŒãã«èªäœãèŠãŠã¿ãŸãããã0ãšããã©ãã«ã®ä»ããæåã®5ã€ã®ãªããžã§ã¯ããš1ãšããã©ãã«ã®ä»ããæåã®5ã€ã®ãªããžã§ã¯ããèŠãŠã¿ãŸãããã
ã³ãŒã
display(text_train[text_train['target']==0].head(5))
display(text_train[text_train['target']==1].head(5))
衚2ãã¯ã©ã¹0ã®æåã®5ã€ã®ãªããžã§ã¯ããã衚3ãã¯ã©ã¹1ã®æåã®5ã€ã®ãªããžã§ã¯ãã
ããã€ãã®ç°¡åãªæé ã§ããã«ããããŸããããã¹ãã1ã€ã®ã¬ãžã¹ã¿ã«ç§»åãããltdããªã©ã®ã¹ãããã¯ãŒããåé€ããåœãåé€ããåæã«å°ççãªååãåé€ããŸãããªããžã§ã¯ãã
å®éããã®ã¿ã¹ã¯ã§ã¯ãã®ãããªåé¡ã解決ã§ããŸããååŠçãè¡ããæ£åžžã«æ©èœããããšã確èªããã¢ãã«ãå®è¡ããå質ã確èªããŠãã¢ãã«ãééã£ãŠãããªããžã§ã¯ããéžæçã«åæããŸãããããç§ã®ç ç©¶ã®ããæ¹ã§ããããããèšäºèªäœã§ã¯ãæçµçãªè§£æ±ºçã瀺ãããåååŠçåŸã®ã¢ã«ãŽãªãºã ã®å質ãçè§£ãããŠããªããããèšäºã®æåŸã«æçµçãªåæãè¡ããŸããããã§ãªããã°ãèšäºã¯äœãšãèšããªããµã€ãºã«ãªããŸã:)
ã³ããŒãäœãã
æ£çŽãªãšããããªãããããã®ãããããŸãããããªãããã€ãããããŠããŸããä»åããããŸã
ã³ãŒã
baseline_train = text_train.copy()
baseline_test = text_test.copy()
ãã¹ãŠã®æåãããã¹ãããå°æåã«å€æãã
ã³ãŒã
# convert text to lowercase
columns = ['name_1','name_2']
for column in columns:
lower_str(baseline_train,column)
for column in columns:
lower_str(baseline_test,column)
åœåãåé€ãã
ã³ã³ãã¹ãã®äž»å¬è ã¯çŽ æŽããã仲éã§ããããšã«æ³šæããŠãã ããïŒå²ãåœãŠãšãšãã«ã圌ãã¯éåžžã«åçŽãªããŒã¹ã©ã€ã³ãåããã©ããããããæäŸããŸããããã®ããŒã¹ã©ã€ã³ã«ã¯ã以äžã®ã³ãŒããå«ãŸããŠããŸãã
ã³ãŒã
# drop any names of countries
countries = [country.name.lower() for country in pycountry.countries]
for country in tqdm(countries):
baseline_train.replace(re.compile(country), "", inplace=True)
baseline_test.replace(re.compile(country), "", inplace=True)
æšèãç¹æ®æåãåé€ãã
ã³ãŒã
# drop punctuation marks
baseline_train.replace(re.compile(r"\s+\(.*\)"), "", inplace=True)
baseline_test.replace(re.compile(r"\s+\(.*\)"), "", inplace=True)
baseline_train.replace(re.compile(r"[^\w\s]"), "", inplace=True)
baseline_test.replace(re.compile(r"[^\w\s]"), "", inplace=True)
çªå·ãåé€ãã
æåã®è©Šã¿ã§ãé¡ã®ããã¹ãããçŽæ¥æ°åãåé€ãããšãã¢ãã«ã®å質ãå€§å¹ ã«äœäžããŸãããããã«ã³ãŒãã瀺ããŸãããå®éã«ã¯äœ¿çšãããŠããŸããã
ãŸãããããŸã§ãäžããããåã«å¯ŸããŠçŽæ¥å€æãå®è¡ããŠããããšã«ã泚æããŠãã ãããååŠçããšã«æ°ããåãäœæããŠã¿ãŸããããããå€ãã®åããããŸãããååŠçã®ããæ®µéã§é害ãçºçããå Žåã¯ãååŠçã®å段éã®åããããããæåãããã¹ãŠãè¡ãå¿ èŠã¯ãããŸããã
å質ãæãªãã³ãŒããããªãã¯ãã£ãšç¹çްã§ããå¿
èŠããããŸã
# # first: make dictionary of frequency every word
# list_words = baseline_train['name_1'].to_string(index=False).split() +\
# baseline_train['name_2'].to_string(index=False).split()
# freq_words = {}
# for w in list_words:
# freq_words[w] = freq_words.get(w, 0) + 1
# # second: make data frame of frequency words
# df_freq = pd.DataFrame.from_dict(freq_words,orient='index').reset_index()
# df_freq.columns = ['word','frequency']
# df_freq_agg = df_freq.groupby(['word']).agg([sum]).reset_index()
# df_freq_agg = rename_agg_columns(id_client='word',data=df_freq_agg,rename='%s_%s')
# df_freq_agg = df_freq_agg.sort_values(by=['frequency_sum'], ascending=False)
# # third: make drop list of digits
# string = df_freq_agg['word'].to_string(index=False)
# digits = [int(digit) for digit in string.split() if digit.isdigit()]
# digits = set(digits)
# digits = list(digits)
# # drop the digits
# baseline_train['name_1_no_digits'] =\
# baseline_train['name_1'].apply(
# lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
# baseline_train['name_2_no_digits'] =\
# baseline_train['name_2'].apply(
# lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
# baseline_test['name_1_no_digits'] =\
# baseline_test['name_1'].apply(
# lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
# baseline_test['name_2_no_digits'] =\
# baseline_test['name_2'].apply(
# lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
æåã®ã¹ãããã¯ãŒããªã¹ããåé€ããŸããããæåã§ïŒ
ããã§ãäŒç€Ÿåã®åèªãªã¹ãããã¹ãããã¯ãŒããå®çŸ©ããŠåé€ããããšããå§ãããŸãã
ãã¬ãŒãã³ã°ãµã³ãã«ã®æåã¬ãã¥ãŒã«åºã¥ããŠãªã¹ããäœæããŸãããè«ççã«ã¯ããã®ãããªãªã¹ãã¯ã次ã®ã¢ãããŒãã䜿çšããŠèªåçã«ã³ã³ãã€ã«ããå¿ èŠããããŸãã
- ãŸããäžäœ10ïŒ20,50,100ïŒã®äžè¬çãªåèªã䜿çšããŸãã
- 次ã«ãããŸããŸãªèšèªã§æšæºã®ã¹ãããã¯ãŒãã©ã€ãã©ãªã䜿çšããŸããããšãã°ãããŸããŸãªèšèªïŒLLCãPJSCãCJSCãltdãgmbhãincãªã©ïŒã§ã®çµç¹ã®çµç¹çããã³æ³ç圢æ ã®æå®
- 第äžã«ãç°ãªãèšèªã§å°åã®ãªã¹ããç·šéããããšã¯çã«ããªã£ãŠããŸã
é »ç¹ã«çºçããäžäœã®åèªã®ãªã¹ããèªåçã«ã³ã³ãã€ã«ããæåã®ãªãã·ã§ã³ã«æ»ããŸãããä»ã®ãšããã¯æåã®ååŠçãæ€èšããŠããŸãã
ã³ãŒã
# drop some stop-words
drop_list = ["ltd.", "co.", "inc.", "b.v.", "s.c.r.l.", "gmbh", "pvt.",
'retail','usa','asia','ceska republika','limited','tradig','llc','group',
'international','plc','retail','tire','mills','chemical','korea','brasil',
'holding','vietnam','tyre','venezuela','polska','americas','industrial','taiwan',
'europe','america','north','czech republic','retailers','retails',
'mexicana','corporation','corp','ltd','co','toronto','nederland','shanghai','gmb','pacific',
'industries','industrias',
'inc', 'ltda', '', '', '', '', '', '', '', '', 'ceska republika', 'ltda',
'sibur', 'enterprises', 'electronics', 'products', 'distribution', 'logistics', 'development',
'technologies', 'pvt', 'technologies', 'comercio', 'industria', 'trading', 'internacionais',
'bank', 'sports',
'express','east', 'west', 'south', 'north', 'factory', 'transportes', 'trade', 'banco',
'management', 'engineering', 'investments', 'enterprise', 'city', 'national', 'express', 'tech',
'auto', 'transporte', 'technology', 'and', 'central', 'american',
'logistica','global','exportacao', 'ceska republika', 'vancouver', 'deutschland',
'sro','rus','chemicals','private','distributors','tyres','industry','services','italia','beijing',
'','company','the','und']
baseline_train['name_1_non_stop_words'] =\
baseline_train['name_1'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_train['name_2_non_stop_words'] =\
baseline_train['name_2'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_test['name_1_non_stop_words'] =\
baseline_test['name_1'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_test['name_2_non_stop_words'] =\
baseline_test['name_2'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
ã¹ãããã¯ãŒããå®éã«ããã¹ãããåé€ãããŠããããšãéžæçã«ç¢ºèªããŸãããã
ã³ãŒã
baseline_train[baseline_train.name_1_non_stop_words.str.contains("factory")].head(3)
衚4ãã¹ãããã¯ãŒããåé€ããããã®ã³ãŒãã®éžæçãã§ãã¯ã
ãã¹ãŠãæ©èœããŠããããã§ããã¹ããŒã¹ã§åºåãããŠãããã¹ãŠã®ã¹ãããã¯ãŒããåé€ããŸãããç§ãã¡ã欲ããã£ããã®ãå ã«é²ã¿ãŸãã
ãã·ã¢èªã®ããã¹ããã©ãã³èªã®ã¢ã«ãã¡ãããã«å€æããŸããã
ããã«ã¯ãèªåã§äœæãã颿°ãšcyrtranslitã©ã€ãã©ãªã䜿çšããŸããããŸãããããã§ããæåã§ç¢ºèªã
ã³ãŒã
# transliteration to latin
baseline_train['name_1_transliterated'] = transliterate(baseline_train['name_1_non_stop_words'])
baseline_train['name_2_transliterated'] = transliterate(baseline_train['name_2_non_stop_words'])
baseline_test['name_1_transliterated'] = transliterate(baseline_test['name_1_non_stop_words'])
baseline_test['name_2_transliterated'] = transliterate(baseline_test['name_2_non_stop_words'])
ID 353150ã®ãã¢ãèŠãŠã¿ãŸãããããã®äžã§ã2çªç®ã®åïŒ "name_2"ïŒã«ã¯ "Michelin"ãšããåèªããããŸããååŠçåŸãåèªã¯ãã§ã«ãã® "mishlen"ã®ããã«æžã蟌ãŸããŠããŸãïŒå "name_2_transliterated"ãåç §ïŒãå®å šã«æ£ããããã§ã¯ãããŸããããæããã«åªããŠããŸãã
ã³ãŒã
pair_id = 353150
baseline_train[baseline_train['pair_id']==353150]
衚çªå·5ãæå倿ã®ããã®ã³ãŒãã®éžæçæ€èšŒã
æãé »ç¹ã«çºçããäžäœ50ã®åèªã®ãªã¹ãã®èªåã³ã³ãã€ã«ãéå§ããã¹ããŒãã«ããããããŸããããæåã®CHIT
å°ãããªãããŒãªã¿ã€ãã«ãããã§äœãããã®ãèŠãŠã¿ãŸãããã
ãŸãã1åç®ãš2åç®ã®ããã¹ãã1ã€ã®é åã«çµåããäžæã®åèªããšã«åºçŸåæ°ãã«ãŠã³ãããŸãã
次ã«ããããã®åèªã®äžäœ50ãéžã³ãŸãããããããŠãããããåé€ã§ããããã«èŠããŸãããã§ããŸããããããã®åèªã«ã¯ãä¿æç©ã®ååïŒ 'total'ã 'knauf'ã 'shell'ã...ïŒãå«ãŸããŠããå ŽåããããŸãããããã¯éåžžã«éèŠãªæ å ±ã§ãããããã«äœ¿çšããããã倱ãããããšã¯ãããŸããããããã£ãŠãç§ãã¡ã¯äžæ£è¡çºïŒçŠæ¢ïŒã®ããªãã¯ã«è¡ããŸãããŸãããã¬ãŒãã³ã°ãµã³ãã«ã®æ éãã€éžæçãªèª¿æ»ã«åºã¥ããŠãé »ç¹ã«ééããä¿æç©ã®ååã®ãªã¹ããäœæããŸãããªã¹ãã¯å®å šã§ã¯ãããŸããããããªããã°ããã¯ãŸã£ããå ¬å¹³ã§ã¯ãããŸãã:)ããããç§ãã¡ã¯è³ã远ããããŠããªãã®ã§ããªãããã§ã¯ãããŸãããæ¬¡ã«ãé »ç¹ã«çºçããäžäœ50ã®åèªã®é åãä¿æåã®ãªã¹ããšæ¯èŒããä¿æã®ååã«äžèŽããåèªããªã¹ãããåé€ããŸãã
ããã§ã2çªç®ã®ã¹ãããã¯ãŒããªã¹ãã宿ããŸãããããã¹ãããåèªãåé€ã§ããŸãã
ããããã®åã«ãä¿æåã®äžæ£ãªã¹ãã«ã€ããŠå°ãã³ã¡ã³ããæ¿å ¥ããããšæããŸãã芳å¯ã«åºã¥ããŠä¿æç©ã®ååã®ãªã¹ãããŸãšãããšããäºå®ã¯ãç§ãã¡ã®ç掻ãã¯ããã«æ¥œã«ããŠãããŸããããããå®éã«ã¯ããã®ãããªãªã¹ããå¥ã®æ¹æ³ã§ç·šéããããšãã§ããŸããããšãã°ãç³æ²¹ååŠã建èšãèªåè»ããã®ä»ã®æ¥çã®æå€§ã®äŒæ¥ã®è©äŸ¡ãååŸããããããçµã¿åãããŠãããããä¿æç©ã®ååãååŸããããšãã§ããŸããããããç§ãã¡ã®ç ç©¶ã®ç®çã®ããã«ãç§ãã¡ã¯åçŽãªã¢ãããŒãã«èªåèªèº«ãå¶éããŸãããã®ã¢ãããŒãã¯ãã³ã³ãã¹ãå ã§ã¯çŠæ¢ãããŠããŸããããã«ã倧äŒã®äž»å¬è ãå ¥è³å Žæã®åè£è ã®äœåã«çŠæ¢ãããŠãããã¯ããã¯ããªãããã§ãã¯ãããŸããæ³šæããŠãã ããïŒ
ã³ãŒã
list_top_companies = ['arlanxeo', 'basf', 'bayer', 'bdp', 'bosch', 'brenntag', 'contitech',
'daewoo', 'dow', 'dupont', 'evonik', 'exxon', 'exxonmobil', 'freudenberg',
'goodyear', 'goter', 'henkel', 'hp', 'hyundai', 'isover', 'itochu', 'kia', 'knauf',
'kraton', 'kumho', 'lusocopla', 'michelin', 'paul bauder', 'pirelli', 'ravago',
'rehau', 'reliance', 'sabic', 'sanyo', 'shell', 'sherwinwilliams', 'sojitz',
'soprema', 'steico', 'strabag', 'sumitomo', 'synthomer', 'synthos',
'total', 'trelleborg', 'trinseo', 'yokohama']
# drop top 50 common words (NAME 1 & NAME 2) exept names of top companies
# first: make dictionary of frequency every word
list_words = baseline_train['name_1_transliterated'].to_string(index=False).split() +\
baseline_train['name_2_transliterated'].to_string(index=False).split()
freq_words = {}
for w in list_words:
freq_words[w] = freq_words.get(w, 0) + 1
# # second: make data frame
df_freq = pd.DataFrame.from_dict(freq_words,orient='index').reset_index()
df_freq.columns = ['word','frequency']
df_freq_agg = df_freq.groupby(['word']).agg([sum]).reset_index()
df_freq_agg = rename_agg_columns(id_client='word',data=df_freq_agg,rename='%s_%s')
df_freq_agg = df_freq_agg.sort_values(by=['frequency_sum'], ascending=False)
drop_list = list(set(df_freq_agg[0:50]['word'].to_string(index=False).split()) - set(list_top_companies))
# # check list of top 50 common words
# print (drop_list)
# drop the top 50 words
baseline_train['name_1_finish'] =\
baseline_train['name_1_transliterated'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_train['name_2_finish'] =\
baseline_train['name_2_transliterated'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_test['name_1_finish'] =\
baseline_test['name_1_transliterated'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
baseline_test['name_2_finish'] =\
baseline_test['name_2_transliterated'].apply(
lambda x: ' '.join([word for word in x.split() if word not in (drop_list)]))
ããã§ãããŒã¿ã®ååŠçãå®äºããŸããæ°ããæ©èœã®çæãéå§ãããªããžã§ã¯ãã0ãŸãã¯1ã§åºåãæ©èœã«ã€ããŠèŠèŠçã«è©äŸ¡ããŠã¿ãŸãããã
ç¹åŸŽã®çæãšåæ
ã¬ãã³ã·ã¥ãã€ã³è·é¢ãèšç®ããŠã¿ãŸããã
strsimpyã©ã€ãã©ãªã䜿çšããŠãåãã¢ã§ïŒãã¹ãŠã®ååŠçã®åŸïŒãæåã®åã®äŒç€Ÿåãã2çªç®ã®åã®äŒç€ŸåãŸã§ã®Levenshteinè·é¢ãèšç®ããŸãã
ã³ãŒã
# create feature with LEVENSTAIN DISTANCE
levenshtein = Levenshtein()
column_1 = 'name_1_finish'
column_2 = 'name_2_finish'
baseline_train["levenstein"] = baseline_train.progress_apply(
lambda r: levenshtein.distance(r[column_1], r[column_2]), axis=1)
baseline_test["levenstein"] = baseline_test.progress_apply(
lambda r: levenshtein.distance(r[column_1], r[column_2]), axis=1)
æ£èŠåãããã¬ãã³ã·ã¥ãã€ã³è·é¢ãèšç®ããŠã¿ãŸããã
ãã¹ãŠãäžèšãšåãã§ãããæ£èŠåãããè·é¢ãã«ãŠã³ãããã ãã§ãã
ã¹ãã€ã©ãŒããããŒ
# create feature with NORMALIZATION LEVENSTAIN DISTANCE
normalized_levenshtein = NormalizedLevenshtein()
column_1 = 'name_1_finish'
column_2 = 'name_2_finish'
baseline_train["norm_levenstein"] = baseline_train.progress_apply(
lambda r: normalized_levenshtein.distance(r[column_1], r[column_2]),axis=1)
baseline_test["norm_levenstein"] = baseline_test.progress_apply(
lambda r: normalized_levenshtein.distance(r[column_1], r[column_2]),axis=1)
ç§ãã¡ã¯æ°ããŸããããããŠä»ç§ãã¡ã¯èŠèŠåããŸã
æ©èœã®èŠèŠå
ç¹æ§ãlevensteinãã®ååžãèŠãŠã¿ãŸããã
ã³ãŒã
data = baseline_train
analyse = 'levenstein'
size = (12,2)
dd = data_statistics(data,analyse,title_print='no')
hist_fz(data,dd,analyse,size)
boxplot(data,analyse,size)
ã°ã©ãïŒ1ãç¹åŸŽã®éèŠæ§ãè©äŸ¡ããããã®å£ã²ãã®ãããã¹ãã°ã©ã ãšããã¯ã¹ã
äžèŠãããšãããã¡ããªãã¯ã¯ããŒã¿ãããŒã¯ã¢ããã§ããŸããæããã«ããŸãè¯ããããŸãããã䜿çšããããšãã§ããŸãã
ç¹æ§ãnorm_levensteinãã®ååžãèŠãŠã¿ãŸããã
ã¹ãã€ã©ãŒããããŒ
data = baseline_train
analyse = 'norm_levenstein'
size = (14,2)
dd = data_statistics(data,analyse,title_print='no')
hist_fz(data,dd,analyse,size)
boxplot(data,analyse,size)
ã°ã©ãâ2ããµã€ã³ã®éèŠæ§ãè©äŸ¡ããããã®å£ã²ãã®ãããã¹ãã°ã©ã ãšããã¯ã¹ã
ãã§ã«è¯ããªã£ãŠããŸããããã§ã2ã€ã®çµã¿åããããæ©èœãã¹ããŒã¹ããªããžã§ã¯ã0ãš1ã«åå²ããæ¹æ³ãèŠãŠã¿ãŸãããã
ã³ãŒã
data = baseline_train
analyse1 = 'levenstein'
analyse2 = 'norm_levenstein'
size = (14,6)
two_features(data,analyse1,analyse2,size)
ã°ã©ãïŒ3ãæ£åžå³ã
éåžžã«è¯ãããŒã¯ã¢ãããåŸãããŸãããããã£ãŠãããŒã¿ãããã»ã©ååŠçããããšã¯äœã®æå³ããããŸãã:)
æ°Žå¹³æ¹åïŒã¡ããªãã¯ãlevensteinãã®å€ïŒãšåçŽæ¹åïŒã¡ããªãã¯ãnorm_levensteinãã®å€ïŒãããã³ç·ãšé»ã®ç¹ããªããžã§ã¯ã0ãš1ã§ããããšã誰ããçè§£ããŠããŸããæ¬¡ã«é²ã¿ãŸãã
åãã¢ã®ããã¹ãå ã®åèªãæ¯èŒããŠã倿°ã®æ©èœãçæããŠã¿ãŸããã
以äžã§ã¯ãäŒç€Ÿåã®åèªãæ¯èŒããŸããæ¬¡ã®æ©èœãäœæããŸãããã
- åãã¢ã®ïŒ1ãšïŒ2ã®åã«éè€ããŠããåèªã®ãªã¹ã
- éè€ããŠããªãåèªã®ãªã¹ã
ãããã®åèªã®ãªã¹ãã«åºã¥ããŠããã¬ãŒãã³ã°æžã¿ã¢ãã«ã«ãã£ãŒãããæ©èœãäœæããŸãã
- éè€ããåèªã®æ°
- éè€ããŠããªãåèªã®æ°
- æåã®åèšãéè€ããåèª
- æåã®åèšãéè€ããåèªã§ã¯ãããŸãã
- éè€ããåèªã®å¹³åã®é·ã
- éè€ããŠããªãåèªã®å¹³åã®é·ã
- NOTéè€ã®æ°ã«å¯Ÿããéè€ã®æ°ã®æ¯ç
ããã§ããæ¥ãã§æžãããã®ã§ãããã®ã³ãŒãã¯ããããããŸãå奜çã§ã¯ãããŸãããããããããã¯æ©èœããŸãããããã¯è¿ éãªç ç©¶ã«è¡ããŸãã
ã³ãŒã
# make some information about duplicates and differences for TRAIN
column_1 = 'name_1_finish'
column_2 = 'name_2_finish'
duplicates = []
difference = []
for i in range(baseline_train.shape[0]):
list1 = list(baseline_train[i:i+1][column_1])
str1 = ''.join(list1).split()
list2 = list(baseline_train[i:i+1][column_2])
str2 = ''.join(list2).split()
duplicates.append(list(set(str1) & set(str2)))
difference.append(list(set(str1).symmetric_difference(set(str2))))
# continue make information about duplicates
duplicate_count,duplicate_sum = compair_metrics(duplicates)
dif_count,dif_sum = compair_metrics(difference)
# create features have information about duplicates and differences for TRAIN
baseline_train['duplicate'] = duplicates
baseline_train['difference'] = difference
baseline_train['duplicate_count'] = duplicate_count
baseline_train['duplicate_sum'] = duplicate_sum
baseline_train['duplicate_mean'] = baseline_train['duplicate_sum'] / baseline_train['duplicate_count']
baseline_train['duplicate_mean'] = baseline_train['duplicate_mean'].fillna(0)
baseline_train['dif_count'] = dif_count
baseline_train['dif_sum'] = dif_sum
baseline_train['dif_mean'] = baseline_train['dif_sum'] / baseline_train['dif_count']
baseline_train['dif_mean'] = baseline_train['dif_mean'].fillna(0)
baseline_train['ratio_duplicate/dif_count'] = baseline_train['duplicate_count'] / baseline_train['dif_count']
# make some information about duplicates and differences for TEST
column_1 = 'name_1_finish'
column_2 = 'name_2_finish'
duplicates = []
difference = []
for i in range(baseline_test.shape[0]):
list1 = list(baseline_test[i:i+1][column_1])
str1 = ''.join(list1).split()
list2 = list(baseline_test[i:i+1][column_2])
str2 = ''.join(list2).split()
duplicates.append(list(set(str1) & set(str2)))
difference.append(list(set(str1).symmetric_difference(set(str2))))
# continue make information about duplicates
duplicate_count,duplicate_sum = compair_metrics(duplicates)
dif_count,dif_sum = compair_metrics(difference)
# create features have information about duplicates and differences for TEST
baseline_test['duplicate'] = duplicates
baseline_test['difference'] = difference
baseline_test['duplicate_count'] = duplicate_count
baseline_test['duplicate_sum'] = duplicate_sum
baseline_test['duplicate_mean'] = baseline_test['duplicate_sum'] / baseline_test['duplicate_count']
baseline_test['duplicate_mean'] = baseline_test['duplicate_mean'].fillna(0)
baseline_test['dif_count'] = dif_count
baseline_test['dif_sum'] = dif_sum
baseline_test['dif_mean'] = baseline_test['dif_sum'] / baseline_test['dif_count']
baseline_test['dif_mean'] = baseline_test['dif_mean'].fillna(0)
baseline_test['ratio_duplicate/dif_count'] = baseline_test['duplicate_count'] / baseline_test['dif_count']
ããã€ãã®å åãèŠèŠåããŸãã
ã³ãŒã
data = baseline_train
analyse = 'dif_sum'
size = (14,2)
dd = data_statistics(data,analyse,title_print='no')
hist_fz(data,dd,analyse,size)
boxplot(data,analyse,size)
ã°ã©ãNo.4ããµã€ã³ã®éèŠæ§ãè©äŸ¡ããããã®ãã¹ãã°ã©ã ãšå£ã²ãã®ããç®±ã
ã³ãŒã
data = baseline_train
analyse1 = 'duplicate_mean'
analyse2 = 'dif_mean'
size = (14,6)
two_features(data,analyse1,analyse2,size)
ã°ã©ãâ5ãã¹ãã£ãã¿ãŒãã€ã¢ã°ã©ã ã
ãªããŠããšãããªããã©ãããŒã¯ã¢ãããã¿ãŒã²ããã©ãã«ã1ã®äŒæ¥ã®å€ãã¯ãããã¹ãã®éè€ããŒãã§ãããååã®éè€ãå¹³å12ã¯ãŒããè¶ ããäŒæ¥ã®å€ãã¯ãã¿ãŒã²ããã©ãã«ã0ã®äŒæ¥ã«å±ããŠããããšã«æ³šæããŠãã ããã
衚圢åŒã®ããŒã¿ãèŠãŠãã¯ãšãªãæºåããŸããæåã®ã±ãŒã¹ïŒäŒç€Ÿã®ååã«éè€ã¯ãããŸããããäŒç€Ÿã¯åãã§ãã
ã³ãŒã
baseline_train[
baseline_train['duplicate_mean']==0][
baseline_train['target']==1].drop(
['duplicate', 'difference',
'name_1_non_stop_words',
'name_2_non_stop_words', 'name_1_transliterated',
'name_2_transliterated'],axis=1)
æããã«ãåŠçäžã«ã·ã¹ãã ãšã©ãŒããããŸããç§ãã¡ã¯ãåèªããšã©ãŒã§ç¶Žãããã ãã§ãªããåã«äžç·ã«ããŸãã¯éã«ããããå¿ èŠãšãããªãå Žåã¯å¥ã ã«ç¶Žãããå¯èœæ§ãããããšãèæ ®ããŸããã§ãããããšãã°ããã¢ïŒ9764ã§ããæåã®åã®ããããããããã®2çªç®ã®ããããããããã§ãããã¯ããã«ã§ã¯ãããŸããããäŒç€Ÿã¯åãã§ãããŸãã¯å¥ã®äŸãšããŠããã¢ïŒ482600ãbridgestoneshenyangããšãbridgestoneãã
äœãã§ããããç§ãæåã«æãã€ããã®ã¯ãçŽæ¥æ£é¢ããã§ã¯ãªããLevenshteinã¡ããªãã¯ã䜿çšããŠæ¯èŒããããšã§ãããããããããã§ãåŸ ã¡äŒããåŸ ã£ãŠããŸãããbridgestoneshenyangããšãbridgestoneãã®éã®è·é¢ã¯å°ãããããŸããããããããã¬ã³ãåãå©ãã«ãªãã§ãããããäŒæ¥ã®ååãã©ã®ããã«ã¬ã³ãåã§ãããã¯ããã«ã¯æããã§ã¯ãããŸããããŸãã¯ãã¿ãã¢ãä¿æ°ã䜿çšããããšãã§ããŸããããã®ç¬éãããçµéšè±å¯ãªä»²éã«ä»»ããŠå ã«é²ã¿ãŸãããã
ããã¹ãã®åèªããç³æ²¹ååŠã建èšããã®ä»ã®æ¥çã®äžäœ50ã®ä¿æãã©ã³ãã®ååã®åèªãšæ¯èŒããŠã¿ãŸããããæ©èœã®2çªç®ã®å€§ããªå±±ãååŸããŸããããã»ã«ã³ãããã
å®éãã³ã³ãã¹ããžã®åå ã«é¢ããèŠåã«ã¯2ã€ã®éåããããŸãã
- -, , «duplicate_name_company»
- -, . , .
ã©ã¡ãã®ãã¯ããã¯ãç«¶äºã«ãŒã«ã§çŠæ¢ãããŠããŸããããªãã¯çŠæ¢ãåé¿ããããšãã§ããŸãããããè¡ãã«ã¯ããã¬ãŒãã³ã°ãµã³ãã«ã®éžæçãªã¬ãã¥ãŒã«åºã¥ããŠæåã§ã§ã¯ãªããå€éšãœãŒã¹ããèªåçã«ä¿æåã®ãªã¹ããç·šéããå¿ èŠããããŸããããããæåã«ãä¿æç©ã®ãªã¹ãã倧ãããªããäœåã§ææ¡ãããåèªã®æ¯èŒã«éåžžã«æéãããããŸããæ¬¡ã«ããã®ãªã¹ããç·šéããå¿ èŠããããŸã:)ãããã£ãŠãç ç©¶ãç°¡åã«ããããã«ãã¢ãã«ã®å質ãã©ã®çšåºŠåäžãããã確èªããŸãããããã®å åãä»åŸã®å±æ-å質ã¯é©ãã»ã©æé·ããŠããŸãïŒ
æåã®æ¹æ³ã§ã¯ãã¹ãŠãæç¢ºã«èŠããŸããã2çªç®ã®æ¹æ³ã§ã¯èª¬æãå¿ èŠã§ãã
ããã§ã¯ãäŒç€Ÿåã®æåã®åã®åè¡ã®ååèªãããïŒã ãã§ãªãïŒäžäœã®ç³æ²¹ååŠäŒç€Ÿã®ãªã¹ãã®ååèªãŸã§ã®ã¬ãã³ã·ã¥ãã€ã³è·é¢ã決å®ããŸãããã
åèªã®é·ãã«å¯Ÿããã¬ãã³ã·ã¥ãã€ã³è·é¢ã®æ¯çã0.4以äžã®å ŽåãäžäœäŒæ¥ã®ãªã¹ãããéžæããåèªã«å¯Ÿããã¬ãã³ã·ã¥ãã€ã³è·é¢ã®æ¯çãã2çªç®ã®åïŒ2çªç®ã®äŒç€Ÿã®ååïŒã®ååèªã«å¯ŸããŠæ±ºå®ããŸãã
2çªç®ã®ä¿æ°ïŒãããäŒæ¥ã®ãªã¹ãããã®åèªã®é·ãã«å¯Ÿããè·é¢ã®æ¯çïŒã0.4以äžã§ããããšã倿ããå Žåãæ¬¡ã®å€ã衚ã«åºå®ããŸãïŒ
- ãã³ããŒã¯ã³äŒæ¥ãªã¹ãã®åèªãããããäŒæ¥ãªã¹ãã®åèªãŸã§ã®ã¬ãã³ã·ã¥ãã€ã³è·é¢
- 2äœäŒæ¥ãªã¹ãã®åèªããäžäœäŒæ¥ãªã¹ãã®åèªãŸã§ã®ã¬ãã³ã·ã¥ãã€ã³è·é¢
- ãªã¹ãïŒ1ã®åèªã®é·ã
- ãªã¹ãïŒ2ã®åèªã®é·ã
- ãããäŒæ¥ã®ãªã¹ãããã®åèªã®é·ã
- ãªã¹ãïŒ1ããã®åèªã®é·ããšè·é¢ã®æ¯ç
- ãªã¹ãNo.2ã®åèªã®é·ããšè·é¢ã®æ¯ç
1è¡ã«è€æ°ã®äžèŽãååšããå¯èœæ§ãããã®ã§ããããã®æå°å€ãéžæããŸãããïŒéèšé¢æ°ïŒã
ææ¡ãããç¹åŸŽãçæããæ¹æ³ã¯éåžžã«å€ãã®ãªãœãŒã¹ãæ¶è²»ããå€éšãœãŒã¹ãããªã¹ããååŸããå Žåãã¡ããªãã¯ãã³ã³ãã€ã«ããããã®ã³ãŒãã®å€æŽãå¿ èŠã«ãªããšããäºå®ã«ããäžåºŠæ³šæãåããããšæããŸãã
ã³ãŒã
# create information about duplicate name of petrochemical companies from top list
list_top_companies = list_top_companies
dp_train = []
for i in list(baseline_train['duplicate']):
dp_train.append(''.join(list(set(i) & set(list_top_companies))))
dp_test = []
for i in list(baseline_test['duplicate']):
dp_test.append(''.join(list(set(i) & set(list_top_companies))))
baseline_train['duplicate_name_company'] = dp_train
baseline_test['duplicate_name_company'] = dp_test
# replace name duplicate to number
baseline_train['duplicate_name_company'] =\
baseline_train['duplicate_name_company'].replace('',0,regex=True)
baseline_train.loc[baseline_train['duplicate_name_company'] != 0, 'duplicate_name_company'] = 1
baseline_test['duplicate_name_company'] =\
baseline_test['duplicate_name_company'].replace('',0,regex=True)
baseline_test.loc[baseline_test['duplicate_name_company'] != 0, 'duplicate_name_company'] = 1
# create some important feature about similar words in the data and names of top companies for TRAIN
# (levenstein distance, length of word, ratio distance to length)
baseline_train = dist_name_to_top_list_make(baseline_train,
'name_1_finish','name_2_finish',list_top_companies)
# create some important feature about similar words in the data and names of top companies for TEST
# (levenstein distance, length of word, ratio distance to length)
baseline_test = dist_name_to_top_list_make(baseline_test,
'name_1_finish','name_2_finish',list_top_companies)
ãã£ãŒãã®ããªãºã ãéããŠç¹åŸŽã®æçšæ§ãèŠãŠã¿ãŸããã
ã³ãŒã
data = baseline_train
analyse = 'levenstein_dist_w1_top_w_min'
size = (14,2)
dd = data_statistics(data,analyse,title_print='no')
hist_fz(data,dd,analyse,size)
boxplot(data,analyse,size)
ãšãŠãè¯ãã
ã¢ãã«ã«æåºããããã®ããŒã¿ã®æºå
倧ããªããŒãã«ããããåæã«ãã¹ãŠã®ããŒã¿ãå¿ èŠãªããã§ã¯ãããŸãããããŒãã«ã®åã®ååãèŠãŠã¿ãŸãããã
ã³ãŒã
baseline_train.columns
åæããåãéžæããŸãããã
çµæã®åçŸæ§ã®ããã«ã·ãŒããä¿®æ£ããŸãããã
ã³ãŒã
# fix some parameters
features = ['levenstein','norm_levenstein',
'duplicate_count','duplicate_sum','duplicate_mean',
'dif_count','dif_sum','dif_mean','ratio_duplicate/dif_count',
'duplicate_name_company',
'levenstein_dist_w1_top_w_min', 'levenstein_dist_w2_top_w_min',
'length_w1_top_w_min', 'length_w2_top_w_min', 'length_top_w_min',
'ratio_dist_w1_to_top_w_min', 'ratio_dist_w2_to_top_w_min'
]
seed = 42
æçµçã«å©çšå¯èœãªãã¹ãŠã®ããŒã¿ã§ã¢ãã«ããã¬ãŒãã³ã°ããæ€èšŒã®ããã«ãœãªã¥ãŒã·ã§ã³ãéä¿¡ããåã«ãã¢ãã«ããã¹ãããããšã¯çã«ããªã£ãŠããŸãããããè¡ãããã«ããã¬ãŒãã³ã°ãµã³ãã«ãæ¡ä»¶ä»ããã¬ãŒãã³ã°ãšæ¡ä»¶ä»ããã¹ãã«åå²ããŸãããã®åè³ªãæž¬å®ãããããç§ãã¡ã«åã£ãŠããå Žåã¯ããœãªã¥ãŒã·ã§ã³ãã³ã³ãã¹ãã«éä¿¡ããŸãã
ã³ãŒã
# provides train/test indices to split data in train/test sets
split = StratifiedShuffleSplit(n_splits=1, train_size=0.8, random_state=seed)
tridx, cvidx = list(split.split(baseline_train[features],
baseline_train["target"]))[0]
print ('Split baseline data train',baseline_train.shape[0])
print (' - new train data:',tridx.shape[0])
print (' - new test data:',cvidx.shape[0])
ã¢ãã«ã®èšå®ãšãã¬ãŒãã³ã°
LightGBMã©ã€ãã©ãªã®æ±ºå®ããªãŒãã¢ãã«ãšããŠäœ¿çšããŸãã
ãã©ã¡ãŒã¿ã倧ããããããŠãæå³ããããŸãããã³ãŒããèŠãŠã¿ãŸãããã
ã³ãŒã
# learning Light GBM Classificier
seed = 50
params = {'n_estimators': 1,
'objective': 'binary',
'max_depth': 40,
'min_child_samples': 5,
'learning_rate': 1,
# 'reg_lambda': 0.75,
# 'subsample': 0.75,
# 'colsample_bytree': 0.4,
# 'min_split_gain': 0.02,
# 'min_child_weight': 40,
'random_state': seed}
model = lgb.LGBMClassifier(**params)
model.fit(baseline_train.iloc[tridx][features].values,
baseline_train.iloc[tridx]["target"].values)
ã¢ãã«ã¯èª¿æŽãããèšç·ŽãããŸãããããã§ã¯ãçµæãèŠãŠã¿ãŸãããã
ã³ãŒã
# make predict proba and predict target
probability_level = 0.99
X = baseline_train
tridx = tridx
cvidx = cvidx
model = model
X_tr, X_cv = contingency_table(X,features,probability_level,tridx,cvidx,model)
train_matrix_confusion = matrix_confusion(X_tr)
cv_matrix_confusion = matrix_confusion(X_cv)
report_score(train_matrix_confusion,
cv_matrix_confusion,
baseline_train,
tridx,cvidx,
X_tr,X_cv)
ã¢ãã«ã¹ã³ã¢ãšããŠf1å質ã¡ããªãã¯ã䜿çšããŠããããšã«æ³šæããŠãã ãããããã¯ããªããžã§ã¯ããã¯ã©ã¹1ãŸãã¯0ã«å²ãåœãŠã確çã®ã¬ãã«ã調æŽããããšãçã«ããªã£ãŠããããšãæå³ããŸãã0.99ã®ã¬ãã«ãéžæããŸãããã€ãŸãã確çã0.99以äžã®å Žåããªããžã§ã¯ãã¯ã¯ã©ã¹1ã0.99æªæºãã€ãŸãã¯ã©ã¹0ã«å²ãåœãŠãããŸããããã¯éèŠãªãã€ã³ãã§ããé床ãå€§å¹ ã«åäžãããããšãã§ããŸããããªãããŒã§ã¯ãªãããã®ãããªåçŽãªããªãã¯ã
åè³ªã¯æªããªãããã§ããæ¡ä»¶ä»ããã¹ããµã³ãã«ã§ã¯ãââã¢ã«ãŽãªãºã ã¯ã¯ã©ã¹0ã®222åã®ãªããžã§ã¯ããå®çŸ©ãããšãã«ãã¹ãç¯ããã¯ã©ã¹0ã«å±ãã90åã®ãªããžã§ã¯ãã§ã¯ãã¹ãç¯ããããããã¯ã©ã¹1ã«å²ãåœãŠãŸããïŒãã¹ãïŒcvïŒããŒã¿ã®ãããªãã¯ã¹ã®æ··ä¹±ãåç §ïŒã
ã©ã®å åãæãéèŠã§ãã©ããããã§ãªãã£ãããèŠãŠã¿ãŸãããã
ã³ãŒã
start = 0
stop = 50
size = (12,6)
tg = table_gain_coef(model,features,start,stop)
gain_hist(tg,size,start,stop)
display(tg)
æ©èœã®éèŠæ§ãè©äŸ¡ããããã«ããsplitããã©ã¡ãŒã¿ãŒã§ã¯ãªããgainããã©ã¡ãŒã¿ãŒã䜿çšããããšã«æ³šæããŠãã ãããéåžžã«ç°¡ç¥åãããããŒãžã§ã³ã§ã¯ãæåã®ãã©ã¡ãŒã¿ãŒã¯ãšã³ããããŒã®æžå°ã«å¯Ÿãããã£ãŒãã£ãŒã®å¯äžãæå³ãã2çªç®ã®ãã©ã¡ãŒã¿ãŒã¯ãã¹ããŒã¹ãããŒã¯ããããã«ãã£ãŒãã£ãŒã䜿çšãããåæ°ã瀺ããããããã¯éèŠã§ãã
äžèŠãããšãããç§ãã¡ãéåžžã«é·ãéè¡ã£ãŠããæ©èœãlevenstein_dist_w1_top_w_minãã¯ããŸã£ããæçã§ã¯ãªãããšãããããŸããããã®å¯äžã¯0ã§ããããããããã¯äžèŠããã ãã§ãã ãduplicate_name_companyã屿§ã䜿çšãããšãæå³ãã»ãŒå®å šã«è€è£œãããŸãã ãduplicate_name_companyããåé€ããŠãlevenstein_dist_w1_top_w_minãã®ãŸãŸã«ãããšãæåã®å±æ§ã®ä»£ããã«2çªç®ã®å±æ§ã䜿çšãããå質ã¯å€ãããŸããããã§ãã¯ããŸããïŒ
äžè¬ã«ããã®ãããªèšå·ã¯ãç¹ã«æ°çŸã®æ©èœãšããã«ãšãã€ãã¹ã«ãããããããã5000åã®ç¹°ãè¿ããããã¢ãã«ãããå Žåã«äŸ¿å©ã§ããæ©èœããããã§åé€ãããã®ç¡çŸã§ãªãã¢ã¯ã·ã§ã³ããå質ãã©ã®ããã«åäžãããã確èªã§ããŸããç§ãã¡ã®å Žåãæ©èœãåé€ããŠãå質ã«ã¯åœ±é¿ããŸããã
ã¡ã€ãããŒãã«ãèŠãŠã¿ãŸãããããŸãããªããžã§ã¯ããFalse Positiveããã€ãŸããã¢ã«ãŽãªãºã ãåãã§ãããšå€æããŠã¯ã©ã¹1ã«å²ãåœãŠããªããžã§ã¯ããèŠãŠã¿ãŸãããããã ããå®éã«ã¯ããããã¯ã¯ã©ã¹0ã«å±ããŠããŸãã
ã³ãŒã
X_cv[X_cv['False_Positive']==1][0:50].drop(['name_1_non_stop_words',
'name_2_non_stop_words', 'name_1_transliterated',
'name_2_transliterated', 'duplicate', 'difference',
'levenstein',
'levenstein_dist_w1_top_w_min', 'levenstein_dist_w2_top_w_min',
'length_w1_top_w_min', 'length_w2_top_w_min', 'length_top_w_min',
'ratio_dist_w1_to_top_w_min', 'ratio_dist_w2_to_top_w_min',
'True_Positive','True_Negative','False_Negative'],axis=1)
ãã ããã§ã¯ã人ã0ãŸãã¯1ãããã«æ±ºå®ããããšã¯ãããŸãããããšãã°ããã¢ïŒ146825ãmitsubishicorpããšãmitsubishicorplããç®ã¯åãããšã ãšèšã£ãŠããŸããããµã³ãã«ã¯å¥ã®äŒç€Ÿã ãšèšã£ãŠããŸãã誰ãä¿¡ããŸããïŒ
ããã«çµãåºãããšãã§ãããšã ãèšã£ãŠãããŸããã-ç§ãã¡ã¯çµãåºããŸãããæ®ãã®äœæ¥ã¯çµéšè±å¯ãªä»²éã«ä»»ããŸã:)
äž»å¬è ã®ãŠã§ããµã€ãã«ããŒã¿ãã¢ããããŒãããŠãäœæ¥ã®è³ªã®è©äŸ¡ã調ã¹ãŸãããã
倧äŒçµæ
ã³ãŒã
model = lgb.LGBMClassifier(**params)
model.fit(baseline_train[features].values,
baseline_train["target"].values)
sample_sub = pd.read_csv('sample_submission.csv', index_col="pair_id")
sample_sub['is_duplicate'] = (model.predict_proba(
baseline_test[features].values)[:, 1] > probability_level).astype(np.int)
sample_sub.to_csv('baseline_submission.csv')
ãããã£ãŠãçŠæ¢ãããŠããæ¹æ³ãèæ ®ããå Žåã®é床ïŒ0.5999
ããããªããšãå質ã¯0.3ãã0.4ã®éã®ã©ããã«ãªããŸãããæ£ç¢ºã«ããããã«ã¢ãã«ãåèµ·åããå¿ èŠããããŸãããç§ã¯å°ãæ æ°ã§ã:)
çµéšãããããèŠçŽããŸãããã
ãŸããã芧ã®ãšãããéåžžã«åçŸæ§ã®é«ãã³ãŒããšããªãé©åãªãã¡ã€ã«æ§é ããããŸããç§ã®çµéšãå°ãªããããç§ã¯ãã€ãŠãå€å°ã®å¿«é©ãªã¹ããŒããåŸãããã«ãæ¥ãã§ä»äºãåããŠãããšããçç±ã ãã§ãããããã®ãã³ããåããŸããããã®çµæããã¡ã€ã«ã¯1é±éåŸã«éãã®ããã§ã«æãã£ããããªãã®ã§ããããšã倿ããŸããããããã»ã©æç¢ºãªãã®ã¯ãããŸããããããã£ãŠãç§ã®ã¡ãã»ãŒãžã¯ãããã«ã³ãŒããèšè¿°ããŠãã¡ã€ã«ãèªã¿åãå¯èœã«ããããšã§ããããã«ããã1幎以å ã«ããŒã¿ã«æ»ããæåã«æ§é ã確èªããå®è¡ãããæé ãçè§£ããŠãããåæé ãç°¡åã«åè§£ã§ããããã«ãªããŸãããã¡ãããåå¿è ã®å Žåãæåã®è©Šè¡ã§ã¯ãã¡ã€ã«ãçŸãããªããã³ãŒããå£ããåé¡ãçºçããŸããã調æ»ããã»ã¹äžã«å®æçã«ã³ãŒããæžãçŽããšããã®åŸã5ã7åã®æžãæãã§ãã³ãŒããã©ãã»ã©ã¯ãªãŒã³ã§ãããã«é©ãããããšã§ãããããšã©ãŒãèŠã€ããŠé床ãåäžãããããšããã§ããŸãã颿°ãå¿ããªãã§ãã ãããããã¯ãã¡ã€ã«ãéåžžã«èªã¿ãããããŸãã
次ã«ãããŒã¿ãåŠçãããã³ã«ããã¹ãŠãèšç»ã©ããã«é²ãã ãã©ããã確èªããŸãããããè¡ãã«ã¯ããã³ãã®ããŒãã«ããã£ã«ã¿ãªã³ã°ã§ããå¿ èŠããããŸãããã®äœåã«ã¯ããããã®ãã£ã«ã¿ãªã³ã°ããããŸããå¥åº·ã®ããã«ããã䜿çšããŠãã ãã:)
第äžã«ãåé¡ã¿ã¹ã¯ã§ã¯ãåžžã«ããŸã£ããåžžã«ãããŒãã«ãšå ±åœ¹ãããªãã¯ã¹ã®äž¡æ¹ã圢æããŸãã衚ãããã¢ã«ãŽãªãºã ãééã£ãŠãããªããžã§ã¯ããç°¡åã«èŠã€ããããšãã§ããŸãããŸããã·ã¹ãã ãšã©ãŒãšåŒã°ãããšã©ãŒã«æ³šæããŠãã ãããä¿®æ£ã«å¿ èŠãªäœæ¥ãå°ãªããŠæžã¿ãããå€ãã®çµæãåŸãããŸããæ¬¡ã«ãã·ã¹ãã ãšã©ãŒãæŽçãããšãã«ãç¹å¥ãªå Žåã«é²ã¿ãŸãããšã©ãŒã®ãããªãã¯ã¹ã«ãã£ãŠãã¢ã«ãŽãªãºã ãããå€ãã®ééããç¯ãå ŽæãããããŸããã¯ã©ã¹0ãŸãã¯1ã§ãããããããšã©ãŒãæãäžããŸããããšãã°ãç§ã®ããªãŒã¯ã¯ã©ã¹1ãé©åã«å®çŸ©ããŠããããšã«æ°ä»ããŸããããã¯ã©ã¹0ã§å€ãã®ééããç¯ããŸããã€ãŸããå®éã«ã¯0ã§ããã®ã«ããã®ãªããžã§ã¯ãã¯ã¯ã©ã¹1ã§ãããšããªãŒããèšããããšããããããŸãããªããžã§ã¯ãã0ãŸãã¯1ã«åé¡ãã確çã®ã¬ãã«ã«é¢é£ä»ããããŠããŸããç§ã®ã¬ãã«ã¯0.9ã«åºå®ãããŸããããªããžã§ã¯ããã¯ã©ã¹1ã«å²ãåœãŠã確çã®ã¬ãã«ã0.99ã«äžãããšãã¯ã©ã¹1ã®ãªããžã§ã¯ãã®éžæãé£ãããªããåºæ¥äžããã«ãªããŸãããé床ãå€§å¹ ã«åäžããŸããã
ç¹°ãè¿ãã«ãªããŸãããã³ã³ãã¹ãã«åå ããç®çã¯ãè³åãç²åŸããããšã§ã¯ãªããçµéšãç©ãããšã§ãããã³ã³ãã¹ããå§ãŸãåã¯ãæ©æ¢°åŠç¿ã§ããã¹ããã©ã®ããã«æ±ãããããããªãã£ãããšãèãããšãæçµçã«ã¯ãæ°æ¥ã§ã·ã³ãã«ã§ãããªããæ©èœããã¢ãã«ãåŸãããç®æšã¯éæããããšèšããŸãããŸããããŒã¿ãµã€ãšã³ã¹ã®äžçã®åå¿è ãµã ã©ã€ã«ãšã£ãŠãè³ã§ã¯ãªãçµéšãç©ãããšãéèŠã ãšæããŸããããããçµéšãè³ã§ãããããã£ãŠãç«¶äºã«åå ããããšãæããªãã§ãã ãããããã®ããã«è¡ã£ãŠãã ããã誰ããããŒããŒã§ãïŒ
èšäºã®å ¬éæç¹ã§ã¯ãç«¶äºã¯ãŸã çµãã£ãŠããŸãããã³ã³ãã¹ãã®çµäºçµæã«åºã¥ããŠãèšäºãžã®ã³ã¡ã³ãã§ãã¢ãã«ã®å質ãåäžãããã¢ãããŒããšæ©èœã«ã€ããŠãæé«ã®ãã§ã¢ã¹ããŒãã«ã€ããŠæžããŸãã
ãããŠãããªãã¯èŠªæãªãèªè ã§ããããªããä»ã¹ããŒããäžããæ¹æ³ã«ã€ããŠã®èããæã£ãŠãããªãã°ãã³ã¡ã³ãã«æžããŠãã ãããåè¡ãããªãã:)
æ å ±æºãè£å©è³æ
- ãããŒã¿GithubãšJupyterããŒãããã¯ã
- ãSIBURCHALLENGE2020ã³ã³ããã£ã·ã§ã³ãã©ãããã©ãŒã ã
- ãã³ã³ããã£ã·ã§ã³SIBURCHALLENGE2020ã®äž»å¬è ã®ãµã€ãã
- "ããã¬ã«é¢ããè¯ãèšäº"ããã¹ãã®èªç¶èšèªåŠçã®åºç€ ""
- "ããã¬ã«é¢ãããã1ã€ã®è¯ãèšäº"ãã¡ãžãŒæååã®æ¯èŒïŒã§ããã°çè§£ããŠãã ãã ""
- ãAPNIãã¬ãžã³ããã®åºçç©ã
- ãè°·æ¬ä¿æ°ã«é¢ããèšäºãæååã®é¡äŒŒæ§ãããã§ã¯äœ¿çšããŠããŸããã