LightGBMã¯ãèªåãªããžã§ã¯ãéžæã®ã¿ã€ãã远å ããã ãã§ãªãã倧ããªåŸé ã§ã®ããŒã¹ãã®äŸã«çŠç¹ãåœãŠãããšã«ãããåŸé ããŒã¹ãã¢ã«ãŽãªãºã ãæ¡åŒµããŸããããã«ãããåŠç¿ãåçã«å éããäºæž¬ããã©ãŒãã³ã¹ãåäžããŸãããããã£ãŠãLightGBMã¯ãååž°ããã³åé¡äºæž¬ã¢ããªã³ã°ã®åé¡ã«ã€ããŠè¡šåœ¢åŒã®ããŒã¿ãåŠçããå Žåã®æ©æ¢°åŠç¿ç«¶äºã®äºå®äžã®ã¢ã«ãŽãªãºã ã«ãªããŸããããã®ãã¥ãŒããªã¢ã«ã§ã¯ãåé¡ãšååž°ã®ããã«Light GradientBoostedãã·ã³ã¢ã³ãµã³ãã«ãèšèšããæ¹æ³ã瀺ããŸãããã®ãã¥ãŒããªã¢ã«ãå®äºãããšã次ã®ããšãããããŸãã
- Light Gradient Boosted MachineïŒLightGBMïŒã¯ã確ççåŸé ããŒã¹ãã¢ã³ãµã³ãã«ã®å¹ççãªãªãŒãã³ãœãŒã¹å®è£ ã§ãã
- scikit-learnAPIã䜿çšããŠåé¡ãšååž°ã®ããã®LightGBMã¢ã³ãµã³ãã«ãéçºããæ¹æ³ã
- LightGBM .

- LightBLM.
- Scikit-Learn API LightGBM.
â LightGBM .
â LightGBM . - LightGBM.
â .
â .
â .
â .
LightBLM
åŸé ããŒã¹ãã¯ãåé¡åé¡ãŸãã¯äºæž¬ååž°ã¢ããªã³ã°ã«äœ¿çšã§ããã¢ã³ãµã³ãã«ãã·ã³åŠç¿ã¢ã«ãŽãªãºã ã®ã¯ã©ã¹ã«å±ããŠããŸãã
ã¢ã³ãµã³ãã«ã¯ãæææ±ºå®ããªãŒã¢ãã«ã«åºã¥ããŠæ§ç¯ãããŸããããªãŒã¯äžåºŠã«1ã€ãã€ã¢ã³ãµã³ãã«ã«è¿œå ããã以åã®ã¢ãã«ã«ãã£ãŠäœæãããäºæž¬ãšã©ãŒãä¿®æ£ããããã«ãã¬ãŒãã³ã°ãããŸããããã¯ãããŒã¹ãã£ã³ã°ãšåŒã°ããã¢ã³ãµã³ãã«ãã·ã³åŠç¿ã¢ãã«ã®äžçš®ã§ãã
ã¢ãã«ã¯ãä»»æã®åŸ®åå¯èœãªæå€±é¢æ°ãšåŸé éäžæé©åã¢ã«ãŽãªãºã ã䜿çšããŠãã¬ãŒãã³ã°ãããŸããããã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®ããã«ã¢ãã«ããã¬ãŒãã³ã°ããããšãã«æå€±åŸé ãæå°åããããããã¡ãœããã«ãåŸé ããŒã¹ãã£ã³ã°ããšããååãä»ããŸããåŸé ããŒã¹ãã®è©³çްã«ã€ããŠã¯ããã¥ãŒããªã¢ã«ãåç §ããŠãã ããããMLåŸé ããŒã¹ãã¢ã«ãŽãªãºã ã®ç©ãããªç޹ä»ãã
LightGBMã¯ãå¹ççã§ãå Žåã«ãã£ãŠã¯ä»ã®å®è£ ãããå¹ççã«ãªãããã«èšèšããããåŸé ããŒã¹ãã®ãªãŒãã³ãœãŒã¹å®è£ ã§ãã
ãã®ãããLightGBMã¯ããªãŒãã³ãœãŒã¹ãããžã§ã¯ãããœãããŠã§ã¢ã©ã€ãã©ãªãããã³ãã·ã³åŠç¿ã¢ã«ãŽãªãºã ã§ããã€ãŸãããã®ãããžã§ã¯ãã¯Extreme GradientBoostingãŸãã¯XGBoostãã¯ããã¯ãšéåžžã«ãã䌌ãŠããŸãã
LightGBMã¯ãGolinãK.ãetalãã«ãã£ãŠèª¬æãããŠããŸãã詳现ã«ã€ããŠã¯ããLightGBMïŒA Highly Efficient Gradient BoostingDecisionTreeããšããã¿ã€ãã«ã®2017幎ã®èšäºãåç §ããŠãã ããããã®å®è£ ã§ã¯ãGOSSãšEFBãšãã2ã€ã®éèŠãªã¢ã€ãã¢ãå°å ¥ãããŠããŸãã
Gradient One-Way SamplingïŒGOSSïŒã¯ãGradient Boostingã®ä¿®æ£çã§ããããã倧ããªåŸé ããããããã¥ãŒããªã¢ã«ã«çŠç¹ãåœãŠãŠããŸããããã«ãããåŠç¿ãã¹ããŒãã¢ããããã¡ãœããã®èšç®ã®è€éãã軜æžãããŸãã
GOSSã䜿çšãããšãåŸé ãå°ããããŒã¿ã€ã³ã¹ã¿ã³ã¹ã®å€§éšåãé€å€ããæ®ãã®ããŒã¿ã€ã³ã¹ã¿ã³ã¹ã®ã¿ã䜿çšããŠæ å ±ã®å¢å ãæšå®ããŸãã倧ããªåŸé ãæã€ããŒã¿ã€ã³ã¹ã¿ã³ã¹ãæ å ±ã²ã€ã³ã®èšç®ã«ãããŠããéèŠãªåœ¹å²ãæãããããGOSSã¯ã¯ããã«å°ããããŒã¿ãµã€ãºã§æ å ±ã²ã€ã³ã®ããªãæ£ç¢ºãªæšå®å€ãååŸã§ãããšäž»åŒµããŸãã
Exclusive Feature BundlingïŒEFBïŒã¯ãåäžã®ãšã³ã³ãŒãã£ã³ã°ã§ãšã³ã³ãŒããããã«ããŽãªå ¥å倿°ãªã©ãã¹ããŒã¹ïŒã»ãšãã©ãnullïŒã®çžäºã«æä»çãªæ©èœãçµã¿åãããã¢ãããŒãã§ãããããã£ãŠãããã¯äžçš®ã®èªåæ©èœéžæã§ãã
...çžäºã«æä»çãªæ©èœãããã±ãŒãžåããŠïŒã€ãŸããåæã«ãŒã以å€ã®å€ãåãããšã¯ãã£ãã«ãããŸããïŒãæ©èœã®æ°ãæžãããŸãã
ããã2ã€ã®å€æŽãåããããšãã¢ã«ãŽãªãºã ã®ãã¬ãŒãã³ã°æéãæå€§20åãŸã§é«éåã§ããŸãããããã£ãŠãLightGBMã¯ãGOSSãšEFBã远å ãããGradient Boosted Decision TreeïŒGBDTïŒãšèããããšãã§ããŸãã
æ°ããGBDTå®è£ ãGOSSããã³EFBLightGBMãšåŒã³ãŸããããã€ãã®å ¬éãããŠããããŒã¿ã»ããã§ã®å®éšã§ã¯ãLightGBMãåŸæ¥ã®GBDTã®åŠç¿ããã»ã¹ã20å以äžå éããã»ãŒåã粟床ãéæããŠããããšã瀺ãããŠããŸãã
Scikit-LightGBMã®APIãåŠã¶
LightGBMã¯ã¹ã¿ã³ãã¢ãã³ã©ã€ãã©ãªãšããŠã€ã³ã¹ããŒã«ã§ããLightGBMã¢ãã«ã¯scikit-learnAPIã䜿çšããŠéçºã§ããŸãã
æåã®ã¹ãããã¯ãLightGBMã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããããšã§ããã»ãšãã©ã®ãã©ãããã©ãŒã ã§ã¯ãpipããã±ãŒãžãããŒãžã£ãŒã䜿çšããŠå®è¡ã§ããŸããäŸãã°ïŒ
sudo pip install lightgbm
次ã®ããã«ã€ã³ã¹ããŒã«ãšããŒãžã§ã³ã確èªã§ããŸãã
# check lightgbm version
import lightgbm
print(lightgbm.__version__)
ã¹ã¯ãªããã¯ãã€ã³ã¹ããŒã«ãããŠããLightGBMã®ããŒãžã§ã³ã衚瀺ããŸããããªãã®ããŒãžã§ã³ã¯åãããã以äžã§ãªããã°ãªããŸãããããã§ãªãå Žåã¯ãLightGBMãæŽæ°ããŸããéçºç°å¢ã«åºæã®æé ãå¿ èŠãªå Žåã¯ããã¥ãŒããªã¢ã«ãLightGBMã€ã³ã¹ããŒã«ã¬ã€ãããåç §ããŠãã ããã
LightGBMã©ã€ãã©ãªã«ã¯ç¬èªã®APIããããŸãããscikit-learnã©ãããŒã¯ã©ã¹ïŒLGBMRegressorããã³LGBMClassifierïŒãä»ããã¡ãœããã䜿çšããŠããŸããããã«ãããscikit-learnãã·ã³åŠç¿ã©ã€ãã©ãªå šäœãããŒã¿ã®æºåãšã¢ãã«ã®è©äŸ¡ã«äœ¿çšã§ããããã«ãªããŸãã
ã©ã¡ãã®ã¢ãã«ãåãããã«æ©èœããåãåŒæ°ã䜿çšããŠã決å®ããªãŒãäœæããŠã¢ã³ãµã³ãã«ã«è¿œå ããæ¹æ³ã«åœ±é¿ãäžããŸããã¢ãã«ã¯ã©ã³ãã æ§ã䜿çšããŸããããã¯ãã¢ã«ãŽãªãºã ãåãããŒã¿ã§å®è¡ããããã³ã«ããããã«ç°ãªãã¢ãã«ãäœæãããããšãæå³ããŸãã
確ççåŠç¿ã¢ã«ãŽãªãºã ã§æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã䜿çšããå Žåã¯ãçžäºæ€èšŒã®æ°åã®å®è¡ãŸãã¯ç¹°ãè¿ãã«ããã£ãŠããã©ãŒãã³ã¹ãå¹³åããããšã«ãã£ãŠããããè©äŸ¡ããããšããå§ãããŸããæçµã¢ãã«ããã£ããã£ã³ã°ãããšãã¯ãæšå®ãç¹°ãè¿ããŠã¢ãã«ã®åæ£ãæžå°ãããŸã§ããªãŒã®æ°ãå¢ããããããã€ãã®æçµã¢ãã«ããã¬ãŒãã³ã°ããŠãããã®äºæž¬ãå¹³åããããšãæãŸããå ŽåããããŸããåé¡ãšååž°ã®äž¡æ¹ã®ããã®LightGBMã¢ã³ãµã³ãã«ã®èšèšãèŠãŠã¿ãŸãããã
åé¡ã®ããã®LightGBMã¢ã³ãµã³ãã«
ãã®ã»ã¯ã·ã§ã³ã§ã¯ãåé¡ã¿ã¹ã¯ã«LightGBMã䜿çšããæ¹æ³ã«ã€ããŠèª¬æããŸãããŸããmake_classificationïŒïŒé¢æ°ã䜿çšããŠã1000ã®äŸãš20ã®å ¥åæ©èœãæã€åæãã€ããªåé¡åé¡ãäœæã§ããŸãã以äžã®äŸå šäœãåç §ããŠãã ããã
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
äŸãå®è¡ãããšãããŒã¿ã»ãããäœæãããå ¥åã³ã³ããŒãã³ããšåºåã³ã³ããŒãã³ãã®åœ¢ç¶ãèŠçŽãããŸãã
(1000, 20) (1000,)
次ã«ããã®ããŒã¿ã»ããã§LightGBMã¢ã«ãŽãªãºã ãè©äŸ¡ã§ããŸãã3åã®ç¹°ãè¿ããš10ã®kã§ç¹°ãè¿ãããéå±€åkãã©ãŒã«ãçžäºæ€èšŒã䜿çšããŠã¢ãã«ãè©äŸ¡ããŸãããã¹ãŠã®ç¹°ãè¿ããšãã©ãŒã«ãã«ãããã¢ãã«ç²ŸåºŠã®å¹³åãšæšæºåå·®ãå ±åããŸãã
# evaluate lightgbm algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = LGBMClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
äŸãå®è¡ãããšãã¢ãã«ã®å¹³åãšæšæºåå·®ã®ç²ŸåºŠãããããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãæ°å詊ããŠãå¹³åçµæãæ¯èŒããŠãã ããã
ãã®å Žåãããã©ã«ãã®ãã€ããŒãã©ã¡ãŒã¿ã䜿çšããLightGBMã¢ã³ãµã³ãã«ã¯ããã®ãã¹ãããŒã¿ã»ããã§çŽ92.5ïŒ ã®åé¡ç²ŸåºŠãéæããŠããããšãããããŸãã
Accuracy: 0.925 (0.031)
LightGBMã¢ãã«ãæçµã¢ãã«ãšããŠäœ¿çšããåé¡ã®äºæž¬ãè¡ãããšãã§ããŸãããŸããLightGBMã¢ã³ãµã³ãã«ã¯å©çšå¯èœãªãã¹ãŠã®ããŒã¿ã«é©åããæ¬¡ã«ãpredict ïŒïŒé¢æ°ãåŒã³åºããŠæ°ããããŒã¿ãäºæž¬ã§ããŸãã以äžã®äŸã¯ããã€ããªåé¡ããŒã¿ã»ããã§ããã瀺ããŠããŸãã
# make predictions using lightgbm for classification
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = LGBMClassifier()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [0.2929949,-4.21223056,-1.288332,-2.17849815,-0.64527665,2.58097719,0.28422388,-7.1827928,-1.91211104,2.73729512,0.81395695,3.96973717,-2.66939799,3.34692332,4.19791821,0.99990998,-0.30201875,-4.43170633,-2.82646737,0.44916808]
yhat = model.predict([row])
print('Predicted Class: %d' % yhat[0])
ãã®äŸãå®è¡ãããšãããŒã¿ã»ããå šäœã®LightGBMã¢ã³ãµã³ãã«ã¢ãã«ããã¬ãŒãã³ã°ãããã¢ãã«ãã¢ããªã±ãŒã·ã§ã³ã§äœ¿çšãããŠããå Žåãšåæ§ã«ãããã䜿çšããŠæ°ããããŒã¿è¡ãäºæž¬ãããŸãã
Predicted Class: 1
åé¡ã«LightGBMã䜿çšããããšã«æ £ããã®ã§ãååž°APIãèŠãŠã¿ãŸãããã
ååž°çšã®LightGBMã¢ã³ãµã³ãã«
ãã®ã»ã¯ã·ã§ã³ã§ã¯ãååž°åé¡ã«LightGBMã䜿çšããæ¹æ³ã«ã€ããŠèª¬æããŸãããŸããmake_regressionïŒïŒé¢æ°
ã䜿çšããŠã1000ã®äŸãš20ã®å ¥åæ©èœã䜿çšããŠåæååž°åé¡ãäœæã§ããŸãã以äžã®äŸå šäœãåç §ããŠãã ããã
# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# summarize the dataset
print(X.shape, y.shape)
äŸãå®è¡ãããšãããŒã¿ã»ãããäœæãããå ¥åã³ã³ããŒãã³ããšåºåã³ã³ããŒãã³ããèŠçŽãããŸãã
(1000, 20) (1000,)
次ã«ããã®ããŒã¿ã»ããã§LightGBMã¢ã«ãŽãªãºã ãè©äŸ¡ã§ããŸãã
åã®ã»ã¯ã·ã§ã³ãšåæ§ã«ã3åã®ç¹°ãè¿ããškã10ã«çããkåã®çžäºæ€èšŒãç¹°ãè¿ããŠã¢ãã«ãè©äŸ¡ããŸãããã¹ãŠã®ç¹°ãè¿ããšçžäºæ€èšŒã°ã«ãŒãã«ãããã¢ãã«ã®å¹³å絶察誀差ïŒMAEïŒãå ±åããŸãã scikit-learnã©ã€ãã©ãªã¯ãMAEãè² ã«ãããããæå°åãããã®ã§ã¯ãªãæå€§åãããŸããããã¯ã倧ããªè² ã®MAEã®æ¹ãåªããŠãããçæ³çãªã¢ãã«ã®MAEã0ã§ããããšãæå³ããŸããå®å šãªäŸã以äžã«ç€ºããŸãã
# evaluate lightgbm ensemble for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from lightgbm import LGBMRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# define the model
model = LGBMRegressor()
# evaluate the model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
äŸãå®è¡ãããšãã¢ãã«ã®å¹³åãšæšæºåå·®ãå ±åãããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãæ°åå®è¡ããå¹³åçµæãæ¯èŒããããšãæ€èšããŠãã ããããã®å Žåãããã©ã«ãã®ãã€ããŒãã©ã¡ãŒã¿ãæã€LightGBMã¢ã³ãµã³ãã«ãçŽ60ã®MAEã«éããããšãããããŸãã
MAE: -60.004 (2.887)
LightGBMã¢ãã«ãæçµã¢ãã«ãšããŠäœ¿çšããååž°ã®äºæž¬ãè¡ãããšãã§ããŸããæåã«ãLightGBMã¢ã³ãµã³ãã«ãå©çšå¯èœãªãã¹ãŠã®ããŒã¿ã§ãã¬ãŒãã³ã°ãããæ¬¡ã«predict ïŒïŒé¢æ°ãåŒã³åºããŠæ°ããããŒã¿ãäºæž¬ã§ããŸãã以äžã®äŸã¯ãååž°ããŒã¿ã»ããã§ããã瀺ããŠããŸãã
# gradient lightgbm for making predictions for regression
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# define the model
model = LGBMRegressor()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [0.20543991,-0.97049844,-0.81403429,-0.23842689,-0.60704084,-0.48541492,0.53113006,2.01834338,-0.90745243,-1.85859731,-1.02334791,-0.6877744,0.60984819,-0.70630121,-1.29161497,1.32385441,1.42150747,1.26567231,2.56569098,-0.11154792]
yhat = model.predict([row])
print('Prediction: %d' % yhat[0])
ãã®äŸãå®è¡ãããšãããŒã¿ã»ããå šäœã§LightGBMã¢ã³ãµã³ãã«ã¢ãã«ããã¬ãŒãã³ã°ãããã¢ããªã±ãŒã·ã§ã³ã§ã¢ãã«ã䜿çšããå Žåãšåæ§ã«ãããã䜿çšããŠæ°ããããŒã¿è¡ãäºæž¬ãããŸãã
Prediction: 52
scikit-learn APIã䜿çšããŠLightGBMã¢ã³ãµã³ãã«ãè©äŸ¡ããã³é©çšããæ¹æ³ã«æ £ããŠããã®ã§ãã¢ãã«ã®èšå®ãèŠãŠã¿ãŸãããã
LightGBMãã€ããŒãã©ã¡ãŒã¿
ãã®ã»ã¯ã·ã§ã³ã§ã¯ãLightGBMã¢ã³ãµã³ãã«ã«ãšã£ãŠéèŠãªããã€ãã®ãã€ããŒãã©ã¡ãŒã¿ãŒãšãããããã¢ãã«ã®ããã©ãŒãã³ã¹ã«äžãã圱é¿ã«ã€ããŠè©³ããèŠãŠãããŸããLightGBMã«ã¯ã確èªãã¹ããã€ããŒãã©ã¡ãŒã¿ãŒããããããããŸããããã§ã¯ãããªãŒã®æ°ãšãã®æ·±ããåŠç¿çãããã³ããŒã¹ãã®ã¿ã€ãã確èªããŸããLightGBMãã€ããŒãã©ã¡ãŒã¿ã®èª¿æŽã«é¢ããäžè¬çãªãã³ãã«ã€ããŠã¯ãããã¥ã¡ã³ããLightGBMãã©ã¡ãŒã¿ã®èª¿æŽããåç §ããŠãã ããã
æšã®æ°ã調ã¹ã
LightGBMã¢ã³ãµã³ãã«ã¢ã«ãŽãªãºã ã®éèŠãªãã€ããŒãã©ã¡ãŒã¿ã¯ãã¢ã³ãµã³ãã«ã§äœ¿çšãããæ±ºå®ããªãŒã®æ°ã§ãã以åã®ããªãŒã«ãã£ãŠè¡ãããäºæž¬ãä¿®æ£ããã³æ¹åããããã«ã決å®ããªãŒãã¢ãã«ã«é 次远å ãããããšãæãåºããŠãã ãããã«ãŒã«ã¯ããæ©èœããŸããæšãå€ãã»ã©è¯ãã§ããããªãŒã®æ°ã¯ãn_estimatorsåŒæ°ïŒããã©ã«ãã¯100ïŒã䜿çšããŠæå®ã§ããŸãã以äžã®äŸã§ã¯ã10ãã5000ãŸã§ã®ããªãŒæ°ã®åœ±é¿ã調ã¹ãŸãã
# explore lightgbm number of trees effect on performance
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
from matplotlib import pyplot
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of models to evaluate
def get_models():
models = dict()
trees = [10, 50, 100, 500, 1000, 5000]
for n in trees:
models[str(n)] = LGBMClassifier(n_estimators=n)
return models
# evaluate a give model using cross-validation
def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
æåã«äŸãå®è¡ãããšã決å®ããªãŒã®æ°ããšã®å¹³å粟床ã衚瀺ãããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãè€æ°åå®è¡ããå¹³åãæ¯èŒããããšãæ€èšããŠãã ããã
ããã§ã¯ããã®ããŒã¿ã»ããã®ããã©ãŒãã³ã¹ãçŽ500ããªãŒã«åäžãããã®åŸã¯æšªã°ãã«ãªã£ãŠããããã«èŠããŸãã
>10 0.857 (0.033)
>50 0.916 (0.032)
>100 0.925 (0.031)
>500 0.938 (0.026)
>1000 0.938 (0.028)
>5000 0.937 (0.028)
ããã¯ã¹ãšãŠã£ã¹ã«ãŒã®ãããããäœæãããæ§æãããããªãŒã®æ°ããšã«ç²ŸåºŠã¹ã³ã¢ã忣ãããŸããã¢ãã«ã®ããã©ãŒãã³ã¹ãšã¢ã³ãµã³ãã«ã®ãµã€ãºãå¢å ããåŸåããããŸãã
æšã®æ·±ãã調ã¹ã
ã¢ã³ãµã³ãã«ã«è¿œå ãããåããªãŒã®æ·±ãã倿Žããããšã¯ãåŸé ããŒã¹ãã®ãã1ã€ã®éèŠãªãã€ããŒãã©ã¡ãŒã¿ãŒã§ããããªãŒã®æ·±ãã«ãã£ãŠãåããªãŒããã¬ãŒãã³ã°ããŒã¿ã»ããã«ã©ã®çšåºŠç¹åããŠããããã€ãŸããã©ã®çšåºŠäžè¬çãŸãã¯ãã¬ãŒãã³ã°æžã¿ã§ããããæ±ºãŸããŸããæµ ãããŠäžè¬çã§ãã£ãŠã¯ãªããïŒAdaBoostãªã©ïŒãæ·±ãããŠç¹æ®åãããŠã¯ãªããªãïŒããŒãã¹ãã©ããéçŽãªã©ïŒããªãŒãæšå¥šãããŸãã
åŸé ããŒã¹ãã¯éåžžââãé©åºŠãªæ·±ãã®æš¹æšã§ããŸãæ©èœãããã¬ãŒãã³ã°ãšäžè¬æ§ã®ãã©ã³ã¹ãåããŸããããªãŒã®æ·±ãã¯ãmax_depthåŒæ°ã«ãã£ãŠå¶åŸ¡ãããŸãããªãŒã®è€éãã管çããããã®ããã©ã«ãã®ã¡ã«ããºã ã¯æéæ°ã®ããŒãã䜿çšããããšã§ãããããããã©ã«ãã¯æªå®çŸ©ã®å€ã§ãã
ããªãŒã®è€éããå¶åŸ¡ããã«ã¯ãäž»ã«2ã€ã®æ¹æ³ããããŸããããªãŒã®æå€§æ·±åºŠãšããªãŒã®ã¿ãŒããã«ããŒãïŒãªãŒãïŒã®æå€§æ°ã§ããããã§ã¯èã®æ°ã調ã¹ãŠããã®ã§ãnum_leavesåŒæ°ãæå®ããŠãããæ·±ãããªãŒããµããŒãããããã«æ°ãå¢ããå¿ èŠããããŸãã以äžã§ã¯ã1ãã10ãŸã§ã®ããªãŒã®æ·±ããšãã¢ãã«ã®ããã©ãŒãã³ã¹ãžã®åœ±é¿ã調ã¹ãŸãã
# explore lightgbm tree depth effect on performance
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
from matplotlib import pyplot
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of models to evaluate
def get_models():
models = dict()
for i in range(1,11):
models[str(i)] = LGBMClassifier(max_depth=i, num_leaves=2**i)
return models
# evaluate a give model using cross-validation
def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
æåã«äŸãå®è¡ãããšã調æŽãããåããªãŒæ·±åºŠã®å¹³å粟床ã衚瀺ãããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãæ°åå®è¡ããå¹³åçµæãæ¯èŒããããšãæ€èšããŠãã ããã
ããã§ã¯ãããªãŒã®æ·±ãã10ã¬ãã«ãŸã§å¢ãããšãããã©ãŒãã³ã¹ãåäžããããšãããããŸããããã«æ·±ãæšãæ¢çŽ¢ããã®ã¯è峿·±ãã§ãããã
>1 0.833 (0.028)
>2 0.870 (0.033)
>3 0.899 (0.032)
>4 0.912 (0.026)
>5 0.925 (0.031)
>6 0.924 (0.029)
>7 0.922 (0.027)
>8 0.926 (0.027)
>9 0.925 (0.028)
>10 0.928 (0.029)
æ§æãããããªãŒã®æ·±ãããšã«ç²ŸåºŠã¹ã³ã¢ã忣ããããã«ãé·æ¹åœ¢ãšãŠã£ã¹ã«ãŒã®ãããããçæãããŸããã¢ãã«ã®ããã©ãŒãã³ã¹ã¯ãããªãŒã®æ·±ããæå€§5ã¬ãã«ã«ãªããšå¢å ããäžè¬çãªåŸåãããããã®åŸã¯ããã©ãŒãã³ã¹ã¯ããªããã©ãããªãŸãŸã§ãã
åŠç¿ç調æ»
åŠç¿çã¯ãåã¢ãã«ãã¢ã³ãµã³ãã«äºæž¬ã«å¯äžãã床åããå¶åŸ¡ããŸããé床ãé ããšãã¢ã³ãµã³ãã«ã§ããå€ãã®æ±ºå®ããªãŒãå¿ èŠã«ãªãå ŽåããããŸããåŠç¿çã¯learning_rateåŒæ°ã§å¶åŸ¡ã§ããŸããããã©ã«ãã§ã¯0.1ã§ãã以äžã¯ãåŠç¿çã調ã¹ã0.0001ãã1.0ãŸã§ã®å€ã®å¹æãæ¯èŒããŸãã
# explore lightgbm learning rate effect on performance
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
from matplotlib import pyplot
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of models to evaluate
def get_models():
models = dict()
rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
for r in rates:
key = '%.4f' % r
models[key] = LGBMClassifier(learning_rate=r)
return models
# evaluate a give model using cross-validation
def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
æåã«äŸãå®è¡ãããšãæ§æãããååŠç¿çã®å¹³å粟床ã衚瀺ãããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãæ°åå®è¡ããå¹³åçµæãæ¯èŒããããšãæ€èšããŠãã ããã
ããã§ã¯ãåŠç¿çãé«ãã»ã©ããã®ããŒã¿ã»ããã®ããã©ãŒãã³ã¹ãåäžããããšãããããŸããã¢ã³ãµã³ãã«ã«ããªãŒã远å ããŠåŠç¿çãäžãããšãããã©ãŒãã³ã¹ãããã«åäžããããšãæåŸ ãããŸãã
>0.0001 0.800 (0.038)
>0.0010 0.811 (0.035)
>0.0100 0.859 (0.035)
>0.1000 0.925 (0.031)
>1.0000 0.928 (0.025)
èšå®ãããååŠç¿çã®ç²ŸåºŠã¹ã³ã¢ãé åžããããã«ãå£ã²ãããã¯ã¹ãäœæãããŸããåŠç¿çã1.0ãŸã§äžãããšãã¢ãã«ã®ããã©ãŒãã³ã¹ãäžãããšããäžè¬çãªåŸåããããŸãã
ããŒã¹ãã£ã³ã°ã¿ã€ãã®ç ç©¶
LightGBMã®ç¹åŸŽã¯ãããŒã¹ãã¿ã€ããšåŒã°ããå€ãã®ããŒã¹ãã¢ã«ãŽãªãºã ããµããŒãããŠããããšã§ããããŒã¹ãã£ã³ã°ã¿ã€ãã¯ãboosting_typeåŒæ°ã䜿çšããŠæå®ãããã¿ã€ããæ±ºå®ããããã«æååãåããŸããå¯èœãªå€ïŒ
- 'gbdt'ïŒGradient Boosted Decision TreeïŒGDBTïŒ;
- 'dart'ïŒããããã¢ãŠãã®æŠå¿µãMARTã«å ¥åãããDARTãååŸãããŸãã
- ' goss 'ïŒåŸé äžæ¹åãã§ããïŒGOSSïŒã
ããã©ã«ãã¯ãå€å žçãªåŸé ããŒã¹ãã¢ã«ãŽãªãºã ã§ããGDBTã§ãã
DARTã¯ããDARTïŒããããã¢ãŠããè€æ°ã®å æ³ååž°ããªãŒã«åºäŒãããšããã¿ã€ãã«ã®2015幎ã®èšäºã§èª¬æãããŠããããã®ååã瀺ãããã«ãåŸé ããŒã¹ã決å®ããªãŒã®å身ã§ããè€æ°å æ³ååž°ããªãŒïŒMARTïŒã¢ã«ãŽãªãºã ã«ãã£ãŒãã©ãŒãã³ã°ããããã¢ãŠãã®æŠå¿µã远å ããŸãã
ãã®ã¢ã«ãŽãªãºã ã¯ãGradient TreeBoostãBoosted TreesãMultiple Additive Regression Trees and TreesïŒMARTïŒãªã©ãå€ãã®ååã§ç¥ãããŠããŸããåŸè ã®ååã䜿çšããŠã¢ã«ãŽãªãºã ãåç §ããŸãã
GOSSã«ã¯ãLightGBMãšlightbgmã©ã€ãã©ãªã«é¢ããäœæ¥ã衚瀺ãããŸãããã®ã¢ãããŒãã¯ã倧ããªãšã©ãŒåŸé ãçºçããã€ã³ã¹ã¿ã³ã¹ã®ã¿ã䜿çšããŠã¢ãã«ãæŽæ°ããæ®ãã®ã€ã³ã¹ã¿ã³ã¹ãåé€ããããšãç®çãšããŠããŸãã
...åŸé ãå°ããããŒã¿ã€ã³ã¹ã¿ã³ã¹ã®å€§éšåãé€å€ããæ®ãã®éšåã®ã¿ã䜿çšããŠæ å ±ã®å¢å ãæšå®ããŸãã
以äžã®LightGBMã¯ã3ã€ã®äž»èŠãªããŒã¹ãæ¹æ³ã䜿çšããåæåé¡ããŒã¿ã»ããã§ãã¬ãŒãã³ã°ãããŠããŸãã
# explore lightgbm boosting type effect on performance
from numpy import arange
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
from matplotlib import pyplot
# get the dataset
def get_dataset():
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y
# get a list of models to evaluate
def get_models():
models = dict()
types = ['gbdt', 'dart', 'goss']
for t in types:
models[t] = LGBMClassifier(boosting_type=t)
return models
# evaluate a give model using cross-validation
def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
æåã«äŸãå®è¡ãããšãæ§æãããåããŒã¹ãã¿ã€ãã®å¹³å粟床ã衚瀺ãããŸãã
泚ïŒã¢ã«ãŽãªãºã ãæšå®æé ã®ç¢ºççãªæ§è³ªããŸãã¯æ°å€ã®ç²ŸåºŠã®éãã«ãããçµæãç°ãªãå ŽåããããŸããäŸãè€æ°åå®è¡ããå¹³åãæ¯èŒããããšãæ€èšããŠãã ããã
ããã©ã«ãã®ããŒã¹ãæ¹æ³ã¯ãä»ã®2ã€ã®è©äŸ¡ãããæ¹æ³ãããããã©ãŒãã³ã¹ãåªããŠããããšãããããŸãã
>gbdt 0.925 (0.031)
>dart 0.912 (0.028)
>goss 0.918 (0.027)
ããã¯ã¹ã¢ã³ããŠã£ã¹ã«ãŒå³ãäœæããŠãæ§æãããåå¢å¹ æ¹æ³ã®ç²ŸåºŠæšå®å€ãé åžããæ¹æ³ãçŽæ¥æ¯èŒã§ããããã«ããŸãã

- æ©æ¢°åŠç¿ã³ãŒã¹
- ããŒã¿ãµã€ãšã³ã¹ã®è·æ¥èšç·Ž
- ããŒã¿ã¢ããªã¹ããã¬ãŒãã³ã°
- Python forWebéçºã³ãŒã¹
ãã®ä»ã®ã³ãŒã¹