🚴🏽 🚖 👩🏽‍💻 型破りな感情分析：BERTとCatBoost 👩🏿‍🌾 🌾 🤛🏻

前書き

感情分析は、データ（テキスト）がポジティブ、ネガティブ、ニュートラルのいずれであるかを判断するために使用される自然言語処理（NLP）手法です。

感情分析は、言語の感情的なニュアンスを理解するための基本です。これにより、レビュー、ソーシャルメディアのディスカッション、コメントなどの背後にある意見を自動的に並べ替えることができます。

近年、感情分析は非常に人気がありますが、2000年代初頭から研究が続けられています。Naive Bayesian、Logistic Regression、Support Vector Machines（SVM）などの従来の機械学習手法は、拡張性に優れているため、大量に広く使用されています。実際には、深層学習（DL）手法は、感情分析を含むさまざまなNLPタスクに最高の精度を提供することが証明されています。ただし、学習と使用には時間がかかり、費用がかかる傾向があります。

この記事では、速度と品質を組み合わせたあまり知られていない代替案を提供したいと思います。比較評価と結論には、ベースラインモデルが必要です。実績のある人気のBERTを選びました。

データ

— , , . , — .

, , , .

- 3, .

BERT

TensorFlow Hub. TensorFlow Hub — , . , BERT Faster R-CNN, .

!pip install tensorflow_hub
!pip install tensorflow_text

small_bert/bert_en_uncased_L-4_H-512_A-8 — BERT, « Well-Read Students Learn Better: On the Importance of Pre-training Compact Models». BERT . , BERT. , .

bert_en_uncased_preprocess — BERT. , BooksCorpus. « », , , .

tfhub_handle_encoder = \
    "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1"
tfhub_handle_preprocess = \
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

, . - , SOTA(State-of-the-Art).

def build_classifier_model():
    
    text_input = tf.keras.layers.Input(
        shape=(), dtype=tf.string, name='text')
    
    preprocessing_layer = hub.KerasLayer(
        tfhub_handle_preprocess, name='preprocessing')
    
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(
        tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(
        3, activation='softmax', name='classifier')(net)
    model = tf.keras.Model(text_input, net)
    
    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    metric = tf.metrics.CategoricalAccuracy('accuracy')
    optimizer = Adam(
        learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
    model.compile(
        optimizer=optimizer, loss=loss, metrics=metric)
    model.summary()
    return model

30% .

train, valid = train_test_split(
    df_train,
    train_size=0.7,
    random_state=0,
    stratify=df_train['Sentiment'])y_train, X_train = \
    train['Sentiment'], train.drop(['Sentiment'], axis=1)
y_valid, X_valid = \
    valid['Sentiment'], valid.drop(['Sentiment'], axis=1)y_train_c = tf.keras.utils.to_categorical(
    y_train.astype('category').cat.codes.values, num_classes=3)
y_valid_c = tf.keras.utils.to_categorical(
    y_valid.astype('category').cat.codes.values, num_classes=3)

— .

history = classifier_model.fit(
    x=X_train['Tweet'].values,
    y=y_train_c,
    validation_data=(X_valid['Tweet'].values, y_valid_c),
    epochs=5)

BERT Accuracy: 0.833859920501709

(Confusion Matrix) — , , . , ( ). , .

Classification Report — , .

. , , .

CatBoost

CatBoost — . 0.19.1, .

, CatBoost . , — CatBoost 20–40 , , CatBoost , . , , .

!pip install catboost

; . .

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
        plot=True,
        use_best_model=True)

CatBoost Pool. Pool — , , , .

text_features — ( ) ( ). , ( : list, numpy.ndarray, pandas.DataFrame, pandas.Series). - , , . feature_names , , pandas.DataFrame , .

tokenizers — .
dictionaries — , .
feature_calcers — , .

; .

model = fit_model(
    train_pool, valid_pool,
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '50000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

CatBoost model accuracy: 0.8299104791995787

. - ? , , . — , .

y_proba_avg = np.argmax((y_proba_cb + y_proba_bert)/2, axis=1)

Average accuracy: 0.855713533438652

BERT ;
組み込みのワードプロセッシング機能を使用して、CatBoostでモデルを作成しました。
両方のモデルの結果を平均するとどうなるかを調べました。

私の意見では、複雑で遅いSOTAソリューションは、ほとんどの場合、特に速度が重要なニーズである場合は回避できます。

CatBoostは、箱から出してすぐに優れたテキスト感情分析機能を提供します。Kaggle、DrivenDataなどの競争力のある愛好家にとって、CatBoostは、基本ソリューションとしても、モデルのアンサンブルの一部としても、優れたモデルを提供できます。

記事のコードはここで見ることができます。

型破りな感情分析：BERTとCatBoost

前書き

データ

BERT

CatBoost

More articles: