🏽 👨‍👩‍👦 👋🏿 Java MachineLearningを使用したビッグテキストデータのクラスタリングと分類。記事＃2-アルゴリズム 🥉 ☠️ 👉

こんにちは、Habr！今日は、Javaでの機械学習を使用したビッグテキストデータのクラスタリングと分類のトピックを続けます。この記事は最初の記事の続きです。

この記事には、理論と、私が使用したアルゴリズムの実装が含まれます。

1.トークン化

理論：

‒ . (, ). , , , , , , . . . (), . ‒ ; , . , - . , , . , , .

, . .

, «». (, , , ), , . , , . , .

, PDF-, , , . , . .

. , , , , . , , , , , . , . . . , , , . , .

, , . , , . , . . , , , , , , . , , . , . , .

Iterator<String> finalIterator = new WordIterator(reader);

private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
        this.br = br;
        curLine = null;
        advance();
    }
    private void advance() {
        try {
            while (true) {
                if (curLine == null || !matcher.find()) {
                    String line = br.readLine();
                    if (line == null) {
                        next = null;
                        br.close();
                        return;
                    }
                    matcher = notWhiteSpace.matcher(line);
                    curLine = line;
                    if (!matcher.find())
                        continue;                    
                }
                next = curLine.substring(matcher.start(), matcher.end());
                break;
            }
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }

2. -

, , «-», «-». , . - . -. - 1958 .. . - ‒ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , « », -, “”, “”, “ ”, “”. , «” “”, , , “” „“ . , , , : “”, “ ”, “”, , . , . - , , .

. -, . , » ", «», «», . -, , , , . , . .

- :

- ‒ .
, -, , , , -.
- - . .
, .
, -, , .
- .
- :
: -, -. .
, («—»): - -. (TF-High), , , . . (TF1), (IDF).
(MI): , (, , ), , . , , .

用語ランダムサンプリング（TBRS）：ストップワードをドキュメントから手動で検出する方法。この手法は、ランダムに選択された個々のデータチャンクを繰り返し処理し、次の式に示すように、

Kullback -Leibler発散測度を使用した形式で、値に基づいて各チャンクの特徴をランク付けすることによって使用されます：d_x（t）= Px（t）.log_2⁡〖（ Px（t））⁄（P（t））〗

ここで、Px（t）は、重みx内の項tの正規化された頻度です。P（t）は、コレクション全体の項tの正規化された頻度です。最終的なストップリストは、すべてのドキュメントで最も情報量の少ない用語を採用し、重複する可能性のあるものをすべて削除することによって作成されます。コード：

TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;

private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;

public TokenFilter loadFromResource(String fileName) {
		try {
			ClassLoader classLoader = getClass().getClassLoader();
			String str = IOUtils.toString(
					classLoader.getResourceAsStream(fileName),
					Charset.defaultCharset());
			InputStream is = new ByteArrayInputStream(str.getBytes());
			BufferedReader br = new BufferedReader(new InputStreamReader(is));

			Set<String> words = new HashSet<String>();
			for (String line = null; (line = br.readLine()) != null;)
				words.add(line);
			br.close();

			this.tokens = words;
			this.excludeTokens = true;
			this.parent = null;
		} catch (Exception e) {
			throw new IOError(e);
		}
		return this;
	}
public boolean accept(String token) {
		token = token.toLowerCase().replaceAll("[\\. \\d]", "");
		return (parent == null || parent.accept(token))
				&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
	}

ファイル：

















....

3.レマタイゼーション

理論：

. , . , .

‒ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .

何年にもわたって、レンマ化機能を提供する多くのツールが開発されてきました。使用される処理方法は異なりますが、それらはすべて、形態学的分析のリソースとして、単語のレキシコン、一連のルール、またはこれらの組み合わせを使用します。最も有名なレンマ化ツールは次のとおりです。

WordNet ‒ WordNet . , , , , . , . WordNet . .
CLEAR ‒ . WordNet , . NLP, , .
GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
MorphAdorner – , , , POS . , MorphAdorner , . , .
morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
         Annotation lemmaText = new Annotation(token);
         pipeline.annotate(lemmaText);
         List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
         String word = "";
         for(CoreLabel t:lemmaToken) {
           word = t.get(LemmaAnnotation.class);  //   (  )
         }

4. –

用語頻度-逆文書頻度（TF-IDF）は、最新の情報検索システムで用語（文書内のキーワード）の重みを計算するために最も広く使用されているアルゴリズムです。この重みは、一連のドキュメントまたはコーパス内のドキュメントにとって単語がどれほど重要であるかを評価するために使用される統計的尺度です。値は、ドキュメントに単語が表示される回数に比例して増加しますが、コーパス内の単語の頻度を補正します。

..。

(TF), , , , () . . (), , , . , . TF – . t D:

tf(t,D)=f_(t,D),

f_(t,D) – .

:

«»: tf(t,D) = 1, t D 0 ;

, :

tf(t,D)=f_(t,D)⁄(∑_(t^'∈D)▒f_(t^',D) )

:

log⁡〖(1+f_(t,D))〗

, , , :

tf(t,D)=0.5+0.5*f_(t,D)/(max⁡{f_(t^',D):t'∈D})

IDF, , , , . , , . , , , , :

idf(t,D)=log⁡N/|{d∈D:t∈d}|

TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :

tfidf(t,D)=tf(t,D)*idf(t,D)

private final TObjectIntMap<T> counts;
public int count(T obj) {
    int count = counts.get(obj);
    count++;
    counts.put(obj, count);
    sum++;
    return count;
}

public synchronized int addColumn(SparseArray<? extends Number> column) {
     if (column.length() > numRows)
         numRows = column.length();
    
     int[] nonZero = column.getElementIndices();
     nonZeroValues += nonZero.length;
     try {
         matrixDos.writeInt(nonZero.length);
         for (int i : nonZero) {
             matrixDos.writeInt(i); // write the row index
             matrixDos.writeFloat(column.get(i).floatValue());
         }
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
     return ++curCol;
}

public interface SparseArray<T> {
    int cardinality();
    T get(int index);
    int[] getElementIndices();
    int length();
    void set(int index, T obj);
    <E> E[] toArray(E[] array);
}

public File transform(File inputFile, File outFile, GlobalTransform transform) {
     try {
         DataInputStream dis = new DataInputStream(
             new BufferedInputStream(new FileInputStream(inputFile)));
         int rows = dis.readInt();
         int cols = dis.readInt();
         DataOutputStream dos = new DataOutputStream(
             new BufferedOutputStream(new FileOutputStream(outFile)));
         dos.writeInt(rows);
         dos.writeInt(cols);
         for (int row = 0; row < rows; ++row) {
             for (int col = 0; col < cols; ++col) {
                 double val = dis.readFloat();
                 dos.writeFloat((float) transform.transform(row, col, val));
             }
         }
         dos.close();
         return outFile;
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
}

public double transform(int row, int column, double value) {
        double tf = value / docTermCount[column];
        double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
        return tf * idf;
}

public void factorize(MatrixFile mFile, int dimensions) {
        try {
            String formatString = "";
            switch (mFile.getFormat()) {
            case SVDLIBC_DENSE_BINARY:
                formatString = " -r db ";
                break;
            case SVDLIBC_DENSE_TEXT:
                formatString = " -r dt ";
                break;
            case SVDLIBC_SPARSE_BINARY:
                formatString = " -r sb ";
                break;
            case SVDLIBC_SPARSE_TEXT:
                break;
            default:
                throw new UnsupportedOperationException(
                    "Format type is not accepted");
            }

            File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
            outputMatrixFile.deleteOnExit();
            String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();

            LOG.fine("creating SVDLIBC factor matrices at: " + 
                              outputMatrixPrefix);
            String commandLine = "svd -o " + outputMatrixPrefix + formatString +
                " -w dt " + 
                " -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
            LOG.fine(commandLine);
            Process svdlibc = Runtime.getRuntime().exec(commandLine);
            BufferedReader stdout = new BufferedReader(
                new InputStreamReader(svdlibc.getInputStream()));
            BufferedReader stderr = new BufferedReader(
                new InputStreamReader(svdlibc.getErrorStream()));

            StringBuilder output = new StringBuilder("SVDLIBC output:\n");
            for (String line = null; (line = stderr.readLine()) != null; ) {
                output.append(line).append("\n");
            }
            LOG.fine(output.toString());
            
            int exitStatus = svdlibc.waitFor();
            LOG.fine("svdlibc exit status: " + exitStatus);

            if (exitStatus == 0) {
                File Ut = new File(outputMatrixPrefix + "-Ut");
                File S  = new File(outputMatrixPrefix + "-S");
                File Vt = new File(outputMatrixPrefix + "-Vt");
                U = MatrixIO.readMatrix(
                        Ut, Format.SVDLIBC_DENSE_TEXT, 
                        Type.DENSE_IN_MEMORY, true); //  U
                scaledDataClasses = false; 
                
                V = MatrixIO.readMatrix(
                        Vt, Format.SVDLIBC_DENSE_TEXT,
                        Type.DENSE_IN_MEMORY); //  V
                scaledClassFeatures = false;


                singularValues =  readSVDLIBCsingularVector(S, dimensions);
            } else {
                StringBuilder sb = new StringBuilder();
                for (String line = null; (line = stderr.readLine()) != null; )
                    sb.append(line).append("\n");
                // warning or error?
                LOG.warning("svdlibc exited with error status.  " + 
                               "stderr:\n" + sb.toString());
            }
        } catch (IOException ioe) {
            LOG.log(Level.SEVERE, "SVDLIBC", ioe);
        } catch (InterruptedException ie) {
            LOG.log(Level.SEVERE, "SVDLIBC", ie);
        }
    }

    public MatrixBuilder getBuilder() {
        return new SvdlibcSparseBinaryMatrixBuilder();
    }

    private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
                                                      int dimensions)
            throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
        double[] m = new double[dimensions];

        int readDimensions = Integer.parseInt(br.readLine());
        if (readDimensions != dimensions)
            throw new RuntimeException(
                    "SVDLIBC generated the incorrect number of " +
                    "dimensions: " + readDimensions + " versus " + dimensions);

        int i = 0;
        for (String line = null; (line = br.readLine()) != null; )
            m[i++] = Double.parseDouble(line);
        return m;
    }

SVD Java ( S-space)

5. Aylien API

Aylien API Text Analysis ‒ API .

Aylien API , , , . ‒ .

, IPTC, -, ‒ IAB-QAG, .

IAB-QAGコンテキスト分類は、IAB（Interactive Advertising Bureau）が学界の分類専門家と協力して開発し、コンテンツカテゴリを少なくとも2つの異なるレベルで定義して、コンテンツ分類の一貫性を高めています。最初のレベルは幅広いレベルのカテゴリであり、2番目のレベルはルートタイプ構造のより詳細な説明です（図6）。

このAPIを使用するには、公式WebサイトでキーとIDを取得する必要があります。次に、このデータを使用して、Javaコードを使用してPOSTメソッドとGETメソッドを呼び出すことができます。

private static TextAPIClient client = new TextAPIClient(" ", " ")

次に、分類するデータを渡すことにより、分類を使用できます。

ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
  System.out.println(c);
}

サービスからの応答は、json形式で返されます。

{
  "categories": [
    {
      "confident": true,
      "id": "IAB19-36",
      "label": "Windows",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
          "rel": "self"
        },
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "parent"
        }
      ],
      "score": 0.5675236066291172
    },
    {
      "confident": true,
      "id": "IAB19",
      "label": "Technology & Computing",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "self"
        }
      ],
      "score": 0.46704140928338533
    }
  ],
  "language": "en",
  "taxonomy": "iab-qag",
  "text": "When Microsoft announced its wrenching..."
}

このAPIは、監視されていない学習クラスタリング手法を使用して取得されるクラスターを分類するために使用されます。

あとがき

上記のアルゴリズムを適用する場合、代替手段と既製のライブラリがあります。あなたはただ見なければなりません。この記事が気に入った場合、またはアイデアや質問がある場合は、コメントを残してください。3番目の部分は抽象的で、主にシステムアーキテクチャについて説明します。アルゴリズムの説明、使用されたもの、および順序。

さらに、各アルゴリズムの適用後のそれぞれの結果と、この作業の最終結果があります。

Java MachineLearningを使用したビッグテキストデータのクラスタリングと分類。記事＃2-アルゴリズム