M.O.を使用したビッグテキストデータのクラスタリングと分類 Javaで。記事#3-アーキテクチャ/結果

こんにちは、Habr!今日は、Javaでの機械学習を使用したビッグテキストデータのクラスタリングと分類のトピックの最後の部分になります。この記事は、最初と2番目の記事の続きです 









この記事では、システムアーキテクチャ、アルゴリズム、および視覚的な結果について説明します。理論とアルゴリズムの詳細はすべて、最初の2つの記事に記載されています。









システムアーキテクチャは、Webアプリケーションとデータクラスタリングおよび分類ソフトウェアの2つの主要部分に分けることができます。









機械学習ソフトウェアアルゴリズムは、次の3つの主要部分で構成されています。





  1. 自然言語処理;





    1. トークン化;





    2. レンマ化;





    3. リストを停止します。





    4. 単語の頻度;





  2. クラスタリング手法;





    1. TF-IDF;





    2. SVD;





    3. クラスターグループの検索。





  3. 分類方法-AylienAPI。





自然言語処理

アルゴリズムは、テキストデータを読み取ることから始まります。私たちのシステムは電子図書館であるため、本はほとんどがpdf形式です。NLP処理の実装と詳細については、こちらをご覧ください





:





  : 4173415
    : 88547
    : 82294
      
      











, , , . , :





characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
      
      



, :





character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
      
      











tf-idf . HashMap, - , - -.





-:





tf-idf:









, , tf-idf. :





-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997

      
      











SVD   .





, .  – , . OrientDB , OrientDB . OrientDB , , , . . .





, .









– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .





r =最大(D)/ n









   max(D)  ‒ , . n -













, . – , –









, . 4-. ( > nt)





nt = N / S

N‒ - , S ‒ .









, .





– Aylien API





Aylien API . API json , . API . 9 , . POST API:





String queryText = "select  DocText from documents where clusters = '" + cluster + "'";
   OResultSet resultSet = database.query(queryText);
   while (resultSet.hasNext()) {
   OResult result = resultSet.next();

   String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
   .toLowerCase();
   keywords.add(textDoc.replaceAll("\\n", ""));
   }

   ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder    = ClassifyByTaxonomyParams.newBuilder();
   classifyByTaxonomybuilder.setText(keywords.toString());
   classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
   TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
   for (TaxonomyCategory c : response.getCategories()) {
   clusterUpdate.add(c.getLabel());
   }

      
      







GET, :









. .













. . , . . , . , :









-





- – . , . - , . Vaadin Flow:









:





  • , .





  • .





  • -.





  • , , , , -.





  • .













“Technology & Computing”:









:









:









, . . , , . . . . : .





, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..





, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .





Aylien API, . , 100 . , , , k-, . , .








All Articles