👩🏾‍🏭 🔕 👩🏾‍🤝‍👩🏽 NLP問題を解決するためのトレーニングのためのデータの収集 🤛🏼 🚵🏾 👩‍🔬

ソースと実装ツールの選択

情報源として、ニュースサイトの要素をまとめたブログであるhabr.comを使用することにしました（ニュース、分析記事、情報技術、ビジネス、インターネットなどに関する記事が公開されています）。このリソースでは、すべての資料がカテゴリ（ハブ）に分類されており、そのうち主要なものだけが416個です。各マテリアルは、1つ以上のカテゴリに属することができます。

() python. – Jupyter notebook Google Colab. :

BeautifulSoup – html / xml;
Requests – http ;
Re – ;
Pandas – .

tqdm ratelim ( ).

, . :

mainUrl = 'https://habr.com/ru/post/'
postCount = 10000

, , , . try… except requests. :

@ratelim.patient(1, 1)
def get_post(postNum):
currPostUrl = mainUrl + str(postNum)
try:
response = requests.get(currPostUrl)
response.raise_for_status()
response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views = executePost(response)
dataList = [postNum, currPostUrl, response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views]
habrParse_df.loc[len(habrParse_df)] = dataList
except requests.exceptions.HTTPError as err:
pass

– . try – , .

executePost - .

def executePost(page):
soup = bs(page.text, 'html.parser')
#   
title = soup.find('meta', property='og:title')
title = str(title).split('="')[1].split('" ')[0]
#   
post = str(soup.find('div', id="post-content-body"))
post = re.sub('\n', ' ', post)
#   
num_comment = soup.find('span', id='comments_count').text
num_comment = int(re.sub('\n', '', num_comment).strip())
#  -     
info_panel = soup.find('ul', attrs={'class' : 'post-stats post-stats_post js-user_'})
#   
try:
rating = int(info_panel.find('span', attrs={'class' : 'voting-wjt__counter js-score'}).text)
except:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_positive js-score'})
if rating:
rating = int(re.sub('/+', '', rating.text))
else:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_negative js-score'}).text
rating = - int(re.sub('–', '', rating))
#         
vote = info_panel.find_all('span')[0].attrs['title']
rating_upVote = int(vote.split(':')[1].split('')[0].strip().split('↑')[1])
rating_downVote = int(vote.split(':')[1].split('')[1].strip().split('↓')[1])
#     
bookmk = int(info_panel.find_all('span')[1].text)
#    
views = info_panel.find_all('span')[3].text
return title, post, num_comment, rating, rating_upVote, rating_downVote, bookmk, views

BeautifulSoup : soup = bs(page.text, ‘html.parser’). find / findall (, html-). , html-, , .

( ), . , 10 . tqdm .

for pc in tqdm(range(postCount)):
postNum = pc + 1
get_post(postNum)

pandas :

その結果、リソースhabr.comの記事のテキストと、タイトル、記事へのリンク、コメントの数、評価、ブックマークの数、ビューの数などの追加情報を含むデータセットを受け取りました。。

将来的には、結果のデータセットを追加のデータで強化し、さまざまな言語モデルの構築、テキストの分類などのトレーニングに使用できます。

NLP問題を解決するためのトレーニングのためのデータの収集

More articles: