ãã®èšäºã§ã¯ãããå°ãè€éã§èå³æ·±ããããã¯ïŒå°ãªããšããæ€çŽ¢ããŒã ã®éçºè ã§ããç§ã«ãšã£ãŠïŒã«è§ŠããŸãããã«ããã¹ãæ€çŽ¢ã§ãã ElasticsearchããŒããã³ã³ããé åã«è¿œå ããã€ã³ããã¯ã¹ãäœæããŠã³ã³ãã³ããæ€çŽ¢ããæ¹æ³ãåŠç¿ããTMDB5000ã ãŒããŒããŒã¿ã»ããã®5000æ¬ã®ãã£ã«ã ã®èª¬æããã¹ãããŒã¿ãšããŠäœ¿çšããŸãã..ããŸããæ€çŽ¢ãã£ã«ã¿ãŒã®äœææ¹æ³ãšãã©ã³ãã³ã°ã«åããŠããªãæãäžããæ¹æ³ã«ã€ããŠãåŠã³ãŸãã
ã€ã³ãã©ã¹ãã©ã¯ãã£ïŒElasticsearch
Elasticsearchã¯ããã«ããã¹ãã€ã³ããã¯ã¹ãäœæã§ãã人æ°ã®ããããã¥ã¡ã³ãã¹ãã¢ã§ãããååãšããŠãæ€çŽ¢ãšã³ãžã³ãšããŠç¹ã«äœ¿çšãããŸããElasticsearchã¯ãããŒã¹ãšãªãApache Luceneãšã³ãžã³ãã·ã£ãŒãã£ã³ã°ãã¬ããªã±ãŒã·ã§ã³ã䟿å©ãªJSON APIãããã³æã人æ°ã®ãããã«ããã¹ãæ€çŽ¢ãœãªã¥ãŒã·ã§ã³ã®1ã€ãšãªã£ã100äžä»¥äžã®è©³çŽ°ãè¿œå ããŸãã
ElasticsearchããŒãã1ã€è¿œå ããŸããã
docker-compose.yml
ïŒ
services:
...
elasticsearch:
image: "elasticsearch:7.5.1"
environment:
- discovery.type=single-node
ports:
- "9200:9200"
...
ç°å¢å€æ°
discovery.type=single-node
ã¯ãElasticsearchã«ãåç¬ã§äœæ¥ã®æºåãããä»ã®ããŒããæ¢ããŠããããã¯ã©ã¹ã¿ãŒã«ããŒãžããªãããã«æ瀺ããŸãïŒãããããã©ã«ãã®åäœã§ãïŒã
ã¢ããªã±ãŒã·ã§ã³ãdocker-composeã«ãã£ãŠäœæããããããã¯ãŒã¯å ãããã²ãŒãããŠããå Žåã§ããããŒã9200ãå€éšã«å ¬éããŠããããšã«æ³šæããŠãã ãããããã¯çŽç²ã«ãããã°çšã§ãããã®æ¹æ³ã§ãã¿ãŒããã«ããçŽæ¥Elasticsearchã«ã¢ã¯ã»ã¹ã§ããŸãïŒããã¹ããŒããªæ¹æ³ãèŠã€ãããŸã§-以äžã§è©³ãã説æããŸãïŒã
Elasticsearchã¯ã©ã€ã¢ã³ããé ç·ã«è¿œå ããããšã¯é£ãããããŸãã-è¯ãããšã«ãElasticã¯æå°éã®Pythonã¯ã©ã€ã¢ã³ããæäŸããŸãã
ã€ã³ããã¯ã¹äœæ
ååã®èšäºã§ã¯ãäž»èŠãªãšã³ãã£ãã£ã§ãããã«ãŒãããMongoDBã³ã¬ã¯ã·ã§ã³ã«é 眮ããŸãããMongoDBãçŽæ¥ã€ã³ããã¯ã¹ãäœæãããããã³ã¬ã¯ã·ã§ã³ããèå¥åã§ã³ã³ãã³ãããã°ããååŸã§ããŸããããã«ã¯ BããªãŒã䜿çšããŸãã
ä»ãç§ãã¡ã¯éã®ã¿ã¹ã¯ã«çŽé¢ããŠããŸã-ã«ãŒãã®èå¥åãååŸããããã®ã³ã³ãã³ãïŒãŸãã¯ãã®ãã©ã°ã¡ã³ãïŒã«ãã£ãŠããããã£ãŠãéã€ã³ããã¯ã¹ãå¿ èŠã§ããããã§Elasticsearchã圹ã«ç«ã¡ãŸãïŒ
ã€ã³ããã¯ã¹ãäœæããããã®äžè¬çãªã¹ããŒã ã¯ãéåžžã次ã®ããã«ãªããŸãã
- äžæã®ååã§æ°ãã空ã®ã€ã³ããã¯ã¹ãäœæããå¿ èŠã«å¿ããŠæ§æããŸãã
- ããŒã¿ããŒã¹å ã®ãã¹ãŠã®ãšã³ãã£ãã£ã調ã¹ãŠãããããæ°ããã€ã³ããã¯ã¹ã«é 眮ããŸãã
- ãã¹ãŠã®ã¯ãšãªãæ°ããã€ã³ããã¯ã¹ã«ç§»åãå§ããããã«ããããã¯ã·ã§ã³ãåãæ¿ããŸãã
- å€ãã€ã³ããã¯ã¹ãåé€ããŸããããã§ã¯ãèªç±ã«ãæåŸã®ããã€ãã®ã€ã³ããã¯ã¹ãä¿åããããšããå§ãããŸããããšãã°ãããã€ãã®åé¡ããããã°ããæ¹ã䟿å©ã§ãã
ã€ã³ãã¯ãµãŒã®ã¹ã±ã«ãã³ãäœæããŠãããåã¹ãããã§ããã«è©³ããèŠãŠãããŸãããã
import datetime
from elasticsearch import Elasticsearch, NotFoundError
from backend.storage.card import Card, CardDAO
class Indexer(object):
def __init__(self, elasticsearch_client: Elasticsearch, card_dao: CardDAO, cards_index_alias: str):
self.elasticsearch_client = elasticsearch_client
self.card_dao = card_dao
self.cards_index_alias = cards_index_alias
def build_new_cards_index(self) -> str:
# .
# .
index_name = "cards-" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
# .
# .
self.create_empty_cards_index(index_name)
# .
#
# .
for card in self.card_dao.get_all():
self.put_card_into_index(card, index_name)
return index_name
def create_empty_cards_index(self, index_name):
...
def put_card_into_index(self, card: Card, index_name: str):
...
def switch_current_cards_index(self, new_index_name: str):
...
ã€ã³ããã¯ã¹äœæïŒã€ã³ããã¯ã¹ã®äœæ
Elasticsearchã®ã€ã³ããã¯ã¹ã¯ããžã®åçŽãªPUTãªã¯ãšã¹ãã«ãã£ãŠã
/-
ãŸãã¯Pythonã¯ã©ã€ã¢ã³ãã䜿çšããŠããå ŽåïŒãã®å ŽåïŒãåŒã³åºãããšã«ãã£ãŠäœæãããŸãã
elasticsearch_client.indices.create(index_name, {
...
})
ãªã¯ãšã¹ãæ¬æã«ã¯3ã€ã®ãã£ãŒã«ããå«ããããšãã§ããŸãã
- ãšã€ãªã¢ã¹ã®èª¬æïŒ
"aliases": ...
ïŒããšã€ãªã¢ã¹ã·ã¹ãã ã䜿çšãããšãElasticsearchåŽã§çŸåšã©ã®ã€ã³ããã¯ã¹ãææ°ã§ããããç¥ãããšãã§ããŸãã以äžã§èª¬æããŸãã - èšå®ïŒ
"settings": ...
ïŒãç§ãã¡ãå®éã®ãããã¯ã·ã§ã³ã®å€§ç©ã§ããå Žåãããã§ã¬ããªã±ãŒã·ã§ã³ãã·ã£ãŒãã£ã³ã°ãããã³ãã®ä»ã®SREã®åã³ãæ§æããããšãã§ããŸãã - ããŒã¿ã¹ããŒãïŒ
"mappings": ...
ïŒãããã§ã¯ãã€ã³ããã¯ã¹ãäœæããããã¥ã¡ã³ãå ã®ãã£ãŒã«ãã®ã¿ã€ãããããã®ãã£ãŒã«ãã®ã©ãã«å¯ŸããŠéã€ã³ããã¯ã¹ãå¿ èŠããã©ã®éèšããµããŒãããå¿ èŠãããããªã©ãæå®ã§ããŸãã
ä»ãç§ãã¡ã¯ã¹ããŒã ã«ã®ã¿èå³ããããããã¯éåžžã«åçŽã§ãïŒ
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "english"
},
"text": {
"type": "text",
"analyzer": "english"
},
"tags": {
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
ãã£ãŒã«ã
name
ã«ããŒã¯ãä»ããtext
è±èªã®ããã¹ããšããŠããŒã¯ããŸãããããŒãµãŒã¯ãããã¹ããã€ã³ããã¯ã¹ã«ä¿åããåã«åŠçããElasticsearchã®ãšã³ãã£ãã£ã§ããenglish
ã¢ãã©ã€ã¶ãŒã®å Žåãããã¹ãã¯åèªã®å¢çïŒè©³çŽ°ïŒã«æ²¿ã£ãŠããŒã¯ã³ã«åå²ããããã®åŸãåã
ã®ããŒã¯ã³ã¯è±èªã®èŠåã«åŸã£ãŠã¬ã³ãåããïŒããšãã°ãåèªtrees
ã¯ã«ç°¡ç¥åãããŸãtree
ïŒãããŸãã«ãäžè¬çãªã¬ã³ãïŒã®ãããªthe
ïŒã¯åé€ãããæ®ãã®ã¬ã³ãã¯éã®ã€ã³ããã¯ã¹ã«å
¥ããããŸãã
ãã£ãŒã«ãã¯
tags
ããå°ãè€éã§ããã¿ã€ãkeyword
ãã®ãã£ãŒã«ãã®å€ã¯ãã¢ãã©ã€ã¶ãŒã§åŠçããå¿
èŠã®ãªãæååå®æ°ã§ãããšæ³å®ããŠããŸããéã€ã³ããã¯ã¹ã¯ãããŒã¯ã³åãã¬ã³ãåãªãã§ããçã®ãå€ã«åºã¥ããŠæ§ç¯ãããŸãããã ããElasticsearchã¯ç¹å¥ãªããŒã¿æ§é ãäœæããŠããã®ãã£ãŒã«ãã®å€ã§éèšãèªã¿åãããšãã§ããããã«ããŸãïŒããšãã°ãæ€çŽ¢ãšåæã«ãæ€çŽ¢ã¯ãšãªãæºããããã¥ã¡ã³ãã§èŠã€ãã£ãã¿ã°ãšãã®éãèŠã€ããããšãã§ããŸãïŒãããã¯ãæ¬è³ªçã«åæåã®ãã£ãŒã«ãã«æé©ã§ãããã®æ©èœã䜿çšããŠãããã€ãã®ã¯ãŒã«ãªæ€çŽ¢ãã£ã«ã¿ãŒãäœæããŸãã
ãã ããã¿ã°ã®ããã¹ããããã¹ãæ€çŽ¢ã§ãæ€çŽ¢ã§ããããã«ããµããã£ãŒã«ããè¿œå ãããšã®
"text"
é¡æšã«ãã£ãŠæ§æããŸããname
text
äžèš-æ¬è³ªçã«ãããã¯ãElasticsearchããããtags.text
ã«æ¥ããã¹ãŠã®ããã¥ã¡ã³ãã®ååã®äžã«å¥ã®ãä»®æ³ããã£ãŒã«ããäœæããããã«ã³ã³ãã³ããã³ããŒããŸããtags
ãç°ãªãã«ãŒã«ã«åŸã£ãŠã€ã³ããã¯ã¹ãäœæããããšãæå³ããŸãã
ã€ã³ããã¯ã¹äœæïŒã€ã³ããã¯ã¹ãžã®å ¥å
ããã¥ã¡ã³ãã«ã€ã³ããã¯ã¹ãä»ããã«ã¯ãPUTãªã¯ãšã¹ããè¡ã
/-/_create/id-
ããPythonã¯ã©ã€ã¢ã³ãã䜿çšããŠããå Žåã¯ãå¿
èŠãªã¡ãœãããåŒã³åºãã ãã§ååã§ããå®è£
ã¯æ¬¡ã®ããã«ãªããŸãã
def put_card_into_index(self, card: Card, index_name: str):
self.elasticsearch_client.create(index_name, card.id, {
"name": card.name,
"text": card.markdown,
"tags": card.tags,
})
ãã£ãŒã«ãã«æ³šæããŠãã ãã
tags
ãããŒã¯ãŒããå«ããã®ãšããŠèª¬æããŸããããåäžã®æååã§ã¯ãªããæååã®ãªã¹ããéä¿¡ããŠããŸããElasticsearchã¯ããããµããŒãããŠããŸããããã¥ã¡ã³ãã¯ä»»æã®å€ã«é
眮ãããŸãã
ã€ã³ããã¯ã¹äœæïŒã€ã³ããã¯ã¹ã®åãæ¿ã
æ€çŽ¢ãå®è£ ããã«ã¯ãææ°ã®å®å šã«æ§ç¯ãããã€ã³ããã¯ã¹ã®ååãç¥ãå¿ èŠããããŸãããšã€ãªã¢ã¹ã¡ã«ããºã ã«ãããElasticsearchåŽã§ãã®æ å ±ãä¿æã§ããŸãã
ãšã€ãªã¢ã¹ã¯ã0å以äžã®ã€ã³ããã¯ã¹ãžã®ãã€ã³ã¿ã§ãã Elasticsearch APIã䜿çšãããšãæ€çŽ¢æã«ã€ã³ããã¯ã¹åã®ä»£ããã«ãšã€ãªã¢ã¹åã䜿çšã§ããŸãïŒPOSTã®
/-/_search
代ããã«POST /-/_search
ïŒããã®å ŽåãElasticsearchã¯ãšã€ãªã¢ã¹ãæããã¹ãŠã®ã€ã³ããã¯ã¹ãæ€çŽ¢ããŸãã
ãšåŒã°ãããšã€ãªã¢ã¹ãäœæã
cards
ãŸããããã¯åžžã«çŸåšã®ã€ã³ããã¯ã¹ãæããŸãããããã£ãŠã建èšå®äºåŸã«å®éã®ã€ã³ããã¯ã¹ã«åãæ¿ãããšã次ã®ããã«ãªããŸãã
def switch_current_cards_index(self, new_index_name: str):
try:
# , .
remove_actions = [
{
"remove": {
"index": index_name,
"alias": self.cards_index_alias,
}
}
for index_name in self.elasticsearch_client.indices.get_alias(name=self.cards_index_alias)
]
except NotFoundError:
# , - .
# , .
remove_actions = []
#
# .
self.elasticsearch_client.indices.update_aliases({
"actions": remove_actions + [{
"add": {
"index": new_index_name,
"alias": self.cards_index_alias,
}
}]
})
ãšã€ãªã¢ã¹APIã«ã€ããŠã¯ãã以äžè©³ãã説æããŸããã詳现ã¯ãã¹ãŠããã¥ã¡ã³ãã«èšèŒãããŠããŸãã
ããã§ãå®éã®é«è² è·ã®ãµãŒãã¹ã§ã¯ããã®ãããªåãæ¿ãã¯éåžžã«é¢åã§ãããäºåã®ãŠã©ãŒã ã¢ãããè¡ãã®ãçã«ããªã£ãŠããå¯èœæ§ãããããšã«æ³šæããå¿ èŠããããŸããã€ãŸããä¿åããããŠãŒã¶ãŒã¯ãšãªã®ããŒã«ãæ°ããã€ã³ããã¯ã¹ã«ããŒãããŸãã
ã€ã³ããã¯ã¹äœæãå®è£ ãããã¹ãŠã®ã³ãŒãã¯ããã®ã³ãããã«ãããŸãã
ã€ã³ããã¯ã¹äœæïŒã³ã³ãã³ãã®è¿œå
ãã®èšäºã®ãã¢ã³ã¹ãã¬ãŒã·ã§ã³ã§ã¯ãTMDB 5000 MovieDatasetã®ããŒã¿ã䜿çšããŠããŸããèäœæš©ã®åé¡ãåé¿ããããã«ãCSVãã¡ã€ã«ããããããã€ã³ããŒããããŠãŒãã£ãªãã£ã®ã³ãŒãã®ã¿ãæäŸããŸããKaggleã®Webãµã€ãããèªåã§ããŠã³ããŒãããããšããå§ãããŸããããŠã³ããŒãåŸãã³ãã³ããå®è¡ããã ãã§ã
docker-compose exec -T backend python -m tools.add_movies < ~/Downloads/tmdb-movie-metadata/tmdb_5000_movies.csv
5000æã®æ ç»ã«ãŒããšããŒã ãäœæãã
docker-compose exec backend python -m tools.build_index
ã€ã³ããã¯ã¹ãäœæããŸããæåŸã®ã³ãã³ãã¯å®éã«ã¯ã€ã³ããã¯ã¹ãäœæãããã¿ã¹ã¯ãã¿ã¹ã¯ãã¥ãŒã«å ¥ããã ãã§ããã®åŸã¯ãŒã«ãŒã§å®è¡ãããããšã«æ³šæããŠãã ããããã®ã¢ãããŒãã«ã€ããŠã¯ãååã®èšäºã§è©³ãã説æããŸããã
docker-compose logs worker
åŽåè
ãã©ã®ããã«è©Šã¿ãããããªãã«èŠããŠãã ããïŒ
å®éãæ€çŽ¢ãéå§ããåã«ãElasticsearchã§äœããæžãããŠãããã©ããããããããªãããããã©ã®ããã«èŠããããèªåã®ç®ã§ç¢ºèªããããšæããŸãã
ãããè¡ãæãçŽæ¥çã§æéã®æ¹æ³ã¯ãElasticsearch HTTPAPIã䜿çšããããšã§ãããŸãããšã€ãªã¢ã¹ãã©ããæããŠãããã確èªããŸãããã
$ curl -s localhost:9200/_cat/aliases
cards cards-2020-09-20-16-14-18 - - - -
çŽ æŽããããã€ã³ããã¯ã¹ãååšããŸãïŒããã詳ããèŠãŠã¿ãŸãããïŒ
$ curl -s localhost:9200/cards-2020-09-20-16-14-18 | jq
{
"cards-2020-09-20-16-14-18": {
"aliases": {
"cards": {}
},
"mappings": {
...
},
"settings": {
"index": {
"creation_date": "1600618458522",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "iLX7A8WZQuCkRSOd7mjgMg",
"version": {
"created": "7050199"
},
"provided_name": "cards-2020-09-20-16-14-18"
}
}
}
}
æåŸã«ããã®å 容ãèŠãŠã¿ãŸãããã
$ curl -s localhost:9200/cards-2020-09-20-16-14-18/_search | jq
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4704,
"relation": "eq"
},
"max_score": 1,
"hits": [
...
]
}
}
åèšã§ãã€ã³ããã¯ã¹ã¯4704ããã¥ã¡ã³ãã§ããããã£ãŒã«ã
hits
ïŒå€§ããããããã¹ãããããŸããïŒã§ã¯ããããã®äžéšã®å
容ã確èªããããšãã§ããŸããæåïŒ
ã€ã³ããã¯ã¹ã®å 容ãé²èŠ§ãããšãäžè¬çã«Elasticsearchãçãããã®ãã¹ãŠã®çš®é¡ã䜿çšããããšã§ããã䟿å©ãªæ¹æ³Kibanaããã³ã³ããã
docker-compose.yml
次ã®å Žæã«è¿œå ããŸãããïŒ
services:
...
kibana:
image: "kibana:7.5.1"
ports:
- "5601:5601"
depends_on:
- elasticsearch
...
2åç®ä»¥éã¯ã
docker-compose up
ãã®ã¢ãã¬ã¹ã®Kibanaã«ç§»åãlocalhost:5601
ïŒãµãŒããŒãããã«èµ·åããªãå ŽåããããŸãïŒãç°¡åãªã»ããã¢ããã®åŸãã€ã³ããã¯ã¹ã®å
容ãçŽ æµãªWebã€ã³ã¿ãŒãã§ã€ã¹ã§è¡šç€ºã§ããŸãã
[éçºããŒã«]ã¿ãã匷ããå§ãããŸããéçºäžã¯ãElasticsearchã§ç¹å®ã®ã¯ãšãªãå®è¡ããå¿ èŠããããŸããèªåè£å®ãšèªåãã©ãŒãããã䜿çšããã€ã³ã¿ã©ã¯ãã£ãã¢ãŒãã§ã¯ãã¯ããã«äŸ¿å©ã§ãã
æ¢ã
ä¿¡ããããªãã»ã©éå±ãªæºåããã¹ãŠçµãã£ãããWebã¢ããªã±ãŒã·ã§ã³ã«æ€çŽ¢æ©èœãè¿œå ããæãæ¥ãŸããïŒ
ãã®éèŠãªã¿ã¹ã¯ã3ã€ã®æ®µéã«åããŠãããããã«ã€ããŠåå¥ã«èª¬æããŸãããã
Searcher
æ€çŽ¢ããžãã¯ãæ åœããã³ã³ããŒãã³ããããã¯ãšã³ãã«è¿œå ããŸããElasticsearchãžã®ã¯ãšãªã圢æããçµæãããã¯ãšã³ãã«ãšã£ãŠããæ¶åãããããã®ã«å€æããŸãã- ãšã³ããã€ã³ããAPIã«è¿œå ããŸãïŒãã³ãã«/ã«ãŒã/äŒç€Ÿã§ã¯äœãšåŒã³ãŸããïŒïŒ
/cards/search
æ€çŽ¢ãå®è¡ããŸããã³ã³ããŒãã³ãã®ã¡ãœãããåŒã³åºããSearcher
çµæãåŠçããŠãã¯ã©ã€ã¢ã³ãã«è¿ããŸãã - ããã³ããšã³ãã«æ€çŽ¢ã€ã³ã¿ãŒãã§ãŒã¹ãå®è£
ããŸãããã
/cards/search
ãŠãŒã¶ãŒãäœãæ€çŽ¢ãããã決å®ãããšãã«é£çµ¡ããçµæïŒããã³å Žåã«ãã£ãŠã¯ããã€ãã®è¿œå ã®ã³ã³ãããŒã«ïŒã衚瀺ããŸãã
æ€çŽ¢ïŒå®è£ ããŸã
æ€çŽ¢ãããŒãžã£ãŒãäœæããã®ã¯ãèšèšããã»ã©é£ãããããŸãããæ€çŽ¢çµæãšãããŒãžã£ãŒã€ã³ã¿ãŒãã§ã€ã¹ã«ã€ããŠèª¬æããããããªãããã§éãããªãã®ãã説æããŸãããã
# backend/backend/search/searcher.py
import abc
from dataclasses import dataclass
from typing import Iterable, Optional
@dataclass
class CardSearchResult:
total_count: int
card_ids: Iterable[str]
next_card_offset: Optional[int]
class Searcher(metaclass=abc.ABCMeta):
@abc.abstractmethod
def search_cards(self, query: str = "",
count: int = 20, offset: int = 0) -> CardSearchResult:
pass
æãããªããšãããã€ããããŸããããšãã°ãããŒãžããŒã·ã§ã³ãç§ãã¡ã¯éå¿çãªè¥ã
ããŸãç®ç«ããªããã®ããããŸããããšãã°ãçµæãšããŠã«ãŒãã§ã¯ãªããIDã®ãªã¹ãã Elasticsearchã¯ããã©ã«ãã§ããã¥ã¡ã³ãå šäœãä¿åããæ€çŽ¢çµæã«è¿ããŸãããã®åäœããªãã«ããŠæ€çŽ¢ã€ã³ããã¯ã¹ã®ãµã€ãºãç¯çŽããããšãã§ããŸãããããã¯æããã«ææå°æ©ã®æé©åã§ããã§ã¯ãããã«ã«ãŒããè¿åŽããŠã¿ãŸãããïŒåçïŒããã¯åäžè²¬ä»»ã®ååã«éåããŸãããããããã€ãããŠãŒã¶ãŒã®èšå®ã«å¿ããŠã«ãŒããä»ã®èšèªã«ç¿»èš³ããã«ãŒããããŒãžã£ãŒã®è€éãªããžãã¯ãå®æããã§ãããããŸãã«ãã®æç¹ã§ãæ€çŽ¢ãããŒãžã£ãŒã«åãããžãã¯ãè¿œå ããã®ãå¿ãããããã«ãŒãããŒãžã®ããŒã¿ãšæ€çŽ¢çµæã®ããŒã¿ã¯åæ£ãããŸãããªã©ãªã©ã
ãã®ã€ã³ã¿ãŒãã§ãŒã¹ã®å®è£ ã¯ãšãŠãåçŽãªã®ã§ãç§ã¯ãã®ã»ã¯ã·ã§ã³ãæžãã®ãé¢åã§ãã:-(
# backend/backend/search/searcher_impl.py
from typing import Any
from elasticsearch import Elasticsearch
from backend.search.searcher import CardSearchResult, Searcher
ElasticsearchQuery = Any #
class ElasticsearchSearcher(Searcher):
def __init__(self, elasticsearch_client: Elasticsearch, cards_index_name: str):
self.elasticsearch_client = elasticsearch_client
self.cards_index_name = cards_index_name
def search_cards(self, query: str = "", count: int = 20, offset: int = 0) -> CardSearchResult:
result = self.elasticsearch_client.search(index=self.cards_index_name, body={
"size": count,
"from": offset,
"query": self._make_text_query(query) if query else self._match_all_query
})
total_count = result["hits"]["total"]["value"]
return CardSearchResult(
total_count=total_count,
card_ids=[hit["_id"] for hit in result["hits"]["hits"]],
next_card_offset=offset + count if offset + count < total_count else None,
)
def _make_text_query(self, query: str) -> ElasticsearchQuery:
return {
# Multi-match query
# ( match
# query, ).
"multi_match": {
"query": query,
# ^ â .
# , .
"fields": ["name^3", "tags.text", "text"],
}
}
_match_all_query: ElasticsearchQuery = {"match_all": {}}
å®éãElasticsearch APIã«ã¢ã¯ã»ã¹ããŠãèŠã€ãã£ãã«ãŒãã®IDãçµæããæ éã«æœåºããŸãã
ãšã³ããã€ã³ãã®å®è£ ãéåžžã«ç°¡åã§ãã
# backend/backend/server.py
...
def search_cards(self):
request = flask.request.json
search_result = self.wiring.searcher.search_cards(**request)
cards = self.wiring.card_dao.get_by_ids(search_result.card_ids)
return flask.jsonify({
"totalCount": search_result.total_count,
"cards": [
{
"id": card.id,
"slug": card.slug,
"name": card.name,
# ,
# ,
# .
} for card in cards
],
"nextCardOffset": search_result.next_card_offset,
})
...
ãã®ãšã³ããã€ã³ãã䜿çšããããã³ããšã³ãã®å®è£ ã¯ãèšå€§ã§ãããäžè¬çã«éåžžã«ç°¡åã§ããããã®èšäºã§ã¯ããã«çŠç¹ãåœãŠãããããŸããããã®ã³ãããã§ãã¹ãŠã®ã³ãŒããèŠãããšãã§ããŸãã
ãããŸã§ã®ãšãããå ã«é²ã¿ãŸãããã
æ€çŽ¢ïŒãã£ã«ã¿ãŒã®è¿œå
ããã¹ãæ€çŽ¢ã¯ãã°ãããã§ãããæ·±å»ãªãªãœãŒã¹ãæ€çŽ¢ããããšãããã°ããã£ã«ã¿ãŒãªã©ã®ããããçš®é¡ã®æ©èœãèŠãããšãããã§ãããã
TMDB 5000ããŒã¿ããŒã¹ã®ãã£ã«ã ã®èª¬æã«ã¯ãã¿ã€ãã«ãšèª¬æã«å ããŠã¿ã°ãä»ããŠããã®ã§ããã¬ãŒãã³ã°çšã«ã¿ã°ã«ãããã£ã«ã¿ãŒãå®è£ ããŸããããç§ãã¡ã®ç®æšã¯ã¹ã¯ãªãŒã³ã·ã§ããã«ãããŸããã¿ã°ãã¯ãªãã¯ãããšããã®ã¿ã°ãä»ãããã£ã«ã ã®ã¿ãæ€çŽ¢çµæã«è¡šç€ºãããŸãïŒçªå·ã¯æšªã®æ¬åŒ§å ã«ç€ºãããŠããŸãïŒã
ãã£ã«ã¿ãå®è£ ããã«ã¯ã2ã€ã®åé¡ã解決ããå¿ èŠããããŸãã
- ãªã¯ãšã¹ãã«å¿ããŠãã©ã®ãã£ã«ã¿ãŒã»ãããå©çšå¯èœããç解ããããšãåŠã³ãŸãããã¹ãŠã®ç»é¢ã«å¯èœãªãã¹ãŠã®ãã£ã«ã¿ãŒå€ã衚瀺ããå¿ èŠã¯ãããŸããããããã¯ããããããããããã®ã»ãšãã©ã¯ç©ºã®çµæã«ã€ãªããããã§ãããªã¯ãšã¹ãã§èŠã€ãã£ãããã¥ã¡ã³ãã®ã¿ã°ãç解ããå¿ èŠããããŸããçæ³çã«ã¯ãNãæã人æ°ã®ãããã®ã«ããŠãããŸãã
- å®éããã£ã«ã¿ãŒãé©çšããããšãåŠã¶ããã«-æ€çŽ¢çµæã«ã¿ã°ä»ãã®ããã¥ã¡ã³ãã®ã¿ãæ®ãããã«ããŠãŒã¶ãŒãéžæãããã£ã«ã¿ãŒã
Elasticsearchã®2çªç®ã¯ãåºæ¬çã«ã¯ãšãªAPIïŒã¯ãšãªãšããçšèªãåç §ïŒãä»ããŠå®è£ ãããæåã¯ãå°ãç°¡åã§ã¯ãªãéèšã¡ã«ããºã ãä»ããŠå®è£ ãããŸãã
ãããã£ãŠãèŠã€ãã£ãã«ãŒãã§ã©ã®ã¿ã°ãèŠã€ãã£ãããç¥ããå¿ èŠãªã¿ã°ã§ã«ãŒãããã£ã«ã¿ãªã³ã°ã§ããããã«ããå¿ èŠããããŸãããŸããæ€çŽ¢ãããŒãžã£ãŒã®èšèšãæŽæ°ããŸãããã
# backend/backend/search/searcher.py
import abc
from dataclasses import dataclass
from typing import Iterable, Optional
@dataclass
class TagStats:
tag: str
cards_count: int
@dataclass
class CardSearchResult:
total_count: int
card_ids: Iterable[str]
next_card_offset: Optional[int]
tag_stats: Iterable[TagStats]
class Searcher(metaclass=abc.ABCMeta):
@abc.abstractmethod
def search_cards(self, query: str = "",
count: int = 20, offset: int = 0,
tags: Optional[Iterable[str]] = None) -> CardSearchResult:
pass
ããã§ã¯ãå®è£ ã«ç§»ããŸããããæåã«è¡ãå¿ èŠãããã®ã¯ããã£ãŒã«ãããšã«éèšãéå§ããããšã§ã
tags
ã
--- a/backend/backend/search/searcher_impl.py
+++ b/backend/backend/search/searcher_impl.py
@@ -10,6 +10,8 @@ ElasticsearchQuery = Any
class ElasticsearchSearcher(Searcher):
+ TAGS_AGGREGATION_NAME = "tags_aggregation"
+
def __init__(self, elasticsearch_client: Elasticsearch, cards_index_name: str):
self.elasticsearch_client = elasticsearch_client
self.cards_index_name = cards_index_name
@@ -18,7 +20,12 @@ class ElasticsearchSearcher(Searcher):
result = self.elasticsearch_client.search(index=self.cards_index_name, body={
"size": count,
"from": offset,
"query": self._make_text_query(query) if query else self._match_all_query,
+ "aggregations": {
+ self.TAGS_AGGREGATION_NAME: {
+ "terms": {"field": "tags"}
+ }
+ }
})
ããã§ãElasticsearchã®æ€çŽ¢çµæã§
aggregations
ãããŒã䜿çšããŠãèŠã€ãã£ãããã¥ã¡ã³ãã®ãã£ãŒã«ãã«ããå€ãšããããçºçããé »åºŠã«é¢ããæ
å ±ãå«ããã±ãããTAGS_AGGREGATION_NAME
ååŸã§ãããã£ãŒã«ããååŸãããŸãããã®ããŒã¿ãæœåºããŠãäžèšã®ããã«è¿ããŸããtags
--- a/backend/backend/search/searcher_impl.py
+++ b/backend/backend/search/searcher_impl.py
@@ -28,10 +28,15 @@ class ElasticsearchSearcher(Searcher):
total_count = result["hits"]["total"]["value"]
+ tag_stats = [
+ TagStats(tag=bucket["key"], cards_count=bucket["doc_count"])
+ for bucket in result["aggregations"][self.TAGS_AGGREGATION_NAME]["buckets"]
+ ]
return CardSearchResult(
total_count=total_count,
card_ids=[hit["_id"] for hit in result["hits"]["hits"]],
next_card_offset=offset + count if offset + count < total_count else None,
+ tag_stats=tag_stats,
)
ãã£ã«ã¿ã¢ããªã±ãŒã·ã§ã³ã®è¿œå ã¯æãç°¡åãªéšåã§ãã
--- a/backend/backend/search/searcher_impl.py
+++ b/backend/backend/search/searcher_impl.py
@@ -16,11 +16,17 @@ class ElasticsearchSearcher(Searcher):
self.elasticsearch_client = elasticsearch_client
self.cards_index_name = cards_index_name
- def search_cards(self, query: str = "", count: int = 20, offset: int = 0) -> CardSearchResult:
+ def search_cards(self, query: str = "", count: int = 20, offset: int = 0,
+ tags: Optional[Iterable[str]] = None) -> CardSearchResult:
result = self.elasticsearch_client.search(index=self.cards_index_name, body={
"size": count,
"from": offset,
- "query": self._make_text_query(query) if query else self._match_all_query,
+ "query": {
+ "bool": {
+ "must": self._make_text_queries(query),
+ "filter": self._make_filter_queries(tags),
+ }
+ },
"aggregations": {
must-clauseã«å«ãŸãããµãã¯ãšãªã¯å¿ é ã§ãããããã¥ã¡ã³ãã®é床ãèšç®ãããšãã«ãèæ ®ãããããã«å¿ããŠã©ã³ã¯ä»ããããŸããããã¹ãã«æ¡ä»¶ãè¿œå ããå Žåã¯ãããã«è¿œå ããããšããå§ãããŸããfilterå¥ã®ãµãã¯ãšãªã¯ãé床ãšã©ã³ãã³ã°ã«åœ±é¿ãäžããã«ãã£ã«ã¿ãªã³ã°ããã ãã§ãã
å®è£ ããå¿ èŠããããŸã
_make_filter_queries()
ïŒ
def _make_filter_queries(self, tags: Optional[Iterable[str]] = None) -> List[ElasticsearchQuery]:
return [] if tags is None else [{
"term": {
"tags": {
"value": tag
}
}
} for tag in tags]
ç¹°ãè¿ãã«ãªããŸãããããã³ããšã³ãã®éšåã«ã€ããŠã¯è©³ãã説æããŸããããã¹ãŠã®ã³ãŒãã¯ãã®ã³ãããã«å«ãŸããŠããŸãã
ã¬ã³ãžã³ã°
ãã®ãããæ€çŽ¢ã§ã¯ã«ãŒããæ€çŽ¢ããæå®ãããã¿ã°ã®ãªã¹ãã«åŸã£ãŠã«ãŒãããã£ã«ã¿ãªã³ã°ããããé åºã§è¡šç€ºããŸããããããã©ãã§ããïŒé åºã¯å®éã®æ€çŽ¢ã«ãšã£ãŠéåžžã«éèŠã§ããã蚎èšäžã«é åºã«é¢ããŠè¡ã£ããã¹ãŠã®ããšã¯
^3
ããã«ããããã¯ãšãªã§åªå
床ãæå®ããããšã«ããã説æãã¿ã°ãããã«ãŒãã®èŠåºãã«ããåèªãèŠã€ããæ¹ãæçã§ãããšElasticsearchã«ç€ºåãããŸããã
ããã©ã«ãã§ã¯ãElasticsearchã¯ããªãããªãããŒãªTF-IDFããŒã¹ã®åŒã§ããã¥ã¡ã³ããã©ã³ã¯ä»ããããšããäºå®ã«ãããããããç§ãã¡ã®æ³åäžã®éå¿çãªã¹ã¿ãŒãã¢ããã«ãšã£ãŠãããã¯ã»ãšãã©ååã§ã¯ãããŸãããç§ãã¡ã®ææžãååã§ããå Žåãç§ãã¡ã¯ãããã®å£²äžã説æã§ããå¿ èŠããããŸãããŠãŒã¶ãŒãäœæããã³ã³ãã³ãã®å Žåã¯ããã®é®®åºŠãªã©ãèæ ®ã«å ¥ããããšãã§ããŸãããã ããæ€çŽ¢ã¯ãšãªãšã®é¢é£æ§ãèæ ®ãããªãããã販売æ°/è¿œå æ¥ã§åçŽã«äžŠã¹æ¿ããããšã¯ã§ããŸããã
ã©ã³ãã³ã°ã¯ããã®èšäºã®æåŸã®1ã€ã®ã»ã¯ã·ã§ã³ã§ã¯ã«ããŒã§ããªãã倧ãããŠçŽãããããã¯ãããžãŒã®é åã§ããã ããããã§ç§ã¯å€§ããªã¹ãããŒã¯ã«åãæ¿ããŠããŸããæ€çŽ¢ã§å·¥æ¥çšã°ã¬ãŒãã®ã©ã³ãã³ã°ãã©ã®ããã«é 眮ã§ããããæãäžè¬çãªçšèªã§èª¬æããElasticsearchã§ã©ã®ããã«å®è£ ã§ãããã«ã€ããŠããã€ãã®æè¡çãªè©³çŽ°ãæããã«ããŸãã
ã©ã³ã¯ä»ãã®ã¿ã¹ã¯ã¯éåžžã«è€éã§ãããããããã解決ããããã®äž»èŠãªææ°ã®æ¹æ³ã®1ã€ãæ©æ¢°åŠç¿ã§ããããšã¯é©ãã¹ãããšã§ã¯ãããŸãããæ©æ¢°åŠç¿æè¡ã®ã©ã³ã¯ä»ããžã®é©çšã¯ããŸãšããŠã©ã³ã¯ä»ãåŠç¿ãšåŒã°ããŸãã
å žåçãªããã»ã¹ã¯æ¬¡ã®ããã«ãªããŸãã
ã©ã³ã¯ä»ããã察象ã決å®ããŸããé¢å¿ã®ãããšã³ãã£ãã£ãã€ã³ããã¯ã¹ã«å ¥ãããããã®ãšã³ãã£ãã£ã®ç¹å®ã®æ€çŽ¢ã¯ãšãªïŒããšãã°ãåçŽãªäžŠã¹æ¿ããåãåãïŒã«å¯ŸããŠåŠ¥åœãªããããååŸããæ¹æ³ãåŠã³ãŸãã次ã«ãããã€ã³ããªãžã§ã³ããªæ¹æ³ã§ã©ã³ã¯ä»ãããæ¹æ³ãåŠã³ãŸãã
ã©ã³ã¯ä»ãããæ¹æ³ã決å®ãã..ããµãŒãã¹ã®ããžãã¹ç®æšã«åŸã£ãŠãçµæãã©ã³ã¯ä»ãããç¹æ§ã決å®ããŸããããšãã°ããšã³ãã£ãã£ã販売ãã補åã§ããå Žåãè³Œå ¥ã®å¯èœæ§ã®é«ãé ã«äžŠã¹æ¿ããããšãã§ããŸããããŒã ã®å Žå-ããããå ±æã®å¯èœæ§ãªã©ã«ãã£ãŠããã¡ããããããã®ç¢ºçãèšç®ããæ¹æ³ã¯ããããŸãã-ããããæšå®ã§ããŸãããããã§ãååãªçµ±èšãããå€ããšã³ãã£ãã£ã«ã€ããŠã®ã¿ã§ã-ããããéæ¥çãªç¬Šå·ã«åºã¥ããŠããããäºæž¬ããããã«ã¢ãã«ã«æããããšããŸãã
å åã®æœåº..ãæ€çŽ¢ã¯ãšãªã«å¯Ÿãããšã³ãã£ãã£ã®é¢é£æ§ãè©äŸ¡ããã®ã«åœ¹ç«ã€ããšã³ãã£ãã£ã®äžé£ã®æ©èœãèãåºããŸãã Elasticsearchã®èšç®æ¹æ³ããã§ã«ç¥ã£ãŠããåãTF-IDFã«å ããŠãå žåçãªäŸã¯CTRïŒã¯ãªãã¯ã¹ã«ãŒçïŒã§ãããšã³ãã£ãã£ãšæ€çŽ¢ã¯ãšãªã®ãã¢ããšã«ããšã³ãã£ãã£ãæ€çŽ¢çµæã«è¡šç€ºãããåæ°ãã«ãŠã³ããããµãŒãã¹ã®ãã°ãåžžã«ååŸããŸãããã®ãªã¯ãšã¹ããšã¯ãªãã¯ãããåæ°ã«ã€ããŠãäžæ¹ãä»æ¹ã§é€ç®ããŸããæ¡ä»¶ä»ãã¯ãªãã¯ç¢ºçã®æãç°¡åãªèŠç©ãããçšæãããŠããŸãããŸããã©ã³ãã³ã°ãããŒãœãã©ã€ãºããããã«ããŠãŒã¶ãŒåºæã®ç¹æ§ãšãŠãŒã¶ãŒãšã³ãã£ãã£ã®ãã¢ã®ç¹æ§ãèãåºãããšãã§ããŸãããµã€ã³ãæãã€ããã®ã§ãããããèšç®ããããçš®ã®ã¹ãã¬ãŒãžã«å ¥ããç¹å®ã®æ€çŽ¢ã¯ãšãªããŠãŒã¶ãŒãããã³ãšã³ãã£ãã£ã®ã»ããã«å¯ŸããŠãªã¢ã«ã¿ã€ã ã§ãµã€ã³ãäžããæ¹æ³ãç¥ã£ãŠããã³ãŒããèšè¿°ããŸãã
ãã¬ãŒãã³ã°ããŒã¿ã»ããããŸãšãããå€ãã®ãªãã·ã§ã³ããããŸãããååãšããŠããããã¯ãã¹ãŠããµãŒãã¹ã®ãè¯ããïŒã¯ãªãã¯ããŠããè³Œå ¥ãªã©ïŒã€ãã³ããšãæªããïŒã¯ãªãã¯ããŠåé¡ã«æ»ããªã©ïŒã€ãã³ãã®ãã°ãã圢æãããŸããããŒã¿ã»ãããäœæãããšãã補åXãšã¯ãšãªQã®é¢é£æ§ã®è©äŸ¡ã¯Pã«ã»ãŒçããããšããã¹ããŒãã¡ã³ãã®ãªã¹ããã補åXã¯è£œåYãšã¯ãšãªQã®é¢é£æ§ãé«ãããšãããã¢ã®ãªã¹ãããŸãã¯ãã¯ãšãªQã補åP 1ãP 2ã...ã®ãªã¹ãã®ã»ããã¯ãã®ããã«æ£ããã©ã³ã¯ä»ããããŸãã -ãã® "ãããã«è¡šç€ºããããã¹ãŠã®è¡ã«å¯Ÿå¿ããèšå·ãç· ããŸãã
ã¢ãã«ããã¬ãŒãã³ã°ããŸããããããã¹ãŠã®MLã¯ã©ã·ãã¯ã§ãïŒãã¬ãŒãã³ã°/ãã¹ãããã€ããŒãã©ã¡ãŒã¿ãŒãåãã¬ãŒãã³ã°ã
ã¢ãã«ãåã蟌ã¿ãŸãããã§ã«ã©ã³ã¯ä»ããããçµæããŠãŒã¶ãŒã«å±ãããã«ããããå šäœã®ã¢ãã«ã®èšç®ããã®å Žã§ã©ãã«ãããŠãã蟌ãå¿ èŠããããŸããå€ãã®ãªãã·ã§ã³ããããŸãã説æã®ããã«ãïŒåã³ïŒåçŽãªElasticsearchãã©ã°ã€ã³Learning toRankã«çŠç¹ãåœãŠãŸãã
ã©ã³ãã³ã°ïŒElasticsearch Learning to RankPlugin
Elasticsearch Learning to Rankã¯ãSERPã§MLã¢ãã«ãèšç®ããèšç®ãããã¬ãŒãã«åŸã£ãŠçµæãå³åº§ã«ã©ã³ã¯ä»ãããæ©èœãElasticsearchã«è¿œå ãããã©ã°ã€ã³ã§ãããŸããElasticsearchã®æ©èœïŒTF-IDFãªã©ïŒãåå©çšããªããããªã¢ã«ã¿ã€ã ã§äœ¿çšããããã®ãšåãæ©èœãååŸããã®ã«ã圹ç«ã¡ãŸãã
ãŸããã³ã³ããå ã®ãã©ã°ã€ã³ãElasticsearchã«æ¥ç¶ããå¿ èŠããããŸããåçŽãªDockerfileãå¿ èŠã§ã
# elasticsearch/Dockerfile
FROM elasticsearch:7.5.1
RUN ./bin/elasticsearch-plugin install --batch http://es-learn-to-rank.labs.o19s.com/ltr-1.1.2-es7.5.1.zip
ããã³é¢é£ããå€æŽ
docker-compose.yml
ïŒ
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -5,7 +5,8 @@ services:
elasticsearch:
- image: "elasticsearch:7.5.1"
+ build:
+ context: elasticsearch
environment:
- discovery.type=single-node
Pythonã¯ã©ã€ã¢ã³ãã§ã®ãã©ã°ã€ã³ãµããŒããå¿ èŠã§ããPythonã®ãµããŒãããã©ã°ã€ã³ã«å«ãŸããŠããªãããšã«é©ããã®ã§ããã®èšäºã®ããã«ç¹å¥ã«æžãçããŸãããé ç·ã§ã¯ã©ã€ã¢ã³ãã«è¿œå
elasticsearch_ltr
ãrequirements.txt
ãŠã¢ããã°ã¬ãŒãããŸãã
--- a/backend/backend/wiring.py
+++ b/backend/backend/wiring.py
@@ -1,5 +1,6 @@
import os
+from elasticsearch_ltr import LTRClient
from celery import Celery
from elasticsearch import Elasticsearch
from pymongo import MongoClient
@@ -39,5 +40,6 @@ class Wiring(object):
self.task_manager = TaskManager(self.celery_app)
self.elasticsearch_client = Elasticsearch(hosts=self.settings.ELASTICSEARCH_HOSTS)
+ LTRClient.infect_client(self.elasticsearch_client)
self.indexer = Indexer(self.elasticsearch_client, self.card_dao, self.settings.CARDS_INDEX_ALIAS)
self.searcher: Searcher = ElasticsearchSearcher(self.elasticsearch_client, self.settings.CARDS_INDEX_ALIAS)
ã©ã³ãã³ã°ïŒã®ãããã®å å
Elasticsearchã®åãªã¯ãšã¹ãã¯ãèŠã€ãã£ãããã¥ã¡ã³ãã®IDã®ãªã¹ãã ãã§ãªããããã«ããã€ãã®IDãè¿ããŸãïŒã¹ã³ã¢ãšããåèªããã·ã¢èªã«ã©ã®ããã«å€æããŸããïŒïŒããããã£ãŠãããã䜿çšããŠããäžèŽãŸãã¯è€æ°äžèŽã®ã¯ãšãªã§ããå Žåãé«éã¯TF-IDFãå«ãéåžžã«ããªãããŒãªåŒãèšç®ããçµæã§ããå ŽåããŒã«ã¯ãšãªã¯ããã¹ããããã¯ãšãªé床ã®çµã¿åããã§ããé¢æ°ã¹ã³ã¢ã¯ãšãªã®å Žå-ç¹å®ã®é¢æ°ïŒããšãã°ãããã¥ã¡ã³ãå ã®æ°å€ãã£ãŒã«ãã®å€ïŒãèšç®ããçµæãªã©ã ELTRãã©ã°ã€ã³ã¯ãä»»æã®èŠæ±ã®é床ãèšå·ãšããŠäœ¿çšããæ©èœãæäŸããããã¥ã¡ã³ããèŠæ±ãšã©ã®çšåºŠäžèŽããŠãããã«é¢ããããŒã¿ïŒãã«ããããã¯ãšãªãä»ããŠïŒãšãäºåã«ããã¥ã¡ã³ãã«å ¥åããããã€ãã®äºåèšç®ãããçµ±èšïŒé¢æ°ã¹ã³ã¢ã¯ãšãªãä»ããŠïŒãç°¡åã«çµã¿åãããããšãã§ããŸãã ..ã
TMDB 5000ããŒã¿ããŒã¹ãæå ã«ããã®ã§ãæ ç»ã®èª¬æãšãã®è©äŸ¡ãªã©ãå«ãŸããŠããã®ã§ãäºåã«èšç®ãããæ©èœã®äŸãšããŠè©äŸ¡ãåãäžããŸãããã
ã§ããã®ã³ãããWebã¢ããªã±ãŒã·ã§ã³ã®ããã¯ãšã³ãã«æ©èœãä¿åããããã®åºæ¬çãªã€ã³ãã©ã¹ãã©ã¯ãã£ãããã€ãè¿œå ããã ãŒããŒãã¡ã€ã«ããã®è©äŸ¡ã®èªã¿èŸŒã¿ããµããŒãããŸãããå¥ã®ã³ãŒããèªãŸãªããŠã¯ãªããªãããã«ãæãåºæ¬çãªããšã説æããŸãã
- æ©èœãå¥ã®ã³ã¬ã¯ã·ã§ã³ã«ä¿åããå¥ã®ãããŒãžã£ãŒãååŸããŸãããã¹ãŠã®ããŒã¿ã1ã€ã®ãšã³ãã£ãã£ã«ãã³ãããããšã¯æªãç¿æ £ã§ãã
- ã€ã³ããã¯ã¹äœæã®æ®µéã§ãã®ãããŒãžã£ãŒã«é£çµ¡ããå©çšå¯èœãªãã¹ãŠã®æ©èœãã€ã³ããã¯ã¹äœæãããããã¥ã¡ã³ãã«é 眮ããŸãã
- ã€ã³ããã¯ã¹ã¹ããŒããç¥ãã«ã¯ãã€ã³ããã¯ã¹ã®äœæãéå§ããåã«ãæ¢åã®ãã¹ãŠã®æ©èœã®ãªã¹ããç¥ãå¿ èŠããããŸããä»ã®ãšããããã®ãªã¹ããããŒãã³ãŒãã£ã³ã°ããŸãã
- å±æ§å€ã§ããã¥ã¡ã³ãããã£ã«ã¿ãªã³ã°ããã®ã§ã¯ãªããã¢ãã«ãèšç®ããããã«ãã§ã«èŠã€ãã£ãããã¥ã¡ã³ãããããããæœåºããã ããª
index: false
ã®ã§ãã¹ããŒãã®ãªãã·ã§ã³ã䜿çšããŠæ°ãããã£ãŒã«ãã«ããéã€ã³ããã¯ã¹ã®äœæããªãã«ããããã«ããå°ãã¹ããŒã¹ãç¯çŽããŸãã
ã©ã³ãã³ã°ïŒããŒã¿ã»ããã®åé
第äžã«ãç§ãã¡ã¯çç£ãè¡ã£ãŠãããã第äºã«ããã®èšäºã®äœçœã¯ããã¬ã¡ããªãŒãKafkaãNiFiãHadoopãSparkãããã³ETLããã»ã¹ã®æ§ç¯ã«ã€ããŠã®è©±ã«ã¯å°ãããããããã«ãŒãã®ã©ã³ãã ãªãã¥ãŒãšã¯ãªãã¯ãçæããŸããããçš®ã®æ€çŽ¢ã¯ãšãªããã®åŸãçµæã®ã«ãŒããšãªã¯ãšã¹ãã®ãã¢ã®ç¹æ§ãèšç®ããå¿ èŠããããŸãã
ELTRãã©ã°ã€ã³APIãããã«æ·±ãæãäžããæãæ¥ãŸãããç¹åŸŽãèšç®ããã«ã¯ãç¹åŸŽã¹ãã¢ãšã³ãã£ãã£ãäœæããå¿ èŠããããŸãïŒç§ãç解ããŠããéããããã¯å®éã«ã¯ãã©ã°ã€ã³ããã¹ãŠã®ããŒã¿ãæ ŒçŽããElasticsearchã®åãªãã€ã³ããã¯ã¹ã§ãïŒã次ã«ç¹åŸŽã»ããïŒåç¹åŸŽã®èšç®æ¹æ³ã®èª¬æãå«ãç¹åŸŽã®ãªã¹ãïŒãäœæããŸãããã®åŸãç¹å¥ãªãªã¯ãšã¹ãã§Elasticsearchã«ã¢ã¯ã»ã¹ããŠãèŠã€ãã£ãåãšã³ãã£ãã£ã®ç¹åŸŽå€ã®ãã¯ãã«ãååŸããã ãã§ååã§ãã
æ©èœã»ãããäœæããããšããå§ããŸãããïŒ
# backend/backend/search/ranking.py
from typing import Iterable, List, Mapping
from elasticsearch import Elasticsearch
from elasticsearch_ltr import LTRClient
from backend.search.features import CardFeaturesManager
class SearchRankingManager:
DEFAULT_FEATURE_SET_NAME = "card_features"
def __init__(self, elasticsearch_client: Elasticsearch,
card_features_manager: CardFeaturesManager,
cards_index_name: str):
self.elasticsearch_client = elasticsearch_client
self.card_features_manager = card_features_manager
self.cards_index_name = cards_index_name
def initialize_ranking(self, feature_set_name=DEFAULT_FEATURE_SET_NAME):
ltr: LTRClient = self.elasticsearch_client.ltr
try:
# feature store ,
# ¯\_(ã)_/¯
ltr.create_feature_store()
except Exception as exc:
if "resource_already_exists_exception" not in str(exc):
raise
# feature set !
ltr.create_feature_set(feature_set_name, {
"featureset": {
"features": [
#
# ,
# ,
# .
self._make_feature("name_tf_idf", ["query"], {
"match": {
# ELTR
# , .
# , ,
# ,
# match query.
"name": "{{query}}"
}
}),
# , .
self._make_feature("combined_tf_idf", ["query"], {
"multi_match": {
"query": "{{query}}",
"fields": ["name^3", "tags.text", "text"]
}
}),
*(
#
# function score.
# -
# , 0.
# (
# !)
self._make_feature(feature_name, [], {
"function_score": {
"field_value_factor": {
"field": feature_name,
"missing": 0
}
}
})
for feature_name in sorted(self.card_features_manager.get_all_feature_names_set())
)
]
}
})
@staticmethod
def _make_feature(name, params, query):
return {
"name": name,
"params": params,
"template_language": "mustache",
"template": query,
}
Now-ç¹å®ã®ã¯ãšãªãšã«ãŒãã®æ©èœãèšç®ããé¢æ°ïŒ
def compute_cards_features(self, query: str, card_ids: Iterable[str],
feature_set_name=DEFAULT_FEATURE_SET_NAME) -> Mapping[str, List[float]]:
card_ids = list(card_ids)
result = self.elasticsearch_client.search({
"query": {
"bool": {
# ,
# â ,
# .
# ID.
"filter": [
{
"terms": {
"_id": card_ids
}
},
# â ,
# SLTR.
#
# feature set.
# ( ,
# filter, .)
{
"sltr": {
"_name": "logged_featureset",
"featureset": feature_set_name,
"params": {
# .
# , ,
#
# {{query}}.
"query": query
}
}
}
]
}
},
#
# .
"ext": {
"ltr_log": {
"log_specs": {
"name": "log_entry1",
"named_query": "logged_featureset"
}
}
},
"size": len(card_ids),
})
# (
# ) .
# ( ,
# , Kibana.)
return {
hit["_id"]: [feature.get("value", float("nan")) for feature in hit["fields"]["_ltrlog"][0]["log_entry1"]]
for hit in result["hits"]["hits"]
}
ãªã¯ãšã¹ããšIDã«ãŒããå ¥åãšããŠCSVãåãå ¥ãã次ã®æ©èœãåããCSVãåºåããåçŽãªã¹ã¯ãªããã
# backend/tools/compute_movie_features.py
import csv
import itertools
import sys
import tqdm
from backend.wiring import Wiring
if __name__ == "__main__":
wiring = Wiring()
reader = iter(csv.reader(sys.stdin))
header = next(reader)
feature_names = wiring.search_ranking_manager.get_feature_names()
writer = csv.writer(sys.stdout)
writer.writerow(["query", "card_id"] + feature_names)
query_index = header.index("query")
card_id_index = header.index("card_id")
chunks = itertools.groupby(reader, lambda row: row[query_index])
for query, rows in tqdm.tqdm(chunks):
card_ids = [row[card_id_index] for row in rows]
features = wiring.search_ranking_manager.compute_cards_features(query, card_ids)
for card_id in card_ids:
writer.writerow((query, card_id, *features[card_id]))
æåŸã«ããã¹ãŠãå®è¡ã§ããŸãïŒ
# feature set
docker-compose exec backend python -m tools.initialize_search_ranking
#
docker-compose exec -T backend \
python -m tools.generate_movie_events \
< ~/Downloads/tmdb-movie-metadata/tmdb_5000_movies.csv \
> ~/Downloads/habr-app-demo-dataset-events.csv
#
docker-compose exec -T backend \
python -m tools.compute_features \
< ~/Downloads/habr-app-demo-dataset-events.csv \
> ~/Downloads/habr-app-demo-dataset-features.csv
ããã§ãã€ãã³ããšãµã€ã³ãå«ã2ã€ã®ãã¡ã€ã«ãã§ãããã¬ãŒãã³ã°ãéå§ã§ããŸãã
ã©ã³ãã³ã°ïŒã¢ãã«ã®ãã¬ãŒãã³ã°ãšå®è£
ããŒã¿ã»ããã®ããŒãã®è©³çŽ°ãã¹ãããããŠïŒãã®ã³ãããã§å®å šãªã¹ã¯ãªããã確èªã§ããŸãïŒãèŠç¹ãçŽæ¥ç解ããŸãããã
# backend/tools/train_model.py
...
if __name__ == "__main__":
args = parser.parse_args()
feature_names, features = read_features(args.features)
events = read_events(args.events)
# train test 4 1.
all_queries = set(events.keys())
train_queries = random.sample(all_queries, int(0.8 * len(all_queries)))
test_queries = all_queries - set(train_queries)
# DMatrix â , xgboost.
#
# . 1, ,
# 0, ( . ).
train_dmatrix = make_dmatrix(train_queries, events, feature_names, features)
test_dmatrix = make_dmatrix(test_queries, events, feature_names, features)
# !
#
# ML,
# XGBoost.
param = {
"max_depth": 2,
"eta": 0.3,
"objective": "binary:logistic",
"eval_metric": "auc",
}
num_round = 10
booster = xgboost.train(param, train_dmatrix, num_round, evals=((train_dmatrix, "train"), (test_dmatrix, "test")))
# .
booster.dump_model(args.output, dump_format="json")
# , :
# ROC-.
xgboost.plot_importance(booster)
plt.figure()
build_roc(test_dmatrix.get_label(), booster.predict(test_dmatrix))
plt.show()
ããŒã³ã
python backend/tools/train_search_ranking_model.py \
--events ~/Downloads/habr-app-demo-dataset-events.csv \
--features ~/Downloads/habr-app-demo-dataset-features.csv \
-o ~/Downloads/habr-app-demo-model.xgb
ããŠãã ããç§ãã¡ã¯ã以åã®ã¹ã¯ãªããã§å¿ èŠãªãã¹ãŠã®ããŒã¿ããšã¯ã¹ããŒãããã®ã§ããã®ã¹ã¯ãªããã¯ãã¯ãããŒãºããããã³ã°ãŠã£ã³ããŠå ã§å®è¡ãããããšã«æ³šæããŠãã ãã-ããã¯ã以åã«ã€ã³ã¹ããŒã«ãããããªãã®ãã·ã³äžã§å®è¡ããå¿ èŠããããŸã
xgboost
ãšsklearn
ãåæ§ã«ãå®éã®æ¬çªç°å¢ã§ã¯ã以åã®ã¹ã¯ãªããã¯æ¬çªç°å¢ã«ã¢ã¯ã»ã¹ã§ããå Žæã§å®è¡ããå¿
èŠããããŸãããããã¯ããã§ã¯ãããŸããã
ãã¹ãŠãæ£ããè¡ããããšãã¢ãã«ã¯æ£åžžã«ãã¬ãŒãã³ã°ããã2ã€ã®çŸããåçã衚瀺ãããŸãã1ã€ç®ã¯ãæ©èœã®éèŠæ§ã®ã°ã©ãã§ãã
ã€ãã³ãã¯ã©ã³ãã ã«çæãããŸãããã
combined_tf_idf
以åã®æ¹æ³ã§ã©ã³ã¯ä»ãããããæ€çŽ¢çµæã®äžäœã«ããã«ãŒãã®ã¯ãªãã¯ã®å¯èœæ§ã人çºçã«äœããããããä»ã®ã«ãŒããããã¯ããã«éèŠã§ããããšãå€æããŸãããã¢ãã«ãããã«æ°ã¥ãããšããäºå®ã¯è¯ãå
åã§ãããåŠç¿ããã»ã¹ã§å®å
šã«æããªééããããªãã£ãããšã®å
åã§ãã
2çªç®ã®ã°ã©ãã¯ROCæ²ç·ã§ãã
éãç·ã¯èµ€ãç·ã®äžã«ãããŸããããã¯ãã¢ãã«ãã³ã€ã³ãã¹ãããã©ãã«ãå°ãè¯ãäºæž¬ããŠããããšãæå³ããŸãã ïŒããã®å人ã®MLãšã³ãžãã¢ã«ãŒãã¯ãã»ãŒå·Šäžé ã«è§Šããã¯ãã§ããïŒ
åé¡ã¯éåžžã«å°ããã§ã-ã¢ãã«ãåããããã®ã¹ã¯ãªãããè¿œå ãããããåããŠãæ€çŽ¢ã¯ãšãªã«å°ããªæ°ããã¢ã€ãã ãè¿œå ããŸã-åã¹ã³ã¢ãªã³ã°ïŒ
--- a/backend/backend/search/searcher_impl.py
+++ b/backend/backend/search/searcher_impl.py
@@ -27,6 +30,19 @@ class ElasticsearchSearcher(Searcher):
"filter": list(self._make_filter_queries(tags, ids)),
}
},
+ "rescore": {
+ "window_size": 1000,
+ "query": {
+ "rescore_query": {
+ "sltr": {
+ "params": {
+ "query": query
+ },
+ "model": self.ranking_manager.get_current_model_name()
+ }
+ }
+ }
+ },
"aggregations": {
self.TAGS_AGGREGATION_NAME: {
"terms": {"field": "tags"}
ããã§ãElasticsearchãå¿ èŠãªæ€çŽ¢ãå®è¡ãããã®ïŒããªãéãïŒã¢ã«ãŽãªãºã ã§çµæãã©ã³ã¯ä»ãããåŸãäžäœ1000ã®çµæãååŸããïŒæ¯èŒçé ãïŒæ©æ¢°åŠç¿åŒã䜿çšããŠåã©ã³ã¯ä»ãããŸããæåïŒ
çµè«
æå°éã®Webã¢ããªã±ãŒã·ã§ã³ãæ¡çšããæ€çŽ¢æ©èœèªäœããªãç¶æ ãããå€ãã®é«åºŠãªæ©èœãåããã¹ã±ãŒã©ãã«ãªãœãªã¥ãŒã·ã§ã³ã«ç§»è¡ããŸãããããã¯ããã»ã©ç°¡åã§ã¯ãããŸããã§ããããããããããããã»ã©é£ããããšã§ã¯ãããŸããïŒæçµçãªã¢ããªã±ãŒã·ã§ã³ã¯ãæ§ãããªååã®ãã©ã³ãã®Githubã®ãªããžããªã«ãããå®è¡ããã«ã¯ã
feature/search
DockerãšPython3ãšãã·ã³ã©ãŒãã³ã°ã©ã€ãã©ãªãå¿
èŠã§ãã
Elasticsearchã䜿çšããŠããããäžè¬çã«ã©ã®ããã«æ©èœããããã©ã®ãããªåé¡ãçºçããã©ã®ããã«è§£æ±ºã§ãããã瀺ããŸãããããããéžæã§ããå¯äžã®ããŒã«ã§ã¯ãããŸãããSolrãPostgreSQLãã«ããã¹ãã€ã³ããã¯ã¹ãããã³ãã®ä»ã®ãšã³ãžã³ãã
ãããŠãã¡ããããã®ãœãªã¥ãŒã·ã§ã³ã¯å®å šã§çç£ã®æºåãã§ããŠãããµããããã®ã§ã¯ãªãããã¹ãŠãã©ã®ããã«è¡ãããããçŽç²ã«ç€ºããŠããŸããããªãã¯ãããã»ãŒç¡éã«æ¹åããããšãã§ããŸãïŒ
- ã€ã³ã¯ãªã¡ã³ã¿ã«ã€ã³ããã¯ã¹ãã«ãŒããå€æŽãããšã
CardManager
ã¯ãã€ã³ããã¯ã¹ã§ããã«æŽæ°ãããšããã§ããããCardManager
ãµãŒãã¹å ã«ãæ€çŽ¢ãããããšãç¥ããªãããã«ããããŠåŸªç°çãªäŸåé¢ä¿ãªãã§è¡ãããã«ãäœããã®åœ¢ã§äŸåé¢ä¿ã®å転ããã蟌ãå¿ èŠããããŸãã - ç¹å®ã®ã±ãŒã¹ã§ããElasticsearchã«ãã³ãã«ãããŠããMongoDBã®ã€ã³ããã¯ã¹äœæã«ã¯ãmongo-connectorãªã©ã®æ¢è£œã®ãœãªã¥ãŒã·ã§ã³ã䜿çšã§ããŸãã
- , â Elasticsearch .
- , , .
- , , . -, -, - ⊠!
- ( , ), ( ). , .
- , , .
- ã·ã£ãŒãã£ã³ã°ãšã¬ããªã±ãŒã·ã§ã³ã䜿çšããŠããŒãã®ã¯ã©ã¹ã¿ãŒã調æŽããããšã¯ããŸã£ããå¥ã®æ¥œãã¿ã§ãã
ããããèšäºã®ãµã€ãºãèªã¿ãããããããã«ãããã§ãããŠããããã®èª²é¡ã«åãæ®ããŸãããæž èŽããããšãããããŸããïŒ