日本語全文検索の実装¶

概要¶

日本語の全部検索を実装する場合は、以下のような事項について考慮が必要です。

多様な文字種: 日本語検索では、ひらがな、カタカナ、漢字、英数字(半角・全角)、特殊記号(①や㌢など)、顔文字など、様々な文字種を取り扱う。
表記ゆれへの対応。表記が異なっていても同じ語句として扱う必要がある。
- 文字種、全角半角、大文字小文字の違いで発生する揺らぎ(あいふぉん、アイフォン、アイフォーン、iphone、i-Phone、iPhone、ｉｐｈｏｎｅなど)
- 末尾の長音記号(ー)の有無による揺らぎ(コンピューターとコンピュータ)
- 長音記号とカタカナによる揺らぎ(サラダボールとサラダボウルは同じ単語として処理する必要があるが、バレエとバレーは異なる単語として処理する必要がある)
- 漢字の踊り字による揺らぎ(明明白白、明々白々)
複合語の処理: 複数の語句が結合した複合語は、一つの単語として処理する要件が存在する(山桜、東京タワー、エアバスA300、ホームページ、瀬戸内しまなみ海道など)
類義語の処理類似するキーワードで検索できるようにする必要がある。(正確/的確/明確/確実/確かなど)

本ラボでは、OpenSearch で日本語検索実装上の課題にどのように対応するかを解説していきます。

ラボの構成¶

本ラボでは、ノートブック環境(JupyterLab) および Amazon OpenSearch Service を使用します。

使用するデータセット¶

本ラボでは、JGLUE 内の FAQ データセットである JSQuAD を使用します。

事前作業¶

パッケージインストール¶

In [48]:

!pip install opensearch-py requests-aws4auth "awswrangler[opensearch]" --quiet

インポート¶

In [49]:

import boto3
import json
import logging

import awswrangler as wr
import pandas as pd
import numpy as np
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

from ipywidgets import interact

ヘルパー関数の定義¶

以降の処理を実行する際に必要なヘルパー関数を定義しておきます。

In [50]:

def search_cloudformation_output(stackname, key):
    cloudformation_client = boto3.client("cloudformation", region_name=default_region)
    for output in cloudformation_client.describe_stacks(StackName=stackname)["Stacks"][0]["Outputs"]:
        if output["OutputKey"] == key:
            return output["OutputValue"]
    raise ValueError(f"{key} is not found in outputs of {stackname}.")

共通変数のセット¶

In [51]:

default_region = boto3.Session().region_name
logging.getLogger().setLevel(logging.ERROR)

OpenSearch クラスターへの接続確認¶

OpenSearch クラスターへのネットワーク接続性が確保されており、OpenSearch の Security 機能により API リクエストが許可されているかを確認します。

レスポンスに cluster_name や cluster_uuid が含まれていれば、接続確認が無事完了したと判断できます

In [52]:

cloudformation_stack_name = "search-lab-jp"
opensearch_cluster_endpoint = search_cloudformation_output(cloudformation_stack_name, "OpenSearchDomainEndpoint")

credentials = boto3.Session().get_credentials()
service_code = "es"
auth = AWSV4SignerAuth(credentials=credentials, region=default_region, service=service_code)
opensearch_client = OpenSearch(
    hosts=[{"host": opensearch_cluster_endpoint, "port": 443}],
    http_compress=True, 
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class = RequestsHttpConnection
)
opensearch_client.info()

Out[52]:

{'name': 'cf756e86f83b28e0bd2ffe2ff501ccf4',
 'cluster_name': '123456789012:opensearchservi-cyiiwlmtgk2r',
 'cluster_uuid': 'UoIf1GJCTauJlwbKQrxTUA',
 'version': {'distribution': 'opensearch',
  'number': '2.17.0',
  'build_type': 'tar',
  'build_hash': 'unknown',
  'build_date': '2025-02-14T09:38:50.023788640Z',
  'build_snapshot': False,
  'lucene_version': '9.11.1',
  'minimum_wire_compatibility_version': '7.10.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'The OpenSearch Project: https://opensearch.org/'}

日本語検索ウォークスルー¶

OpenSearch におけるテキスト処理の全体像¶

全文検索の対象となるデータは、以下の流れで処理され、転置インデックスに登録されます。

以降のセクションでは、各フェーズで登場するコンポーネントの解説と、具体的なコンポーネントの動作を見ていきます。

Tokenizer¶

Tokenizer は入力されたテキストを自身のロジックに基づいて分割するコンポーネントです。日本語検索では形態素解析を用いる手法、もしくは n-Gram という N 文字ずつテキストを区切る手法が一般的に用いられます。各手法について実際の挙動を見ていきましょう。

N-Gram¶

N-Gram はテキストから N 文字ずつ取り出してトークン化する手法です。一文字ずつ取り出すことを uni-gram、二文字ずつ切り取ることを bi-gram、三文字ずつ切り取ることを tri-gram などと呼びます。

ここでは、N-Gram tokenizer で、以下の文字列を 2 文字ずつトークン化した結果を見ていきます。トークンにホワイトスペースや記号が含まれないように、token_chars パラメーターで制御を行っています。

"大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です"

In [53]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です。",
  "tokenizer": {
    "type": "ngram",
    "min_gram": 2,
    "max_gram": 2,
    "token_chars": ["letter", "digit"]
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_bigram = pd.json_normalize(response["tokens"])
df_bigram

Out[53]:

	token	start_offset	end_offset	type	position
0	大阪	0	2	word	0
1	阪府	1	3	word	1
2	府の	2	4	word	2
3	の関	3	5	word	3
4	関西	4	6	word	4
5	西国	5	7	word	5
6	国際	6	8	word	6
7	際空	7	9	word	7
8	空港	8	10	word	8
9	KI	11	13	word	9
10	IX	12	14	word	10
11	から	15	17	word	11
12	ら東	16	18	word	12
13	東京	17	19	word	13
14	京都	18	20	word	14
15	都の	19	21	word	15
16	の羽	20	22	word	16
17	羽田	21	23	word	17
18	田空	22	24	word	18
19	空港	23	25	word	19
20	HN	26	28	word	20
21	ND	27	29	word	21
22	まで	30	32	word	22
23	での	31	33	word	23
24	のフ	32	34	word	24
25	フラ	33	35	word	25
26	ライ	34	36	word	26
27	イト	35	37	word	27
28	ト時	36	38	word	28
29	時間	37	39	word	29
30	間は	38	40	word	30
31	はお	39	41	word	31
32	およ	40	42	word	32
33	よそ	41	43	word	33
34	70	44	46	word	34
35	分で	47	49	word	35
36	です	48	50	word	36

上記の例では、文章を 1 文字ずつずらしながら、2 文字のトークンが抽出されたことがわかります。N-Gram は N 文字ずつトークンを抽出することから、未知語に対するヒット率の向上が期待できます。

一方、検索ノイズの増加については考慮が必要です。抽出されたトークンには "京都" も含まれているため、京都で検索を行った際に無関係の本文章がヒットしてしまいます。

検索ノイズを削減するテクニックとしては以下のようなものが考えられます。

複数の N-Gram (bi-gram と tri-gram など)インデックスを併用し、ユーザーが入力した検索キーワードの長さに応じて、アプリケーション側で処理を分岐させる
形態素解析と組み合わせる
トークンフィルターを適用し、"です" や "ます" などの不要な語句(ストップワード)で構成されるトークンを除去する

N-gram の最小文字数と最大文字数に 2 以上の差がある場合の設定

ngram tokenizer の min_gram および max_gram に 2 以上の差がある場合は、インデックスに index.max_ngram_diff の設定を追加する必要があります。追加されていない場合、以下のようなエラーが発生します。

In [54]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です。",
  "tokenizer": {
    "type": "ngram",
    "min_gram": 1,
    "max_gram": 3,
  }
}

try:
    response = opensearch_client.indices.analyze(
      body=payload
    )
    df_bigram = pd.json_normalize(response["tokens"])
    df_bigram
except Exception as e:
    print(e)

RequestError(400, 'illegal_argument_exception', 'The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [2]. This limit can be set by changing the [index.max_ngram_diff] index level setting.')

形態素解析¶

形態素解析を用いることで、単語の品詞情報が格納された辞書や文法に基づくトークン分割を行えます。

例えば、吾輩は猫である。という文章を形態素解析エンジンで処理すると、吾輩 / は / 猫 / で / ある / 。と自然に分割されたトークンが取得できます。

OpenSearch では、Sudachi もしくは Kuromoji を利用可能です。以降のセクションでは、各エンジンごとの動作を解説していきます。

Kuromoji¶

Kuromoji は Java で実装されたオープンソースの日本語形態素解析ツールです。atilika により開発、Apache Software Foundation に寄贈されており、OpenSearch のベースである Apache Lucene に組み込まれています。Amazon OpenSearch Service および Amazon OpenSearch Serverless では、デフォルトで Kuromoji が利用可能です。

OSS 版の OpenSearch でも、標準の日本語プラグインとして登録されているため、opensearch-plugin install analysis-kuromoji コマンドで導入が可能です。

kuromoji_tokenizer は、以下 3 つの分割モードをサポートしています。

normal: デフォルトのモード。最も長い分割単位でトークンを出力。複合トークンの分割は行わない。
search: 検索に特化したモード。複合トークンの分割も合わせて行う。
extended: search の動作に加えて、未知語をユニグラム(1 文字トークン)として出力する

各モードごとの実行結果を見ていきましょう。

normal モード

カッコなどの記号や句読点がトークンに含まれていないのは、kuromoji tokenizer の discard_punctuation オプションがデフォルトで true になっているためです。記号や句読点をトークンとして含める場合は同設定を false にセットします。

In [55]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です。",
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "normal",
    "discard_punctuation": True #デフォルト
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_kuromoji_normal = pd.json_normalize(response["tokens"])
df_kuromoji_normal

df_kuromoji_normal

Out[55]:

	token	start_offset	end_offset	type	position
0	大阪	0	2	word	0
1	府	2	3	word	1
2	の	3	4	word	2
3	関西国際空港	4	10	word	3
4	KIX	11	14	word	4
5	から	15	17	word	5
6	東京	17	19	word	6
7	都	19	20	word	7
8	の	20	21	word	8
9	羽田空港	21	25	word	9
10	HND	26	29	word	10
11	まで	30	32	word	11
12	の	32	33	word	12
13	フライト	33	37	word	13
14	時間	37	39	word	14
15	は	39	40	word	15
16	およそ	40	43	word	16
17	70	44	46	word	17
18	分	47	48	word	18
19	です	48	50	word	19

search モード

In [56]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "search"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_kuromoji_search = pd.json_normalize(response["tokens"])


df_kuromoji_search

Out[56]:

	token	start_offset	end_offset	type	position	positionLength
0	大阪	0	2	word	0	NaN
1	府	2	3	word	1	NaN
2	の	3	4	word	2	NaN
3	関西	4	6	word	3	NaN
4	関西国際空港	4	10	word	3	3.0
5	国際	6	8	word	4	NaN
6	空港	8	10	word	5	NaN
7	KIX	11	14	word	6	NaN
8	から	15	17	word	7	NaN
9	東京	17	19	word	8	NaN
10	都	19	20	word	9	NaN
11	の	20	21	word	10	NaN
12	羽田	21	23	word	11	NaN
13	羽田空港	21	25	word	11	2.0
14	空港	23	25	word	12	NaN
15	HND	26	29	word	13	NaN
16	まで	30	32	word	14	NaN
17	の	32	33	word	15	NaN
18	フライト	33	37	word	16	NaN
19	時間	37	39	word	17	NaN
20	は	39	40	word	18	NaN
21	およそ	40	43	word	19	NaN
22	70	44	46	word	20	NaN
23	分	47	48	word	21	NaN
24	です	48	50	word	22	NaN

extended モード

In [57]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "extended"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_kuromoji_extended = pd.json_normalize(response["tokens"])
df_kuromoji_extended

Out[57]:

	token	start_offset	end_offset	type	position	positionLength
0	大阪	0	2	word	0	NaN
1	府	2	3	word	1	NaN
2	の	3	4	word	2	NaN
3	関西	4	6	word	3	NaN
4	関西国際空港	4	10	word	3	3.0
5	国際	6	8	word	4	NaN
6	空港	8	10	word	5	NaN
7	K	11	12	word	6	NaN
8	I	12	13	word	7	NaN
9	X	13	14	word	8	NaN
10	から	15	17	word	9	NaN
11	東京	17	19	word	10	NaN
12	都	19	20	word	11	NaN
13	の	20	21	word	12	NaN
14	羽田	21	23	word	13	NaN
15	羽田空港	21	25	word	13	2.0
16	空港	23	25	word	14	NaN
17	H	26	27	word	15	NaN
18	N	27	28	word	16	NaN
19	D	28	29	word	17	NaN
20	まで	30	32	word	18	NaN
21	の	32	33	word	19	NaN
22	フライト	33	37	word	20	NaN
23	時間	37	39	word	21	NaN
24	は	39	40	word	22	NaN
25	およそ	40	43	word	23	NaN
26	7	44	45	word	24	NaN
27	0	45	46	word	25	NaN
28	分	47	48	word	26	NaN
29	です	48	50	word	27	NaN

normal/search/extended モードの比較 3 つのモードを比較します。normal -> search -> extended の順にトークンが増加する様子が分かります。

In [58]:

df_kuromoji_search_and_normal = pd.merge(df_kuromoji_search, df_kuromoji_normal, on=["start_offset", "end_offset"], how="left", suffixes=["_kuromoji_search","_kuromoji_normal"]).drop(["type_kuromoji_search","type_kuromoji_normal","positionLength","position_kuromoji_search", "position_kuromoji_normal"],axis=1).reindex(["start_offset", "end_offset", "token_kuromoji_search", "token_kuromoji_normal"],axis=1).fillna("")
df_kuromoji_extended_and_normal = pd.merge(df_kuromoji_extended, df_kuromoji_normal, on=["start_offset", "end_offset"], how="left", suffixes=["_kuromoji_extended","_kuromoji_normal"]).drop(["type_kuromoji_extended","type_kuromoji_normal","positionLength","position_kuromoji_extended","position_kuromoji_normal"],axis=1).reindex(["start_offset", "end_offset", "token_kuromoji_extended", "token_kuromoji_normal"],axis=1)
df_kuromoji = pd.merge(df_kuromoji_extended_and_normal, df_kuromoji_search_and_normal, on=["start_offset"], how="left").drop(["token_kuromoji_normal_x"],axis=1).rename(columns={"token_kuromoji_normal_y": "token_kuromoji_normal"}).reindex(["start_offset", "token_kuromoji_extended", "token_kuromoji_search", "token_kuromoji_normal"],axis=1).fillna("")
df_kuromoji

Out[58]:

	start_offset	token_kuromoji_extended	token_kuromoji_search	token_kuromoji_normal
0	0	大阪	大阪	大阪
1	2	府	府	府
2	3	の	の	の
3	4	関西	関西
4	4	関西	関西国際空港	関西国際空港
5	4	関西国際空港	関西
6	4	関西国際空港	関西国際空港	関西国際空港
7	6	国際	国際
8	8	空港	空港
9	11	K	KIX	KIX
10	12	I
11	13	X
12	15	から	から	から
13	17	東京	東京	東京
14	19	都	都	都
15	20	の	の	の
16	21	羽田	羽田
17	21	羽田	羽田空港	羽田空港
18	21	羽田空港	羽田
19	21	羽田空港	羽田空港	羽田空港
20	23	空港	空港
21	26	H	HND	HND
22	27	N
23	28	D
24	30	まで	まで	まで
25	32	の	の	の
26	33	フライト	フライト	フライト
27	37	時間	時間	時間
28	39	は	は	は
29	40	およそ	およそ	およそ
30	44	7	70	70
31	45	0
32	47	分	分	分
33	48	です	です	です

なお、search もしくは extended モードで、分割前の複合語を破棄する場合は、discard_compound_token に true をセットします。以下は search モードにおける discard_compound_token パラメーターによる結果の違いです。

In [59]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "search",
    "discard_compound_token": True
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_kuromoji_search_discard_compound_token = pd.json_normalize(response["tokens"])
df_kuromoji_search_results = pd.merge(df_kuromoji_search, df_kuromoji_search_discard_compound_token, on=["start_offset", "end_offset"], how="left", suffixes=["_without_discard_compound_token","_with_discard_compound_token"]).drop(["type_without_discard_compound_token","type_with_discard_compound_token","positionLength","position_without_discard_compound_token", "position_with_discard_compound_token"],axis=1).reindex(["start_offset", "end_offset", "token_without_discard_compound_token", "token_with_discard_compound_token"],axis=1).fillna("")
df_kuromoji_search_results

Out[59]:

	start_offset	end_offset	token_without_discard_compound_token	token_with_discard_compound_token
0	0	2	大阪	大阪
1	2	3	府	府
2	3	4	の	の
3	4	6	関西	関西
4	4	10	関西国際空港
5	6	8	国際	国際
6	8	10	空港	空港
7	11	14	KIX	KIX
8	15	17	から	から
9	17	19	東京	東京
10	19	20	都	都
11	20	21	の	の
12	21	23	羽田	羽田
13	21	25	羽田空港
14	23	25	空港	空港
15	26	29	HND	HND
16	30	32	まで	まで
17	32	33	の	の
18	33	37	フライト	フライト
19	37	39	時間	時間
20	39	40	は	は
21	40	43	およそ	およそ
22	44	46	70	70
23	47	48	分	分
24	48	50	です	です

Sudachi¶

Sudachi は Works Applications によって開発されている形態素解析エンジンです。Kuromoji と比較して以下の点が優れています。

システム辞書が継続的にメンテナンスされていること。Kuromoji の標準辞書は 2007 年でメンテナンスが止まっています。一方、Sudachi のシステム辞書は 2024 年現在も継続的にメンテナンスされています。
UniDic ショートユニットから固有表現の抽出まで、テキスト分割モードを柔軟に選択可能。カスタム辞書内でも語句ごとに分割モードごとの指定が可能。
豊富な正規化機能。カスタム語句についても、辞書内で正規化された表現を定義することが可能

プラットフォームによってサポート状況や利用方法が異なります。

OpenSearch Service: カスタムパッケージの機能から Sudachi の関連付けを行うことで利用可能
OpenSearch Serverless: Sudachi はサポートされていません。
OSS OpenSearch: リポジトリからソースコードを入手してビルドを実施。作成されたプラグインのバイナリをインストール。

Sudachi は以下 3 つの分割モードを提供しています。各モードごとの違いを見ていきます。

A: UniDicショートユニット (SUW) に相当する最小ユニットに分割
B: 固有表現を中間ユニットに分割して出力 (最小ユニット 2 つまでを結合)
C: 固有表現を抽出

なお、句読点や記号が省略されているのは、discard_punctuation オプションに false がセットされているためです。

In [60]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "sudachi_tokenizer",
    "split_mode": "A"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_sudachi_a = pd.json_normalize(response["tokens"])
df_sudachi_a

Out[60]:

	token	start_offset	end_offset	type	position
0	大阪	0	2	word	0
1	府	2	3	word	1
2	の	3	4	word	2
3	関西	4	6	word	3
4	国際	6	8	word	4
5	空港	8	10	word	5
6	KIX	11	14	word	6
7	から	15	17	word	7
8	東京	17	19	word	8
9	都	19	20	word	9
10	の	20	21	word	10
11	羽田	21	23	word	11
12	空港	23	25	word	12
13	HND	26	29	word	13
14	まで	30	32	word	14
15	の	32	33	word	15
16	フライト	33	37	word	16
17	時間	37	39	word	17
18	は	39	40	word	18
19	およそ	40	43	word	19
20	70	44	46	word	20
21	分	47	48	word	21
22	です	48	50	word	22

In [61]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "sudachi_tokenizer",
    "split_mode": "B"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_sudachi_b = pd.json_normalize(response["tokens"])
df_sudachi_b

Out[61]:

	token	start_offset	end_offset	type	position
0	大阪府	0	3	word	0
1	の	3	4	word	1
2	関西	4	6	word	2
3	国際	6	8	word	3
4	空港	8	10	word	4
5	KIX	11	14	word	5
6	から	15	17	word	6
7	東京都	17	20	word	7
8	の	20	21	word	8
9	羽田	21	23	word	9
10	空港	23	25	word	10
11	HND	26	29	word	11
12	まで	30	32	word	12
13	の	32	33	word	13
14	フライト	33	37	word	14
15	時間	37	39	word	15
16	は	39	40	word	16
17	およそ	40	43	word	17
18	70	44	46	word	18
19	分	47	48	word	19
20	です	48	50	word	20

In [62]:

payload = {
  "text": "大阪府の関西国際空港(KIX)から東京都の羽田空港(HND)までのフライト時間はおよそ 70 分です",
  "tokenizer": {
    "type": "sudachi_tokenizer",
    "split_mode": "C"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)
df_sudachi_c = pd.json_normalize(response["tokens"])
df_sudachi_c

Out[62]:

	token	start_offset	end_offset	type	position
0	大阪府	0	3	word	0
1	の	3	4	word	1
2	関西国際空港	4	10	word	2
3	KIX	11	14	word	3
4	から	15	17	word	4
5	東京都	17	20	word	5
6	の	20	21	word	6
7	羽田空港	21	25	word	7
8	HND	26	29	word	8
9	まで	30	32	word	9
10	の	32	33	word	10
11	フライト	33	37	word	11
12	時間	37	39	word	12
13	は	39	40	word	13
14	およそ	40	43	word	14
15	70	44	46	word	15
16	分	47	48	word	16
17	です	48	50	word	17

3 つの分割モードの結果を比較します。

In [63]:

df_sudachi_b_and_a = pd.merge(df_sudachi_a, df_sudachi_b, on=["start_offset"], how="left", suffixes=["_a","_b"]).drop(["type_a","type_b","position_a", "position_b"],axis=1).reindex(["start_offset", "token_a", "token_b"],axis=1).fillna("")
df_sudachi_c_and_a = pd.merge(df_sudachi_a, df_sudachi_c, on=["start_offset"], how="left", suffixes=["_a","_c"]).drop(["type_a","type_c","position_a", "position_c"],axis=1).reindex(["start_offset", "token_a", "token_c"],axis=1).fillna("")
df_sudachi = pd.merge(df_sudachi_c_and_a, df_sudachi_b_and_a, on=["start_offset"], how="left").drop(["token_a_x"],axis=1).rename(columns={"token_a_y": "token_a"}).reindex(["start_offset", "token_a", "token_b", "token_c"],axis=1).fillna("")
df_sudachi

Out[63]:

	start_offset	token_a	token_b	token_c
0	0	大阪	大阪府	大阪府
1	2	府
2	3	の	の	の
3	4	関西	関西	関西国際空港
4	6	国際	国際
5	8	空港	空港
6	11	KIX	KIX	KIX
7	15	から	から	から
8	17	東京	東京都	東京都
9	19	都
10	20	の	の	の
11	21	羽田	羽田	羽田空港
12	23	空港	空港
13	26	HND	HND	HND
14	30	まで	まで	まで
15	32	の	の	の
16	33	フライト	フライト	フライト
17	37	時間	時間	時間
18	39	は	は	は
19	40	およそ	およそ	およそ
20	44	70	70	70
21	47	分	分	分
22	48	です	です	です

Character Filter¶

Tokenizer に渡す前段での正規化を担当するコンポーネントです。不要な文字の除去や半角・全角を揃えるなどの正規化処理を行うことで、表記ゆれによる検索精度の低下を防ぎます。

Character Filter には踊り字の置き換えといった、トークン分割自体の精度向上に寄与するものもあります。

ICU normalization character filter¶

ICU normalization character filter は、文字列の正規化処理を行うフィルターです。以下のような表記ゆれを補正可能です。

変換内容	変換例(前)	変換例(後)
大文字 -> 小文字	OpenSearch	opensearch
全角英数字・記号 -> 半角英数字・記号	oｐeｎ＿sｅaｒcｈ	open_search
半角カナ -> 全角カナ	ｵｰﾌﾟﾝｿｰｽ	オープンソース
数字記号 -> 半角数字	①	1
単位記号 -> 全角カナ	㍍	メートル

以下の例では、様々な種類の文字が混在する文字列の正規化を行っています。

In [64]:

payload = {
  "text": "OｐeｎsｅaｒCｈは①⓪⓪㌫ｵｰﾌﾟンｿｰｽの検索／分析スイートです",
  "tokenizer": {
    "type": "sudachi_tokenizer"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi = pd.json_normalize(response["tokens"])

payload = {
  "text": "OｐeｎsｅaｒCｈは①⓪⓪㌫ｵｰﾌﾟンｿｰｽの検索／分析スイートです",
  "tokenizer": {
    "type": "sudachi_tokenizer"
  },
  "char_filter": ["icu_normalizer"]
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_normalized = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi, df_sudachi_normalized, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token", "token_y": "token_normalized"}).reindex(["start_offset", "end_offset", "token", "token_normalized"],axis=1).fillna("")

Out[64]:

	start_offset	end_offset	token	token_normalized
0	0	10	OｐeｎsｅaｒCｈ	opensearch
1	10	11	は	は
2	11	14	①⓪⓪	100
3	14	15	㌫	パーセント
4	15	23	ｵｰﾌﾟンｿｰｽ	オープンソース
5	23	24	の	の
6	24	26	検索	検索
7	27	29	分析	分析
8	29	33	スイート	スイート
9	33	35	です	です

kuromoji_iteration_mark character filter¶

kuromoji_iteration_mark は、踊り字(々, ゝ, ヽ)を直前の文字で置き換える機能を提供します。

踊り字を変換せずにそのままトークン分割を行った場合、以下のような問題が発生します

トークン分割時に踊り字だけがインデクシングされてしまう
踊り字を含むキーワードで検索を行った際に、踊り字を含むすべてのキーワードがヒットしてしまう
文字列の分割箇所がおかしくなる

例えば、こゝろ や つゝむ をそのまま Kuromoji Tokenizer で処理すると、ゝが一つのトークンとして抽出されます。このままの状態でインデックスにデータが格納された場合、こゝろ で検索を行うと、つゝむ もヒットしてしまいます。

また、学問のすゝめについては、学問/の/すゝ/めと不自然な位置で区切られてしまいます。

In [65]:

payload = {
  "text": "こゝろ",
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[65]:

	token	start_offset	end_offset	type	position
0	こ	0	1	word	0
1	ゝ	1	2	word	1
2	ろ	2	3	word	2

In [66]:

payload = {
  "text": "つゝむ",
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[66]:

	token	start_offset	end_offset	type	position
0	つ	0	1	word	0
1	ゝ	1	2	word	1
2	む	2	3	word	2

In [67]:

payload = {
  "text": "学問のすゝめ",
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[67]:

	token	start_offset	end_offset	type	position
0	学問	0	2	word	0
1	の	2	3	word	1
2	すゝ	3	5	word	2
3	め	5	6	word	3

kuromoji_iteration_mark を利用することで、踊り字がひとつ前の文字に置き換えられ、トークンが正しく抽出されるようになります

In [68]:

payload = {
  "text": "学問のすゝめ",
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "char_filter": ["kuromoji_iteration_mark"]
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[68]:

	token	start_offset	end_offset	type	position
0	学問	0	2	word	0
1	の	2	3	word	1
2	すすめ	3	6	word	2

なお、Sudachi Tokenizer を使用する場合は基本的に踊り字でトークンが不自然に区切られることがないため、本フィルタの利用は必須ではありません。

In [69]:

payload = {
  "text": "学問のすゝめ",
  "tokenizer": {
    "type": "sudachi_tokenizer"
  }
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[69]:

	token	start_offset	end_offset	type	position
0	学問	0	2	word	0
1	の	2	3	word	1
2	すゝめ	3	6	word	2

Token Filter¶

Token Filter は Tokenizer によって分割・抽出されたトークンに対する処理を行います。検索ノイズの増加に影響するストップワードや特定の品詞の除去、ステミングや表記ゆれの補正など、検索精度を向上するうえで欠かせない処理が提供されています。以降、主要な Token Filter について解説していきます。

なお、Token Filter の中には、品詞分類などを手掛かりとして処理を行うものが存在します。こうした処理は、同じプラグイン(Kuromoji、Sudachi)でトークナイズされていることが前提となるため、Kuromoji で生成されたトークンを Sudachi のトークンフィルタで処理できない場合があります。そうした制限についても以降のセクションで解説していきます。

原形への置き換え¶

変化形を原形に置き換えてインデックスへの格納・検索を行うことで、食べると食べたといった形の違いによる検索ヒット率の低下を防ぎます。 Kuromoji でトークン分割を行った場合は kuromoji_baseform Token Filter を、Sudachi でトークン分割を行った場合は sudachi_baseform を使用します。

In [70]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": ["kuromoji_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji_baseform = pd.json_normalize(response["tokens"])

pd.merge(df_kuromoji_baseform, df_kuromoji, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token_baseform", "token_y": "token"}).reindex(["start_offset", "end_offset", "token", "token_baseform"],axis=1).fillna("")

Out[70]:

	start_offset	end_offset	token	token_baseform
0	0	2	寿司	寿司
1	2	3	を	を
2	3	5	食べ	食べる
3	5	6	た	た
4	7	12	美味しかっ	美味しい
5	12	13	た	た
6	13	14	な	な

In [71]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": ["sudachi_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_baseform = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi_baseform, df_sudachi, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token_baseform", "token_y": "token"}).reindex(["start_offset", "end_offset", "token", "token_baseform"],axis=1).fillna("")

Out[71]:

	start_offset	end_offset	token	token_baseform
0	0	2	寿司	寿司
1	2	3	を	を
2	3	5	食べ	食べる
3	5	6	た	た
4	7	12	美味しかっ	美味しい
5	12	13	た	た
6	13	14	な	な

sudachi_tokenizer と kuromoji_baseform、kuromoji_tokenizer と sudachi_baseform といった組み合わせは成立しません。

In [72]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": ["sudachi_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

try: 
    response = opensearch_client.indices.analyze(
      body=payload
    )
    pd.json_normalize(response["tokens"])
except Exception as e:
    print(e)

TransportError(500, 'illegal_state_exception', 'Attribute MorphemeAttribute was not present')

In [73]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": ["kuromoji_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[73]:

	token	start_offset	end_offset	type	position
0	寿司	0	2	word	0
1	を	2	3	word	1
2	食べ	3	5	word	2
3	た	5	6	word	3
4	美味しかっ	7	12	word	4
5	た	12	13	word	5
6	な	13	14	word	6

品詞分類によるトークン除去¶

トークナイザーにより抽出されたトークンには品詞の情報が付与されています。品詞分類を元に、助詞や接続詞などの検索ノイズになりうるトークンを削除します。

In [74]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": ["kuromoji_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji_baseform = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    "kuromoji_baseform",
    {
        "type": "kuromoji_part_of_speech",
        "stoptags": [
          "助詞-格助詞-一般",
          "助動詞",
          "助詞-終助詞"
        ]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji_baseform_part_of_speech = pd.json_normalize(response["tokens"])

pd.merge(df_kuromoji_baseform, df_kuromoji_baseform_part_of_speech, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token_baseform", "token_y": "token_baseform_part_of_speech"}).reindex(["start_offset", "end_offset", "token_baseform", "token_baseform_part_of_speech"],axis=1).fillna("")

Out[74]:

	start_offset	end_offset	token_baseform	token_baseform_part_of_speech
0	0	2	寿司	寿司
1	2	3	を
2	3	5	食べる	食べる
3	5	6	た
4	7	12	美味しい	美味しい
5	12	13	た
6	13	14	な

In [75]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": ["sudachi_baseform"],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_baseform = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform",
    {
        "type": "sudachi_part_of_speech",
        "stoptags": [
            "助詞,終助詞",
            "助詞,格助詞",
            "助動詞",
        ]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_baseform_part_of_speech = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi_baseform, df_sudachi_baseform_part_of_speech, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token_baseform", "token_y": "token_baseform_part_of_speech"}).reindex(["start_offset", "end_offset", "token_baseform", "token_baseform_part_of_speech"],axis=1).fillna("")

Out[75]:

	start_offset	end_offset	token_baseform	token_baseform_part_of_speech
0	0	2	寿司	寿司
1	2	3	を
2	3	5	食べる	食べる
3	5	6	た
4	7	12	美味しい	美味しい
5	12	13	た
6	13	14	な

ストップワードの除去¶

日本語における "てにをは" など、検索において重要ではない語句をストップワードと呼びます。ストップワードがインデックスに格納されると検索性が低下するため、一般的にはインデックスに格納されないよう除去します。品詞単位の除去に似ていますが、ストップワードの除去は品詞の分類による判断ではなく、ストップワードリストを元に判断します。

In [76]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform"
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform",
    {
      "type": "ja_stop",
      "stopwords": ["_japanese_","寿司"]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_ja_stop = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi, df_sudachi_ja_stop, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token", "token_y": "token_ja_stop"}).reindex(["start_offset", "end_offset", "token", "token_ja_stop"],axis=1).fillna("")

Out[76]:

	start_offset	end_offset	token	token_ja_stop
0	0	2	寿司
1	2	3	を
2	3	5	食べる	食べる
3	5	6	た
4	7	12	美味しい	美味しい
5	12	13	た
6	13	14	な

In [77]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform"
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform",
    {
      "type": "ja_stop",
      "stopwords": ["_japanese_","寿司"]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_ja_stop = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi, df_sudachi_ja_stop, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token", "token_y": "token_ja_stop"}).reindex(["start_offset", "end_offset", "token", "token_ja_stop"],axis=1).fillna("")

Out[77]:

	start_offset	end_offset	token	token_ja_stop
0	0	2	寿司
1	2	3	を
2	3	5	食べる	食べる
3	5	6	た
4	7	12	美味しい	美味しい
5	12	13	た
6	13	14	な

In [78]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    "kuromoji_baseform"
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    "kuromoji_baseform",
    {
      "type": "sudachi_ja_stop",
      "stopwords": ["_japanese_","寿司"]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_kuromoji_sudachi_ja_stop = pd.json_normalize(response["tokens"])

pd.merge(df_kuromoji, df_kuromoji_sudachi_ja_stop, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token", "token_y": "token_ja_stop"}).reindex(["start_offset", "end_offset", "token", "token_ja_stop"],axis=1).fillna("")

Out[78]:

	start_offset	end_offset	token	token_ja_stop
0	0	2	寿司
1	2	3	を
2	3	5	食べる	食べる
3	5	6	た
4	7	12	美味しい	美味しい
5	12	13	た
6	13	14	な

In [79]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform"
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi = pd.json_normalize(response["tokens"])

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    "sudachi_baseform",
    {
      "type": "stop",
      "stopwords": ["_japanese_","寿司"]
    }
  ],
  "text": "寿司を食べた。美味しかったな"
}

response = opensearch_client.indices.analyze(
  body=payload
)

df_sudachi_stop = pd.json_normalize(response["tokens"])

pd.merge(df_sudachi, df_sudachi_stop, on=["start_offset","end_offset"], how="outer").rename(columns={"token_x": "token", "token_y": "token_ja_stop"}).reindex(["start_offset", "end_offset", "token", "token_ja_stop"],axis=1).fillna("")

Out[79]:

	start_offset	end_offset	token	token_ja_stop
0	0	2	寿司
1	2	3	を	を
2	3	5	食べる	食べる
3	5	6	た	た
4	7	12	美味しい	美味しい
5	12	13	た	た
6	13	14	な	な

類義語¶

OpenSearch では類義語を同じ語句として取り扱うことで検索精度を向上させます。

例えば、"パイン"、"パイナップル" など、同じものを指していても、表記が異なれば異なるキーワードとして扱われます。以下は実際の動作例です。

以下のワードについては Sudachi と Kuromoji で動作が同じであるため Kuromoji でのみ動作を確認しています。tokenizer を sudachi_tokenizer にセットすることで Sudachi に切り替えることが可能です。

In [80]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  #"tokenizer": "sudachi_tokenizer",
  "text": ["パイン", "パイナップル"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[80]:

	token	start_offset	end_offset	type	position
0	パイン	0	3	word	0
1	パイナップル	4	10	word	101

シノニムを設定することで、インデクシング時および検索時にテキストの類義語を展開することができます。

In [81]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    {
      "type": "synonym",
      "lenient": False,
      "synonyms": [ "パイン=> パイナップル" ]
    }
  ],
  "text": ["パインゼリー", "パイナップルアイス"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[81]:

	token	start_offset	end_offset	type	position
0	パイナップル	0	3	SYNONYM	0
1	ゼリー	3	6	word	1
2	パイナップル	7	13	word	102
3	アイス	13	16	word	103

_analyze API の実行結果で type が SYNONYM となっているものは、シノニムの定義により展開・出力されたトークンであることを表します。上記の例でパインがパイナップルに変化したのは、シノニム設定時に、矢印 (=>) で展開方向を抑制しているためです。矢印 (=>) で展開方向を抑制したことで、パインはパイナップルに変換されてからインデックスに格納されます

一方、矢印を記載せずにカンマで区切った場合、シノニムは相互展開されます。以下は展開例です。

In [82]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    {
      "type": "synonym",
      "lenient": False,
      "synonyms": [ "パイン,パイナップル" ]
    }
  ],
  "text": ["パインゼリー", "パイナップルアイス"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[82]:

	token	start_offset	end_offset	type	position
0	パイン	0	3	word	0
1	パイナップル	0	3	SYNONYM	0
2	ゼリー	3	6	word	1
3	パイナップル	7	13	word	102
4	パイン	7	13	SYNONYM	102
5	アイス	13	16	word	103

カナおよびローマ字読みへの変換¶

トークンをカナ表記、ローマ字表記に変換することで検索ワードの揺らぎを補正することが可能です。

Sudachi と Kuromoji それぞれで固有の readingform filter を使用する必要があります。kuromoji_tokenizer に対しては kuromoji_readingform を、sudachi_tokenizer については sudachi_readingform を使用します。

use_romaji オプションを true にするとローマ字に、false にするとカタカナに変換されます。

In [83]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    {
      "type": "kuromoji_readingform",
      "use_romaji": True
    },
  ],
  "text": ["いか", "烏賊", "イカ"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[83]:

	token	start_offset	end_offset	type	position
0	ika	0	2	word	0
1	ika	3	5	word	101
2	ika	6	8	word	202

In [84]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    {
      "type": "sudachi_readingform",
      "use_romaji": False
    },
  ],
  "text": ["いか", "烏賊", "イカ"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[84]:

	token	start_offset	end_offset	type	position
0	イカ	0	2	word	0
1	イカ	3	5	word	101
2	イカ	6	8	word	202

変換の精度は辞書に依存します。例えば、"紅まどんな(べにまどんな)" は Sudachi のデフォルトシステム辞書に登録されていないため、トークン分割された上に "べに" ではなく "くれない" と読まれてしまいます。カスタム辞書に読み仮名を含めて登録することで対処可能です。

In [85]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    {
      "type": "sudachi_readingform",
      "use_romaji": False
    },
  ],
  "text": ["紅まどんな"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[85]:

	token	start_offset	end_offset	type	position
0	クレナイ	0	1	word	0
1	マ	1	2	word	1
2	ドンナ	2	5	word	2

もう一つの注意点として、同音異字も同じ文字に変換されます。これは検索ノイズの増加につながる可能性があります

In [86]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": [
    {
      "type": "sudachi_readingform",
      "use_romaji": False
    },
  ],
  "text": ["感情", "勘定", "環状"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[86]:

	token	start_offset	end_offset	type	position
0	カンジョウ	0	2	word	0
1	カンジョウ	3	5	word	101
2	カンジョウ	6	8	word	202

その他の正規化機能¶

その他、各形態素解析器固有の機能について解説していきます。

総合的な正規化機能 (Sudachi)¶

Sudachi プラグインは sudachi_normalizedform トークンフィルターを提供しています。以下のような正規化を行うことが可能です。

Okurigana: e.g. 打込む → 打ち込む
Script: e.g. かつ丼 → カツ丼
Variant: e.g. 附属 → 付属
Misspelling: e.g. シュミレーション → シミュレーション
Contracted form: e.g. ちゃあ → ては
Long sign: e.g. コンピュータ → コンピューター

In [87]:

payload = {
  "tokenizer": "sudachi_tokenizer",
  "filter": ["sudachi_normalizedform"],
      "text": ["コンピュータ", "ユーザ", "プリンタ","シュミレーション", "コーラ", "ちゃあ", "附属", "打込み", "かつ丼"],
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[87]:

	token	start_offset	end_offset	type	position
0	コンピューター	0	6	word	0
1	ユーザー	7	10	word	101
2	プリンター	11	15	word	202
3	シミュレーション	16	24	word	303
4	コーラ	25	28	word	404
5	だ	29	32	word	505
6	付属	33	35	word	606
7	打ち込む	36	39	word	707
8	カツ丼	40	43	word	808

長音記号のステミング (Kuromoji)¶

Kuromoji kuromoji_stemmer と呼ばれるトークン末尾の長音記号(ー)を削除する機能を提供します。minimum_length オプションで、長音記号を削除するトークンの最小文字数を指定することが可能です。

minimum_length オプションで指定した文字長未満のトークンは末尾の長音記号削除は行われません。デフォルト値は 4 です。この数値は以前の JISZ8301 にて、3音以上の言葉については語尾に長音符号を付けない、2音以下の言葉については語尾に調音符号を付与するというものに由来していると考えられます。2024 年現在の JISZ8301 ではこの基準は削除されています。
本 Token Filter は全角カナのみが対象となるため、半角カナや全角かなに適用するためには、icu_normalizer による半角カナ->全角カナの置き換えや、kuromoji_readingform による全角かな->全角カナへの置き換えが必要です。

In [88]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    {
      "type": "kuromoji_stemmer",
      "minimum_length": 4 #default
    }
  ],
  "text": ["コピー", "サーバー"]
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[88]:

	token	start_offset	end_offset	type	position
0	コピー	0	3	word	0
1	サーバ	4	8	word	101

Sudachi は sudachi_normalizedform で末尾の調音記号の正規化を行うことが可能であるため、Sudachi では本機能は必須ではありません。sudachi_tokenizer と組み合わせて利用することはできますが、sudachi_normalizedform は "コンピュータ" は "コンピューター" に変換するなど、kuromoji_stemmer とは逆に現在の主流である長音記号の付与を行っています。Sudachi に kuromoji_stemmer を組み合わせるメリットは無いと考えます。

詳細については、内閣告示・内閣訓令「外来語の表記　留意事項その2(細則的な事項)」や、JTCA の「TC 関連ガイドライン」をご覧ください。

語末の長音記号の有無による表記ゆれを解消できる本 Token Filter ですが、長音記号を削除することで元々の単語の意味が変わってしまう副作用には注意が必要です。

例えば、コーラー(caller) 末尾の長音記号を削除した場合、生成されるトークンはコーラ(Cola) となり語句の意味自体が変わってしまいます。

このような問題を抑制するために minimum_length 設定があります。デフォルト値の 4 を使用した場合、以下のようなケースを防止可能です。

エコー(echo) -> エコ(eco)
エラー(error) -> エラ(era)
カバー(cover) -> カバ

アラビア数字への置き換え (Kuromoji)¶

Kuromoji は kuromoji_number と呼ばれる、漢数字をアラビア数字に置換する機能を提供します。置換対象の漢数字は Lucene の JapaneseNumberFilter.java より確認可能です。

対応している単位は垓(10 の 20 乗) までです。

アラビア数字への置き換えは、Tokenizer により分割されたトークンが漢数字で構成された文字列のみが対象となります。

In [89]:

payload = {
  "tokenizer": "kuromoji_tokenizer",
  "filter": [
    {
      "type": "kuromoji_number"
    }
  ],
  "text": ["千垓千一",  "二千,五百十円です", "千載一遇"]
}

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[89]:

	token	start_offset	end_offset	type	position	positionLength
0	100000000000000000001001	0	4	word	0	NaN
1	2510	5	11	word	101	NaN
2	円	11	12	word	102	NaN
3	です	12	14	word	103	NaN
4	千載	15	17	word	204	NaN
5	千載一遇	15	19	word	204	3.0
6	一	17	18	word	205	NaN
7	遇	18	19	word	206	NaN

トークンの再分割¶

Tokenizer により分割されたトークンを、Token Filter を使って再分割することが可能です。mode によって挙動が異なります。

"search": Additional segmentation useful for search. (Use C and A mode)

Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ
"extended": Similar to search mode, but also unigram unknown words.

Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ,

extended mode

In [90]:

payload = {
  "tokenizer": {
    "type": "sudachi_tokenizer",
    "split_mode": "C"
  },
  "filter": [
    {
      "type": "sudachi_split",
      "mode": "extended"
    },
  ],
  "text": ["アバラカダブラ","関西国際空港"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[90]:

	token	start_offset	end_offset	type	position	positionLength
0	アバラカダブラ	0	7	word	0	7.0
1	ア	0	1	word	0	NaN
2	バ	1	2	word	1	NaN
3	ラ	2	3	word	2	NaN
4	カ	3	4	word	3	NaN
5	ダ	4	5	word	4	NaN
6	ブ	5	6	word	5	NaN
7	ラ	6	7	word	6	NaN
8	関西国際空港	8	14	word	107	3.0
9	関西	8	10	word	107	NaN
10	国際	10	12	word	108	NaN
11	空港	12	14	word	109	NaN

search mode

In [91]:

payload = {
  "tokenizer": {
    "type": "sudachi_tokenizer",
    "split_mode": "C"
  },
  "filter": [
    {
      "type": "sudachi_split",
      "mode": "search"
    },
  ],
  "text": ["アバラカダブラ","関西国際空港"]
} 

response = opensearch_client.indices.analyze(
  body=payload
)

pd.json_normalize(response["tokens"])

Out[91]:

	token	start_offset	end_offset	type	position	positionLength
0	アバラカダブラ	0	7	word	0	NaN
1	関西国際空港	8	14	word	101	3.0
2	関西	8	10	word	101	NaN
3	国際	10	12	word	102	NaN
4	空港	12	14	word	103	NaN

日本語検索の実行¶

サンプルインデックスにデータをロードし、いくつかの日本語検索を実行していきます。

サンプルデータの準備¶

In [98]:

%%time
dataset_dir = "./dataset/jsquad"
%mkdir -p $dataset_dir
!curl -L -s -o $dataset_dir/valid.json https://github.com/yahoojapan/JGLUE/raw/main/datasets/jsquad-v1.3/valid-v1.3.json 
!curl -L -s -o $dataset_dir/train.json https://github.com/yahoojapan/JGLUE/raw/main/datasets/jsquad-v1.3/train-v1.3.json 

CPU times: user 36.1 ms, sys: 13.6 ms, total: 49.6 ms
Wall time: 2.83 s

In [99]:

%%time
import pandas as pd
import json

def squad_json_to_dataframe(input_file_path, record_path=["data", "paragraphs", "qas", "answers"]):
    file = json.loads(open(input_file_path).read())
    m = pd.json_normalize(file, record_path[:-1])
    r = pd.json_normalize(file, record_path[:-2])

    idx = np.repeat(r["context"].values, r.qas.str.len())
    m["context"] = idx
    m["answers"] = m["answers"]
    m["answers"] = m["answers"].apply(lambda x: np.unique(pd.json_normalize(x)["text"].to_list()))
    return m[["id", "question", "context", "answers"]]

valid_filename = f"{dataset_dir}/valid.json"
valid_df = squad_json_to_dataframe(valid_filename)

train_filename = f"{dataset_dir}/train.json"
train_df = squad_json_to_dataframe(train_filename)

CPU times: user 18.6 s, sys: 597 ms, total: 19.2 s
Wall time: 18.7 s

サンプルデータの確認¶

サンプルデータは日本語の FAQ データセットです。質問文フィールドの question、回答の answers、説明文の context フィールド、問題 ID である id フィールドから構成されています。

In [100]:

valid_df

Out[100]:

	id	question	context	answers
0	a10336p0q0	日本で梅雨がないのは北海道とどこか。	梅雨 [SEP] 梅雨（つゆ、ばいう）は、北海道と小笠原諸島を除く日本、朝鮮半島南部、中国の...	[小笠原諸島, 小笠原諸島を除く日本]
1	a10336p0q1	梅雨とは何季の一種か?	梅雨 [SEP] 梅雨（つゆ、ばいう）は、北海道と小笠原諸島を除く日本、朝鮮半島南部、中国の...	[雨季]
2	a10336p0q2	梅雨は、世界的にどのあたりで見られる気象ですか？	梅雨 [SEP] 梅雨（つゆ、ばいう）は、北海道と小笠原諸島を除く日本、朝鮮半島南部、中国の...	[東アジア, 東アジアの広範囲]
3	a10336p0q3	梅雨がみられるのはどの期間？	梅雨 [SEP] 梅雨（つゆ、ばいう）は、北海道と小笠原諸島を除く日本、朝鮮半島南部、中国の...	[5月から7月, 5月から7月にかけて]
4	a10336p1q0	入梅は何の目安の時期か？	梅雨 [SEP] 梅雨の時期が始まることを梅雨入りや入梅（にゅうばい）といい、社会通念上・気...	[春の終わりであるとともに夏の始まり（初夏）, 田植えの時期, 田植えの時期の目安]
...	...	...	...	...
4437	a95156p5q3	国際銀行間通信協会ならびに国際決済機関の何と何も企業体である	多国籍企業 [SEP] 国際銀行間通信協会ならびに国際決済機関のクリアストリームとユーロクリ...	[クリアストリームとユーロクリア]
4438	a95156p6q0	ゼネコンはどの国特有の形態か	多国籍企業 [SEP] ゼネコンは日本特有の形態。セメントメジャーにラファージュホルシムやイ...	[日本]
4439	a95156p6q1	多国籍企業においてゼネコンはどこの国特有の形態であるか？	多国籍企業 [SEP] ゼネコンは日本特有の形態。セメントメジャーにラファージュホルシムやイ...	[日本]
4440	a95156p6q2	多国籍企業を一つ挙げよ	多国籍企業 [SEP] ゼネコンは日本特有の形態。セメントメジャーにラファージュホルシムやイ...	[イタルチェメンティ, ラファージュホルシム]
4441	a95156p6q3	ゼネコンはどの国の特有の形態か？	多国籍企業 [SEP] ゼネコンは日本特有の形態。セメントメジャーにラファージュホルシムやイ...	[日本]

4442 rows × 4 columns

In [101]:

train_df

Out[101]:

	id	question	context	answers
0	a1000888p0q0	新たに語（単語）を造ることや、既存の語を組み合わせて新たな意味の語を造ること	造語 [SEP] 造語（ぞうご）は、新たに語（単語）を造ることや、既存の語を組み合わせて新た...	[造語]
1	a1000888p0q1	新たに造られた語のことを新語または何という？	造語 [SEP] 造語（ぞうご）は、新たに語（単語）を造ることや、既存の語を組み合わせて新た...	[新造語]
2	a1000888p0q2	たに語（単語）を造ることや、既存の語を組み合わせて新たな意味の語を造ること、また、そうして造...	造語 [SEP] 造語（ぞうご）は、新たに語（単語）を造ることや、既存の語を組み合わせて新た...	[造語]
3	a1000888p0q3	新たに語を造ることや、既存の語を組み合わせて新たな意味の語を造ることを何という？	造語 [SEP] 造語（ぞうご）は、新たに語（単語）を造ることや、既存の語を組み合わせて新た...	[造語]
4	a1000888p0q4	既存の語を組み合わせたりして新しく単語を造ることを何と言う？	造語 [SEP] 造語（ぞうご）は、新たに語（単語）を造ることや、既存の語を組み合わせて新た...	[造語]
...	...	...	...	...
62692	a99943p9q0	ストラングラーズは、どんな車で各地を回っていたか？	パンク・ロック [SEP] 他に、ザ・ジャムがネオ・モッズ・ムーブメントを巻き起こし、UKチ...	[アイスクリーム販売用のバン]
62693	a99943p9q1	ザ・ジャムが解散したのはいつか？	パンク・ロック [SEP] 他に、ザ・ジャムがネオ・モッズ・ムーブメントを巻き起こし、UKチ...	[1982年]
62694	a99943p9q2	ストラングラーズは、イギリス国内を何で移動してライヴを行った？	パンク・ロック [SEP] 他に、ザ・ジャムがネオ・モッズ・ムーブメントを巻き起こし、UKチ...	[アイスクリーム販売用のバン]
62695	a99943p9q3	ザ・ジャムが解散したのは何年か。	パンク・ロック [SEP] 他に、ザ・ジャムがネオ・モッズ・ムーブメントを巻き起こし、UKチ...	[1982年]
62696	a99943p9q4	アイスクリーム販売用のバンで移動しながらライブを行ったバンドは何か。	パンク・ロック [SEP] 他に、ザ・ジャムがネオ・モッズ・ムーブメントを巻き起こし、UKチ...	[ストラングラーズ]

62697 rows × 4 columns

インデックス作成¶

In [102]:

index_name = "jsquad-sudachi"

payload = {
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "question": {"type": "text", "analyzer": "custom_sudachi_analyzer"},
      "context":  {"type": "text", "analyzer": "custom_sudachi_analyzer"},
      "answers":  {"type": "text", "analyzer": "custom_sudachi_analyzer"}
    }
  },
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0,
    "index.refresh_interval": -1,
    "analysis": {
      "analyzer": {
        "custom_sudachi_analyzer": {
          "char_filter": ["icu_normalizer"],
          "filter": [
              "sudachi_normalizedform",
              "custom_sudachi_part_of_speech"
          ],
          "tokenizer": "sudachi_tokenizer",
          "type": "custom"
        }
      },
      "filter": {
        "custom_sudachi_part_of_speech": {
          "type": "sudachi_part_of_speech",
          "stoptags": ["感動詞,フィラー","接頭辞","代名詞","副詞","助詞","助動詞","動詞,一般,*,*,*,終止形-一般","名詞,普通名詞,副詞可能"]
        }
      }
    }
  }
}

try:
    # 既に同名のインデックスが存在する場合、いったん削除を行う
    print("# delete index")
    response = opensearch_client.indices.delete(index=index_name)
    print(json.dumps(response, indent=2))
except Exception as e:
    print(e)

# インデックスの作成を行う
print("# create index")
response = opensearch_client.indices.create(index=index_name, body=payload)
print(json.dumps(response, indent=2))

# delete index
NotFoundError(404, 'index_not_found_exception', 'no such index [jsquad-sudachi]', jsquad-sudachi, index_or_alias)
# create index
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "jsquad-sudachi"
}

ドキュメントのロード¶

ドキュメントのロードを行います。ドキュメントのロードは "OpenSearch の基本概念・基本操作の理解" でも解説した通り bulk API を使用することで効率よく進められますが、データ処理フレームワークを利用することでより簡単にデータを取り込むことも可能です。本ワークショップでは、AWS SDK for Pandas を使用したデータ取り込みを行います。

In [103]:

%%time
index_name = "jsquad-sudachi"

response = wr.opensearch.index_df(
    client=opensearch_client,
    df=pd.concat([train_df, valid_df]),
    use_threads=True,
    id_keys=["id"],
    index=index_name,
    bulk_size=1000,
    refresh=False
)

CPU times: user 8.82 s, sys: 202 ms, total: 9.02 s
Wall time: 25.9 s

response["success"] の値が DataFrame の件数と一致しているかを確認します。True が表示される場合は全件登録に成功していると判断できます。

In [104]:

response["success"] == pd.concat([train_df, valid_df]).id.count()

Out[104]:

True

本ラボではデータ登録時に意図的に Refresh オプションを無効化しているため、念のため Refresh API を実行し、登録されたドキュメントが確実に検索可能となるようにします

In [105]:

index_name = "jsquad-sudachi"
response = opensearch_client.indices.refresh(index_name)
response = opensearch_client.indices.forcemerge(index_name)

ドキュメントの検索¶

シミュレーションの誤字であるシュミレーションで検索を行い、表記ゆれが補正された検索結果が返されることを確認します。

In [106]:

index_name = "jsquad-sudachi"
query = "シュミレーション 言語"

payload = {
  "query": {
    "match": {
      "question": {
        "query": query,
        "operator": "and"
      }
    }
  }
}
response = opensearch_client.search(
    index=index_name,
    body=payload
)

pd.json_normalize(response["hits"]["hits"]) 

Out[106]:

	_index	_id	_score	_source.id	_source.question	_source.context	_source.answers
0	jsquad-sudachi	a30060p2q2	16.041592	a30060p2q2	シミュレーション言語のプロジェクトを開始した人物は？	オブジェクト指向プログラミング [SEP] 1962年、クリステン・ニゴールはでシミュレーシ...	[クリステン・ニゴール]
1	jsquad-sudachi	a30060p2q3	16.041592	a30060p2q3	クリステン・ニゴールはでシミュレーション言語のプロジェクトを開始	オブジェクト指向プログラミング [SEP] 1962年、クリステン・ニゴールはでシミュレーシ...	[オブジェクト指向プログラミング]
2	jsquad-sudachi	a30060p2q0	15.179986	a30060p2q0	1962年、シミュレーション言語のプロジェクトを開始したのは誰？	オブジェクト指向プログラミング [SEP] 1962年、クリステン・ニゴールはでシミュレーシ...	[クリステン・ニゴール]

インタラクティブな検索¶

以降は時間の許す限り、自由に検索クエリを実行してみましょう

query テキストボックスの内容を書き換えることで、検索クエリを変更することが可能です
question、context、answers のチェックボックスを ON/OFF で切り替えることで、フィールド単位で検索可否を調整可能です。
具体的にどの個所にヒットしたかは、highlight. のカラムから確認可能です。

In [108]:

def search(index_name, query, question, context, answers):
    fields = []
    if question:
        fields.append("question")
    if context:
        fields.append("context")
    if answers:
        fields.append("answers")
    payload = {
      "query": {
        "multi_match": {
          "query": query,
          "fields": fields,
          "operator": "and"
        }
      },
      "highlight": {
        "fields": {
          "*" : {}
        }
      },
      "_source": False,
      "fields": fields
    }
    response = opensearch_client.search(
        index=index_name,
        body=payload
    )
    return pd.json_normalize(response["hits"]["hits"])

index_name = "jsquad-sudachi"
query = "シュミレーション 言語"

# テキストボックス
interact(search, index_name=index_name, query=query, question=True, context=True, answers=True)

interactive(children=(Text(value='jsquad-sudachi', description='index_name'), Text(value='シュミレーション 言語', descri…

Out[108]:

<function __main__.search(index_name, query, question, context, answers)>

まとめ¶

本ラボでは、OpenSearch の日本語検索について学習しました。本ラボで学習した内容を元に、次のステップとして以下のラボを実行してみましょう。

日本語検索の精度向上について学びたい方向け¶

Kuromoji ユーザー辞書のカスタマイズによる日本語検索の精度改善

ベクトル検索など他の検索手法を学びたい方向け¶

ベクトル検索の実装 (Amazon SageMaker 編)

後片付け¶

インデックス削除¶

本ワークショップで使用したインデックスを削除します。インデックスの削除は Delete index API で行います。インデックスを削除するとインデックス内のドキュメントも削除されます。

In [109]:

index_name = "jsquad-sudachi"

try:
    response = opensearch_client.indices.delete(index=index_name)
    print(json.dumps(response, indent=2))
except Exception as e:
    print(e)

{
  "acknowledged": true
}

データセット削除¶

ダウンロードしたデータセットを削除します。./dataset ディレクトリ配下に何もない場合は、./dataset ディレクトリも合わせて削除します。

In [110]:

%rm -rf {dataset_dir}

In [111]:

%rmdir ./dataset