Elasticsearch マシンラーニングを使用して文字化けテキストを識別および除外する - Elasticsearch

ソーシャルメディア、フォーラム、またはオンラインコミュニケーションでテキストを分析する場合、不明瞭、非論理的、または文字化けしたコンテンツに遭遇することがあります。これは、データ分析の精度を低下させ、データに基づく意思決定の質にさらに影響を与えます。このトピックでは、自然言語処理 (NLP) モデルを使用して、Elasticsearch クラスター内の文字化けテキストを識別および除外する方法について説明します。

準備

モデルのアップロード

この例では、Hugging Face のライブラリにあるテキスト分類モデル madhurjindal/autonlp-Gibberish-Detector-492513457 を使用します。

中国本土のネットワーク経由での Hugging Face へのアクセスは低速です。この例では、モデルはオフラインモードで Elasticsearch にアップロードされます。

madhurjindal--autonlp-gibberish-detector-492513457.tar.gz をクリックして、モデルをダウンロードします。
Elastic Compute Service (ECS) インスタンスにモデルをアップロードします。
- ECS インスタンスのルートディレクトリにフォルダを作成し、モデルをそのフォルダにアップロードします。モデルを ECS インスタンスの /root/ ディレクトリにアップロードしないでください。この例では、model という名前のフォルダが作成されます。
- モデルファイルのサイズは大きいです。 WinSCP を使用してモデルファイルをアップロードすることをお勧めします。詳細については、「WinSCP を使用してファイルをアップロードまたはダウンロードする (Windows オペレーティングシステムを実行しているオンプレミスホスト)」をご参照ください。
ECS インスタンスで次のコマンドを実行して、model フォルダ内のモデルを解凍します。
```
cd /model/
tar -xzvf madhurjindal--autonlp-gibberish-detector-492513457.tar.gz
cd
```

ECS インスタンスで次のコマンドを実行して、モデルを Elasticsearch クラスターにアップロードします。

eland_import_hub_model       
--url 'http://es-cn-lbj3l7erv0009****.elasticsearch.aliyuncs.com:9200' \       
--hub-model-id '/model/root/.cache/huggingface/hub/models--madhurjindal--autonlp-Gibberish-Detector-492513457/snapshots/c068f552cdee957e45d8773db9f7158d43902244'       
--task-type text_classification       
--es-username elastic       
--es-password  ****       
--es-model-id models--madhurjindal--autonlp-gibberish-detector \

モデルのデプロイ

Elasticsearch クラスターの Kibana コンソールにログインします。詳細については、「Kibana コンソールにログインする」をご参照ください。
Kibana コンソールの左上隅にあるアイコンをクリックします。左側のナビゲーションウィンドウで、[分析] > [マシンラーニング] を選択します。
表示されるページの左側のナビゲーションウィンドウで、[モデル管理] > [トレーニング済みモデル] を選択します。
オプション。[トレーニング済みモデル] ページの上部にある [ジョブとトレーニング済みモデルを同期] をクリックします。表示されるパネルで、[同期] をクリックします。
[トレーニング済みモデル] ページで、アップロードされたモデルを見つけ、[アクション] 列のアイコンをクリックして、モデルを起動します。
表示されるダイアログボックスで、モデルを構成し、[開始] をクリックします。
ページの右下に [モデルが開始されました] というメッセージが表示された場合、モデルはデプロイされています。
説明
モデルを起動できない場合は、Elasticsearch クラスターのメモリが不足している可能性があります。 Elasticsearch クラスターの構成をスペックアップした後に、モデルを再起動できます。エラーを通知するダイアログボックスで、[完全なエラーメッセージを表示] をクリックして、エラーの原因を表示できます。

モデルのテスト

[トレーニング済みモデル] ページで、デプロイされたモデルを見つけ、[アクション] 列の テストモデル アイコンをクリックし、をクリックします。
[トレーニング済みモデルのテスト] パネルで、モデルをテストし、出力結果が期待どおりかを確認します。
出力の説明：
- word salad：文字化けテキスト、または無秩序で理解できない用語。このメトリックは、文字化けテキストを検出するために使用されます。メトリックスコアが高いほど、文字化けテキストである可能性が高くなります。
  次のサンプルテストでは、word salad メトリックが最高のスコアを獲得しています。これは、テストテキストが文字化けである可能性が非常に高いことを示しています。
- clean：通常のテキスト
- mild gibberish：おそらく文字化け
- noise：文字化け
  次のテストでは、noise メトリックが最高のスコアを獲得しています。これは、入力テキストが文字化けである可能性が非常に高いことを示しています。

Kibana コンソールの [Dev Tools] ページで文字化けテキストを識別する

Kibana コンソールの左上隅にあるアイコンをクリックします。左側のナビゲーションウィンドウで、[管理] > [dev Tools] を選択します。

[Dev Tools] ページの [コンソール] タブで、次のコードを実行します。

1. インデックスを作成します。
PUT /gibberish_index
{
  "mappings": {
    "properties": {
      "text_field": { "type": "text" }
    }
  }
}

2. データを追加します。
POST /gibberish_index/_doc/1
{
  "text_field": "how are you" // こんにちは
}

POST /gibberish_index/_doc/2
{
  "text_field": "sdfgsdfg wertwert" // 文字化けの例
}

POST /gibberish_index/_doc/3
{
  "text_field": "I am not sure this makes sense" // これは意味があるか分かりません
}

POST /gibberish_index/_doc/4
{
  "text_field": "䧀䳀䇀䛀䧀䳀痀糀䧀䳀䇀䛀䧀䳀䇀䛀䧀䳀" // 文字化けの例
}

POST /gibberish_index/_doc/5
{
  "text_field": "The test fields." // テストフィールド
}

POST /gibberish_index/_doc/6
{
  "text_field": "䇀䛀䧀䳀痀糀䧀䳀䇀䛀䧀䳀䇀䛀䧀" // 文字化けの例
}

3. インジェストパイプラインを作成します。
推論プロセッサフィールド：
model_id：推論に使用されるマシンラーニングモデルの ID。
target_field：推論結果を格納するために使用されるフィールド。
field_map.text_field：ドキュメント内の入力フィールドをモデルが予期するフィールドにマッピングするために使用されるフィールド。

PUT /_ingest/pipeline/gibberish_detection_pipeline
{
  "description": "A pipeline to detect gibberish text", // 文字化けテキストを検出するためのパイプライン
  "processors": [
    {
      "inference": {
        "model_id": "models--madhurjindal--autonlp-gibberish-detector",
        "target_field": "inference_results",
        "field_map": {
          "text_field": "text_field"
        }
      }
    }
  ]
}

4. パイプラインを使用して、インデックス内のドキュメントを更新します。
POST /gibberish_index/_update_by_query?pipeline=gibberish_detection_pipeline

5. 推論結果を含むドキュメントを検索します。
GET /gibberish_index/_search
{
  "query": {
    "exists": {
      "field": "inference_results"
    }
  }
}

6. 完全一致を実行します。
inference_results.predicted_value.keyword フィールドの値が「word salad」と一致します。
inference_results.prediction_probability フィールドの値が 0.1 以上です。

GET /gibberish_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "inference_results.predicted_value.keyword": "word salad"
          }
        },
        {
          "range": {
            "inference_results.prediction_probability": {
              "gte": 0.1
            }
          }
        }
      ]
    }
  }
}

完全一致を実行すると、次の 2 つのデータレコードが返されます。2 つのデータレコードの word salad メトリックは最高のスコアを獲得しており、2 つのデータレコードが文字化けである可能性が非常に高いことを示しています。

{ /* レスポンス例 */
  // ... (省略)
}

Python スクリプトを使用して文字化けテキストを識別する

Python スクリプトを使用して文字化けテキストを識別することもできます。 ECS インスタンスで Python3 コマンドを実行して、Python 環境を読み込みます。次に、次のコマンドを実行します。

from elasticsearch import Elasticsearch

es_username = 'elastic'
es_password = '****'

# basic_auth パラメータを使用して、Elasticsearch クライアントインスタンスを作成します。
es = Elasticsearch(
    "http://es-cn-lbj3l7erv0009****.elasticsearch.aliyuncs.com:9200",
    basic_auth=(es_username, es_password)
)

# インデックスを作成し、インデックスのマッピングを構成します。
create_index_body = {
  "mappings": {
    "properties": {
      "text_field": { "type": "text" }
    }
  }
}
es.indices.create(index='gibberish_index2', body=create_index_body)

# ドキュメントを挿入します。
docs = [
    {"text_field": "how are you"}, // こんにちは
    {"text_field": "sdfgsdfg wertwert"}, // 文字化けの例
    {"text_field": "I am not sure this makes sense"}, // これは意味があるか分かりません
    {"text_field": "䧀䳀䇀䛀䧀䳀痀糀䧀䳀䇀䛀䧀䳀䇀䛀䧀䳀"}, // 文字化けの例
    {"text_field": "The test fields."}, // テストフィールド
    {"text_field": "䇀䛀䧀䳀痀糀䧀䳀䇀䛀䧀䳀䇀䛀䧀"} // 文字化けの例
]

for i, doc in enumerate(docs):
    es.index(index='gibberish_index2', id=i+1, body=doc)

# プロセッサとパイプラインを作成します。
pipeline_body = {
    "description": "A pipeline to detect gibberish text", // 文字化けテキストを検出するためのパイプライン
    "processors": [
      {
        "inference": {
          "model_id": "models--madhurjindal--autonlp-gibberish-detector",
          "target_field": "inference_results",
          "field_map": {
            "text_field": "text_field"
          }
        }
      }
    ]
}
es.ingest.put_pipeline(id='gibberish_detection_pipeline2', body=pipeline_body)

# パイプラインを使用して、既存のドキュメントを更新します。
es.update_by_query(index='gibberish_index2', body={}, pipeline='gibberish_detection_pipeline2')

# 推論結果を含むドキュメントを検索します。
search_body = {
  "query": {
    "exists": {
      "field": "inference_results"
    }
  }
}
response = es.search(index='gibberish_index2', body=search_body)
print(response)

# 完全一致を実行します。
# 1.inference_results.predicted_value.keyword フィールドの値が「word salad」と一致します。
# 2.inference_results.prediction_probability フィールドの値が 0.1 以上です。
search_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "inference_results.predicted_value.keyword": "word salad"
                    }
                },
                {
                    "range": {
                        "inference_results.prediction_probability": {
                            "gte": 0.1
                        }
                    }
                }
            ]
        }
    }
}
response = es.search(index='gibberish_index2', body=search_query)
print(response)

{ /* レスポンス例 */
	// ... (省略)
}

参照

クライアントを使用して Alibaba Cloud Elasticsearch クラスターにアクセスする