Qwen-VL と Milvus によるマルチモーダル検索パイプライン構築 - Milvus

このトピックでは、Milvus 用ベクトル検索サービス (Milvus) と Qwen-VL 大規模視覚言語モデル (LVLM) を統合してマルチモーダル検索システムを構築する方法を説明します。この統合により、画像の特徴を抽出し、マルチモーダル埋め込みモデルを使用して効率的なマルチモーダル検索を行うことができます。検索方法には、テキストから画像、テキストからテキスト、画像から画像、および画像からテキストの取得が含まれます。

背景情報

マルチモーダル検索では、画像やテキストなどの非構造化データをベクター表現に変換する必要があります。その後、ベクトル取得技術を使用して類似のコンテンツを迅速に見つけます。このトピックでは、次のツールを使用します。

Vector Retrieval Service Milvus: ベクターを格納および取得するための効率的なベクターデータベース。
Qwen-VL: 画像の説明とキーワードを抽出します。詳細については、「Qwen-VL」をご参照ください。
DashScope Embedding API: 画像とテキストをベクターに変換します。詳細については、「マルチモーダル埋め込み API の詳細」をご参照ください。

特徴は次のとおりです:

テキストから画像の検索: テキストクエリを入力して、最も類似した画像を検索します。
テキストからテキストの検索: テキストクエリを入力して、最も類似した画像の説明を検索します。
画像から画像の検索: 画像クエリを入力して、最も類似した画像を検索します。
画像からテキストの検索: 画像クエリを入力して、最も類似した画像の説明を検索します。

システムアーキテクチャ

次の図は、このトピックで使用されるマルチモーダル検索システムの全体的なアーキテクチャを示しています。

前提条件

Milvus インスタンスを作成します。詳細については、「Milvus インスタンスをすばやく作成する」をご参照ください。
Alibaba Cloud Model Studio をアクティブ化し、API キーを取得します。詳細については、「API キーの取得」をご参照ください。
必要な依存関係パッケージをインストールします。
```
pip3 install dashscope pymilvus==2.5.0
```
このトピックの例は、Python 3.9 環境で実行されます。
サンプルデータセットをダウンロードして解凍します。
```
wget https://github.com/milvus-io/pymilvus-assets/releases/download/imagedata/reverse_image_search.zip
unzip -q -o reverse_image_search.zip
```
サンプルデータセットには、reverse_image_search.csv という名前の CSV ファイルといくつかの画像ファイルが含まれています。
説明
このトピックでは、オープンソースプロジェクト Milvus のサンプルデータセットと画像を使用します。

コアコードの概要

このトピックの例では、Qwen-VL モデルがまず画像の説明を抽出し、それを image_description フィールドに格納します。次に、マルチモーダル埋め込みモデルが画像とその説明を image_embedding と text_embedding という名前のベクター表現に変換します。このプロセスにより、クロスモーダルな取得と分析が可能になります。

簡単にするために、この例では最初の 200 枚の画像からのみデータを抽出します。

import base64
import csv
import dashscope
import os
import pandas as pd
import sys
import time
from tqdm import tqdm
from pymilvus import (
    connections,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    MilvusException,
    utility,
)

from http import HTTPStatus
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class FeatureExtractor:
    def __init__(self, DASHSCOPE_API_KEY):
        self._api_key = DASHSCOPE_API_KEY  # API キーを環境変数に格納します

    def __call__(self, input_data, input_type):
        if input_type not in ("image", "text"):
            raise ValueError("Invalid input type. Must be 'image' or 'text'.")

        try:
            if input_type == "image":
                _, ext = os.path.splitext(input_data)
                image_format = ext.lstrip(".").lower()
                with open(input_data, "rb") as image_file:
                    base64_image = base64.b64encode(image_file.read()).decode("utf-8")
                input_data = f"data:image/{image_format};base64,{base64_image}"
                payload = [{"image": input_data}]
            else:
                payload = [{"text": input_data}]

            resp = dashscope.MultiModalEmbedding.call(
                model="multimodal-embedding-v1",
                input=payload,
                api_key=self._api_key,
            )

            if resp.status_code == HTTPStatus.OK:
                return resp.output["embeddings"][0]["embedding"]
            else:
                raise RuntimeError(
                    f"API call failed. Status code: {resp.status_code}, Error message: {resp.message}"
                )
        except Exception as e:
            logger.error(f"Processing failed: {str(e)}")
            raise


class FeatureExtractorVL:
    def __init__(self, DASHSCOPE_API_KEY):
        self._api_key = DASHSCOPE_API_KEY  # API キーを環境変数に格納します

    def __call__(self, input_data, input_type):
        if input_type not in ("image"):
            raise ValueError("Invalid input type. Must be 'image'.")

        try:
            if input_type == "image":
                payload=[
                            {
                                "role": "system",
                                "content": [{"type":"text","text": "You are a helpful assistant."}]
                            },
                            {
                                "role": "user",
                                "content": [
                                            # {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                                            {"image": input_data},
                                            {"text": "First, describe this image in under 50 words, and then provide 5 keywords"}
                                            ],
                            }
                        ]

            resp = dashscope.MultiModalConversation.call(
                model="qwen-vl-plus",
                messages=payload,
                api_key=self._api_key,
            )

            if resp.status_code == HTTPStatus.OK:
                return resp.output["choices"][0]["message"].content[0]["text"]
            else:
                raise RuntimeError(
                    f"API call failed. Status code: {resp.status_code}, Error message: {resp.message}"
                )
        except Exception as e:
            logger.error(f"Processing failed: {str(e)}")
            raise


class MilvusClient:
    def __init__(self, MILVUS_TOKEN, MILVUS_HOST, MILVUS_PORT, INDEX, COLLECTION_NAME):
        self._token = MILVUS_TOKEN
        self._host = MILVUS_HOST
        self._port = MILVUS_PORT
        self._index = INDEX
        self._collection_name = COLLECTION_NAME

        self._connect()
        self._create_collection_if_not_exists()

    def _connect(self):
        try:
            connections.connect(alias="default", host=self._host, port=self._port, token=self._token)
            logger.info("Connected to Milvus successfully.")
        except Exception as e:
            logger.error(f"Failed to connect to Milvus: {str(e)}")
            sys.exit(1)

    def _collection_exists(self):
        return self._collection_name in utility.list_collections()
    
    def _create_collection_if_not_exists(self):
        try:
            if not self._collection_exists():
                fields = [
                    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
                    FieldSchema(name="origin", dtype=DataType.VARCHAR, max_length=512),
                    FieldSchema(name="image_description", dtype=DataType.VARCHAR, max_length=1024),
                    FieldSchema(name="image_embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
                    FieldSchema(name="text_embedding", dtype=DataType.FLOAT_VECTOR, dim=1024)
                ]

                schema = CollectionSchema(fields)

                self._collection = Collection(self._collection_name, schema)

                if self._index == 'IVF_FLAT':
                    self._create_ivf_index()
                else:
                    self._create_hnsw_index()   
                logger.info("Collection created successfully.")
            else:
                self._collection = Collection(self._collection_name)
                logger.info("Collection already exists.")
        except Exception as e:
            logger.error(f"Failed to create or load collection: {str(e)}")
            sys.exit(1)


    def _create_ivf_index(self):
        index_params = {
            "index_type": "IVF_FLAT",
            "params": {
                        "nlist": 1024, # インデックスのクラスター数
                    },
            "metric_type": "L2",
        }
        self._collection.create_index("image_embedding", index_params)
        self._collection.create_index("text_embedding", index_params)
        logger.info("Index created successfully.")

    def _create_hnsw_index(self):
        index_params = {
            "index_type": "HNSW",
            "params": {
                        "M": 64, # グラフ内で各ノードが接続できる隣接ノードの最大数
                        "efConstruction": 100, # インデックス構築中に接続候補として考慮される隣接ノードの数
                    },
            "metric_type": "L2",
        }
        self._collection.create_index("image_embedding", index_params)
        self._collection.create_index("text_embedding", index_params)
        logger.info("Index created successfully.")
    
    def insert(self, data):
        try:
            self._collection.insert(data)
            self._collection.load()
            logger.info("Data inserted and loaded successfully.")
        except MilvusException as e:
            logger.error(f"Failed to insert data: {str(e)}")
            raise

    def search(self, query_embedding, field, limit=3):
        try:
            if self._index == 'IVF_FLAT':
                param={"metric_type": "L2", "params": {"nprobe": 10}}
            else:
                param={"metric_type": "L2", "params": {"ef": 10}}

            result = self._collection.search(
                data=[query_embedding],
                anns_field=field,
                param=param,
                limit=limit,
                output_fields=["origin", "image_description"],
            )
            return [{"id": hit.id, "distance": hit.distance, "origin": hit.origin, "image_description": hit.image_description} for hit in result[0]]
        except Exception as e:
            logger.error(f"Search failed: {str(e)}")
            return None


# Load data and generate embeddings
def load_image_embeddings(extractor, extractorVL, csv_path):
    df = pd.read_csv(csv_path)
    image_embeddings = {}

    for image_path in tqdm(df["path"].tolist()[:200], desc="Generating image embeddings"): # このデモでは最初の 200 枚の画像のみを使用します
        try:
            desc = extractorVL(image_path, "image")
            image_embeddings[image_path] = [desc, extractor(image_path, "image"), extractor(desc, "text")]
            time.sleep(1)  # API 呼び出しの頻度を制御します
        except Exception as e:
            logger.warning(f"Failed to process {image_path}, skipping: {str(e)}")

    return [{"origin": k, 'image_description':v[0], "image_embedding": v[1], 'text_embedding': v[2]} for k, v in image_embeddings.items()]

各項目の説明:

FeatureExtractor: このクラスは DashScope Embedding API を呼び出して、画像またはテキストをベクター表現に変換します。
FeatureExtractorVL: このクラスは Qwen-VL モデルを呼び出して、画像からテキストの説明とキーワードを抽出します。
MilvusClient: このクラスは、接続の作成、コレクションの管理、インデックスの構築、データの挿入、検索などの Milvus 操作をカプセル化します。

手順

ステップ 1: データセットのロード

if __name__ == "__main__":
    # Milvus と DashScope API を設定します
    MILVUS_TOKEN = "root:****"
    MILVUS_HOST = "c-0aa16b1****.milvus.aliyuncs.com"
    MILVUS_PORT = "19530"
    COLLECTION_NAME = "multimodal_search"
    INDEX = "IVF_FLAT"  # IVF_FLAT または HNSW  
    script_dir = os.path.dirname(os.path.abspath(__file__))
    csv_path = os.path.join(script_dir, "reverse_image_search.csv")



    # ステップ 1: Milvus クライアントを初期化します
    milvus_client = MilvusClient(MILVUS_TOKEN, MILVUS_HOST, MILVUS_PORT, INDEX, COLLECTION_NAME)

    # ステップ 2: Qwen-VL 大規模モデルとマルチモーダル埋め込みモデルを初期化します
    extractor = FeatureExtractor(DASHSCOPE_API_KEY)
    extractorVL = FeatureExtractorVL(DASHSCOPE_API_KEY)

    # ステップ 3: 画像データセットの埋め込みを生成し、Milvus に挿入します
    embeddings = load_image_embeddings(extractor, extractorVL, csv_path)
    milvus_client.insert(embeddings)

次のパラメーターを実際の値に置き換えてください。

パラメーター	説明
`DASHSCOPE_API_KEY`	DashScope の API キー。Qwen-VL およびマルチモーダル埋め込みモデルを呼び出すために使用されます。
`MILVUS_TOKEN`	Milvus インスタンスのアクセス資格情報。`username:password` のフォーマットです。
`MILVUS_HOST`	Milvus インスタンスの内部またはパブリックエンドポイント (例: `c-xxxxxxxxxxxx.milvus.aliyuncs.com`)。Milvus インスタンスの [インスタンス詳細] ページで表示できます。
`MILVUS_PORT`	Milvus インスタンスのポート番号。デフォルト値は `19530` です。
`COLLECTION_NAME`	画像とテキストのベクトルデータを格納するために使用される Milvus コレクションの名前。

Python ファイルを実行します。出力に次の情報が含まれている場合、データは正常にロードされています。

Generating image embeddings: 100%
INFO:__main__:Data inserted and loaded successfully.

また、Attu ページにアクセスし、[データ] タブに移動してデータセット情報を確認することもできます。

たとえば、Qwen-VL 大規模モデルが画像を分析すると、「ジーンズと緑のブーツを履いてビーチにいる人。砂は水跡で覆われている。キーワード: ビーチ、足跡、砂、靴、パンツ」のように、シーンを鮮やかに説明するテキストのまとめが抽出されます。

画像の説明では、簡潔で鮮やかな言葉を使用して画像の主な特徴をハイライトし、シーンの明確な心象風景を作り出します。

ステップ 2: マルチモーダルベクトル取得の実行

例 1: テキストから画像およびテキストからテキストの検索

この例では、クエリテキストは「a brown dog」です。マルチモーダルベクターモデルは、このクエリを埋め込みに変換します。この埋め込みは、image_embedding フィールドでのテキストから画像の検索と、text_embedding フィールドでのテキストからテキストの検索に使用されます。両方の検索の結果が返されます。

Python ファイルで、main セクションを次のコードに置き換えてファイルを実行します。

if __name__ == "__main__":
    MILVUS_HOST = "c-xxxxxxxxxxxx.milvus.aliyuncs.com"
    MILVUS_PORT = "19530"
    MILVUS_TOKEN = "root:****"
    COLLECTION_NAME = "multimodal_search"
    INDEX = "IVF_FLAT" # IVF_FLAT または HNSW
    DASHSCOPE_API_KEY = "<YOUR_DASHSCOPE_API_KEY >"
    
    # ステップ 1: Milvus クライアントを初期化します
    milvus_client = MilvusClient(MILVUS_TOKEN, MILVUS_HOST, MILVUS_PORT, INDEX, COLLECTION_NAME)
    
    # ステップ 2: マルチモーダル埋め込みモデルを初期化します
    extractor = FeatureExtractor(DASHSCOPE_API_KEY)

    # ステップ 4: テキストから画像およびテキストからテキストの検索のためのマルチモーダル検索の例
    text_query = "a brown dog"
    text_embedding = extractor(text_query, "text")
    text_results_1 = milvus_client.search(text_embedding, field = 'image_embedding')
    logger.info(f"Text-to-image search results: {text_results_1}")
    text_results_2 = milvus_client.search(text_embedding, field = 'text_embedding')
    logger.info(f"Text-to-text search results: {text_results_2}")

次の情報が返されます。

説明

大規模モデルの出力は非決定的であるため、結果はこの例と若干異なる場合があります。

INFO:__main__:Text-to-image search results: [
{'id': 456882250782308942, 'distance': 1.338853359222412, 'origin': './train/Rhodesian_ridgeback/n02087394_9675.JPEG', 'image_description': 'カーペットの上に立っている子犬の写真。茶色の毛と青い目をしています。\nキーワード: 子犬, カーペット, 目, 毛色, 立っている'}, 
{'id': 456882250782308933, 'distance': 1.3568601608276367, 'origin': './train/Rhodesian_ridgeback/n02087394_6382.JPEG', 'image_description': 'これは、垂れた耳と首輪をつけた茶色の猟犬です。まっすぐ前を見ています。\n\nキーワード: 犬, 茶色, 猟犬, 耳, 首輪'}, 
{'id': 456882250782308940, 'distance': 1.3838427066802979, 'origin': './train/Rhodesian_ridgeback/n02087394_5846.JPEG', 'image_description': '2 匹の子犬が毛布の上で遊んでいます。1 匹の犬がもう 1 匹の上に横たわっており、背景にはテディベアがいます。\n\nキーワード: 子犬, 遊ぶ, 毛布, テディベア, 相互作用'}]
INFO:__main__:Text-to-text search results: [
{'id': 456882250782309025, 'distance': 0.6969608068466187, 'origin': './train/mongoose/n02137549_7552.JPEG', 'image_description': 'これは小さな茶色の動物のクローズアップ写真です。丸い顔と大きな目をしています。\n\nキーワード: 小動物, 茶色の毛, 丸い顔, 大きな目, 自然な背景'}, 
{'id': 456882250782308933, 'distance': 0.7110348343849182, 'origin': './train/Rhodesian_ridgeback/n02087394_6382.JPEG', 'image_description': 'これは、垂れた耳と首輪をつけた茶色の猟犬です。まっすぐ前を見ています。\n\nキーワード: 犬, 茶色, 猟犬, 耳, 首輪'}, 
{'id': 456882250782308992, 'distance': 0.7725887298583984, 'origin': './train/lion/n02129165_19310.JPEG', 'image_description': 'これはライオンのクローズアップ写真です。厚いたてがみと鋭い目をしています。\n\nキーワード: ライオン, 目, たてがみ, 自然環境, 野生動物'}]

例 2: 画像から画像および画像からテキストの検索

この例では、`test` ディレクトリ (パス: `test/lion/n02129165_13728.JPEG`) のライオン画像を使用して類似検索が実行されます。

画像から画像および画像からテキストの検索メソッドは、視覚的およびテキスト的な観点の両方からターゲット画像に関連するコンテンツを取得します。これにより、多次元の類似性マッチングが可能になります。

if __name__ == "__main__":
    # Milvus と DashScope API を設定します
    MILVUS_TOKEN = "root:****"
    MILVUS_HOST = "c-0aa16b1****.milvus.aliyuncs.com"
    MILVUS_PORT = "19530"
    COLLECTION_NAME = "multimodal_search"
    INDEX = "IVF_FLAT"  # IVF_FLAT または HNSW
    DASHSCOPE_API_KEY = "<YOUR_DASHSCOPE_API_KEY >"

    # ステップ 1: Milvus クライアントを初期化します
    milvus_client = MilvusClient(MILVUS_TOKEN, MILVUS_HOST, MILVUS_PORT, INDEX, COLLECTION_NAME)
  
    # ステップ 2: マルチモーダル埋め込みモデルを初期化します
    extractor = FeatureExtractor(DASHSCOPE_API_KEY)

    # ステップ 5: 画像から画像および画像からテキストの検索のためのマルチモーダル検索の例
    image_query_path = "./test/lion/n02129165_13728.JPEG"
    image_embedding = extractor(image_query_path, "image")
    image_results_1 = milvus_client.search(image_embedding, field = 'image_embedding')
    logger.info(f"Image-to-image search results: {image_results_1}")
    image_results_2 = milvus_client.search(image_embedding, field = 'text_embedding')
    logger.info(f"Image-to-text search results: {image_results_2}")

次の出力が返されます。

説明

大規模モデルの出力はある程度ランダムです。結果がこの例と異なる場合があります。

INFO:__main__:Image-to-image search results: [
{'id': 456882250782308987, 'distance': 0.23892249166965485, 'origin': './train/lion/n02129165_19953.JPEG', 'image_description': '雄大なライオンが岩のそばに立っており、背景には木々や茂みがあります。太陽の光が当たっています。\n\nキーワード: ライオン, 岩, 森, 太陽光, 野生'}, 
{'id': 456882250782308989, 'distance': 0.4113130569458008, 'origin': './train/lion/n02129165_1142.JPEG', 'image_description': 'ライオンが密集した緑の植生の中で休んでいます。背景は竹と木で構成されています。\n\nキーワード: ライオン, 草, 緑の植物, 木の幹, 自然環境'}, 
{'id': 456882250782308984, 'distance': 0.5206397175788879, 'origin': './train/lion/n02129165_16.JPEG', 'image_description': '画像には、草の上に立っている一対のライオンが写っています。雄ライオンは厚いたてがみを持ち、雌ライオンはより痩せて見えます。\n\nキーワード: ライオン, 草, 雄, 雌, 自然環境'}]
INFO:__main__:Image-to-text search results: 
[{'id': 456882250782308989, 'distance': 1.0935896635055542, 'origin': './train/lion/n02129165_1142.JPEG', 'image_description': 'ライオンが密集した緑の植生の中で休んでいます。背景は竹と木で構成されています。\n\nキーワード: ライオン, 草, 緑の植物, 木の幹, 自然環境'}, 
{'id': 456882250782308987, 'distance': 1.2102885246276855, 'origin': './train/lion/n02129165_19953.JPEG', 'image_description': '雄大なライオンが岩のそばに立っており、背景には木々や茂みがあります。太陽の光が当たっています。\n\nキーワード: ライオン, 岩, 森, 太陽光, 野生'}, 
{'id': 456882250782308992, 'distance': 1.2725986242294312, 'origin': './train/lion/n02129165_19310.JPEG', 'image_description': 'これはライオンのクローズアップ写真です。厚いたてがみと鋭い目をしています。\n\nキーワード: ライオン, 目, たてがみ, 自然環境, 野生動物'}]