All Products
Search
Document Center

Platform For AI:Manage and use multimodal data

Last Updated:Jan 27, 2026

1. Overview

Multimodal data management lets you process data, such as images, using multimodal large language models (MLLMs) and embedding models. This preprocessing, through Smart Tagging and Semantic Indexing, generates rich metadata. Use this metadata to search, filter, and quickly find specific data subsets for downstream tasks like data annotation and model training. Additionally, Platform for AI (PAI) datasets provide a comprehensive OpenAPI to simplify integration with your custom platforms. The following figure shows the product architecture.

image

2. Limitations

Multimodal data management in PAI has the following limitations:

  • Region: This feature is available in the following regions: China (Hangzhou), China (Shanghai), China (Shenzhen), China (Ulanqab), China (Beijing), China (Guangzhou), Singapore, Germany (Frankfurt), US (Virginia), China (Hong Kong), Japan (Tokyo), Indonesia (Jakarta), US (Silicon Valley), Malaysia (Kuala Lumpur), and Korea (Seoul).

  • Storage type: Currently, multimodal data management only supports Object Storage Service (OSS).

  • File type: Only image files are supported. Supported formats include JPG, JPEG, PNG, GIF, BMP, TIFF, and WEBP.

  • File quantity: A single dataset version supports a maximum of 1,000,000 files. To increase the capacity for special requirements, contact PAI PDSA.

  • Models:

    • Tagging models: Supports Qwen-VL-MAX and Qwen-VL-Plus models from the Model Studio platform.

    • Indexing models: Supports the Model Studio Multimodal Embedding Model (such as tongyi-embedding-vision-plus) and GME models from the PAI Model Gallery. You can deploy these models to PAI-EAS.

  • Metadata storage:

    • Metadata: PAI securely stores metadata in its built-in metadatabase.

    • Embedding vectors: Supports storage in the following custom vector databases:

      • Elasticsearch (Vector Search Edition, version 8.17.0 or later)

      • OpenSearch (Vector Search Edition)

      • Milvus (version 2.4 or later)

      • Hologres (version 4.0.9 or later)

      • Lindorm (Vector Engine Edition)

  • Dataset processing mode: Supports running Smart Tagging and Semantic Indexing tasks in full and incremental modes.

3. Workflow

PAI Multimodal Data Management Guide

3.1 Prerequisites

3.1.1 Activate PAI, create a default workspace, and obtain administrator permissions

  1. Use a root account to activate PAI and create a workspace. Log on to the PAI console, select a region in the upper-left corner, and then authorize and activate the product.

  2. Authorize the operating account. You can skip this step if you are using a root account. RAM users must have the Workspace administrator role. For instructions on how to authorize an account, see the "Configure member roles" section in Create and manage workspaces.

3.1.2 Activate Model Studio and create an API key

To activate Alibaba Cloud Model Studio and create an API key, see Get an API key.

3.1.3 Create a vector database

Create a vector database instance

Multimodal dataset management currently supports the following Alibaba Cloud vector databases:

  • Elasticsearch (Vector Search Edition, 8.17.0 or later)

  • OpenSearch (Vector Search Edition)

  • Milvus (2.4 or later)

  • Hologres (4.0.9 or later)

  • Lindorm (Vector Engine Edition)

For instructions on how to create an instance for each cloud vector database, refer to the documentation for the respective product.

Configure network and whitelist settings

  • Public network access

    If your vector database instance has a public endpoint enabled, add the following IP addresses to the instance's public access whitelist. This allows the multimodal data management service to access the instance over the public network. For instructions on how to set up an Elasticsearch whitelist, see Configure a public or private IP address whitelist for an Elasticsearch cluster.

    Region

    IP address list

    Hangzhou

    47.110.230.142, 47.98.189.92

    Shanghai

    47.117.86.159, 106.14.192.90

    Shenzhen

    47.106.88.217, 39.108.12.110

    Ulanqab

    8.130.24.177, 8.130.82.15

    Beijing

    39.107.234.20, 182.92.58.94

  • Private network access

    Please submit a ticket to apply for this option.

Create a vector index table (Optional)

The system can create an index table automatically. You can skip this step unless you need a custom one.

In some vector databases, a vector index table is also known as a Collection or an Index.

The index table structure must be defined as follows:

Table schema

{
    "id":"text",                    // Primary key ID. This must be defined in OpenSearch. It exists by default in other databases and does not need to be defined.
    "index_set_id": "keyword",      // Index set ID. Must support indexing.
    "file_meta_id": "text",         // File metadata ID.   
    "dataset_id": "text",           // Dataset ID.
    "dataset_version": "text",      // Dataset version.
    "uri": "text",                  // URI of the OSS file.
    "file_vector": {                // Vector field.
        "type": "float",            // Vector type: float.
        "dims": 1536,               // Vector dimensions. Custom.
        "similarity": "DotProduct"  // Vector distance algorithm. Cosine distance or dot product.
    }
}

This section uses Elasticsearch as an example to show how to create a semantic index table with Python. For instructions on how to create index tables for other types of vector databases, refer to the documentation for the respective product. The sample code is as follows:

Sample code for creating a semantic index table in Elasticsearch

from elasticsearch import Elasticsearch

# 1. Connect to the Alibaba Cloud Elasticsearch instance.
# Note:
# (1) Python 3.9 or later is required: python3 -V
# (2) The Elasticsearch client must be version 8.x: pip show elasticsearch
# (3) If you use a VPC endpoint, the caller must be able to communicate with the VPC of the ES instance. Otherwise, use a public endpoint and add the public IP address of the caller to the ES whitelist.
# The default userName is elastic.
es_client = Elasticsearch(
    hosts=["http://es-cn-l4p***5z.elasticsearch.aliyuncs.com:9200"],
    basic_auth=("{userName}", "{password}"),
)

# 2. Define the index name and structure. The HNSW index algorithm is used by default.
index_name = "dataset_embed_test"
index_mapping = {
    "settings": {
        "number_of_shards": 1,          # Number of shards.
        "number_of_replicas": 1         # Number of replicas.
    },
    "mappings": {
        "properties": {
            "index_set_id": {
                "type": "keyword"
            },
            "uri": {
                "type": "text"
            },
            "file_meta_id": {
                "type": "text"
            },
            "dataset_id": {
                "type": "text"
            },
            "dataset_version": {
                "type": "text"  
            },
            "file_vector": {
                "type": "dense_vector",  # Define file_vector as a dense vector type.
                "dims": 1536,  # The vector dimension is 1536.
                "similarity": "dot_product"  # The similarity calculation method is dot product.
            }
        }
    }
}

# 3. Create the index.
if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_mapping)
    print(f"Index {index_name} created successfully!")
else:
    print(f"Index {index_name} already exists. It will not be created again.")

# 4. View the schema of the created index table (Optional).
# indexes = es_client.indices.get(index=index_name)
# print(indexes)

3.2 Create a dataset

  1. Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Datasets > Create Dataset.

    image

  2. Configure the dataset parameters. Key parameters are as follows. You can keep the default values for other parameters.

    1. Storage: Select Object Storage Service.

    2. Type: Select Premium.

    3. Content Type: Select Image.

    4. OSS Path: Select the OSS storage path for the dataset. If you have not prepared a dataset, you can download the sample dataset retrieval_demo_data, upload it to OSS, and then try out the multimodal data management feature.

    Note

    Importing a file or folder only records its path and does not copy the data.

    image

    Then, click OK to create the dataset.

3.3 Create connections

3.3.1 Create a Smart Tagging model connection

  1. Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.

    image

  2. Select Alibaba Cloud Model Studio Service and configure the Model Studio api_key.

    image

  3. After the connection is created, you can see the Alibaba Cloud Model Studio Service in the list.

    image

3.3.2 Create a Semantic Indexing model connection

  1. You can skip this step if you plan to use the Model Studio Semantic Indexing model service. In the left-side navigation pane, click Model Gallery, find and deploy the GME multimodal retrieval model to obtain an EAS service. The deployment takes about 5 minutes. When the status is Running, the deployment is successful.

    Important

    When you no longer need the index model, you can stop and delete the service to avoid further charges.

    image

  2. Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.

  3. Configure the model connection information based on whether you chose the Model Studio Semantic Indexing model or a custom-deployed EAS Semantic Indexing model.

    Use the Model Studio Semantic Indexing model

    • For Connection Type, select General Multimodal Embedding Model Service.

    • For Service Provider, select Third-party Model Service.

    • Model Name: tongyi-embedding-vision-plus.

    • base_url: https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

    • api_key: Get an API key and fill it in.

    image

    Use a custom-deployed EAS Semantic Indexing model

    • For Connection Type, select General-purpose Multimodal Embedding Model Service.

    • For Service Provider, select PAI-EAS Model Service.

    • EAS Service: Select the GME multimodal retrieval model that you just deployed. If the service provider is not under the current account, you can select Third-party Service Model.

    image

    image

  4. After the connection is created, you can see the model connection service in the list.

    image

3.3.3 Create a vector database connection

  1. In the left-side navigation pane, choose AI Asset Management > Connection > Database > Create Connection.

    image

  2. The multimodal retrieval service supports vector databases such as Milvus, Lindorm, Open Search, Elasticsearch, and Hogress. This topic uses Elasticsearch as an example to describe how to create a database connection. Select Elasticsearch and configure parameters such as url, username, and password. For more information, see Create a database connection.

    image

    The connection formats for each vector database are as follows:

    Milvus

    uri: http://xxx.milvus.aliyuncs.com:19530 
    database: {your_database} 
    token: root:{password}

    OpenSearch

    uri: http://xxxx.ha.aliyuncs.com
    username: {username} 
    password: {password}

    Hologres

    host: xxxx.hologres.aliyuncs.com
    database: {your_database} 
    port: {port}
    access_key_id={password}

    Elasticsearch

    uri: http://xxxx.elasticsearch.aliyuncs.com:9200
    username: {username} 
    password: {password}

    Lindorm

    uri: xxxx.lindorm.aliyuncs.com:{port}
    username: {username} 
    password: root:{password}
  3. After the connection is created, the vector database connection appears in the list.

    image

3.4 Create a Smart Tagging task

3.4.1 Create a Smart Tag Definition

In the left menu bar, click AI Asset Management > Datasets > Intelligent Tag Definition > Create Intelligent Tag Definition. The tag configuration page opens. The following is an example configuration:

  • Guide Prompt: You are a seasoned driver with extensive experience on both highways and urban roads.

  • Tag Definition:

    Sample tag definition for autonomous driving

    {
        "Reflective strips": "Usually yellow, or alternating yellow and black, attached to corners and other permanent protruding obstacles to alert drivers to avoid them. They are strip-shaped, not traffic cones, parking locks, or water-filled barriers.",
        "Parking locks": "Also known as parking space locks, they can be raised to prevent a parking space from being occupied. If a parking lock is present, you must specify whether it is in the raised or lowered state. It is in the raised state if it has a raised frame, otherwise it is in the lowered state.",
        "Lit construction vehicles": "The target is a vehicle with two arrow-shaped lights on the left and right that are lit. Otherwise, it does not exist.",
        "Overturned vehicles": "A vehicle that has overturned on the ground.",
        "Fallen water-filled barriers": "A water-filled barrier is a plastic shell obstacle used to divide road surfaces or form a blockage, typically in the form of a red plastic wall. It is commonly used in road traffic facilities and is often seen on highways, urban roads, and at overpass intersections. It is significantly larger than a traffic cone and has a sheet-like structure. Water-filled barriers are normally upright. If one is lying on the ground, it must be clearly indicated.",
        "Fallen traffic cones": "Also known as conical traffic markers or pylons, commonly called road cones or safety cones, they are cone-shaped temporary road signs. Obstacles that are rod-shaped or sheet-like are not traffic cones because they are not conical. A traffic cone may be knocked over by a car. If a traffic cone is present in the image and you need to determine if it has fallen, observe whether the bottom of the cone (the base of the cone) is in contact with the ground. If it is, it has not fallen. Otherwise, it has.",
        "Charging spaces": "A parking space against a wall with a visible charging gun, charging pile equipment, or marked as a new energy vehicle space is a charging space. It can only appear in a parking lot (indoor or outdoor). Note that parking locks are not related to charging.",
        "Speed bumps": "Usually yellow and black, or just yellow, these are narrow raised strips across the road, perpendicular to the road edge, used to slow down vehicles. They cannot appear within a parking space.",
        "Deceleration lane lines": "Dashed lines in a fishbone pattern on both sides of the lane, inside the solid lines. Both sides must have them to be considered deceleration lane lines.",
        "Ramps": "Can only be identified if there is a clear large curve on a highway. Ramps are usually on the right side of the main highway and are used to enter or exit toll stations.",
        "Ground shadows": "There are clear shadows on the ground.",
        "Cloudy": "Can only be identified if the sky is visible and there are clear clouds in the sky.",
        "Glaring car": "The lights of a car ahead are causing glare (the light changes from a single point to a line of light), usually occurring at night or on rainy days.",
        "Left turn, right turn, U-turn arrows": "Milky white arrow markings painted on the road surface (a few are yellow), not the green and white arrows on highway signs indicating a right curve. When determining the presence of these arrows, only clear arrow markings in the middle of the road lane are the target. Those on the roadside are not. If there are arrows on the ground, the direction is determined as follows: a right-turn arrow rotates clockwise from the base to the tip; a left-turn arrow rotates counter-clockwise from the base to the tip; a U-shaped arrow is a U-turn arrow.",
        "Crosswalks": "Can only exist on the road surface (also possible in parking lots) and at intersections. They must be white lines distributed at repeated intervals parallel to the road edge for pedestrian crossing. They cannot appear on highways, highway ramps, or in tunnels.",
        "Overexposure": "During the day, direct sunlight causes the lens to be overexposed (can only happen during the day).",
        "Motor vehicles": "There are other motor vehicles in the field of view.",
        "Merging in or out": "A place on a highway where multiple lanes merge into one, or one lane divides into multiple lanes.",
        "Intersections": "An intersection where there are no lane lines within the intersection area (refers to the absence of lines within the intersection itself; lines outside the intersection do not matter).",
        "No parking signs": "A sign, either hanging or standing on the ground, with the words 'No Parking' or a symbol of a 'P' in a circle with a diagonal line through it.",
        "Lane lines": "Lane lines on the road, with special attention to blurry lane lines.",
        "Fallen rocks or tires on the road": "Obstacles on the road that affect traffic.",
        "Tunnels": "Pay special attention to distinguishing between entering and exiting a tunnel.",
        "Wet ground on a rainy day": "The ground is slippery due to rain.",
        "Non-motorized vehicles": "Includes non-motorized objects such as bicycles, electric bikes, wheelchairs, unicycles, and shopping carts, which may be parked on the roadside, in parking spaces, or moving on the road."
      }

3.4.2 Create an Offline Smart Tagging Task

  1. Click Custom Dataset, click a dataset name to open its details page, and then click the Dataset jobs tab.

    image

  2. On the task page, click Create job > Smart tag to configure the task parameters.

    image

    • Dataset Version: Select the version to label, such as v1.

    • Labeling Model Connection: Select an existing Model Studio model connection.

    • Smart Labeling Model: Supported models include Qwen-VL-MAX and QwenVL-Plus.

    • Max Concurrency: This value depends on the specifications of the EAS model service. For a single card, the recommended maximum concurrency is 5.

    • Intelligent Tag Definition: Select the definition that you just created.

    • Labeling Mode: The available patterns are Increment and Full.

  3. After the smart tagging task is created, it appears in the task list. You can click the links in the Actions column to view logs or stop the task.

    Note

    When you start a smart tagging task for the first time, the system builds the metadata. This process may take a long time.

3.5 Create a Semantic Indexing Task

  1. Click the dataset name to open the details page. In the Index Configuration area, click the edit button.

    image

  2. Configure the index.

    • Index Model Connection: Select the connection that you created in 3.3.2.

    • Index Database Connection: Select the index database connection you created in Section 3.3.3.

    • Index Database Table: Enter the name of the index table created in Create a vector index table (Optional), such as dataset_embed_test.

    Click Save > Refresh Now. A semantic index task is created to update the semantic index for all files in the selected dataset version. You can click Semantic Indexing Task in the upper-right corner of the dataset details page to view the task details.

    Note

    When you start a semantic indexing task for the first time, the system builds the metadata. This process may take a long time.

    If you click Cancel instead of Refresh Now, you can create the task manually by following these steps:

    On the dataset details page, click Dataset jobs to open the Tasks page.

    image

    Click Create job > Semantic Indexing. Configure the dataset version and set the maximum number of concurrent requests based on the EAS model service specifications (the recommended maximum is 5 for a single card). Then, click Confirm to create the semantic index task.

    image

3.6 Preview Data

  1. After the Smart Tagging and Semantic Indexing tasks are complete, go to the dataset details page and click View Data to preview the images in that dataset version.

    image.png

  2. On the View Data page, you can preview the images in the dataset version. You can switch between Gallery View and List View.

    image.png

    image.png

  3. Click a specific image to view a larger version and see the tags it contains.

    image.png

  4. Click the checkbox in the upper-left corner of a thumbnail to select it. You can also hold down the Shift key and click a checkbox to select multiple rows of data at once.

    image.png

3.7 Basic Data Search (Combined Search)

  1. On the left toolbar of the 'View Data' page, you can perform an Index Retrieval and a Search by Tag. Press Enter or click Search to start the search.

  2. Index Retrieval: Performs a text keyword search by matching keyword vectors with image index vectors from the semantic index. In Advanced Settings, you can set parameters such as topk and the score threshold.

    image

  3. Index Retrieval (search by image): Based on semantic indexing, you can upload an image from your local computer or select an image from OSS to search for matching images in the dataset by comparing vectors. In Advanced Settings, you can set parameters such as topk and Score threshold.

    image

  4. Search by Tag: Finds images by matching keywords with tags from the Smart Tagging feature. You can combine the following search conditions: Include Any of Following (NOT)Include All Following (AND), and Exclude Any of Following (NOT).

    image

  5. Metadata search: You can search for files by file name, storage path, and last modified time.

    image

    All the preceding search conditions are combined with an AND operator.

3.8 Advanced Data Search (DSL)

Advanced search uses DSL search, a domain-specific language for expressing complex search conditions. DSL is ideal for advanced search scenarios and supports features such as grouping, Boolean logic (AND/OR/NOT), range comparison (>, >=, <, <=), property existence (HAS/NOT HAS), token matching (:), and exact match (=). For more information about the syntax, see Retrieve a list of dataset file metadata.

image

3.9 Export search results

Note

This step exports the search results as a file list index for subsequent model training or data analytics.

After the search is complete, you can click the Export Results button at the bottom of the page. Two export modes are available:

image

3.9.1 Export to a file

  1. Click Export as file. On the configuration page, set the export content and the destination OSS folder, and then click OK.

    image.png

  2. You can view the export progress under AI Asset Management > Job > Dataset jobs in the left navigation bar.

  3. Use the exported results. After the export is complete, you can mount the exported result file and the original dataset to the training environment, such as a DLC or DSW instance. Then, you can write code to read the exported result file index and load the object files from the original dataset for model training or data analytics.

3.9.2 Export to a logical dataset version

You can import the search results from an advanced dataset into a version of another logical dataset. You can then use the data of that logical dataset version using the dataset software development kit (SDK).

  1. Click Export to logical dataset version, select the target logical dataset, and then click Confirm.

    image.png

    If no logical dataset is available, create one as described in the following section:

    Create a logical dataset

    Create a logical dataset. In the navigation pane on the left, click AI Asset Management > Dataset > Create Dataset, and then configure the following key parameters. You can configure the other parameters as needed.

    • Set Dataset Type to Logical.

    • Metadata OSS path: Select the OSS path of the exported metadata.

    • Set Import method to Import later.

    Click OK to create the dataset.

  2. Use the logical dataset. After the import task is complete, the destination logical dataset contains the exported metadata. You can use the SDK to load and use the data. For information about how to use the SDK, see the dataset details page.

    image

    image

    The command to install the SDK is:

    pip install https://pai-sdk.oss-cn-shanghai.aliyuncs.com/dataset/pai_dataset_sdk-1.0.0-py3-none-any.whl

4. Custom semantic indexing model (Optional)

You can fine-tune a custom semantic retrieval model. After the model is deployed in EAS, you can create a model connection as described in Section 3.3.2 to use it for multimodal data management.

4.1 Data preparation

This topic provides sample data. You can click retrieval_demo_data to download it.

4.1.1 Data format requirements

Each data sample is saved as a line in JSON format in the dataset.jsonl file. Each sample must contain the following fields:

  • image_id: A unique identifier for the image, such as the image name or a unique ID.

  • tags: A list of text tags associated with the image. The tags are an array of strings.

Example format:

{  
    "image_id": "c909f3df-ac4074ed",  
    "tags": ["silver sedan", "white SUV", "city street", "snowing", "night"], 
}

4.1.2 File organization

Place all image files in a folder named images. Place the dataset.jsonl file in the same directory as the images folder.

Directory structure example:

├── images
│   ├── image1.jpg
│   ├── image2.jpg
│   └── image3.jpg
└── dataset.jsonl  
Important

The filename dataset.jsonl and the folder name images are required and cannot be changed.

4.2 Model training

  1. In the Model Gallery, find retrieval-related models. Select a suitable model for fine-tuning and deployment based on the required model size and compute resources.

    image

    Fine-tuning VRAM (bs=4)

    Fine-tuning (4 × A800) train_samples/second

    Deployment VRAM

    Vector dimensions

    GME-2B

    14 G

    16.331

    5G

    1536

    GME-7B

    35 G

    13.868

    16 G

    3584

  2. For example, to train the GME-2B model, click Train, and then enter the data address and the model output path to start the training. The data address defaults to the sample data address.

    image

    image

4.3 Model deployment

After a model is trained, you can deploy the fine-tuned model by clicking Deploy in the training task.

To deploy the original GME model, click the Deploy button on the model tab in Model Gallery.

image

After the deployment completes, you can retrieve the EAS Endpoint and Token from the page.image

4.4 Service invocation

Input parameters

Name

Type

Required

Example

Description

model

String

Yes

pai-multimodal-embedding-v1

The model type. Support for custom user models and base model version iterations can be added later.

contents.input

list(dict) or list(str)

No

input = [{'text': text}]

input=[xxx,xxx,xxx,...]

input = [{'text': text},{'image', f"data:image/{image_format};base64,{image64}"}]

The content to be embedded.

Currently, only text and image are supported.

Response parameters

Name

Type

Example

Description

status_code

Integer

200

The HTTP status code.

200: The request was successful.

204: The request was partially successful.

400: The request failed.

message

list(str)

['Invalid input data: must be a list of strings or dict']

The error message.

output

dict

See the following table.

The embedding result.

The result from Dashscope is as follows: {'output', {'embeddings': list(dict), 'usage': xxx, 'request_id':xxx}}. The usage and request_id parameters are currently not used.

The elements in embeddings contain the following keys. If an index fails, a message key is added to the corresponding element to provide the reason for the failure.

Name

Type

Example

Description

index

Data ID

0

The HTTP status code.

200, 400, 500, etc.

embedding

List[Float]

[0.0391846,0.0518188,.....,-0.0329895,

0.0251465]

1536

The vector after embedding.

type

String

"Internal execute error."

The error message.

Sample code

import base64
import json
import os
import sys
from io import BytesIO

import requests
from PIL import Image, PngImagePlugin
import numpy as np

ENCODING = 'utf-8'

hosts = 'EAS URL'
head = {
    'Authorization': 'EAS TOKEN'
}

def encode_image_to_base64(image_path):
    """
    Encodes an image file into a Base64 string.
    """
    with open(image_path, "rb") as image_file:
        # Read the binary data of the image file.
        image_data = image_file.read()
        # Encode into a Base64 string.
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    
    return base64_encoded

if __name__=='__main__':
    image_path = "path_to_your_image"
    text = 'prompt'

    image_format = 'jpg'
    input_data = []
    
    image64 = encode_image_to_base64(image_path)
    input_data.append({'image': f"data:image/{image_format};base64,{image64}"})

    input_data.append({'text': text})

    datas = json.dumps({
        'input': {
            'contents': input_data
        }
    })
    r = requests.post(hosts, data=datas, headers=head)
    data = json.loads(r.content.decode('utf-8'))

    if data['status_code']==200:
        if len(data['message'])!=0:
            print('Part failed for the following reasons.')
            print(data['message'])

        for result_item in data['output']['embeddings']:
            print('The following succeed.')
            print('index', result_item['index'])
            print('type', result_item['type'])
            print('embedding', len(result_item['embedding']))
    else:
        print('Processed fail')
        print(data['message'])

Sample output:

{
    "status_code": 200,
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.020782470703125,
                    -0.01399993896484375,
                    -0.0229949951171875,
                    ...
                ],
                "type": "text"
            }
        ]
    }
}

4.5 Model evaluation

The following table shows the evaluation results on our sample data. The evaluation file was used for this test.

Original model precision

Model precision after 1 epoch of fine-tuning

gme2b

Precision@1 0.3542

Precision@5 0.5280

Precision@10 0.5923

Precision@50 0.5800

Precision@100 0.5792

Precision@1 0.4271

Precision@5 0.6480

Precision@10 0.7308

Precision@50 0.7331

Precision@100 0.7404

gme7b

Precision@1 0.3958

Precision@5 0.5920

Precision@10 0.6667

Precision@50 0.6517

Precision@100 0.6415

Precision@1 0.4375

Precision@5 0.6680

Precision@10 0.7590

Precision@50 0.7683

Precision@100 0.7723

Sample script for model evaluation

import base64
import json
import os
import requests
import numpy as np
import torch
from tqdm import tqdm
from collections import defaultdict


# Constants
ENCODING = 'utf-8'
HOST_URL = 'http://1xxxxxxxx4.cn-xxx.pai-eas.aliyuncs.com/api/xxx'
AUTH_HEADER = {'Authorization': 'ZTg*********Mw=='}

def encode_image_to_base64(image_path):
    """Encodes an image file into a Base64 string."""
    with open(image_path, "rb") as image_file:
        image_data = image_file.read()
        base64_encoded = base64.b64encode(image_data).decode(ENCODING)
    return base64_encoded


def load_image_features(feature_file):
    print("Begin to load image features...")
    image_ids, image_feats = [], []
    with open(feature_file, "r") as fin:
        for line in tqdm(fin):
            obj = json.loads(line.strip())
            image_ids.append(obj['image_id'])
            image_feats.append(obj['feature'])
    image_feats_array = np.array(image_feats, dtype=np.float32)
    print("Finished loading image features.")
    return image_ids, image_feats_array


def precision_at_k(predictions, gts, k):
    """
    Calculates the precision at K.
    
    :param predictions: [(image_id, similarity_score), ...]
    :param gts: set of ground truth image_ids
    :param k: int, the top K results
    :return: float, precision
    """
    if len(predictions) > k:
        predictions = predictions[:k]
    
    predicted_ids = {p[0] for p in predictions}
    relevant_and_retrieved = predicted_ids.intersection(gts)
    precision = len(relevant_and_retrieved) / k
    return precision


def main():
    root_dir = '/mnt/data/retrieval/data/'
    data_dir = os.path.join(root_dir, 'images')
    tag_file = os.path.join(root_dir, 'meta/test.jsonl')
    model_type = 'finetune_gme7b_final'
    save_feature_file = os.path.join(root_dir, 'features', f'features_{model_type}_eas.jsonl')
    final_result_log = os.path.join(root_dir, 'results', f'retrieval_{model_type}_log_eas.txt')
    final_result = os.path.join(root_dir, 'results', f'retrieval_{model_type}_log_eas.jsonl')

    os.makedirs(os.path.join(root_dir, 'features'), exist_ok=True)
    os.makedirs(os.path.join(root_dir, 'results'), exist_ok=True)

    tag_dict = defaultdict(list)
    gt_image_ids = []
    with open(tag_file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            data = json.loads(line.strip())
            gt_image_ids.append(data['image_id'])
            img_id = data['image_id'].split('.')[0]
            for caption in data['tags']:
                tag_dict[caption.strip()].append(img_id)

    print('Total tags:', len(tag_dict.keys()))

    prefix = ''
    texts = [prefix + text for text in tag_dict.keys()]
    images = [os.path.join(data_dir, i+'.jpg') for i in gt_image_ids]
    print('Total images:', len(images))

    encode_images = True
    if encode_images:
        with open(save_feature_file, "w") as fout:
            for image_path in tqdm(images):
                image_id = os.path.basename(image_path).split('.')[0]
                image64 = encode_image_to_base64(image_path)
                input_data = [{'image': f"data:image/jpg;base64,{image64}"}]

                datas = json.dumps({'input': {'contents': input_data}})
                r = requests.post(HOST_URL, data=datas, headers=AUTH_HEADER)

                data = json.loads(r.content.decode(ENCODING))
                if data['status_code'] == 200:
                    if len(data['message']) != 0:
                        print('Part failed:', data['message'])
                    for result_item in data['output']['embeddings']:
                        fout.write(json.dumps({"image_id": image_id, "feature": result_item['embedding']}) + "\n")
                else:
                    print('Processed fail:', data['message'])

    image_ids, image_feats_array = load_image_features(save_feature_file)

    top_k_list = [1, 5, 10, 50, 100]
    top_k_list_precision  = [[] for _ in top_k_list]

    with open(final_result, 'w') as f_w, open(final_result_log, 'w') as f:
        for tag in tqdm(texts):
            datas = json.dumps({'input': {'contents': [{'text': tag}]}})
            r = requests.post(HOST_URL, data=datas, headers=AUTH_HEADER)
            data = json.loads(r.content.decode(ENCODING))

            if data['status_code'] == 200:
                if len(data['message']) != 0:
                    print('Part failed:', data['message'])

                for result_item in data['output']['embeddings']:
                    text_feat_tensor = result_item['embedding']
                    idx = 0
                    score_tuples = []
                    batch_size = 128
                    while idx < len(image_ids):
                        img_feats_tensor = torch.from_numpy(image_feats_array[idx:min(idx + batch_size, len(image_ids))]).cuda()
                        batch_scores = torch.from_numpy(np.array(text_feat_tensor)).cuda().float() @ img_feats_tensor.t()
                        for image_id, score in zip(image_ids[idx:min(idx + batch_size, len(image_ids))], batch_scores.squeeze(0).tolist()):
                            score_tuples.append((image_id, score))
                        idx += batch_size
                    
                    predictions = sorted(score_tuples, key=lambda x: x[1], reverse=True)
            else:
                print('Processed fail:', data['message'])

            gts = tag_dict[tag.replace(prefix, '')]

            # Write result
            predictions_tmp = predictions[:10]
            result_dict = {'tag': tag, 'gts': gts, 'preds': [pred[0] for pred in predictions_tmp]}
            f_w.write(json.dumps(result_dict, ensure_ascii=False, indent=4) + '\n')

            for top_k_id, k in enumerate(top_k_list):
                need_exit = False

                if k > len(gts):
                    k = len(gts)
                    need_exit = True

                prec = precision_at_k(predictions, gts, k)

                f.write(f'Tag {tag}, Len(GT) {len(gts)}, Precision@{k} {prec:.4f} \n')
                f.flush()

                if need_exit:
                    break
                else:
                    top_k_list_precision[top_k_id].append(prec)
                    
    for idx, k in enumerate(top_k_list):
        print(f'Precision@{k} {np.mean(top_k_list_precision[idx]):.4f}')


if __name__ == "__main__":
    main()

4.6 Use the Model

After the fine-tuned embedding model is deployed in EAS, you can create a model connection as described in Section 3.3.2 to use it for multimodal data management.