1. Overview
Multimodal data management lets you process data, such as images, using multimodal large language models (MLLMs) and embedding models. This preprocessing, through Smart Tagging and Semantic Indexing, generates rich metadata. Use this metadata to search, filter, and quickly find specific data subsets for downstream tasks like data annotation and model training. Additionally, Platform for AI (PAI) datasets provide a comprehensive OpenAPI to simplify integration with your custom platforms. The following figure shows the product architecture.

2. Limitations
Multimodal data management in PAI has the following limitations:
Region: This feature is available in the following regions: China (Hangzhou), China (Shanghai), China (Shenzhen), China (Ulanqab), China (Beijing), China (Guangzhou), Singapore, Germany (Frankfurt), US (Virginia), China (Hong Kong), Japan (Tokyo), Indonesia (Jakarta), US (Silicon Valley), Malaysia (Kuala Lumpur), and Korea (Seoul).
Storage type: Currently, multimodal data management only supports Object Storage Service (OSS).
File type: Only image files are supported. Supported formats include JPG, JPEG, PNG, GIF, BMP, TIFF, and WEBP.
File quantity: A single dataset version supports a maximum of 1,000,000 files. To increase the capacity for special requirements, contact PAI PDSA.
Models:
Tagging models: Supports Qwen-VL-MAX and Qwen-VL-Plus models from the Model Studio platform.
Indexing models: Supports the Model Studio Multimodal Embedding Model (such as tongyi-embedding-vision-plus) and GME models from the PAI Model Gallery. You can deploy these models to PAI-EAS.
Metadata storage:
Metadata: PAI securely stores metadata in its built-in metadatabase.
Embedding vectors: Supports storage in the following custom vector databases:
Elasticsearch (Vector Search Edition, version 8.17.0 or later)
OpenSearch (Vector Search Edition)
Milvus (version 2.4 or later)
Hologres (version 4.0.9 or later)
Lindorm (Vector Engine Edition)
Dataset processing mode: Supports running Smart Tagging and Semantic Indexing tasks in full and incremental modes.
3. Workflow

3.1 Prerequisites
3.1.1 Activate PAI, create a default workspace, and obtain administrator permissions
Use a root account to activate PAI and create a workspace. Log on to the PAI console, select a region in the upper-left corner, and then authorize and activate the product.
Authorize the operating account. You can skip this step if you are using a root account. RAM users must have the Workspace administrator role. For instructions on how to authorize an account, see the "Configure member roles" section in Create and manage workspaces.
3.1.2 Activate Model Studio and create an API key
To activate Alibaba Cloud Model Studio and create an API key, see Get an API key.
3.1.3 Create a vector database
Create a vector database instance
Multimodal dataset management currently supports the following Alibaba Cloud vector databases:
Elasticsearch (Vector Search Edition, 8.17.0 or later)
OpenSearch (Vector Search Edition)
Milvus (2.4 or later)
Hologres (4.0.9 or later)
Lindorm (Vector Engine Edition)
For instructions on how to create an instance for each cloud vector database, refer to the documentation for the respective product.
Configure network and whitelist settings
Public network access
If your vector database instance has a public endpoint enabled, add the following IP addresses to the instance's public access whitelist. This allows the multimodal data management service to access the instance over the public network. For instructions on how to set up an Elasticsearch whitelist, see Configure a public or private IP address whitelist for an Elasticsearch cluster.
Region
IP address list
Hangzhou
47.110.230.142, 47.98.189.92
Shanghai
47.117.86.159, 106.14.192.90
Shenzhen
47.106.88.217, 39.108.12.110
Ulanqab
8.130.24.177, 8.130.82.15
Beijing
39.107.234.20, 182.92.58.94
Private network access
Please submit a ticket to apply for this option.
Create a vector index table (Optional)
The system can create an index table automatically. You can skip this step unless you need a custom one.
In some vector databases, a vector index table is also known as a Collection or an Index.
The index table structure must be defined as follows:
This section uses Elasticsearch as an example to show how to create a semantic index table with Python. For instructions on how to create index tables for other types of vector databases, refer to the documentation for the respective product. The sample code is as follows:
3.2 Create a dataset
Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Datasets > Create Dataset.

Configure the dataset parameters. Key parameters are as follows. You can keep the default values for other parameters.
Storage: Select Object Storage Service.
Type: Select Premium.
Content Type: Select Image.
OSS Path: Select the OSS storage path for the dataset. If you have not prepared a dataset, you can download the sample dataset retrieval_demo_data, upload it to OSS, and then try out the multimodal data management feature.
NoteImporting a file or folder only records its path and does not copy the data.

Then, click OK to create the dataset.
3.3 Create connections
3.3.1 Create a Smart Tagging model connection
Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.

Select Alibaba Cloud Model Studio Service and configure the Model Studio api_key.

After the connection is created, you can see the Alibaba Cloud Model Studio Service in the list.

3.3.2 Create a Semantic Indexing model connection
You can skip this step if you plan to use the Model Studio Semantic Indexing model service. In the left-side navigation pane, click Model Gallery, find and deploy the GME multimodal retrieval model to obtain an EAS service. The deployment takes about 5 minutes. When the status is Running, the deployment is successful.
ImportantWhen you no longer need the index model, you can stop and delete the service to avoid further charges.

Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.
Configure the model connection information based on whether you chose the Model Studio Semantic Indexing model or a custom-deployed EAS Semantic Indexing model.
Use the Model Studio Semantic Indexing model
For Connection Type, select General Multimodal Embedding Model Service.
For Service Provider, select Third-party Model Service.
Model Name: tongyi-embedding-vision-plus.
base_url:
https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embeddingapi_key: Get an API key and fill it in.

Use a custom-deployed EAS Semantic Indexing model
For Connection Type, select General-purpose Multimodal Embedding Model Service.
For Service Provider, select PAI-EAS Model Service.
EAS Service: Select the GME multimodal retrieval model that you just deployed. If the service provider is not under the current account, you can select Third-party Service Model.


After the connection is created, you can see the model connection service in the list.

3.3.3 Create a vector database connection
In the left-side navigation pane, choose AI Asset Management > Connection > Database > Create Connection.

The multimodal retrieval service supports vector databases such as Milvus, Lindorm, Open Search, Elasticsearch, and Hogress. This topic uses Elasticsearch as an example to describe how to create a database connection. Select Elasticsearch and configure parameters such as url, username, and password. For more information, see Create a database connection.

The connection formats for each vector database are as follows:
Milvus
uri: http://xxx.milvus.aliyuncs.com:19530 database: {your_database} token: root:{password}OpenSearch
uri: http://xxxx.ha.aliyuncs.com username: {username} password: {password}Hologres
host: xxxx.hologres.aliyuncs.com database: {your_database} port: {port} access_key_id={password}Elasticsearch
uri: http://xxxx.elasticsearch.aliyuncs.com:9200 username: {username} password: {password}Lindorm
uri: xxxx.lindorm.aliyuncs.com:{port} username: {username} password: root:{password}After the connection is created, the vector database connection appears in the list.

3.4 Create a Smart Tagging task
3.4.1 Create a Smart Tag Definition
In the left menu bar, click AI Asset Management > Datasets > Intelligent Tag Definition > Create Intelligent Tag Definition. The tag configuration page opens. The following is an example configuration:
Guide Prompt: You are a seasoned driver with extensive experience on both highways and urban roads.
Tag Definition:
3.4.2 Create an Offline Smart Tagging Task
Click Custom Dataset, click a dataset name to open its details page, and then click the Dataset jobs tab.

On the task page, click Create job > Smart tag to configure the task parameters.

Dataset Version: Select the version to label, such as v1.
Labeling Model Connection: Select an existing Model Studio model connection.
Smart Labeling Model: Supported models include Qwen-VL-MAX and QwenVL-Plus.
Max Concurrency: This value depends on the specifications of the EAS model service. For a single card, the recommended maximum concurrency is 5.
Intelligent Tag Definition: Select the definition that you just created.
Labeling Mode: The available patterns are Increment and Full.
After the smart tagging task is created, it appears in the task list. You can click the links in the Actions column to view logs or stop the task.
NoteWhen you start a smart tagging task for the first time, the system builds the metadata. This process may take a long time.
3.5 Create a Semantic Indexing Task
Click the dataset name to open the details page. In the Index Configuration area, click the edit button.

Configure the index.
Index Model Connection: Select the connection that you created in 3.3.2.
Index Database Connection: Select the index database connection you created in Section 3.3.3.
Index Database Table: Enter the name of the index table created in Create a vector index table (Optional), such as dataset_embed_test.
Click Save > Refresh Now. A semantic index task is created to update the semantic index for all files in the selected dataset version. You can click Semantic Indexing Task in the upper-right corner of the dataset details page to view the task details.
NoteWhen you start a semantic indexing task for the first time, the system builds the metadata. This process may take a long time.
If you click Cancel instead of Refresh Now, you can create the task manually by following these steps:
On the dataset details page, click Dataset jobs to open the Tasks page.

Click Create job > Semantic Indexing. Configure the dataset version and set the maximum number of concurrent requests based on the EAS model service specifications (the recommended maximum is 5 for a single card). Then, click Confirm to create the semantic index task.

3.6 Preview Data
After the Smart Tagging and Semantic Indexing tasks are complete, go to the dataset details page and click View Data to preview the images in that dataset version.

On the View Data page, you can preview the images in the dataset version. You can switch between Gallery View and List View.


Click a specific image to view a larger version and see the tags it contains.

Click the checkbox in the upper-left corner of a thumbnail to select it. You can also hold down the Shift key and click a checkbox to select multiple rows of data at once.

3.7 Basic Data Search (Combined Search)
On the left toolbar of the 'View Data' page, you can perform an Index Retrieval and a Search by Tag. Press Enter or click Search to start the search.
Index Retrieval: Performs a text keyword search by matching keyword vectors with image index vectors from the semantic index. In Advanced Settings, you can set parameters such as topk and the score threshold.

Index Retrieval (search by image): Based on semantic indexing, you can upload an image from your local computer or select an image from OSS to search for matching images in the dataset by comparing vectors. In Advanced Settings, you can set parameters such as topk and Score threshold.

Search by Tag: Finds images by matching keywords with tags from the Smart Tagging feature. You can combine the following search conditions: Include Any of Following (NOT)、Include All Following (AND), and Exclude Any of Following (NOT).

Metadata search: You can search for files by file name, storage path, and last modified time.

All the preceding search conditions are combined with an AND operator.
3.8 Advanced Data Search (DSL)
Advanced search uses DSL search, a domain-specific language for expressing complex search conditions. DSL is ideal for advanced search scenarios and supports features such as grouping, Boolean logic (AND/OR/NOT), range comparison (>, >=, <, <=), property existence (HAS/NOT HAS), token matching (:), and exact match (=). For more information about the syntax, see Retrieve a list of dataset file metadata.

3.9 Export search results
This step exports the search results as a file list index for subsequent model training or data analytics.
After the search is complete, you can click the Export Results button at the bottom of the page. Two export modes are available:

3.9.1 Export to a file
Click Export as file. On the configuration page, set the export content and the destination OSS folder, and then click OK.

You can view the export progress under AI Asset Management > Job > Dataset jobs in the left navigation bar.
Use the exported results. After the export is complete, you can mount the exported result file and the original dataset to the training environment, such as a DLC or DSW instance. Then, you can write code to read the exported result file index and load the object files from the original dataset for model training or data analytics.
3.9.2 Export to a logical dataset version
You can import the search results from an advanced dataset into a version of another logical dataset. You can then use the data of that logical dataset version using the dataset software development kit (SDK).
Click Export to logical dataset version, select the target logical dataset, and then click Confirm.

If no logical dataset is available, create one as described in the following section:
Use the logical dataset. After the import task is complete, the destination logical dataset contains the exported metadata. You can use the SDK to load and use the data. For information about how to use the SDK, see the dataset details page.


The command to install the SDK is:
pip install https://pai-sdk.oss-cn-shanghai.aliyuncs.com/dataset/pai_dataset_sdk-1.0.0-py3-none-any.whl
4. Custom semantic indexing model (Optional)
You can fine-tune a custom semantic retrieval model. After the model is deployed in EAS, you can create a model connection as described in Section 3.3.2 to use it for multimodal data management.
4.1 Data preparation
This topic provides sample data. You can click retrieval_demo_data to download it.
4.1.1 Data format requirements
Each data sample is saved as a line in JSON format in the dataset.jsonl file. Each sample must contain the following fields:
image_id: A unique identifier for the image, such as the image name or a unique ID.tags: A list of text tags associated with the image. The tags are an array of strings.
Example format:
{
"image_id": "c909f3df-ac4074ed",
"tags": ["silver sedan", "white SUV", "city street", "snowing", "night"],
}4.1.2 File organization
Place all image files in a folder named images. Place the dataset.jsonl file in the same directory as the images folder.
Directory structure example:
├── images
│ ├── image1.jpg
│ ├── image2.jpg
│ └── image3.jpg
└── dataset.jsonl The filename dataset.jsonl and the folder name images are required and cannot be changed.
4.2 Model training
In the Model Gallery, find retrieval-related models. Select a suitable model for fine-tuning and deployment based on the required model size and compute resources.

Fine-tuning VRAM (bs=4)
Fine-tuning (4 × A800) train_samples/second
Deployment VRAM
Vector dimensions
GME-2B
14 G
16.331
5G
1536
GME-7B
35 G
13.868
16 G
3584
For example, to train the GME-2B model, click Train, and then enter the data address and the model output path to start the training. The data address defaults to the sample data address.


4.3 Model deployment
After a model is trained, you can deploy the fine-tuned model by clicking Deploy in the training task.
To deploy the original GME model, click the Deploy button on the model tab in Model Gallery.

After the deployment completes, you can retrieve the EAS Endpoint and Token from the page.
4.4 Service invocation
Input parameters
Name | Type | Required | Example | Description |
model | String | Yes | pai-multimodal-embedding-v1 | The model type. Support for custom user models and base model version iterations can be added later. |
contents.input | list(dict) or list(str) | No | input = [{'text': text}] input=[xxx,xxx,xxx,...] input = [{'text': text},{'image', f"data:image/{image_format};base64,{image64}"}] | The content to be embedded. Currently, only text and image are supported. |
Response parameters
Name | Type | Example | Description |
status_code | Integer | 200 | The HTTP status code. 200: The request was successful. 204: The request was partially successful. 400: The request failed. |
message | list(str) | ['Invalid input data: must be a list of strings or dict'] | The error message. |
output | dict | See the following table. | The embedding result. |
The result from Dashscope is as follows: {'output', {'embeddings': list(dict), 'usage': xxx, 'request_id':xxx}}. The usage and request_id parameters are currently not used.
The elements in embeddings contain the following keys. If an index fails, a message key is added to the corresponding element to provide the reason for the failure.
Name | Type | Example | Description |
index | Data ID | 0 | The HTTP status code. 200, 400, 500, etc. |
embedding | List[Float] | [0.0391846,0.0518188,.....,-0.0329895, 0.0251465] 1536 | The vector after embedding. |
type | String | "Internal execute error." | The error message. |
Sample output:
{
"status_code": 200,
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.020782470703125,
-0.01399993896484375,
-0.0229949951171875,
...
],
"type": "text"
}
]
}
}4.5 Model evaluation
The following table shows the evaluation results on our sample data. The evaluation file was used for this test.
Original model precision | Model precision after 1 epoch of fine-tuning | |
gme2b | Precision@1 0.3542 Precision@5 0.5280 Precision@10 0.5923 Precision@50 0.5800 Precision@100 0.5792 | Precision@1 0.4271 Precision@5 0.6480 Precision@10 0.7308 Precision@50 0.7331 Precision@100 0.7404 |
gme7b | Precision@1 0.3958 Precision@5 0.5920 Precision@10 0.6667 Precision@50 0.6517 Precision@100 0.6415 | Precision@1 0.4375 Precision@5 0.6680 Precision@10 0.7590 Precision@50 0.7683 Precision@100 0.7723 |
4.6 Use the Model
After the fine-tuned embedding model is deployed in EAS, you can create a model connection as described in Section 3.3.2 to use it for multimodal data management.