1. Overview
Multimodal data management uses large multimodal models and embedding models to preprocess data, such as images. This process generates rich metadata through smart tagging and semantic indexing. This metadata lets you search and filter your data. You can quickly find data subsets for specific scenarios to use in processes such as data annotation and model training. PAI datasets also provide a full set of OpenAPI operations, which allows for easy integration with your own platforms. The service architecture is shown in the following figure:

2. Limits
The following limits apply to multimodal data management in PAI:
Regions: Currently available in Hangzhou, Shanghai, Shenzhen, Ulanqab, Beijing, Guangzhou, Singapore, Germany, US (Virginia), China (Hong Kong), Tokyo, Jakarta, Guangzhou, US (Silicon Valley), Kuala Lumpur, and Seoul.
Storage class: Multimodal data management can only be used with Object Storage Service (OSS).
File types: Only image files are supported. Supported file formats include JPG, JPEG, PNG, GIF, BMP, TIFF, and WEBP.
Number of files: A single dataset version can contain a maximum of 1,000,000 files. If you have special requirements, contact a PAI Product and Solution Architect (PDSA) to increase the limit.
Usage model:
Tagging models: Supports the Qwen-VL Max and Qwen-VL Plus models from the Model Studio platform.
Indexing models: Supports the General Multimodal Embedding (GME) model from PAI-Model Gallery for indexing. The model is deployed on PAI-Elastic Algorithm Service (EAS).
Metadata storage:
Metadata: Metadata is securely stored in the built-in PAI metadatabase.
Embedding vectors: You can store embedding vectors in the following custom vector databases:
Elasticsearch (Vector-enhanced Edition, version 8.17.0 or later)
OpenSearch (Vector Search Edition)
Milvus (version 2.4 or later)
Hologres (version 4.0.9 or later)
Dataset processing mode: Currently, smart tagging and semantic indexing tasks can run in full mode only. Incremental mode is not supported.
3. Procedure

3.1. Prerequisites
3.1.1. Activate PAI, create a default workspace, and obtain administrator permissions
Use your Alibaba Cloud account to activate PAI and create a workspace. Log on to the PAI console, select a region in the upper-left corner, and then authorize and activate the service. For more information, see Activate PAI and create a workspace.
Grant permissions to the operating account. You can skip this step if you are using your Alibaba Cloud account. If you are using a Resource Access Management (RAM) user, you must have the administrator role for the workspace. For more information about how to grant permissions, see the "Configure member roles" section of Manage workspaces.
3.1.2. Activate Model Studio and create an API key
For more information about how to activate Alibaba Cloud Model Studio and create an API key, see Obtain an API key.
3.1.3. Create a vector database
Create a vector database instance
Multimodal dataset management currently supports the following Alibaba Cloud vector databases:
Elasticsearch (Vector-enhanced Edition, version 8.17.0 or later)
OpenSearch (Vector Search Edition)
Milvus (version 2.4 or later)
Hologres (version 4.0.9 or later)
For information about how to create an instance for each cloud vector database, see the documentation for the corresponding product.
Configure the network and whitelist
Public network access
If your vector database instance has a public endpoint, add the following IP addresses to the instance's public access whitelist. This allows the multimodal data management service to access the instance over the public network. For information about how to configure the whitelist for Elasticsearch, see Configure a public or private IP address whitelist for an instance.
Region
IP address list
Hangzhou
47.110.230.142, 47.98.189.92
Shanghai
47.117.86.159, 106.14.192.90
Shenzhen
47.106.88.217, 39.108.12.110
Ulanqab
8.130.24.177, 8.130.82.15
Private network access
To apply, submit a ticket.
Create a vector index table (Optional)
In some vector databases, a vector index table is also called a collection or an index.
The index table schema must be defined as follows:
This topic uses Elasticsearch as an example to demonstrate how to create a semantic index table using Python. For information about how to create index tables for other types of vector databases, see the documentation for the corresponding cloud product. The following code provides an example:
3.2. Create a dataset
Go to the PAI workspace. In the navigation pane on the left, click AI Asset Management > Datasets > Create Dataset to go to the dataset configuration page.

Configure the dataset parameters. The key parameters are described in the following list. You can use the default values for the other parameters.
Storage Type: Object Storage Service (OSS).
Set Type to Advanced.
Set Content Type to Image.
Set OSS Path to the OSS storage path of the dataset. If you do not have a dataset, you can download the sample dataset retrieval_demo_data and upload it to OSS to test the multimodal data management feature.
NoteWhen you import a file or folder, only the path is recorded. The data is not copied.

Then, click OK to create the dataset.
3.3. Create connections
3.3.1. Create a smart tagging model connection
Go to the PAI workspace. In the navigation pane on the left, click AI Asset Management > Connections > Model Services > Create Connection to open the Create Connection page.

Select Model Studio Large Model Service and configure the Model Studio api_key.

After the connection is created, the Model Studio large model service appears in the list.

3.3.2. Create a custom semantic indexing model connection
In the navigation pane on the left, click Model Gallery. Find and deploy the GME multimodal retrieval model to obtain an EAS service. The deployment takes about 5 minutes. The deployment is successful when the status changes to Running.
ImportantWhen you no longer need the index model, stop and delete the service to avoid further charges.

Go to the PAI workspace. In the navigation pane on the left, click AI Asset Management > Connections > Model Services > Create Connection to open the Create Connection page.
Select General Multimodal Embedding Model Service. Click the EAS Service input box and select the GME multimodal retrieval model that you deployed. If the service provider is not under your current account, you can select a third-party model service.


After the connection is created, the model connection service appears in the list.

3.3.3. Create a vector database connection
In the navigation pane on the left, click AI Asset Management > Connections > Databases > Create Connection to open the Create Connection page.

The multimodal retrieval service supports Milvus, Lindorm, OpenSearch, and Elasticsearch vector databases. This section uses Elasticsearch as an example to show how to create a connection. Select Search And Analytics Service - Elasticsearch and configure parameters such as uri, username, and password. For more information about the configuration, see Create a database connection.

After the connection is created, the vector database connection appears in the list.

3.4. Create a smart tagging task
3.4.1. Create smart tag definitions
In the navigation pane on the left, click AI Asset Management > Datasets > Smart Tag Definitions > Create Smart Tag to open the tag configuration page. The following example shows how to configure the tags:
Set Guiding Prompt to: As an expert driver with many years of experience, you are highly experienced in driving on both highways and urban roads.
Set Tag Definition to:
3.4.2. Create an offline smart tagging task
Click Custom Datasets. Click the name of the dataset to go to its details page. Then, click the Dataset jobs tab.

On the tasks page, click Create Task > Smart Tagging and configure the task parameters.

Set Dataset Version to the version that you want to tag, such as v1.
Set Smart Tagging Model Connection to the Model Studio model connection that you created.
Set Smart Tagging Model. Qwen-VL Max and Qwen-VL Plus are supported.
Set Maximum Concurrency based on the specifications of the EAS model service. For a single GPU, the recommended maximum concurrency is 5.
Set Smart Tag Definition to the smart tag definition that you created.
NoteCurrently, the tagging mode supports tagging only all files in a dataset version.
After the smart tagging task is created, it appears in the task list. You can monitor the running task and click the link on the right side of the list to view logs or stop the task.
NoteWhen you start a smart tagging task for the first time, metadata is built. This process may take a long time.
3.5. Create a semantic indexing task
Click the name of the dataset to go to its details page. In the Index Library Configuration section, click the edit icon.

Configure the index library.
Set Index Model Connection to the index model connection that you created in section 3.3.2.
Set Index Database Connection to the index library connection that you created in section 3.3.3.
Set Index Database Table to the name of the index table that you created in the Create a vector index table (Optional) section, which is `dataset_embed_test`.
Click Save and then click Refresh Now. A semantic indexing task is created for the selected dataset version. This task updates the semantic index for all files in the version. You can click Semantic Indexing Task in the upper-right corner of the dataset details page to view the task details.
NoteWhen you start a semantic indexing task for the first time, metadata is built. This process may take a long time.
If you click Cancel instead of Refresh Now, you can create the task manually by following these steps:
On the dataset details page, click the Dataset Tasks tab to go to the tasks page.

Click Create Task > Semantic Indexing. Configure the dataset version and set the maximum concurrency based on the specifications of the Elastic Algorithm Service (EAS) model service. For a single GPU, a maximum concurrency of 5 is recommended. Then, click OK to create the task.

3.6. Preview data
After the smart tagging and semantic indexing tasks are complete, on the dataset details page, click View Data to preview the images in the dataset version.

On the View Data page, you can preview the images in the dataset version. You can switch between Gallery View and List View.


Click an image to view a larger version and see its tags.

Click the checkbox in the upper-left corner of a thumbnail to select the image. You can select multiple images this way. You can also hold down the Shift key and click a checkbox to select multiple rows of data at once.

3.7. Basic data search
In the toolbar on the left side of the View Data page, you can perform an Index Search and a Tag Search. Press Enter or click Search to begin.
An Index Search lets you search using text keywords. The search is based on the results of the semantic index and works by matching the keyword vector with the image index results. In "Advanced Settings", you can set parameters such as top-k and the score threshold.

For an Index Search, you can also search by image. This search uses semantic index results. You can upload an image from your local computer or select an image from OSS. The search matches the vector of the uploaded image with the image index results in the dataset. In "Advanced Settings", you can set parameters such as top-k and the score threshold.

For a Tag Search, you can search by keyword. This search is based on the results of smart tagging and works by matching the keyword with the image tags. You can combine the logic of Contains Any Of The Following Tags, Contains All Of The Following Tags, and Excludes Any Of The Following Tags in a single search.

For a Metadata search, you can search by file name, storage path, or last modified time.

All the preceding search conditions are combined using an AND operator.
3.8. Advanced data search (DSL)
For an Advanced Search, you can use a DSL Search. Domain-Specific Language (DSL) is a language used to express complex search conditions. It supports features such as grouping, Boolean logic (AND, OR, and NOT), range comparisons (>, >=, <, and <=), property existence (HAS and NOT HAS), token matching (:), and exact matching (=). DSL is suitable for advanced search scenarios. For more information about the syntax, see Obtain a list of file metadata in a dataset.

3.9. Export search result sets
This step exports the search results as a file list index for subsequent model training or data analytics.
After the search is complete, you can click the Export Search Results button at the bottom of the page. Two export modes are supported:

3.9.1. Export to a file
Click Export To File. On the configuration page, set the export content and the destination OSS directory. Then, click OK.

To view the export progress, in the navigation pane on the left, click AI Asset Management > Tasks > Dataset Tasks.
Use the exported results. After the export is complete, you can mount the exported result file and the original dataset to the corresponding training environment, such as a Deep Learning Containers (DLC) or Data Science Workshop (DSW) instance. Then, you can use code to read the exported result file index and load the object files from the original dataset for model training or analysis.
3.9.2. Export to a logical dataset version
You can import the search results from an advanced dataset into a version of a logical dataset. You can then use the dataset software development kit (SDK) to access the data in that logical dataset version.
Click Export To Logical Dataset Version, select the destination logical dataset, and click Confirm.

If no logical datasets are available to select, see the following information:
Use the logical dataset. After the import task is complete, the destination logical dataset contains the exported metadata. You can use the SDK to load and use the data. For information about how to use the SDK, see the dataset's details page.


The command to install the SDK is:
pip install https://pai-sdk.oss-cn-shanghai.aliyuncs.com/dataset/pai_dataset_sdk-1.0.0-py3-none-any.whl
4. (Optional) Customize a semantic indexing model
You can fine-tune a custom semantic retrieval model. After the model is deployed on EAS, you can follow the steps in section 3.3.2 to create a model connection and use it for multimodal data management.
4.1. Prepare data
This topic provides sample data. You can download retrieval_demo_data.
4.1.1. Data format requirements
Each data sample is saved as a line in JSON format in the `dataset.jsonl` file. Each sample must contain the following fields:
`image_id`: A unique identifier for the image, such as the image name or a unique ID.
`tags`: A list of text tags associated with the image. The tags are a string array.
Example:
{
"image_id": "c909f3df-ac4074ed",
"tags": ["silver sedan", "white SUV", "city street", "snowing", "night"],
}4.1.2. File organization structure
Place all image files in a folder named `images`. Place the `dataset.jsonl` file in the same directory as the `images` folder.
Example directory structure:
├── images
│ ├── image1.jpg
│ ├── image2.jpg
│ └── image3.jpg
└── dataset.jsonl You must use the original file name `dataset.jsonl`. The folder name `images` cannot be changed.
4.2. Model training
In Model Gallery, find a retrieval-related model. Select a suitable model for fine-tuning and deployment based on your required model size and compute resources.

Fine-tuning VRAM bs=4
Fine-tuning (4 × A800) train_samples/second
Deployment VRAM
Vector dimensions
GME-2B
14 GB
16.331
5G
1536
GME-7B
35 GB
13.868
16 GB
3584
This section uses the GME-2B model as an example. Click Train, enter the data address (the default address is the sample data address), and specify the model output path to start training the model.


4.3. Model deployment
After the model is trained, you can click Deploy in the training task to deploy the fine-tuned model.
Click the Deploy button on the model card in Model Gallery to deploy the original GME model.

After the deployment is complete, you can obtain the corresponding EAS Endpoint and Token on the page.
4.4. Service invocation
Input parameters
Name | Type | Required | Example | Description |
model | String | Yes | pai-multimodal-embedding-v1 | The model type. Support for custom user models or base model version iterations may be added later. |
contents.input | list(dict) or list(str) | No | input = [{'text': text}] input=[xxx,xxx,xxx,...] input = [{'text': text},{'image', f"data:image/{image_format};base64,{image64}"}] | The content to be embedded. Currently, only text and image are supported. |
Response parameters
Name | Type | Example | Description |
status_code | Integer | 200 | HTTP status code. 200: The request was successful. 204: The request was partially successful. 400: The request failed. |
message | list(str) | ['Invalid input data: must be a list of strings or dict'] | Error message. |
output | dict | See the table below. | Embedding result. |
The dashscope response is a {'output', {'embeddings': list(dict), 'usage': xxx, 'request_id':xxx}}. The 'usage' and 'request_id' fields are not currently used.
The elements in `embeddings` contain the following keys. If an index fails, the reason is added to the `message` field.
Name | Type | Example | Description |
index | Data ID | 0 | HTTP status code. 200, 400, 500, etc. |
embedding | List[Float] | [0.0391846,0.0518188,.....,-0.0329895, 0.0251465] 1536 | The vector after embedding. |
type | String | "Internal execute error." | Error message. |
Sample output:
{
"status_code": 200,
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.020782470703125,
-0.01399993896484375,
-0.0229949951171875,
...
],
"type": "text"
}
]
}
}4.5. Model evaluation
The following table shows the evaluation results on our sample data. The evaluation file used is available at the provided link:
Original Model Precision | Fine-tuned Model Precision (1 epoch) | |
gme2b | Precision@1 0.3542 Precision@5 0.5280 Precision@10 0.5923 Precision@50 0.5800 Precision@100 0.5792 | Precision@1 0.4271 Precision@5 0.6480 Precision@10 0.7308 Precision@50 0.7331 Precision@100 0.7404 |
gme7b | Precision@1 0.3958 Precision@5 0.5920 Precision@10 0.6667 Precision@50 0.6517 Precision@100 0.6415 | Precision@1 0.4375 Precision@5 0.6680 Precision@10 0.7590 Precision@50 0.7683 Precision@100 0.7723 |