The multimodal embedding API converts text, images, or a combination of both into dense vector representations. Use the resulting embeddings for cross-modal retrieval (text-to-image, image-to-text) and similarity search.
This service is trained on the Qwen2-VL multimodal large language model (MLLM) and supports both single-modal and multimodal input combinations.
Available models
| Model | Service ID | Dimensions | Language | Description |
|---|---|---|---|---|
| M2-Encoder-Multimodal Vector Model | ops-m2-encoder | 768 | Chinese-English bilingual | Trained on 6 billion image-text pairs (3 billion Chinese, 3 billion English) based on BM-6B. Supports cross-modal retrieval between text and images, and image classification. |
| M2-Encoder-Large-Multimodal Vector Model | ops-m2-encoder-large | 1024 | Chinese-English bilingual | Larger model with 1 billion parameters. Provides stronger expression capabilities and higher performance in multimodal tasks compared to ops-m2-encoder. |
| GME Multimodal Vector-Qwen2-VL-2B | ops-gme-qwen2-vl-2b-instruct | 1536 | - | Trained on the Qwen2-VL MLLM. Supports single-modal and multimodal input combinations, processing text, images, and combined data types. |
For ops-m2-encoder and ops-m2-encoder-large, text and images cannot be entered in the same doc.
Rate limits
The queries per second (QPS) limits apply per Alibaba Cloud account, including all RAM users under that account.
| Model | QPS |
|---|---|
ops-m2-encoder | 10 |
To request a higher QPS limit, submit a ticket.
Prerequisites
Before you begin, make sure you have:
An API key and the service endpoint for the AI Search Open Platform. For details, see Obtain a service endpoint.
API reference
Endpoint
POST {host}/v3/openapi/workspaces/{workspace_name}/multi-modal-embedding/{service_id}Replace the path parameters with actual values:
| Parameter | Description | Example |
|---|---|---|
host | Service endpoint. Supports Internet and VPC access. | http://ops-cn-hangzhou.opensearch.aliyuncs.com |
workspace_name | Workspace name. | default |
service_id | ID of the embedding model. | ops-m2-encoder |
Request
The request body must not exceed 8 MB.
Headers
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
| Content-Type | String | Yes | Request content type. | application/json |
| Authorization | String | Yes | API key for authentication. | Bearer OS-d1**2a |
Body
| Parameter | Type | Required | Description |
|---|---|---|---|
input | List<ContentObject> | Yes | List of inputs. Maximum 32 entries per request. |
ContentObject fields:
| Field | Type | Required | Description |
|---|---|---|---|
text | String | No | Text to embed. |
image | String | No | Image to embed. Accepts a URL or Base64-encoded data. |
For ops-m2-encoder and ops-m2-encoder-large, each ContentObject must contain either a text field or an image field. Providing both fields causes the image to be ignored.
For ops-gme-qwen2-vl-2b-instruct, a ContentObject can contain both text and image fields for combined multimodal embedding.
Image input formats:
URL -- Must be accessible.
{ "image": "http://example.com/photo.jpg" }Base64 -- Use the format
data:image/{format};base64,{base64_image}, where{format}is the actual image type (e.g.,jpeg,png) and{base64_image}is the encoded data.{ "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAoHCB..." }
Response
A successful response includes the following fields:
| Field | Type | Description |
|---|---|---|
request_id | String | Unique request identifier. |
latency | Int | Processing time in milliseconds. |
usage.image | Int | Number of images processed. |
usage.token_count | Int | Number of tokens processed. |
result.embeddings | List | Array of embedding results. Each element corresponds to one input entry. |
result.embeddings[].index | Int | Position of the input in the request array (zero-based). |
result.embeddings[].embedding | List<Double> | The embedding vector. |
Error responses include code and message fields describing the error:
| Field | Type | Description |
|---|---|---|
request_id | String | Unique request identifier. |
latency | Int | 0 for error responses. |
code | String | Error code. |
message | String | Error description. |
For a full list of error codes, see Status codes.
Examples
Generate an image embedding
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
"http://<your-endpoint>/v3/openapi/workspaces/default/multi-modal-embedding/ops-m2-encoder" \
-d '{
"input": [
{
"image": "http://example.com/photo.jpg"
}
]
}'Replace the following placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
<your-api-key> | API key for authentication | OS-d1xxxxx2a |
<your-endpoint> | Service endpoint | ops-cn-hangzhou.opensearch.aliyuncs.com |
Sample success response
{
"request_id": "B4AB89C8-B135-****-A6F8-2BAB801A2CE4",
"latency": 38,
"usage": {
"image": 1,
"token_count": 28
},
"result": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.033447265625,
0.10577392578125,
-0.0015211105346679688,
-0.044189453125,
"...",
0.004688262939453125,
-4.5239925384521484E-5
]
}
]
}
}Sample error response
{
"request_id": "651B3087-8A07-****-B931-9C4E7B60F52D",
"latency": 0,
"code": "InvalidParameter",
"message": "JSON parse error: Cannot deserialize value of type `InputType` from String \"xxx\""
}