如何使用Multimodal-Embedding API - Alibaba Cloud Model Studio

多模態向量模型將文本、映像或視頻轉換成統一的1024維浮點數向量，適用於視頻分類、映像分類、圖文檢索等。

核心能力

跨模態檢索：實現以文搜圖、以圖搜視頻、以圖搜圖等跨模態的語義搜尋。
語義相似性計算：在統一的向量空間中，衡量不同模態內容之間的語義相似性。
內容分類與聚類：基於內容的語義向量進行智能分組、打標和群集。

關鍵特性：所有模態（文本、圖片、視頻）產生的向量都位於同一語義空間，可直接通過計算餘弦相似性等方法進行跨模態匹配與比較。關於模型選型和應用方法的更多介紹，參考文本與多模態向量化。

重要

此模型服務僅在“華北2（北京）”地區提供，調用時必須使用該地區的API Key。

模型介紹、選型建議和使用方法，請參考文本與多模態向量化。

模型概覽

新加坡

模型名稱

向量維度

文本長度限制

圖片大小限制

視頻大小限制

單價（每百萬輸入Token）

免費額度（注）

tongyi-embedding-vision-plus

1152, 1024, 512, 256, 128, 64

1,024 Token

單張大小不超過3 MB

視頻檔案大小不超過 10 MB

圖片/視頻：$0.09

文本：$0.09

100萬Token

有效期間：百鍊開通後90天內

tongyi-embedding-vision-flash

768, 512, 256, 128, 64

圖片/視頻：$0.03

文本：$0.09

北京

模型名稱

向量維度

文本長度限制

圖片大小限制

視頻大小限制

單價（每百萬輸入Token）

qwen3-vl-embedding

2560, 2048, 1536, 1024, 768, 512, 256

32,000 Token

最多 1 張且單張大小不超過5 MB

視頻檔案大小不超過 50 MB

圖片/視頻：$0.258

文本：$0.1

multimodal-embedding-v1

1024

512 Token

最多 8 張且單張大小不超過3 MB

視頻檔案大小不超過 10 MB

免費試用

輸入格式與語種限制：

多模態融合向量模型
模型	文本	圖片	視頻	單次請求條數
qwen3-vl-embedding	支援中、英、日、韓、法、德等33種主流語言	JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI（支援URL或Base64）	MP4, AVI, MOV（僅支援URL）	一次請求中傳入內容元素總數不超過 20；圖片、文本、視頻共用該上限。
多模態獨立向量化模型
模型	文本	圖片	視頻	單次請求條數
tongyi-embedding-vision-plus	中/英文	JPG, PNG, BMP (支援URL或Base64)	MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, MKV（僅支援URL）	暫無傳入內容元素數量限制，輸入內容總Token數不超過Token數量上限即可。
tongyi-embedding-vision-flash				暫無傳入內容元素數量限制，輸入內容總Token數不超過Token數量上限即可。
multimodal-embedding-v1				一次請求中傳入內容元素總數不超過 20；圖片、視頻各最多 1 條，文本最多 20 條，共用總條數上限。

介面支援單段文本、單張圖片或單個視頻檔案的上傳，也允許不同類型組合（如文本+圖片），部分模型支援同類型內容的多個輸入（如多張圖片），請參考具體模型的限制說明。

前提條件

您需要已擷取API Key與API Host並配置API Key到環境變數（準備下線，併入配置 API Key）。如果通過SDK調用，還需要安裝DashScope SDK。請將範例程式碼中的 DASHSCOPE_API_HOST 替換為擷取的 API Host。

HTTP調用

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

請求	多模態獨立向量 `qwen3-vl-embedding`支援兩種使用方式：將文本、圖片、視頻放在一起輸入會產生1個融合向量，分開輸入（如下方程式碼範例中所示）則每個內容產生1個獨立向量。 curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "model": "tongyi-embedding-vision-plus", "input": { "contents": [ {"text": "多模態向量模型"}, {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"}, {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"}, {"multi_images": [ "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png", "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png" ] } ] } }' 多模態融合向量 `qwen3-vl-embedding`支援兩種使用方式：將文本、圖片、視頻放在一起輸入（如下方程式碼範例中所示）會產生1個融合向量，分開輸入則每個內容產生1個獨立向量。 curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \ --header "Authorization: Bearer $DASHSCOPE_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "model": "qwen3-vl-embedding", "input": { "contents": [ {"text": "這是一段測試文本，用於產生多模態融合向量", "image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png", "video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4" } ] }, "parameters": { "dimension": 1024, "output_type": "dense", "fps": 0.5 } }'
要求標頭（Headers）
Content-Type `string` （必選）請求內容類型。可設定為application/json 或者text/event-stream（開啟 SSE 響應）。 Content-Type `string` （必選）請求內容類型。此參數必須設定為`application/json`。
Authorization `string`（必選）請求身份認證。介面使用阿里雲百鍊API-Key進行身份認證。樣本值：Bearer sk-xxxx。
請求體（Request Body）
model `string`（必選）模型名稱。設定為模型概覽中的模型名稱。
input `object` （必選）輸入內容。屬性 contents `array`（必選）待處理的內容列表。每個元素是一個字典或者字串，用於指定內容的類型和值。格式為{"模態類型": "輸入字串或映像、視頻url"}。支援`text`, `image`, `video`和`multi_images`四種模態類型。 `qwen3-vl-embedding` 同時支援融合向量和獨立向量產生。當把 text、image、video 放在同一個對象裡時，產生融合向量；當把這三個分開作為獨立的元素時，會針對每個單獨產生向量。`qwen2.5-vl-embedding` 僅支援融合向量，不支援獨立向量。文本：key為`text`。value為字串形式。也可不通過dict直接傳入字串。圖片：key為`image`。value可以是公開可訪問的URL，或Base64編碼的Data URI。Base64格式為 `data:image/{format};base64,{data}`，其中 `{format}` 是圖片格式（如 `jpeg`, `png`），`{data}` 是Base64編碼字串。多圖片：僅`tongyi-embedding-vision-plus`與`tongyi-embedding-vision-flash`模型支援此類型。key為`multi_images`，value是多圖序列列表，每條為一個圖片，格式要求如上方所示，圖片數量最多為8張。視頻：key為`video`，value必須是公開可訪問的URL。 parameters `object` （可選）向量處理參數。HTTP調用需封裝在parameters對象中，SDK調用可直接使用以下參數。屬性 output_type `string` 可選使用者指定輸出向量表示格式，目前僅支援dense。 dimension `integer` 可選用於使用者指定輸出向量維度。不同模型支援的值不同： `qwen3-vl-embedding` 支援 2560、2048、1536、1024、768、512、256，預設值為 2560； `tongyi-embedding-vision-plus` 支援 64、128、256、512、1024、1152，預設值為 1152； `tongyi-embedding-vision-flash` 支援 64、128、256、512、768，預設值為 768； `multimodal-embedding-v1` 不支援此參數，固定返回 1024 維向量。 fps `float` 可選控制視頻的幀數，比例越小，實際抽取的幀數越少，範圍為 [0,1]。預設值為1.0。 instruct `string` 可選添加自訂任務說明，可用於指導模型理解查詢意圖。建議使用英文撰寫，通常可帶來約 1%–5% 的效果提升。

響應	成功響應 `{ "output": { "embeddings": [ { "index": 0, "embedding": [ -0.026611328125, -0.016571044921875, -0.02227783203125, ... ], "type": "text" }, { "index": 1, "embedding": [ 0.051544189453125, 0.007717132568359375, 0.026611328125, ... ], "type": "image" }, { "index": 2, "embedding": [ -0.0217437744140625, -0.016448974609375, 0.040679931640625, ... ], "type": "video" } ] }, "usage": { "input_tokens": 10, "image_tokens": 896 }, "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5" }` 異常響應 `{ "code":"InvalidApiKey", "message":"Invalid API-key provided.", "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1" }`
output `object` 任務輸出資訊。屬性 embeddings `array` 向量結果清單，每個對象對應輸入列表中的一個元素。屬性 index `int` 結果在輸入列表中的索引。 embedding `array` 產生的1024維向量。 type `string` 結果對應的輸入類型 text/image/video/multi_images/vl（僅當使用`qwen3-vl-embedding`模型時返回該類型）。
request_id `string` 請求唯一標識。可用於請求明細溯源和問題排查。
code `string` 請求失敗的錯誤碼。請求成功時不會返回此參數，詳情請參見錯誤資訊。
message `string` 請求失敗的詳細資料。請求成功時不會返回此參數，詳情請參見錯誤資訊。
usage `object` 輸出資訊統計。屬性 input_tokens `int` 本次請求輸入內容的 Token 數目。 image_tokens `int` 本次請求輸入的圖片或視頻的Token數量。系統會對輸入視頻進行抽幀處理，幀數上限受系統配置控制，隨後基於處理結果計算 Token。 image_count `int` 本次請求輸入的圖片數量。 duration `int` 本次請求輸入的視頻時間長度（秒）。

SDK使用

目前的版本的 SDK 調用與原生 HTTP 調用的請求體結構不一致。SDK 的input參數對應了HTTP中的input.contents。

程式碼範例

產生圖片Embedding樣本

使用圖片URL

import dashscope
import json
from http import HTTPStatus
# 實際使用中請將url地址替換為您的圖片url地址
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# 調用模型介面
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

使用本地圖片

您可以參考以下範例程式碼，將本地圖片轉換為Base64格式後調用multimodal-embedding-v1模型進行向量化處理。

import dashscope
import base64
import json
from http import HTTPStatus
# 讀取圖片並轉換為Base64,實際使用中請將xxx.png替換為您的圖片檔案名稱或路徑
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # 讀取檔案並轉換為Base64
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# 設定映像格式
image_format = "png"  # 根據實際情況修改，比如jpg、bmp 等
image_data = f"data:image/{image_format};base64,{base64_image}"
# 輸入資料
input = [{'image': image_data}]

# 調用模型介面
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

產生視頻Embedding樣本

多模態向量化模型目前僅支援以URL形式輸入視頻檔案，暫不支援直接傳入本地視頻。

import dashscope
import json
from http import HTTPStatus
# 實際使用中請將url地址替換為您的視頻url地址
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# 調用模型介面
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

產生文本Embedding樣本

import dashscope
import json
from http import HTTPStatus

text = "通用多模態表徵模型樣本"
input = [{'text': text}]
# 調用模型介面
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

產生融合Embedding樣本

import dashscope
import json
import os
from http import HTTPStatus

# 多模態融合向量：將文本、圖片、視頻融合成一個融合向量
# 適用於跨模態檢索、圖搜等情境
text = "這是一段測試文本，用於產生多模態融合向量"
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"

# 輸入包含文本、圖片、視頻，模型會將它們融合成一個融合向量
input_data = [
    {
        "text": text,
        "image": image,
        "video": video
    }
]

# 使用 qwen3-vl-embedding 產生融合向量
resp = dashscope.MultiModalEmbedding.call(
    # 若沒有配置環境變數，請用百鍊API Key將下行替換為：api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    # 選擇性參數：指定向量維度（支援 2560、2048、1536、1024、768、512、256，預設 2560）
    # parameters={"dimension": 1024}
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

輸出樣本

JSON

{
    "status_code": 200,
    "request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.009490966796875,
                    -0.024871826171875,
                    -0.031280517578125,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 0,
            "text_tokens": 10
        },
        "output_tokens": 1,
        "total_tokens": 11
    }
}

錯誤碼

如果模型調用失敗並返回報錯資訊，請參見錯誤資訊進行解決。