全部產品
Search
文件中心

Alibaba Cloud Model Studio:音頻理解(Qwen3-Omni-Captioner)

更新時間:Dec 20, 2025

Qwen3-Omni-Captioner是以通義千問3-Omni為基座的開源模型,無需任何提示,自動為複雜語音、環境聲、音樂、影視聲效等產生精準、全面的描述,能識別說話人的情緒、音樂元素(如風格、樂器)、敏感資訊等,適用於音頻內容分析、安全性稽核、意圖識別、音頻剪輯等多個領域。

支援的模型

國際(新加坡)

模型名稱

上下文長度

最大輸入

最大輸出

輸入成本

輸出成本

免費額度

(注)

(Token數)

(每百萬Token)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$3.81

$3.06

100萬Token

有效期間:阿里雲百鍊開通後90天內

中國大陸(北京)

模型名稱

上下文長度

最大輸入

最大輸出

輸入成本

輸出成本

免費額度

(注)

(Token數)

(每百萬Token)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$2.265

$1.821

無免費額度

音頻轉換為Token的規則:總 Tokens 數 = 音頻時間長度(單位:秒)* 12.5,若音頻時間長度不足1秒,則按 1 秒計算。

快速開始

前提條件

Qwen3-Omni-Captioner模型僅支援通過API調用,暫不支援在阿里雲百鍊的控制台線上體驗。

以下是理解線上音頻(通過URL指定,非本地音頻)的範例程式碼。瞭解如何傳入本地檔案音頻檔案的限制

OpenAI相容

Python

import os
from openai import OpenAI

client = OpenAI(
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"                    
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)

響應結果

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “哎呀,這樣我還怎麼安靜工作啊?” (“Oh my, how can I possibly work quietly like this?”). His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     }
            }]
        }]
});

console.log(completion.choices[0].message.content)

響應結果

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “哎呀,這樣我還怎麼安靜工作啊?” (“Oh my, how can I possibly work quietly like this?”). His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= 重要提示 =======
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# 以下是新加坡地區base_url,如果使用北京地區的模型,需要將base_url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === 執行時請刪除該注釋 ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
}'

響應結果

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long—captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, “哎呀,這樣我還怎麼安靜工作啊?” (“Oh, with this, how am I supposed to work quietly?”) His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker’s language, and his tone strongly suggests a scenario of home office disruption—perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

DashScope

Python

import dashscope
import os

# 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # 若沒有配置環境變數,請用百鍊API Key將下行替換為:api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages
)

print("輸出結果為:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

響應結果

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “哎呀,這樣我還怎麼安靜工作啊?” (“Oh my, how can I possibly work quietly like this?”). His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
         // 以下是新加坡地區的base-url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("輸出結果為:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

響應結果

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “哎呀,這樣我還怎麼安靜工作啊?” (“Oh my, how can I possibly work quietly like this?”). His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= 重要提示 =======
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === 執行時請刪除該注釋 ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    }
}'

響應結果

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: “哎呀,這樣我還怎麼安靜工作啊?” (“Oh, how can I possibly work quietly like this?”). His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound—a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker’s complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment—perhaps a student, office worker, or someone in a quiet home environment—caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

工作方式

  • 單輪互動:模型不支援多輪對話。每次請求都是一次獨立的分析任務。

  • 固定任務:模型的核心任務是產生音頻描述(僅為英文描述),無法通過指令(如 System Message)改變其行為,例如控制輸出格式或內容重點。

  • 僅支援音頻輸入:模型僅接收音頻作為輸入,無需傳入文本提示,message參數格式固定。

    message 格式樣本

    OpenAI 相容

    messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                        }
                    }
                ]
            }
        ]

    DashScope

    messages = [
        {
            "role": "user",
            "content": [
                {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
            ]
        }
    ]

流式輸出

大模型接收到輸入後,會逐步產生中間結果,最終結果由這些中間結果拼接而成。這種一邊產生一邊輸出中間結果的方式稱為流式輸出。採用流式輸出時,您可以在模型進行輸出的同時閱讀,減少等待模型回複的時間。

OpenAI相容

通過 OpenAI 相容方式開啟流式輸出十分方便,只需在請求參數中設定stream參數為true即可。

Python

import os
from openai import OpenAI

client = OpenAI(
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True},

)
for chunk in completion:
    # 如果stream_options.include_usage為True,則最後一個chunk的choices欄位為空白列表,需要跳過(可以通過chunk.usage擷取 Token 使用量)
    if chunk.choices and chunk.choices[0].delta.content != "":
        print(chunk.choices[0].delta.content,end="")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     },
            }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}

curl

# ======= 重要提示 =======
# 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === 執行時請刪除該注釋 ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    }
}'

DashScope

可通過DashScope SDK或HTTP方式調用通義千問VL模型,體驗流式輸出的功能。根據不同的調用方式需設定相應的參數來實現流式輸出:

  • Python SDK方式:設定stream參數為True。

  • Java SDK方式:需要通過streamCall介面調用。

  • HTTP方式:需要在Header中指定X-DashScope-SSEenable

流式輸出的內容預設是非增量式(即每次返回的內容都包含之前產生的內容),如果您需要使用增量式流式輸出,請設定incremental_output(Java 為incrementalOutput)參數為 true

Python

import dashscope

# 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
     # 若沒有配置環境變數,請用百鍊API Key將下行替換為:api_key="sk-xxx",
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    stream=True,
    incremental_output=True
)

full_content = ""
print("流式輸出內容為:")
for response in response:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"完整內容為:{full_content}")

Java

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // qwen3-omni-30b-a3b-captioner僅支援輸入1個音頻檔案
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav");}}
                )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // 若沒有配置環境變數,請用百鍊API Key將下行替換為:.apiKey("sk-xxx")
                // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult.Output.Choice.Message.Content> content = item.getOutput().getChoices().get(0).getMessage().getContent();
                // 判斷content是否存在且不為空白
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= 重要提示 =======
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === 執行時請刪除該注釋 ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    },
    "parameters": {
      "incremental_output": true
    }
}'

傳入本地檔案(Base 64 編碼或檔案路徑)

模型提供兩種本地檔案上傳方式:

  • Base 64 編碼上傳

  • 檔案路徑直接上傳(傳輸更穩定,推薦方式

上傳方式:

檔案路徑傳入

直接向模型傳入檔案路徑。僅 DashScope Python 和 Java SDK 支援,不支援 HTTP 方式。請您參考下表,結合您的程式設計語言與作業系統指定檔案的路徑。

指定檔案的路徑

系統

SDK

傳入的檔案路徑

樣本

Linux或macOS系統

Python SDK

file://{檔案的絕對路徑}

file:///home/images/test.mp3

Java SDK

Windows系統

Python SDK

file://{檔案的絕對路徑}

file://D:/images/test.mp3

Java SDK

file:///{檔案的絕對路徑}

file:///D:/images/test.mp3

Base 64 編碼傳入

Base64 編碼,將檔案轉換為 Base 64 編碼字串,再傳入模型。

傳入 Base 64 編碼字串的步驟

  1. 檔案編碼:將本地音頻檔案轉換為 Base 64 編碼;

    音頻轉換為 Base 64 編碼的範例程式碼

    #  編碼函數: 將本地檔案轉換為 Base 64 編碼的字串
    def encode_audio(video_path):
        with open(video_path, "rb") as video_file:
            return base64.b64encode(video_file.read()).decode("utf-8")
    
    # 將xxxx/test.mp3替換為你本地音訊絕對路徑
    base64_audio = encode_audio("xxxx/test.mp3")
  2. 構建Data URL,格式如下:data:;base64,{base64_audio}base64_audio為上一步產生的 Base64 字串;

  3. 調用模型:通過audio(DashScope SDK)或input_audio(OpenAI 相容 SDK)參數傳遞Data URL並調用模型。

使用限制:

  • 建議優先選擇檔案路徑上傳(傳輸更穩定),1MB以下的檔案也可使用 Base 64 編碼;

  • 直接傳入檔案路徑時,音頻本身需小於 10MB;

  • Base64編碼方式傳入時,由於 Base 64 編碼會增加資料體積,需保證編碼後的 Base64 字串需小於 10MB。

檔案路徑傳入

傳入檔案路徑僅支援 DashScope Python 和 Java SDK方式調用,不支援 HTTP 方式。

Python

import dashscope
import os

# 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# 將 ABSOLUTE_PATH/welcome.mp3 替換為本地音訊絕對路徑,
# 本地檔案的完整路徑必須以 file:// 為首碼,以保證路徑的合法性,例如:file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "user",
        # 在 audio 參數中傳入以 file:// 為首碼的檔案路徑
        "content": [{"audio": audio_file_path}],
    }
]

response = dashscope.MultiModalConversation.call(
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # 若沒有配置環境變數,請用百鍊API Key將下行替換為:api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner", 
    messages=messages)

print("輸出結果為:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {

        // 請將 ABSOLUTE_PATH/welcome.mp3 替換為本地音頻檔案的絕對路徑
        // 本地檔案的完整路徑必須以 file:// 為首碼,以保證路徑的合法性,例如:file:///home/images/test.mp3
        // 當前測試系統為macOS。如果您使用Windows系統,請用"file:///ABSOLUTE_PATH/welcome.mp3"代替

        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", localFilePath);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // 若沒有配置環境變數,請用百鍊API Key將下行替換為:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("輸出結果為:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base 64 編碼傳入

OpenAI相容

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


# 請將 ABSOLUTE_PATH/welcome.mp3 替換為本地音訊絕對路徑
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        # 以 Base 64 編碼方式傳入本地檔案時,必須要以data:為首碼,以保證檔案 URL 的合法性。
                        # 在 Base 64 編碼資料(base64_audio)前需要包含"base64",否則也會報錯。
                        "data": f"data:;base64,{base64_audio}"
                    },
                }
            ],
        },
    ]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // 以下為新加坡地區url,若使用北京地區的模型,需將url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
//  請將 ABSOLUTE_PATH/welcome.mp3 替換為本地音訊絕對路徑
const base64Audio = encodeAudio("xxx/ABSOLUTE_PATH/welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`}
            }]
        }]
});

console.log(completion.choices[0].message.content);

curl

  • 將檔案轉換為 Base 64 編碼的字串的方法可參見範例程式碼

  • 為了便於展示,代碼中的"data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 字串已截斷,實際使用時請傳入完整編碼。

# ======= 重要提示 =======
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# 以下是新加坡地區base_url,如果使用北京地區的模型,需要將base_url替換為:https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === 執行時請刪除該注釋 ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."
            
          }
        }
      ]
    }
  ]
}'

DashScope

Python

import os
import base64
import dashscope 

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"
# 編碼函數: 將本地檔案轉換為 Base 64 編碼的字串
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# 請將 ABSOLUTE_PATH/welcome.mp3 替換為本地音訊絕對路徑
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
print(base64_audio)

messages = [
    {
        "role": "user",
        # 以 Base 64 編碼方式傳入本地檔案時,必須要以data:為首碼,以保證檔案 URL 的合法性。
        # 在 Base 64 編碼資料(base64_audio)前需要包含"base64",否則也會報錯。
        "content": [{"audio":f"data:;base64,{base64_audio}"}],
    }
]

response = dashscope.MultiModalConversation.call(
    # 若沒有配置環境變數,請用百鍊API Key將下行替換為:api_key="sk-xxx"
    # 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    )
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // 請將 ABSOLUTE_PATH/welcome.mp3 替換為本地檔案的實際路徑
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // 以 Base 64 編碼方式傳入本地檔案時,必須要以data:為首碼,以保證檔案 URL 的合法性。
                // 在 Base 64 編碼資料(base64_audio)前需要包含"base64",否則也會報錯。
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                // 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
               // 若沒有配置環境變數,請用百鍊API Key將下行替換為:.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("輸出結果為:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • 將檔案轉換為 Base 64 編碼的字串的方法可參見範例程式碼

  • 為了便於展示,代碼中的"data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 字串已截斷,實際使用時請傳入完整編碼。

# ======= 重要提示 =======
# 新加坡和北京地區的API Key不同。擷取API Key:https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# 以下為新加坡地區base_url,若使用北京地區的模型,需將base_url替換為:https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === 執行時請刪除該注釋 ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."}
                ]
            }
        ]
    }
}'

API參考

關於通義千問3-Omni-Captioner的輸入輸出參數,請參見通義千問

錯誤碼

如果模型調用失敗並返回報錯資訊,請參見錯誤資訊進行解決。

常見問題

如何壓縮音頻檔案到滿足要求的大小?

  • 線上工具:使用 Compresss 等線上工具壓縮音頻檔案。

  • 代碼實現:使用FFmpeg工具,更多用法請參見FFmpeg官網

    # 基礎轉換命令(萬能模板)
    # -i,作用:輸入檔案路徑,常用值樣本:input.mp3
    
    # -b:a,作用: 設定音頻位元速率 ,
      # 一般取值有64kbps(低品質,適合語音、低頻寬流媒體)、128k(中等品質,適合日常音頻、播客)、192kbps(高品質,適合音樂、廣播)
      # 位元速率越高,音質越好,檔案體積越大
      
    # -ar,作用:設定音頻採樣率,表示每秒採樣的次數,
     # 一般取值為8000Hz、22050Hz、44100 Hz(標準採樣率)
     # 採樣率越高,檔案體積越大
     
    # -ac,作用:設定音頻通道數。一般取值有 1(單聲道),2(立體聲),單聲道檔案體積更小
    
    # -y,作用:覆蓋已存在檔案(無需值)# output.mp3,作用:輸出檔案路徑
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

限制

模型對音頻檔案的限制如下:

  • 時間長度限制:時間長度需小於或等於 40 分鐘

  • 檔案數量:每次請求僅支援1個音頻檔案

  • 檔案格式:支援AMR、 WAV(CodecID: GSM_MS)、 WAV(PCM)、 3GP、 3GPP、 AAC、 MP3等主流格式

  • 檔案輸入方式:公網可訪問的音頻URL、 Base 64 編碼、本地檔案路徑

  • 檔案大小:

    • 公網URL傳入:不超過 1GB

    • 傳入檔案路徑:音頻本身需小於 10MB

    • Base64編碼傳入:需確保編碼後 Base64 字串小於 10MB,詳情請參見如何傳入本地檔案

    如需壓縮檔請參見如何壓縮音頻檔案到滿足要求的大小?