All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio understanding (Qwen3-Omni-Captioner)

Last Updated:Mar 14, 2026

Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It automatically generates accurate and comprehensive descriptions for complex audio—speech, ambient sounds, music, and sound effects—without prompts. The model identifies speaker emotions, musical elements (style, instruments), and sensitive information. Ideal for audio content analysis, security audits, intent recognition, and video editing.

Availability

Supported regions

  • Singapore: Requires an API key from this region.

  • Beijing: Requires an API key from this region.

Supported models

International

In international deployment mode, the endpoint and data storage are both located in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding Chinese Mainland.

Model

Context window

Max input

Max output

Input cost

Output cost

Free quota

(Note)

(tokens)

(per 1M tokens)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$3.81

$3.06

1 million tokens

Valid for 90 days after activating Model Studio

Chinese Mainland

In Chinese Mainland deployment mode, the endpoint and data storage are both located in the Beijing region. Model inference compute resources are limited to Chinese Mainland.

Model

Context window

Max input

Max output

Input cost

Output cost

Free quota

(Note)

(tokens)

(per 1M tokens)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$2.265

$1.821

No free quota.

Token conversion rule for audio: Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.

Getting started

Prerequisites

Qwen3-Omni-Captioner supports API calls only. Online testing in the Model Studio console is not available.

These code samples analyze online audio via a URL, not local files. Learn how to pass local files and audio file limits.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys differ by region. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"                    
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     }
            }]
        }]
});

console.log(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long-captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, “Oh, with this, how am I supposed to work quietly?” His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker’s language, and his tone strongly suggests a scenario of home office disruption-perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

DashScope

Python

import dashscope
import os

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages
)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
         // The following is the base-url for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces-likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise-likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction-possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    }
}'

Response

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: “Oh, how can I possibly work quietly like this?” His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound-a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker’s complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment-perhaps a student, office worker, or someone in a quiet home environment-caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

How it works

  • Single-turn interaction: Each request is an independent analysis task. Multi-turn conversation is not supported.

  • Fixed task: The model generates audio descriptions in English only. You cannot use instructions (e.g., system messages) to change behavior, output format, or content focus.

  • Audio input only: The model accepts audio only. Text prompts are not needed. The message parameter format is fixed.

    Example message format

    OpenAI compatible

    messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                        }
                    }
                ]
            }
        ]

    DashScope

    messages = [
        {
            "role": "user",
            "content": [
                {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
            ]
        }
    ]

Streaming output

Streaming output generates intermediate results step-by-step and returns them simultaneously, allowing you to read responses as they're generated. This reduces wait time.

OpenAI compatible

Set stream to true to enable streaming output.

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True},

)
for chunk in completion:
    # If stream_options.include_usage is True, the choices field of the last chunk is an empty list and should be skipped. You can get the token usage from chunk.usage.
    if chunk.choices and chunk.choices[0].delta.content != "":
        print(chunk.choices[0].delta.content,end="")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     },
            }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}

curl

# ======= Important =======
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    }
}'

DashScope

Call via DashScope SDK or HTTP. Set parameters based on your method:

  • Python SDK: Set the stream parameter to True.

  • Java SDK: Use the streamCall method.

  • HTTP: In the header, set X-DashScope-SSE to enable.

By default, streaming output is non-incremental. This means each returned chunk contains all previously generated content. For incremental output, set incremental_output (incrementalOutput in Java) to true.

Python

import dashscope

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
     # If environment variable not configured, replace with your API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    stream=True,
    incremental_output=True
)

full_content = ""
print("Streaming output:")
for response in response:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"Full content: {full_content}")

Java

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // qwen3-omni-30b-a3b-captioner supports only one audio file as input.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav");}}
                )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult.Output.Choice.Message.Content> content = item.getOutput().getChoices().get(0).getMessage().getContent();
                // Check if content exists and is not empty.
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    },
    "parameters": {
      "incremental_output": true
    }
}'

Pass local file (Base64 encoding or file path)

Two methods are available to upload local files:

  • Use Base64 encoding

  • Direct file path (Recommended for greater transmission stability)

Upload methods:

Pass by file path

Pass the file path directly to the model. Supported by DashScope Python and Java SDKs only, not HTTP. See the table below for path format by language and OS.

Specify the file path

System

SDK

Input file path

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.mp3

Java SDK

Windows operating system

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.mp3

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/images/test.mp3

Pass by Base64 encoding

Convert the file to a Base64 string and pass it to the model.

Steps to pass a Base64-encoded string

  1. Encode the file: Convert the local audio file to a Base64 string.

    Example: Converting an audio file to a Base64 string

    import base64
    
    # Encoding function: Converts a local file to a Base64-encoded string
    def encode_audio(audio_path):
        with open(audio_path, "rb") as audio_file:
            return base64.b64encode(audio_file.read()).decode("utf-8")
    
    # Replace xxxx/test.mp3 with the absolute path of your local audio file
    base64_audio = encode_audio("xxxx/test.mp3")
  2. Construct a Data URL: data:;base64,{base64_audio}, where base64_audio is the Base64 string from step 1.

  3. Pass the Data URL via audio (DashScope) or input_audio (OpenAI compatible) parameter.

Limits:

  • Recommended: pass the file path directly for greater transmission stability. For files under 1 MB, Base64 encoding also works.

  • When passing by file path, audio files must be under 10 MB.

  • When using Base64, the encoded string must be under 10 MB. Note: Base64 increases file size.

Pass by file path

File path passing is supported by DashScope Python and Java SDKs only, not HTTP.

Python

import dashscope
import os

# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "user",
        # Pass the file path prefixed with file:// in the audio parameter.
        "content": [{"audio": audio_file_path}],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys differ by region. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner", 
    messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {

        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
        // The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
        // The current test system is macOS. If you use Windows, use "file:///ABSOLUTE_PATH/welcome.mp3" instead.

        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", localFilePath);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If environment variable not configured, replace with your API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                        "data": f"data:;base64,{base64_audio}"
                    },
                }
            ],
        },
    ]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Singapore region URL. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
//  Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
const base64Audio = encodeAudio("xxx/ABSOLUTE_PATH/welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`}
            }]
        }]
});

console.log(completion.choices[0].message.content);

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."
            
          }
        }
      ]
    }
  ]
}'

DashScope

Python

import os
import base64
import dashscope 

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"
# Encoding function: Converts a local file to a Base64-encoded string
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
print(base64_audio)

messages = [
    {
        "role": "user",
        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
        "content": [{"audio":f"data:;base64,{base64_audio}"}],
    }
]

response = dashscope.MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    # API keys differ by region. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    )
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // Replace ABSOLUTE_PATH/welcome.mp3 with the actual path of your local file.
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                // The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
               // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Singapore region base_url. For Beijing, use: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."}
                ]
            }
        ]
    }
}'

API reference

For Qwen3-Omni-Captioner parameters, see Qwen.

Error codes

If the model call fails and returns an error message, see Error messages for resolution.

FAQ

How to compress an audio file to the required size?

  • Online tools: You can use tools like Compresss to compress audio.

  • Code implementation: Use FFmpeg. For usage details, see the official FFmpeg website.

    # Basic conversion command (universal template)
    # -i: Specifies the input file path. Example: input.mp3
    
    # -b:a: Sets the audio bitrate.
      # Common values: 64 kbps (low quality, for voice and low-bandwidth streaming), 128k (medium quality, for general audio and podcasts), 192 kbps (high quality, for music and broadcasting).
      # A higher bitrate results in better audio quality and a larger file size.
      
    # -ar: Sets the audio sample rate, which is the number of samples per second.
     # Common values: 8000 Hz, 22050 Hz, 44100 Hz (standard sample rate).
     # A higher sample rate results in a larger file size.
     
    # -ac: Sets the number of audio channels. Common values: 1 (mono), 2 (stereo). Mono files are smaller.
    
    # -y: Overwrites the output file if it exists (no value needed). # output.mp3: Specifies the output file path.
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Limitations

Audio file limits:

  • Duration: Up to 40 minutes.

  • Number of files: Only one audio file is supported per request.

  • File formats: AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.

  • File input methods: Public URL, Base64 encoding, or local file path.

  • File size:

    • Public URL: No more than 1 GB.

    • File path: The audio file must be smaller than 10 MB.

    • Base64 encoding: The encoded Base64 string must be smaller than 10 MB. For more information, see Pass local file.

    To compress a file, see How to compress an audio file to the required size?