All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio understanding (Qwen3-Omni-Captioner)

Last Updated:Oct 30, 2025

Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It automatically generates accurate and comprehensive descriptions for complex audio, including speech, ambient sounds, music, and sound effects, without requiring any prompts. The model can identify speaker emotions, musical elements such as style and instruments, and sensitive information. It is ideal for audio content analysis, security audits, intent recognition, and video editing.

Supported models

International (Singapore)

Model

Context window

Max input

Max output

Input cost

Output cost

Free quota

(Note)

(Tokens)

(Million tokens)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$3.81

$3.06

1 million tokens

Validity: 90 days after you activate Alibaba Cloud Model Studio

Mainland China (Beijing)

Model

Context window

Max input

Max output

Input cost

Output cost

Free quota

(Note)

(Tokens)

(Per 1 million tokens)

qwen3-omni-30b-a3b-captioner

65,536

32,768

32,768

$2.265

$1.821

No free quota

Token conversion rule for audio: Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.

Getting started

Prerequisites

Qwen3-Omni-Captioner only supports API calls. It is not available for online testing in the Alibaba Cloud Model Studio console.

The following code samples show how to analyze online audio specified by a URL, not a local file. Learn how to pass local files and the limits on audio files.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"                    
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
                     }
            }]
        }]
});

console.log(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
          }
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long—captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, “Oh, with this, how am I supposed to work quietly?” His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker’s language, and his tone strongly suggests a scenario of home office disruption—perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

DashScope

Python

import dashscope
import os

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages
)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
         // The following is the base-url for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"}
                ]
            }
        ]
    }
}'

Response

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: “Oh, how can I possibly work quietly like this?” His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound—a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker’s complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment—perhaps a student, office worker, or someone in a quiet home environment—caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

How it works

  • Single-turn interaction: The model does not support multi-turn conversation. Each request is an independent analysis task.

  • Fixed task: The model's core task is to generate audio descriptions in English only. You cannot use instructions, such as a system message, to change its behavior, such as controlling the output format or content focus.

  • Audio input only: The model accepts only audio as input. You do not need to pass text prompts. The format of the message parameter is fixed.

    Example message format

    OpenAI compatible

    messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
                        }
                    }
                ]
            }
        ]

    DashScope

    messages = [
        {
            "role": "user",
            "content": [
                {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"}
            ]
        }
    ]

Streaming output

After the model receives input, it generates intermediate results step-by-step. The final result is a combination of these intermediate results. This method of generating and outputting results simultaneously is called streaming output. With streaming output, you can read the response as it is generated, which reduces your waiting time.

OpenAI compatible

Enabling streaming output with the OpenAI compatible method is straightforward. Simply set the stream parameter to true in your request.

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
                    }
                }
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True},

)
for chunk in completion:
    # If stream_options.include_usage is True, the choices field of the last chunk is an empty list and should be skipped. You can get the token usage from chunk.usage.
    if chunk.choices and chunk.choices[0].delta.content != "":
        print(chunk.choices[0].delta.content,end="")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
                     },
            }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}

curl

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"
          }
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    }
}'

DashScope

You can call the model through the DashScope SDK or using HTTP to use streaming output. Set the parameters as follows based on your call method:

  • Python SDK: Set the stream parameter to True.

  • Java SDK: Use the streamCall method.

  • HTTP: In the header, set X-DashScope-SSE to enable.

By default, streaming output is non-incremental. This means that each returned chunk contains all previously generated content. If you want incremental streaming output, set the incremental_output parameter (or incrementalOutput for Java) to true.

Python

import dashscope

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
     # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    stream=True,
    incremental_output=True
)

full_content = ""
print("Streaming output:")
for response in response:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"Full content: {full_content}")

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // qwen3-omni-30b-a3b-captioner supports only one audio file as input.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav");}}
                )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                var content = item.getOutput().getChoices().get(0).getMessage().getContent();
                // Check if content exists and is not empty.
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20240916/xvappi/Renovation_Noise.wav"}
                ]
            }
        ]
    },
    "parameters": {
      "incremental_output": true
    }
}'

Pass a local file (Base64 encoding or file path)

The model provides two methods to upload a local file:

  • Base64 encoding

  • Direct file path (Recommended for more stable transmission)

Upload methods:

Pass by file path

Pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, not by HTTP. Refer to the following table to specify the file path based on your programming language and operating system.

Specify the file path

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.mp3

Java SDK

Windows

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.mp3

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/images/test.mp3

Pass by Base64 encoding

Convert the file to a Base64-encoded string and then pass it to the model.

Steps to pass a Base64-encoded string

  1. Encode the file: Convert the local audio file to a Base64 string.

    Example: Converting an audio file to a Base64 string

    # Encoding function: Converts a local file to a Base64-encoded string
    def encode_audio(video_path):
        with open(video_path, "rb") as video_file:
            return base64.b64encode(video_file.read()).decode("utf-8")
    
    # Replace xxxx/test.mp3 with the absolute path of your local audio file
    base64_audio = encode_audio("xxxx/test.mp3")
  2. Construct a Data URL in the following format: data:;base64,{base64_audio}, where base64_audio is the Base64 string generated in the previous step.

  3. Invoke the model: Pass the Data URL using the audio (DashScope SDK) or input_audio (OpenAI compatible SDK) parameter.

Limits:

  • We recommend passing the file path directly for greater stability. You can also use Base64 encoding for files smaller than 1 MB.

  • When passing a file path directly, the audio file must be smaller than 10 MB.

  • When passing a file using Base64 encoding, the encoded string must be smaller than 10 MB. Base64 encoding increases the data size.

Pass by file path

Passing a file path is supported only by the DashScope Python and Java SDKs, not by HTTP.

Python

import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "user",
        # Pass the file path prefixed with file:// in the audio parameter.
        "content": [{"audio": audio_file_path}],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner", 
    messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {

        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
        // The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
        // The current test system is macOS. If you use Windows, use "file:///ABSOLUTE_PATH/welcome.mp3" instead.

        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", localFilePath);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                        "data": f"data:;base64,{base64_audio}"
                    },
                }
            ],
        },
    ]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
//  Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
const base64Audio = encodeAudio("xxx/ABSOLUTE_PATH/welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`}
            }]
        }]
});

console.log(completion.choices[0].message.content);

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."
            
          }
        }
      ]
    }
  ]
}'

DashScope

Python

import os
import base64
import dashscope 

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"
# Encoding function: Converts a local file to a Base64-encoded string
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
print(base64_audio)

messages = [
    {
        "role": "user",
        # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
        # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
        "content": [{"audio":f"data:;base64,{base64_audio}"}],
    }
]

response = dashscope.MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY")
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    )
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // Replace ABSOLUTE_PATH/welcome.mp3 with the actual path of your local file.
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
                // The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
               // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For information about how to convert a file to a Base64-encoded string, see the code sample.

  • For demonstration purposes, the "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." Base64 string is truncated. In practice, you must pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."}
                ]
            }
        ]
    }
}'

API reference

For more information about the input and output parameters of Qwen3-Omni-Captioner, see Qwen.

Error codes

If a call fails, see Error messages for troubleshooting.

FAQ

How to compress an audio file to the required size?

  • Online tools: You can use online tools such as Compresss to compress audio files.

  • Code implementation: You can use the FFmpeg tool. For more information about its usage, see the official FFmpeg website.

    # Basic conversion command (universal template)
    # -i: Specifies the input file path. Example: input.mp3
    
    # -b:a: Sets the audio bitrate.
      # Common values: 64 kbps (low quality, for voice and low-bandwidth streaming), 128k (medium quality, for general audio and podcasts), 192 kbps (high quality, for music and broadcasting).
      # A higher bitrate results in better audio quality and a larger file size.
      
    # -ar: Sets the audio sample rate, which is the number of samples per second.
     # Common values: 8000 Hz, 22050 Hz, 44100 Hz (standard sample rate).
     # A higher sample rate results in a larger file size.
     
    # -ac: Sets the number of audio channels. Common values: 1 (mono), 2 (stereo). Mono files are smaller.
    
    # -y: Overwrites the output file if it exists (no value needed). # output.mp3: Specifies the output file path.
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Limitations

The model has the following limits for audio files:

  • Duration: Less than or equal to 40 minutes.

  • Number of files: Only one audio file is supported per request.

  • File formats: Supported formats include AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.

  • File input methods: Publicly accessible audio URL, Base64 encoding, or local file path.

  • File size:

    To compress a file, see How to compress an audio file to the required size?