All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio file recognition - Qwen

Last Updated:Nov 13, 2025

The Qwen audio file recognition models accurately convert recorded audio into text. These models support features such as multi-language recognition, singing voice recognition, and noise rejection.

Supported models

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Unit price

Free quota (Note)

qwen3-asr-flash

Currently equivalent to qwen3-asr-flash-2025-09-08

Stable

Chinese (Mandarin, Sichuanese, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish

16 kHz

$0.000035/second

36,000 seconds (10 hours)

Validity: Within 90 days after you activate Model Studio

qwen3-asr-flash-2025-09-08

Snapshot

Mainland China (Beijing)

Model

Version

Supported languages

Supported sample rates

Unit price

qwen3-asr-flash

Currently equivalent to qwen3-asr-flash-2025-09-08

Stable

Chinese (Mandarin, Sichuanese, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish

16 kHz

$0.000032/second

qwen3-asr-flash-2025-09-08

Snapshot

Features

Features

Qwen3-ASR

Connection type

Java/Python SDK, HTTP API

Multi-language

Chinese (Mandarin, Sichuanese, Minnan, Wu), Cantonese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese

Context Enhancement

✅ Configure context using the text request parameter for custom recognition

Emotion recognition

Language detection

Specify recognition language

✅ If the audio language is known, specify it using the language request parameter to improve accuracy

Singing voice recognition

Noise rejection

ITN (Inverse Text Normalization)

✅ Enable by setting the enable_itn request parameter to true. This feature is only for Chinese and English audio.

Punctuation prediction

Streaming output

Audio input method

Supported audio formats for detection

aac, amr, avi, aiff, flac, flv, m4a, mkv, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Audio channel to detect

Mono

Input audio sampling rate

16 kHz

Audio file size

The audio file cannot exceed 100 MB in size and 20 minutes in duration.

Getting started

An online trial is not currently available. To use the model, you must call the API. The following sections provide sample code for API calls.

Before you start, make sure you have created an API key and exported the API key as an environment variable. If you are calling the API through an SDK, you also need to install the latest version of the DashScope SDK.

Qwen3-ASR

The Qwen3-ASR model is a single-turn invocation model. It does not support multi-turn conversation or custom prompts, including System Prompts or User Prompts.

Audio file URL

Python

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition
    {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        #"language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":True
    }
)
print(response)

The complete result is output to the console in JSON format. The result includes the status code, a unique request ID, the recognized content, and token usage information for the call.

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "annotations": [
                        {
                            "language": "zh",
                            "type": "audio_info",
                            "emotion": "neutral"
                        }
                    ],
                    "content": [
                        {
                            "text": "Welcome to Alibaba Cloud."
                        }
                    ],
                    "role": "assistant"
                }
            }
        ]
    },
    "usage": {
        "input_tokens_details": {
            "text_tokens": 0
        },
        "output_tokens_details": {
            "text_tokens": 6
        },
        "seconds": 1
    },
    "request_id": "568e2bf0-d6f2-97f8-9f15-a57b11dc6977"
}

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", true);
        // asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-asr-flash")
                .message(userMessage)
                .message(sysMessage)
                .parameter("asr_options", asrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }
    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

The complete result is output to the console in JSON format. The result includes the status code, a unique request ID, the recognized content, and token usage information for the call.

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "annotations": [
                        {
                            "language": "zh",
                            "type": "audio_info",
                            "emotion": "neutral"
                        }
                    ],
                    "content": [
                        {
                            "text": "Welcome to Alibaba Cloud."
                        }
                    ],
                    "role": "assistant"
                }
            }
        ]
    },
    "usage": {
        "input_tokens_details": {
            "text_tokens": 0
        },
        "output_tokens_details": {
            "text_tokens": 6
        },
        "seconds": 1
    },
    "request_id": "568e2bf0-d6f2-97f8-9f15-a57b11dc6977"
}

curl

You can configure the context for custom recognition using the text parameter of the System Message.

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location --request POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-asr-flash",
    "input": {
        "messages": [
            {"content": [{"text": ""}],"role": "system"},
            {"content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}],"role": "user"}
        ]
    },
    "parameters": {
        "asr_options": {
            "enable_itn": true
        }
    }
}'

The complete result is output to the console in JSON format. The result includes the status code, a unique request ID, the recognized content, and token usage information for the call.

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "annotations": [
                        {
                            "language": "zh",
                            "type": "audio_info",
                            "emotion": "neutral"
                        }
                    ],
                    "content": [
                        {
                            "text": "Welcome to Alibaba Cloud."
                        }
                    ],
                    "role": "assistant"
                }
            }
        ]
    },
    "usage": {
        "input_tokens_details": {
            "text_tokens": 0
        },
        "output_tokens_details": {
            "text_tokens": 6
        },
        "seconds": 1
    },
    "request_id": "568e2bf0-d6f2-97f8-9f15-a57b11dc6977"
}

Local file

When you use the DashScope SDK to process local image files, you must provide a file path. The following table shows how to create the file path for your operating system.

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/audio/test.wav

Java SDK

Windows

Python SDK

file://{absolute_path_of_the_file}

file://D:/audio/test.wav

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/audio/test.wav

Important

When you use local files, the API call limit is 100 QPS, and this limit cannot be increased. This method is not recommended for production environments, high concurrency, or stress testing scenarios. For higher concurrency, you can upload the file to OSS and call the API using the audio file URL.

Python

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition
    {"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        # "language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":True
    }
)
print(response)

The complete result is output to the console in JSON format. The result includes the status code, a unique request ID, the recognized content, and token usage information for the call.

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "annotations": [
                        {
                            "language": "zh",
                            "type": "audio_info",
                            "emotion": "neutral"
                        }
                    ],
                    "content": [
                        {
                            "text": "Welcome to Alibaba Cloud."
                        }
                    ],
                    "role": "assistant"
                }
            }
        ]
    },
    "usage": {
        "input_tokens_details": {
            "text_tokens": 0
        },
        "output_tokens_details": {
            "text_tokens": 6
        },
        "seconds": 1
    },
    "request_id": "568e2bf0-d6f2-97f8-9f15-a57b11dc6977"
}

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local file.
        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", localFilePath)))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", true);
        // asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-asr-flash")
                .message(userMessage)
                .message(sysMessage)
                .parameter("asr_options", asrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }
    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

The complete result is output to the console in JSON format. The result includes the status code, a unique request ID, the recognized content, and token usage information for the call.

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "annotations": [
                        {
                            "language": "zh",
                            "type": "audio_info",
                            "emotion": "neutral"
                        }
                    ],
                    "content": [
                        {
                            "text": "Welcome to Alibaba Cloud."
                        }
                    ],
                    "role": "assistant"
                }
            }
        ]
    },
    "usage": {
        "input_tokens_details": {
            "text_tokens": 0
        },
        "output_tokens_details": {
            "text_tokens": 6
        },
        "seconds": 1
    },
    "request_id": "568e2bf0-d6f2-97f8-9f15-a57b11dc6977"
}

Streaming output

The model can generate results progressively. In non-streaming mode, the API returns the complete result after the entire generation process is finished. In streaming mode, the API returns intermediate results in real time as they are generated. This reduces the waiting time. To enable streaming output, you must set a specific parameter based on the calling method:

  • DashScope Python SDK: Set the stream parameter to true.

  • DashScope Java SDK: Call the API through the streamCall interface.

  • DashScope HTTP: Set the X-DashScope-SSE header to enable.

Python

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition
    {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        # "language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":True
    },
    stream=True
)

for response in response:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
    except:
        pass

The intermediate recognition results are output to the console as strings.

Welcome
Welcome to
Welcome to Ali
Welcome to Alibaba Cloud
Welcome to Alibaba Cloud.
Welcome to Alibaba Cloud.

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", true);
        // asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-asr-flash")
                .message(userMessage)
                .message(sysMessage)
                .parameter("asr_options", asrOptions)
                .build();
        Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
        resultFlowable.blockingForEach(item -> {
            try {
                System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

The intermediate recognition results are output to the console as strings.

Welcome
Welcome to
Welcome to Ali
Welcome to Alibaba Cloud
Welcome to Alibaba Cloud.
Welcome to Alibaba Cloud.

curl

You can configure the context for custom recognition using the text parameter of the System Message.

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location --request POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \
--data '{
    "model": "qwen3-asr-flash",
    "input": {
        "messages": [
            {
                "content": [
                    {
                        "text": ""
                    }
                ],
                "role": "system"
            },
            {
                "content": [
                    {
                        "audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    }
                ],
                "role": "user"
            }
        ]
    },
    "parameters": {
        "incremental_output": true,
        "asr_options": {
            "enable_itn": true
        }
    }
}'

The intermediate recognition results are output to the console in JSON format.

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"zh","emotion": "neutral"}],"content":[{"text":"Welcome"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":2},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"zh","emotion": "neutral"}],"content":[{"text":"to"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":3},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"zh","emotion": "neutral"}],"content":[{"text":"Alibaba"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":4},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

id:4
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"zh","emotion": "neutral"}],"content":[{"text":"Cloud"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":5},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"zh","emotion": "neutral"}],"content":[{"text":"."}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":6},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

id:6
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"output_tokens_details":{"text_tokens":6},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

Core usage: Contextual biasing

With Qwen3-ASR, you can provide context to improve the recognition of domain-specific vocabulary, such as names, places, and product terms. This feature significantly improves transcription accuracy and is more flexible and powerful than traditional hotword solutions.

Length limit: The context content cannot exceed 10,000 tokens.

Usage: When calling the API, pass the text in the text parameter of the system message.

Supported text types: Supported text types include the following:

  • Hotword lists (in various separator formats, such as hotword 1, hotword 2, hotword 3, or hotword 4)

  • Text paragraphs or chapters of any format and length

  • Mixed content: Any combination of word lists and paragraphs

  • Irrelevant or meaningless text (including garbled text). The model has a high fault tolerance for irrelevant text, which rarely has a negative impact on performance.

Example:

The correct recognition result for a certain audio clip should be "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB ...".

Without contextual biasing

Without contextual biasing, some investment bank names are recognized incorrectly. For example, "Bird Rock" should be "Bulge Bracket".

Recognition result: "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bird Rock, BB ..."

With contextual biasing

With contextual biasing, the investment bank names are recognized correctly.

Recognition result: "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bulge Bracket, BB ..."

To achieve this result, you can add any of the following content to the context:

  • Word list:

    • Word list 1:

      Bulge Bracket, Boutique, Middle Market, domestic securities firms
    • Word list 2:

      Bulge Bracket Boutique Middle Market domestic securities firms
    • Word list 3:

      ['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
  • Natural language:

    Investment Banking Categories Revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it to everyone. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large firms are enormous in both business scope and scale.
    Boutique Investment Banks: These investment banks are relatively small in scale but are very focused in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
    Middle Market Investment Banks: This type of investment bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major firms, they have high influence in specific markets.
    Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some divisions of Position and business, which you can refer to in the relevant charts. I hope this information helps everyone better understand investment banking and prepare for their future careers!
  • Natural language with interference: Some text is irrelevant to the recognition content, such as the names in the following example.

    Investment Banking Categories Revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it to everyone. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large firms are enormous in both business scope and scale.
    Boutique Investment Banks: These investment banks are relatively small in scale but are very focused in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
    Middle Market Investment Banks: This type of investment bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major firms, they have high influence in specific markets.
    Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some divisions of Position and business, which you can refer to in the relevant charts. I hope this information helps everyone better understand investment banking and prepare for their future careers!
    Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei 
    Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu 
    Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen 
    Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

API reference

Audio file recognition - Qwen API reference

FAQ

Q: How do I provide a publicly accessible audio URL for the API?

Use Alibaba Cloud Object Storage Service (OSS). OSS is a highly available and reliable storage service that lets you easily generate public access URLs.

Verify that the generated URL is accessible from the internet: Access the URL in a browser or using a curl command to ensure the audio file can be successfully downloaded or played (HTTP status code 200).

Q: How do I check if the audio format meets the requirements?

Use the open-source tool ffprobe to quickly obtain detailed information about the audio:

# Query the container format (format_name), encoding (codec_name), sample rate (sample_rate), and number of channels (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3

Q: How do I process audio to meet the model's requirements?

Use the open-source tool FFmpeg to crop or convert the audio format:

  • Audio cropping: Extract a segment from a long audio file

    # -i: input file
    # -ss 00:01:30: Set the start time for cropping (starts at 1 minute and 30 seconds)
    # -t 00:02:00: Set the duration of the crop (crops for 2 minutes)
    # -c copy: Directly copy the audio stream without re-encoding for faster processing
    # output_clip.wav: output file
    ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav
  • Format conversion

    For example, you can convert any audio to a 16 kHz, 16-bit, mono WAV file.

    # -i: input file
    # -ac 1: Set the number of channels to 1 (mono)
    # -ar 16000: Set the sample rate to 16000 Hz (16 kHz)
    # -sample_fmt s16: Set the sample format to 16-bit signed integer PCM
    # output.wav: output file
    ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav