The Qwen audio file recognition models accurately convert recorded audio into text. These models support features such as multi-language recognition, singing voice recognition, and noise rejection.
Supported models
International (Singapore)
Model | Version | Supported languages | Supported sample rates | Unit price | Free quota (Note) |
qwen3-asr-flash Currently equivalent to qwen3-asr-flash-2025-09-08 | Stable | Chinese (Mandarin, Sichuanese, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish | 16 kHz | $0.000035/second | 36,000 seconds (10 hours) Validity: Within 90 days after you activate Model Studio |
qwen3-asr-flash-2025-09-08 | Snapshot |
Mainland China (Beijing)
Model | Version | Supported languages | Supported sample rates | Unit price |
qwen3-asr-flash Currently equivalent to qwen3-asr-flash-2025-09-08 | Stable | Chinese (Mandarin, Sichuanese, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish | 16 kHz | $0.000032/second |
qwen3-asr-flash-2025-09-08 | Snapshot |
Features
Features | Qwen3-ASR |
Connection type | Java/Python SDK, HTTP API |
Multi-language | Chinese (Mandarin, Sichuanese, Minnan, Wu), Cantonese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese |
Context Enhancement | ✅ Configure context using the |
Emotion recognition | ✅ |
Language detection | ✅ |
Specify recognition language | ✅ If the audio language is known, specify it using the |
Singing voice recognition | ✅ |
Noise rejection | ✅ |
ITN (Inverse Text Normalization) | ✅ Enable by setting the |
Punctuation prediction | ✅ |
Streaming output | ✅ |
Audio input method |
|
Supported audio formats for detection | aac, amr, avi, aiff, flac, flv, m4a, mkv, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv |
Audio channel to detect | Mono |
Input audio sampling rate | 16 kHz |
Audio file size | The audio file cannot exceed 100 MB in size and 20 minutes in duration. |
Getting started
An online trial is not currently available. To use the model, you must call the API. The following sections provide sample code for API calls.
Before you start, make sure you have created an API key and exported the API key as an environment variable. If you are calling the API through an SDK, you also need to install the latest version of the DashScope SDK.
Qwen3-ASR
The Qwen3-ASR model is a single-turn invocation model. It does not support multi-turn conversation or custom prompts, including System Prompts or User Prompts.
Audio file URL
Python
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "system", "content": [{"text": ""}]}, # Configure the context for custom recognition
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
#"language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
"enable_itn":True
}
)
print(response)Java
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
// Configure the context for custom recognition here
.content(Arrays.asList(Collections.singletonMap("text", "")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", true);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash")
.message(userMessage)
.message(sysMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
You can configure the context for custom recognition using the text parameter of the System Message.
# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{"content": [{"text": ""}],"role": "system"},
{"content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}],"role": "user"}
]
},
"parameters": {
"asr_options": {
"enable_itn": true
}
}
}'Local file
When you use the DashScope SDK to process local image files, you must provide a file path. The following table shows how to create the file path for your operating system.
System | SDK | File path to pass | Example |
Linux or macOS | Python SDK | file://{absolute_path_of_the_file} | file:///home/audio/test.wav |
Java SDK | |||
Windows | Python SDK | file://{absolute_path_of_the_file} | file://D:/audio/test.wav |
Java SDK | file:///{absolute_path_of_the_file} | file:///D:/audio/test.wav |
When you use local files, the API call limit is 100 QPS, and this limit cannot be increased. This method is not recommended for production environments, high concurrency, or stress testing scenarios. For higher concurrency, you can upload the file to OSS and call the API using the audio file URL.
Python
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{"role": "system", "content": [{"text": ""}]}, # Configure the context for custom recognition
{"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
"enable_itn":True
}
)
print(response)Java
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
// Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local file.
String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", localFilePath)))
.build();
MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
// Configure the context for custom recognition here
.content(Arrays.asList(Collections.singletonMap("text", "")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", true);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash")
.message(userMessage)
.message(sysMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Streaming output
The model can generate results progressively. In non-streaming mode, the API returns the complete result after the entire generation process is finished. In streaming mode, the API returns intermediate results in real time as they are generated. This reduces the waiting time. To enable streaming output, you must set a specific parameter based on the calling method:
DashScope Python SDK: Set the
streamparameter to true.DashScope Java SDK: Call the API through the
streamCallinterface.DashScope HTTP: Set the
X-DashScope-SSEheader toenable.
Python
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "system", "content": [{"text": ""}]}, # Configure the context for custom recognition
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
"enable_itn":True
},
stream=True
)
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
except:
passJava
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
// Configure the context for custom recognition here
.content(Arrays.asList(Collections.singletonMap("text", "")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", true);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, specify it with this parameter to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash")
.message(userMessage)
.message(sysMessage)
.parameter("asr_options", asrOptions)
.build();
Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
resultFlowable.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
You can configure the context for custom recognition using the text parameter of the System Message.
# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before execution ===
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \
--data '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"incremental_output": true,
"asr_options": {
"enable_itn": true
}
}
}'Core usage: Contextual biasing
With Qwen3-ASR, you can provide context to improve the recognition of domain-specific vocabulary, such as names, places, and product terms. This feature significantly improves transcription accuracy and is more flexible and powerful than traditional hotword solutions.
Length limit: The context content cannot exceed 10,000 tokens.
Usage: When calling the API, pass the text in the text parameter of the system message.
Supported text types: Supported text types include the following:
Hotword lists (in various separator formats, such as hotword 1, hotword 2, hotword 3, or hotword 4)
Text paragraphs or chapters of any format and length
Mixed content: Any combination of word lists and paragraphs
Irrelevant or meaningless text (including garbled text). The model has a high fault tolerance for irrelevant text, which rarely has a negative impact on performance.
Example:
The correct recognition result for a certain audio clip should be "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB ...".
Without contextual biasing Without contextual biasing, some investment bank names are recognized incorrectly. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bird Rock, BB ..." | With contextual biasing With contextual biasing, the investment bank names are recognized correctly. Recognition result: "What jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bulge Bracket, BB ..." |
To achieve this result, you can add any of the following content to the context:
Word list:
Word list 1:
Bulge Bracket, Boutique, Middle Market, domestic securities firmsWord list 2:
Bulge Bracket Boutique Middle Market domestic securities firmsWord list 3:
['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
Natural language:
Investment Banking Categories Revealed! Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it to everyone. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms. Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large firms are enormous in both business scope and scale. Boutique Investment Banks: These investment banks are relatively small in scale but are very focused in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields. Middle Market Investment Banks: This type of investment bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major firms, they have high influence in specific markets. Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market. In addition, there are some divisions of Position and business, which you can refer to in the relevant charts. I hope this information helps everyone better understand investment banking and prepare for their future careers!Natural language with interference: Some text is irrelevant to the recognition content, such as the names in the following example.
Investment Banking Categories Revealed! Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it to everyone. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms. Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large firms are enormous in both business scope and scale. Boutique Investment Banks: These investment banks are relatively small in scale but are very focused in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields. Middle Market Investment Banks: This type of investment bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major firms, they have high influence in specific markets. Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market. In addition, there are some divisions of Position and business, which you can refer to in the relevant charts. I hope this information helps everyone better understand investment banking and prepare for their future careers! Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan
API reference
Audio file recognition - Qwen API reference
FAQ
Q: How do I provide a publicly accessible audio URL for the API?
Use Alibaba Cloud Object Storage Service (OSS). OSS is a highly available and reliable storage service that lets you easily generate public access URLs.
Verify that the generated URL is accessible from the internet: Access the URL in a browser or using a curl command to ensure the audio file can be successfully downloaded or played (HTTP status code 200).
Q: How do I check if the audio format meets the requirements?
Use the open-source tool ffprobe to quickly obtain detailed information about the audio:
# Query the container format (format_name), encoding (codec_name), sample rate (sample_rate), and number of channels (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3Q: How do I process audio to meet the model's requirements?
Use the open-source tool FFmpeg to crop or convert the audio format:
Audio cropping: Extract a segment from a long audio file
# -i: input file # -ss 00:01:30: Set the start time for cropping (starts at 1 minute and 30 seconds) # -t 00:02:00: Set the duration of the crop (crops for 2 minutes) # -c copy: Directly copy the audio stream without re-encoding for faster processing # output_clip.wav: output file ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wavFormat conversion
For example, you can convert any audio to a 16 kHz, 16-bit, mono WAV file.
# -i: input file # -ac 1: Set the number of channels to 1 (mono) # -ar 16000: Set the sample rate to 16000 Hz (16 kHz) # -sample_fmt s16: Set the sample format to 16-bit signed integer PCM # output.wav: output file ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav