Transcribe pre-recorded audio into text. Non-real-time speech recognition models support multilingual recognition, sung-content recognition, noise reduction, and speaker diarization, making them well suited for meeting transcription, call analysis, subtitle generation, and similar use cases.
Overview
Transcribe pre-recorded audio and video files in bulk by submitting asynchronous tasks.
-
Configurable features include speaker diarization, sensitive-word filtering, sentence- and word-level timestamps, and hotword enhancement.
-
Asynchronously transcribes a single audio file of up to 12 hours and 2 GB.
-
Accepts any sample rate and works with common audio and video formats, including AAC, WAV, and MP3.
For real-time scenarios such as live captioning, online meetings, or voice assistants, use Real-time speech recognition - Qwen instead. For guidance on choosing the right model, see Speech-to-text.
Prerequisites
-
You have Create an API key and stored the API key as an environment variable.
-
To call the API through the DashScope SDK, install the latest SDK.
Quick start
Fun-ASR
Audio and video files are typically large, so the file-transcription API is asynchronous: submit the task, poll the query endpoint for its status, and retrieve the recognition result after the task completes.
Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='fun-asr',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter used to specify the language code of the audio to be recognized. For the value range, see the API reference documentation.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)
Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
// Create the transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("fun-asr")
// language_hints is an optional parameter used to specify the language code of the audio to be recognized. For the value range, see the API reference documentation.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete, then get the result
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}
Qwen3-ASR-Flash-Filetrans
Qwen3-ASR-Flash-Filetrans is purpose-built for asynchronous transcription of audio files. It supports recordings of up to 12 hours, accepts only publicly accessible audio file URLs (local file upload is not supported), and returns the full recognition result in a single response after the task completes.
cURL
When you call the API with cURL, submit the task first to obtain a task_id, then use that ID to query the result.
Submit a task
The URL below targets the Singapore region. To use the model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription.
curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id":[
0
],
"enable_itn": false,
"enable_words": true
}
}'
Get the task result
The URL below targets the Singapore region. To use the model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}. Replace {task_id} with the ID of the task you want to query.
curl -X GET 'https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"
Download the recognition result
After the task succeeds, the output.result.transcription_url field returned by the query endpoint points to a publicly downloadable JSON file that contains the full recognition result. The URL is valid for 24 hours by default, so download and persist the file promptly.
# Replace {transcription_url} with the transcription_url value returned by the query endpoint
curl -sS '{transcription_url}' -o transcription.json
cat transcription.json | jq .Complete example
Java
import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import okhttp3.*;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class Main {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription
private static final String API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription";
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/tasks/
private static final String API_URL_QUERY = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/";
private static final Gson gson = new Gson();
public static void main(String[] args) {
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: String apiKey = "sk-xxx"
String apiKey = System.getenv("DASHSCOPE_API_KEY");
OkHttpClient client = new OkHttpClient();
// 1. Submit the task
/*String payloadJson = """
{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
"enable_itn": false,
"language": "zh"
}
}
""";*/
String payloadJson = """
{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
"enable_itn": false,
"enable_words": true
}
}
""";
RequestBody body = RequestBody.create(payloadJson, MediaType.get("application/json; charset=utf-8"));
Request submitRequest = new Request.Builder()
.url(API_URL_SUBMIT)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("Content-Type", "application/json")
.addHeader("X-DashScope-Async", "enable")
.post(body)
.build();
String taskId = null;
try (Response response = client.newCall(submitRequest).execute()) {
if (response.isSuccessful() && response.body() != null) {
String respBody = response.body().string();
ApiResponse apiResp = gson.fromJson(respBody, ApiResponse.class);
if (apiResp.output != null) {
taskId = apiResp.output.taskId;
System.out.println("Task submitted, task_id: " + taskId);
} else {
System.out.println("Submit response content: " + respBody);
return;
}
} else {
System.out.println("Task submission failed! HTTP code: " + response.code());
if (response.body() != null) {
System.out.println(response.body().string());
}
return;
}
} catch (IOException e) {
e.printStackTrace();
return;
}
// 2. Poll the task status
boolean finished = false;
while (!finished) {
try {
TimeUnit.SECONDS.sleep(2); // Wait 2 seconds before querying again
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
String queryUrl = API_URL_QUERY + taskId;
Request queryRequest = new Request.Builder()
.url(queryUrl)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("X-DashScope-Async", "enable")
.addHeader("Content-Type", "application/json")
.get()
.build();
try (Response response = client.newCall(queryRequest).execute()) {
if (response.body() != null) {
String queryResponse = response.body().string();
ApiResponse apiResp = gson.fromJson(queryResponse, ApiResponse.class);
if (apiResp.output != null && apiResp.output.taskStatus != null) {
String status = apiResp.output.taskStatus;
System.out.println("Current task status: " + status);
if ("SUCCEEDED".equalsIgnoreCase(status)
|| "FAILED".equalsIgnoreCase(status)
|| "UNKNOWN".equalsIgnoreCase(status)) {
finished = true;
System.out.println("Task completed. Final result: ");
System.out.println(queryResponse);
}
} else {
System.out.println("Query response content: " + queryResponse);
}
}
} catch (IOException e) {
e.printStackTrace();
return;
}
}
}
static class ApiResponse {
@SerializedName("request_id")
String requestId;
Output output;
}
static class Output {
@SerializedName("task_id")
String taskId;
@SerializedName("task_status")
String taskStatus;
}
}
Python
import os
import time
import requests
import json
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription
API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription"
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/tasks/
API_URL_QUERY_BASE = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/"
def main():
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"X-DashScope-Async": "enable"
}
# 1. Submit the task
payload = {
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
# "language": "zh",
"enable_itn": False,
"enable_words": True
}
}
print("Submitting ASR transcription task...")
try:
submit_resp = requests.post(API_URL_SUBMIT, headers=headers, data=json.dumps(payload))
except requests.RequestException as e:
print(f"Request to submit task failed: {e}")
return
if submit_resp.status_code != 200:
print(f"Task submission failed! HTTP code: {submit_resp.status_code}")
print(submit_resp.text)
return
resp_data = submit_resp.json()
output = resp_data.get("output")
if not output or "task_id" not in output:
print("Unexpected submit response content:", resp_data)
return
task_id = output["task_id"]
print(f"Task submitted, task_id: {task_id}")
# 2. Poll the task status
finished = False
while not finished:
time.sleep(2) # Wait 2 seconds before querying again
query_url = API_URL_QUERY_BASE + task_id
try:
query_resp = requests.get(query_url, headers=headers)
except requests.RequestException as e:
print(f"Request to query task failed: {e}")
return
if query_resp.status_code != 200:
print(f"Task query failed! HTTP code: {query_resp.status_code}")
print(query_resp.text)
return
query_data = query_resp.json()
output = query_data.get("output")
if output and "task_status" in output:
status = output["task_status"]
print(f"Current task status: {status}")
if status.upper() in ("SUCCEEDED", "FAILED", "UNKNOWN"):
finished = True
print("Task completed. Final result:")
print(json.dumps(query_data, indent=2, ensure_ascii=False))
else:
print("Query response content:", query_data)
if __name__ == "__main__":
main()
Java SDK
import com.alibaba.dashscope.audio.qwen_asr.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
public class Main {
public static void main(String[] args) {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
QwenTranscriptionParam param =
QwenTranscriptionParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash-filetrans")
.fileUrl("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav")
//.parameter("language", "zh")
//.parameter("channel_id", new ArrayList<String>(){{add("0");add("1");}})
.parameter("enable_itn", false)
.parameter("enable_words", true)
.build();
try {
QwenTranscription transcription = new QwenTranscription();
// Submit the task
QwenTranscriptionResult result = transcription.asyncCall(param);
System.out.println("create task result: " + result);
// Query the task status
result = transcription.fetch(QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task status: " + result);
// Wait for the task to complete
result =
transcription.wait(
QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task result: " + result);
// Get the speech recognition result
QwenTranscriptionTaskResult taskResult = result.getResult();
if (taskResult != null) {
// Get the URL of the recognition result
String transcriptionUrl = taskResult.getTranscriptionUrl();
// Fetch the content at the URL
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
// Pretty-print the JSON result
Gson gson = new GsonBuilder().setPrettyPrinting().create();
System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
}
} catch (Exception e) {
System.out.println("error: " + e);
}
}
}Python SDK
import json
import os
import sys
from http import HTTPStatus
import dashscope
from dashscope.audio.qwen_asr import QwenTranscription
from dashscope.api_entities.dashscope_response import TranscriptionResponse
# run the transcription script
if __name__ == '__main__':
# API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
task_response = QwenTranscription.async_call(
model='qwen3-asr-flash-filetrans',
file_url='https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav',
#language="",
enable_itn=False,
enable_words=True
)
print(f'task_response: {task_response}')
print(task_response.output.task_id)
query_response = QwenTranscription.fetch(task=task_response.output.task_id)
print(f'query_response: {query_response}')
task_result = QwenTranscription.wait(task=task_response.output.task_id)
print(f'task_result: {task_result}')Qwen3-ASR-Flash
Qwen3-ASR-Flash supports recordings of up to 5 minutes. It accepts either a publicly accessible audio file URL or a local file upload, and can stream the recognition result back to you.
Input: audio file URL
Python SDK
import os
import dashscope
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
#"language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
The URL below targets the Singapore region. To use the model in the US region, replace the URL with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation. To use the model in the Beijing region, replace it with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation.
curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"asr_options": {
"enable_itn": false
}
}
}'
Input: Base64-encoded audio file
Pass Base64-encoded audio as a data URL in the form data:<mediatype>;base64,<data>.
-
<mediatype>: the MIME type.The value depends on the audio format. For example:
-
WAV:
audio/wav -
MP3:
audio/mpeg
-
-
<data>: the audio data encoded as a Base64 string.Base64 encoding increases the payload size, so keep the source file small enough that the encoded result stays within the 10 MB input limit.
-
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
Python SDK
The example uses this audio file: welcome.mp3.
import base64
import dashscope
import os
import pathlib
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace with the actual path to your audio file
file_path = "welcome.mp3"
# Replace with the actual MIME type of your audio file
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
messages = [
{"role": "user", "content": [{"audio": data_uri}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
The example uses this audio file: welcome.mp3.
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
// Replace with the actual path to your audio file
private static final String AUDIO_FILE = "welcome.mp3";
// Replace with the actual MIME type of your audio file
private static final String AUDIO_MIME_TYPE = "audio/mpeg";
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException, IOException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", toDataUrl())))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
// Generate a data URI
public static String toDataUrl() throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get(AUDIO_FILE));
String encoded = Base64.getEncoder().encodeToString(bytes);
return "data:" + AUDIO_MIME_TYPE + ";base64," + encoded;
}
}
Input: absolute path to a local audio file
When you process a local audio file with the DashScope SDK, pass the file path as input. Build the path according to your SDK and operating system, as shown in the following table.
|
Operating system |
SDK |
File path format |
Example |
|
Linux or macOS |
Python SDK |
file://{absolute_path_to_file} |
file:///home/audio/test.wav |
|
Java SDK |
|||
|
Windows |
Python SDK |
file://{absolute_path_to_file} |
file://D:/audio/test.wav |
|
Java SDK |
file:///{absolute_path_to_file} |
file:///D:/audio/test.wav |
Local-file calls are capped at 100 QPS and the limit cannot be increased, so they are not suitable for production, high-concurrency, or load-testing workloads. For higher concurrency, upload the file to OSS and call the API with its URL.
Python SDK
The example uses this audio file: welcome.mp3.
import os
import dashscope
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local audio file
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Java SDK
The example uses this audio file: welcome.mp3.
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
// Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local file
String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", localFilePath)))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Streaming output
The model generates intermediate results step by step, and the final result is assembled from them. A non-streaming call waits for the full result and returns it in one response, while a streaming call returns results as they are generated, which significantly reduces time to first token. Choose the streaming parameter that matches your call method:
-
DashScope Python SDK: set
streamto true. -
DashScope Java SDK: call the
streamCallmethod. -
DashScope HTTP: set the
X-DashScope-SSEheader toenable.
Python SDK
import os
import dashscope
# The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
"enable_itn":False
},
stream=True
)
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
except:
pass
Java SDK
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, use this parameter to specify the language to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API Keys for the Singapore and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
// To use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
.model("qwen3-asr-flash")
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
resultFlowable.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
// The following is the Singapore region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. To use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
The URL below targets the Singapore region. To use the model in the US region, replace the URL with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation. To use the model in the Beijing region, replace it with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation.
curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"incremental_output": true,
"asr_options": {
"enable_itn": false
}
}
}'
Paraformer
The example code for Paraformer is similar to Fun-ASR. Replace the model name with a Paraformer model name.
Advanced features
Use the OpenAI-compatible API
The OpenAI-compatible mode is not available in the US region.
Only the Qwen3-ASR-Flash model series supports OpenAI-compatible calls. This mode accepts only publicly accessible audio file URLs; absolute paths to local audio files are not accepted.
The OpenAI Python SDK must be 1.52.0 or later, and the Node.js SDK must be 4.68.0 or later. To install or upgrade:
# Python
pip install -U "openai>=1.52.0"
# Node.js
npm install openai@^4.68.0
asr_options is not a standard OpenAI parameter. When you call the API through the OpenAI SDK, pass it through extra_body.
Input: audio file URL
Python SDK
from openai import OpenAI
import os
try:
client = OpenAI(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the Singapore/US region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Whether to enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# When stream is False, stream_options cannot be set
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# When stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped (you can get token usage via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
Node.js SDK
// Preparations before running:
// Common to Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 recommended)
// 2. Run the following command to install the required dependency: npm install openai
import OpenAI from "openai";
const client = new OpenAI({
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the Singapore/US region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
async function main() {
try {
const streamEnabled = false; // Whether to enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
]
}
],
stream: streamEnabled,
// When stream is False, stream_options cannot be set
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
cURL
The URL below targets the Singapore and US regions. To call a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions.
curl -X POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-asr-flash",
"messages": [
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
"stream":false,
"asr_options": {
"enable_itn": false
}
}'
Input: Base64-encoded audio file
You can also pass Base64-encoded audio as a data URL, in the format data:<mediatype>;base64,<data>.
-
<mediatype>: The MIME type.The value varies by audio format. For example:
-
WAV:
audio/wav -
MP3:
audio/mpeg
-
-
<data>: The Base64-encoded string of the audio data.Base64 encoding inflates the payload size. Keep the source file small enough that the encoded data still fits within the 10 MB input limit.
-
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
Python SDK
The example uses this audio file: welcome.mp3.
import base64
from openai import OpenAI
import os
import pathlib
try:
# Replace with the actual path to your audio file
file_path = "welcome.mp3"
# Replace with the actual MIME type of your audio file
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
client = OpenAI(
# API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the Singapore/US region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Whether to enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": data_uri
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# When stream is False, stream_options cannot be set
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# When stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped (you can get token usage via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
Node.js SDK
The example uses this audio file: welcome.mp3.
// Preparations before running:
// Common to Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 recommended)
// 2. Run the following command to install the required dependency: npm install openai
import OpenAI from "openai";
import { readFileSync } from 'fs';
const client = new OpenAI({
// API Keys for the Singapore/US and Beijing regions are different. To obtain an API Key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the Singapore/US region URL. To use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
const encodeAudioFile = (audioFilePath) => {
const audioFile = readFileSync(audioFilePath);
return audioFile.toString('base64');
};
// Replace with the actual path to your audio file
const dataUri = `data:audio/mpeg;base64,${encodeAudioFile("welcome.mp3")}`;
async function main() {
try {
const streamEnabled = false; // Whether to enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: dataUri
}
}
]
}
],
stream: streamEnabled,
// When stream is False, stream_options cannot be set
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
Process long audio files
Non-real-time speech recognition transcribes long audio files asynchronously, making it well suited for producing meeting minutes, interview transcripts, and reviewing call recordings.
Limitations:
-
Qwen3-ASR-Flash-Filetrans, Fun-ASR, and Paraformer: Each audio file is capped at 2 GB in size and 12 hours in duration.
-
Qwen3-ASR-Flash: Each audio file is capped at 10 MB in size and 5 minutes in duration. For longer audio, use Qwen3-ASR-Flash-Filetrans or Fun-ASR.
-
When speaker diarization is enabled: Keep the audio duration under 2 hours to avoid recognition failures or timeouts. For details, see Speaker diarization.
How it works: Long-audio transcription runs as an asynchronous task in three steps:
-
Submit a transcription task to receive a
task_id. -
Poll the task status, or call the SDK's wait method to block until the task completes.
-
After the task completes, download the result JSON from the returned URL.
For code samples, see the quick start of Qwen3-ASR-Flash-Filetrans.
Streaming output
Qwen3-ASR-Flash supports streaming output: intermediate results are returned while the audio is being processed, which is well suited for use cases that require real-time progress feedback.
Fun-ASR, Paraformer, and Qwen3-ASR-Flash-Filetrans are asynchronous transcription models and do not support streaming output. Retrieve their final results through task polling (see Process long audio files).
To enable streaming output:
-
DashScope Python SDK: set
streamtoTrue. -
DashScope Java SDK: call the API through the
streamCallmethod. -
DashScope HTTP: set the
X-DashScope-SSEheader toenable. -
OpenAI-compatible SDK: set
streamtoTrue.
For a streaming code sample, see Streaming output in the Qwen3-ASR-Flash quick start.
Improve accuracy with hotwords
Fun-ASR and Paraformer improve recognition accuracy for domain-specific proper nouns (names, locations, product names) through hotwords. Create a hotword list in the Model Studio console, then pass its ID to the API through the vocabulary_id parameter.
For instructions on creating and using hotword lists, see Custom hotwords.
SDK naming conventions for these parameters vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference for each SDK.
Speaker diarization
Speaker diarization identifies the different speakers in an audio file and tags each sentence in the transcript with a speaker label. It is well suited for multi-party meetings and interview recordings.
Supported models: Fun-ASR and Paraformer support speaker diarization (off by default). The Qwen-ASR series does not yet support it.
To enable: Set diarization_enabled to true in the API request. Each sentence in the result then includes a speaker_id field that identifies the speaker.
Response structure (excerpt):
{
"transcripts": [
{
"sentences": [
{ "begin_time": 100, "end_time": 3820, "text": "Hello, let's discuss the project progress today.", "speaker_id": 0 },
{ "begin_time": 3820, "end_time": 6500, "text": "Sure, I'll give the update first.", "speaker_id": 1 }
]
}
]
}
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference for each SDK.
When speaker diarization is enabled, keep the audio duration under 2 hours to avoid recognition failures or timeouts. (For the audio duration limit when diarization is not enabled, see Process long audio files.) Diarization is supported only for mono audio.
For complete field definitions, see the API reference.
Sensitive word filter
The sensitive word filter replaces or removes sensitive words from recognition results. It is well suited for customer service quality assurance (QA), content compliance, and subtitle moderation.
Supported models: Fun-ASR and Paraformer support the sensitive word filter. The Qwen-ASR series (Qwen3-ASR-Flash and Qwen3-ASR-Flash-Filetrans) does not yet support it.
Default behavior: When the special_word_filter parameter is not specified, the system applies the built-in Alibaba Cloud Model Studio sensitive word list. Matched words are replaced with the same number of * characters.
Custom configuration: special_word_filter is a JSON object with three fields:
-
filter_with_signed.word_list: An array of strings whose matches are replaced with the same number of*characters. For example, with["test"], "please help me test it" becomes "please help me **** it". -
filter_with_empty.word_list: An array of strings whose matches are removed from the result. For example, with["start"], "is the game about to start now" becomes "is the game about to now". -
system_reserved_filter: A boolean. Defaults totrue. Controls whether the built-in sensitive word list is applied in addition to the custom lists.
Example configuration:
{
"special_word_filter": {
"filter_with_signed": {
"word_list": ["test"]
},
"filter_with_empty": {
"word_list": ["start", "happen"]
},
"system_reserved_filter": true
}
}
SDK naming conventions for these parameters vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
Emotion recognition
Qwen3-ASR-Flash-Filetrans and Qwen3-ASR-Flash have emotion recognition always on, with no additional configuration required. The result includes an emotion tag for the speaker, drawn from seven fine-grained categories: surprised, neutral, happy, sad, disgusted, angry, and fearful.
Field paths (vary by interface):
-
OpenAI-compatible interface (Qwen3-ASR-Flash real-time transcription): nested at
choices[].delta.annotations[].emotion(streaming) orchoices[].message.annotations[].emotion(non-streaming). -
DashScope synchronous interface (Qwen3-ASR-Flash): nested at
output.choices[].message.annotations[].emotion. -
DashScope asynchronous task interface (Qwen3-ASR-Flash-Filetrans, audio file transcription): nested at
transcripts[].sentences[].emotion, alongside the timestamp and speaker fields on each sentence object.
Response structure (excerpt from the DashScope asynchronous task interface):
{
"transcripts": [{
"sentences": [{
"begin_time": 0,
"end_time": 1440,
"text": "Welcome to Alibaba Cloud.",
"emotion": "neutral",
"language": "en"
}]
}]
}
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
The Fun-ASR and Paraformer non-real-time models do not yet support emotion recognition. To use emotion recognition with real-time recognition, see the corresponding section in Real-time speech recognition - Qwen.
Get timestamps
Non-real-time speech recognition can return timestamps in the transcript, which supports subtitle generation, keyword highlighting, and audio or video editing. All three asynchronous transcription models—Fun-ASR, Paraformer, and Qwen3-ASR-Flash-Filetrans—support timestamps, but the default behavior and the control method differ by model:
-
Qwen3-ASR-Flash-Filetrans: Only the DashScope asynchronous interface supports timestamps; the feature is permanently on. The
enable_wordsrequest parameter controls the granularity:false(default) returns sentence-level timestamps;truereturns word-level timestamps. Word-level timestamps are supported only for Chinese, English, Japanese, Korean, German, French, Spanish, Italian, Portuguese, and Russian. Accuracy is not guaranteed for other languages. -
Fun-ASR: Timestamps are permanently on and cannot be disabled.
-
Paraformer: Timestamps are off by default. To enable them, set the
timestamp_alignment_enabledrequest parameter totrue.
When Qwen3-ASR-Flash is called through the OpenAI-compatible interface, the output is a chat.completion and does not include timestamp fields. For timestamps, use Qwen3-ASR-Flash-Filetrans (the asynchronous task interface).
Timestamps are returned in milliseconds at two levels:
-
Sentence level:
sentences[].begin_timeandsentences[].end_timemark the start and end of each sentence in the audio. -
Word level: The
sentences[].words[]array. Each element containsbegin_time,end_time, andtext(the word or character text).
Response structure (excerpt from the DashScope asynchronous task interface):
{
"transcripts": [{
"sentences": [{
"begin_time": 100,
"end_time": 3820,
"text": "Hello, let's discuss the project progress today.",
"words": [
{ "begin_time": 100, "end_time": 596, "text": "Hello" },
{ "begin_time": 596, "end_time": 844, "text": "let's" }
]
}]
}]
}
The in-audio timestamps are integer milliseconds (for example, 100). They are not the same as the task-level end_time (the task completion time, a string such as "2024-09-12 15:11:40.903"). Do not confuse them.
SDK naming conventions for these fields vary (dictionary keys, object attributes, or methods). For the full field mapping, see the API reference.
Apply in production
The best practices below improve recognition quality and system stability when you use non-real-time speech recognition in production.
Production best practices
-
File hosting: Upload audio files to Alibaba Cloud OSS and call the API by URL. Avoid uploading local files (the local-file API is capped at 100 QPS and the limit cannot be increased).
-
Asynchronous polling: Long-audio transcription uses an asynchronous flow. Set a reasonable polling interval (for example, 2–5 seconds) to avoid burning through your quota with frequent queries.
-
Error handling: Implement a robust retry mechanism. Retry network timeouts and transient server errors (5xx) with exponential backoff.
-
Noise reduction: For noisy audio, preprocess the file with tools such as FFmpeg before submitting it for recognition.
-
Model selection: Choose a model based on audio duration. Use Qwen3-ASR-Flash for short audio up to 5 minutes. Use Qwen3-ASR-Flash-Filetrans or Fun-ASR for longer audio.
Supported scope
Available models vary by deployment scope:
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.
To call the models below, use an API key in the Singapore region:
-
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
-
Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans (stable, currently equivalent to qwen3-asr-flash-filetrans-2025-11-17), qwen3-asr-flash-filetrans-2025-11-17 (snapshot)
-
Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2026-02-10 (latest snapshot), qwen3-asr-flash-2025-09-08 (snapshot)
US
If you select the US deployment scope, model inference compute resources are restricted to the United States. Static data is stored in your selected region. Supported region: US (Virginia).
To call the models below, use an API key in the US region:
Qwen3-ASR-Flash: qwen3-asr-flash-us (stable, currently equivalent to qwen3-asr-flash-2025-09-08-us), qwen3-asr-flash-2025-09-08-us (snapshot)
Chinese mainland
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).
To call the models below, use an API key in the Beijing region:
-
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
-
Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans (stable, currently equivalent to qwen3-asr-flash-filetrans-2025-11-17), qwen3-asr-flash-filetrans-2025-11-17 (snapshot)
-
Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2026-02-10 (latest snapshot), qwen3-asr-flash-2025-09-08 (snapshot)
-
Paraformer: paraformer-v2, paraformer-8k-v2
API reference
FAQ
Q: How do I provide a publicly accessible audio URL to the API?
Use Alibaba Cloud Object Storage Service (OSS). OSS provides highly available and durable storage and can generate publicly accessible URLs.
Verify that the URL is reachable from the public internet: Open the URL in a browser or run curl against it to confirm the audio file downloads or plays successfully (HTTP status code 200).
Q: How do I check whether the audio format meets the requirements?
Use the open-source tool ffprobe to quickly inspect audio details:
# Inspect the container format (format_name), codec (codec_name), sample rate (sample_rate), and channel count (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3
Q: How do I process audio to meet the model requirements?
Use the open-source tool FFmpeg to clip audio or convert formats:
-
Audio clipping: extract a segment from a long audio file
# -i: Input file # -ss 00:01:30: Set the clip start time (start at 1 minute 30 seconds) # -t 00:02:00: Set the clip duration (clip 2 minutes) # -c copy: Copy the audio stream directly without re-encoding; faster # output_clip.wav: Output file ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav -
Format conversion
For example, convert any audio to a 16 kHz, 16-bit, mono WAV file:
# -i: Input file # -ac 1: Set the channel count to 1 (mono) # -ar 16000: Set the sample rate to 16000 Hz (16 kHz) # -sample_fmt s16: Set the sample format to 16-bit signed integer PCM # output.wav: Output file ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav
Q: How do I improve recognition accuracy?
The following factors affect recognition accuracy. Check each one and tune accordingly.
Main factors:
-
Sound quality: Recording equipment, sample rate, and ambient noise directly affect audio clarity. High-quality input is the foundation of accurate recognition.
-
Speaker characteristics: Variations in pitch, speaking rate, accent, and dialect—especially uncommon dialects and strong accents—make recognition harder.
-
Language and vocabulary: Mixed languages, technical terms, and slang make recognition harder. Configure hotwords to improve accuracy for domain-specific terminology.
How to optimize:
-
Improve audio quality: Use a high-quality microphone, record at the recommended sample rate, and minimize ambient noise and echo.
-
Adapt to the speaker: For audio with strong accents or distinct dialects, choose a model that supports the relevant dialect.
-
Configure hotwords: Set hotwords for technical terms and proper nouns.