All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio file recognition - Qwen

Last Updated:Jan 08, 2026

The Qwen audio file recognition models convert recorded audio into text. These models support features such as multi-language recognition, singing voice recognition, and noise rejection.

Core features

  • Multi-language recognition: Recognizes multiple languages, including Mandarin and various dialects, such as Cantonese and Sichuanese.

  • Adaptation to complex environments: Handles complex acoustic environments. The models support automatic language detection and intelligent filtering of non-human sounds.

  • Singing voice recognition: Transcribes entire songs, even with background music (BGM).

  • Context biasing: Improves recognition accuracy by configuring context. For more information, see Context biasing.

  • Emotion recognition: Recognizes multiple emotional states, including surprise, calm, happiness, sadness, disgust, anger, and fear.

Scope

Supported models:

This service offers two core models:

  • Qwen3-ASR-Flash-Filetrans: Designed for asynchronous recognition of long audio files up to 12 hours in length. This model is suitable for scenarios such as transcribing meeting records and interviews.

  • Qwen3-ASR-Flash: Designed for synchronous or streaming recognition of short audio files up to 5 minutes in length. This model is suitable for scenarios such as voice messaging and real-time captions.

International

In the International deployment mode, the endpoint and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding the Chinese mainland.

When you call the following models, select an API key from the Singapore region:

  • Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans, qwen3-asr-flash-filetrans-2025-11-17 (snapshot)

  • Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2025-09-08 (snapshot)

US

In the US deployment mode, the endpoint and data storage are in the US (Virginia) region. Model inference compute resources are restricted to the United States.

When you call the following models, select an API key from the US (Virginia) region:

Qwen3-ASR-Flash: qwen3-asr-flash-us (stable, currently equivalent to qwen3-asr-flash-2025-09-08-us), qwen3-asr-flash-2025-09-08-us (snapshot)

Chinese mainland

In the Chinese mainland deployment mode, the endpoint and data storage are in the Beijing region. Model inference compute resources are restricted to the Chinese mainland.

When you call the following models, select an API key from the Beijing region:

  • Qwen3-ASR-Flash-Filetrans: qwen3-asr-flash-filetrans, qwen3-asr-flash-filetrans-2025-11-17 (snapshot)

  • Qwen3-ASR-Flash: qwen3-asr-flash (stable, currently equivalent to qwen3-asr-flash-2025-09-08), qwen3-asr-flash-2025-09-08 (snapshot)

For more information, see Model list.

Model selection

Scenario

Recommended model

Reason

Notes

Long audio recognition

qwen3-asr-flash-filetrans

Supports recordings up to 12 hours long. This model provides emotion recognition and sentence-level timestamps, which are suitable for later indexing and analysis.

The audio file size cannot exceed 2 GB, and the duration cannot exceed 12 hours.

Short audio recognition

qwen3-asr-flash (or qwen3-asr-flash-us)

Provides low-latency recognition for short audio.

The audio file size cannot exceed 10 MB, and the duration cannot exceed 5 minutes.

Customer service quality inspection

qwen3-asr-flash-filetrans, qwen3-asr-flash (or qwen3-asr-flash-us)

Analyzes customer emotions.

This model does not support sensitive word filtering or speaker diarization. Select the appropriate model based on the audio duration.

News/interview program caption generation

qwen3-asr-flash-filetrans

Supports long audio, punctuation prediction, and timestamps to directly generate structured captions.

This model requires post-processing to generate standard subtitle files. Select the appropriate model based on the audio duration.

Multilingual video localization

qwen3-asr-flash-filetrans, qwen3-asr-flash (or qwen3-asr-flash-us)

This model covers multiple languages and dialects, which makes it suitable for cross-language caption creation.

Select the appropriate model based on the audio duration.

Singing audio analysis

qwen3-asr-flash-filetrans, qwen3-asr-flash (or qwen3-asr-flash-us)

This model recognizes lyrics and analyzes emotions, which makes it suitable for song indexing and recommendation.

Select the appropriate model based on the audio duration.

For more information, see Model feature comparison.

Getting started

Before you begin, obtain an API key. If you use an SDK, install the latest version of the SDK.

DashScope

qwen3-asr-flash-filetrans

The qwen3-asr-flash-filetrans model is designed for asynchronous transcription of audio files and supports recordings up to 12 hours long. This model requires a publicly accessible URL of an audio file as input and does not support direct uploads of local files. It is a non-streaming API that returns all recognition results at once after the task is complete.

cURL

When you use cURL for speech recognition, you can first submit a task to obtain a task ID (task_id), and then use this ID to retrieve the task execution result.

Submit a task

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription
# The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
    "model": "qwen3-asr-flash-filetrans",
    "input": {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
    },
    "parameters": {
        "channel_id":[
            0
        ], 
        "enable_itn": false
    }
}'

Get the task execution result

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}. Note: Replace {task_id} with the ID of the task to query.
# The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X GET 'https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"

Complete examples

Java

import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import okhttp3.*;

import java.io.IOException;
import java.util.concurrent.TimeUnit;

public class Main {
    // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription
    private static final String API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription";
    // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/tasks/
    private static final String API_URL_QUERY = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/";
    private static final Gson gson = new Gson();

    public static void main(String[] args) {
        // The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: String apiKey = "sk-xxx"
        String apiKey = System.getenv("DASHSCOPE_API_KEY");

        OkHttpClient client = new OkHttpClient();

        // 1. Submit the task
        /*String payloadJson = """
                {
                    "model": "qwen3-asr-flash-filetrans",
                    "input": {
                        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    },
                    "parameters": {
                        "channel_id": [0],
                        "enable_itn": false,
                        "language": "zh",
                        "corpus": {
                            "text": ""
                        }
                    }
                }
                """;*/
        String payloadJson = """
                {
                    "model": "qwen3-asr-flash-filetrans",
                    "input": {
                        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    },
                    "parameters": {
                        "channel_id": [0],
                        "enable_itn": false
                    }
                }
                """;

        RequestBody body = RequestBody.create(payloadJson, MediaType.get("application/json; charset=utf-8"));
        Request submitRequest = new Request.Builder()
                .url(API_URL_SUBMIT)
                .addHeader("Authorization", "Bearer " + apiKey)
                .addHeader("Content-Type", "application/json")
                .addHeader("X-DashScope-Async", "enable")
                .post(body)
                .build();

        String taskId = null;

        try (Response response = client.newCall(submitRequest).execute()) {
            if (response.isSuccessful() && response.body() != null) {
                String respBody = response.body().string();
                ApiResponse apiResp = gson.fromJson(respBody, ApiResponse.class);
                if (apiResp.output != null) {
                    taskId = apiResp.output.taskId;
                    System.out.println("Task submitted. task_id: " + taskId);
                } else {
                    System.out.println("Submission response content: " + respBody);
                    return;
                }
            } else {
                System.out.println("Task submission failed! HTTP code: " + response.code());
                if (response.body() != null) {
                    System.out.println(response.body().string());
                }
                return;
            }
        } catch (IOException e) {
            e.printStackTrace();
            return;
        }

        // 2. Poll the task status
        boolean finished = false;
        while (!finished) {
            try {
                TimeUnit.SECONDS.sleep(2);  // Wait 2 seconds before querying again
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return;
            }

            String queryUrl = API_URL_QUERY + taskId;
            Request queryRequest = new Request.Builder()
                    .url(queryUrl)
                    .addHeader("Authorization", "Bearer " + apiKey)
                    .addHeader("X-DashScope-Async", "enable")
                    .addHeader("Content-Type", "application/json")
                    .get()
                    .build();

            try (Response response = client.newCall(queryRequest).execute()) {
                if (response.body() != null) {
                    String queryResponse = response.body().string();
                    ApiResponse apiResp = gson.fromJson(queryResponse, ApiResponse.class);

                    if (apiResp.output != null && apiResp.output.taskStatus != null) {
                        String status = apiResp.output.taskStatus;
                        System.out.println("Current task status: " + status);
                        if ("SUCCEEDED".equalsIgnoreCase(status)
                                || "FAILED".equalsIgnoreCase(status)
                                || "UNKNOWN".equalsIgnoreCase(status)) {
                            finished = true;
                            System.out.println("Task completed. Final result: ");
                            System.out.println(queryResponse);
                        }
                    } else {
                        System.out.println("Query response content: " + queryResponse);
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
                return;
            }
        }
    }

    static class ApiResponse {
        @SerializedName("request_id")
        String requestId;
        Output output;
    }

    static class Output {
        @SerializedName("task_id")
        String taskId;
        @SerializedName("task_status")
        String taskStatus;
    }
}

Python

import os
import time
import requests
import json

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/asr/transcription
API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription"
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/tasks/
API_URL_QUERY_BASE = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/"


def main():
    # The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY")

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "X-DashScope-Async": "enable"
    }

    # 1. Submit the task
    payload = {
        "model": "qwen3-asr-flash-filetrans",
        "input": {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
        },
        "parameters": {
            "channel_id": [0],
            # "language": "zh",
            "enable_itn": False
            # "corpus": {
            #     "text": ""
            # }
        }
    }

    print("Submitting ASR transcription task...")
    try:
        submit_resp = requests.post(API_URL_SUBMIT, headers=headers, data=json.dumps(payload))
    except requests.RequestException as e:
        print(f"Failed to submit the task request: {e}")
        return

    if submit_resp.status_code != 200:
        print(f"Task submission failed! HTTP code: {submit_resp.status_code}")
        print(submit_resp.text)
        return

    resp_data = submit_resp.json()
    output = resp_data.get("output")
    if not output or "task_id" not in output:
        print("Abnormal submission response content:", resp_data)
        return

    task_id = output["task_id"]
    print(f"Task submitted. task_id: {task_id}")

    # 2. Poll the task status
    finished = False
    while not finished:
        time.sleep(2)  # Wait 2 seconds before querying again

        query_url = API_URL_QUERY_BASE + task_id
        try:
            query_resp = requests.get(query_url, headers=headers)
        except requests.RequestException as e:
            print(f"Failed to query the task: {e}")
            return

        if query_resp.status_code != 200:
            print(f"Task query failed! HTTP code: {query_resp.status_code}")
            print(query_resp.text)
            return

        query_data = query_resp.json()
        output = query_data.get("output")
        if output and "task_status" in output:
            status = output["task_status"]
            print(f"Current task status: {status}")

            if status.upper() in ("SUCCEEDED", "FAILED", "UNKNOWN"):
                finished = True
                print("Task completed. The final result is as follows:")
                print(json.dumps(query_data, indent=2, ensure_ascii=False))
        else:
            print("Query response content:", query_data)


if __name__ == "__main__":
    main()

Java SDK

import com.alibaba.dashscope.audio.qwen_asr.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;

public class Main {
    public static void main(String[] args) {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        QwenTranscriptionParam param =
                QwenTranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("qwen3-asr-flash-filetrans")
                        .fileUrl("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav")
                        //.parameter("language", "zh")
                        //.parameter("channel_id", new ArrayList<String>(){{add("0");add("1");}})
                        .parameter("enable_itn", false)
                        //.parameter("corpus", new HashMap<String, String>() {{put("text", "");}})
                        .build();
        try {
            QwenTranscription transcription = new QwenTranscription();
            // Submit the task
            QwenTranscriptionResult result = transcription.asyncCall(param);
            System.out.println("create task result: " + result);
            // Query the task status
            result = transcription.fetch(QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            System.out.println("task status: " + result);
            // Wait for the task to complete
            result =
                    transcription.wait(
                            QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            System.out.println("task result: " + result);
            // Get the speech recognition result
            QwenTranscriptionTaskResult taskResult = result.getResult();
            if (taskResult != null) {
                // Get the URL of the recognition result
                String transcriptionUrl = taskResult.getTranscriptionUrl();
                // Get the result from the URL
                HttpURLConnection connection =
                        (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                connection.setRequestMethod("GET");
                connection.connect();
                BufferedReader reader =
                        new BufferedReader(new InputStreamReader(connection.getInputStream()));
                // Format and print the JSON result
                Gson gson = new GsonBuilder().setPrettyPrinting().create();
                System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
    }
}

Python SDK

import json
import os
import sys
from http import HTTPStatus

import dashscope
from dashscope.audio.qwen_asr import QwenTranscription
from dashscope.api_entities.dashscope_response import TranscriptionResponse


# run the transcription script
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
    dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
    task_response = QwenTranscription.async_call(
        model='qwen3-asr-flash-filetrans',
        file_url='https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav',
        #language="",
        enable_itn=False
        #corpus= {
        #    "text": ""
        #}
    )
    print(f'task_response: {task_response}')
    print(task_response.output.task_id)
    query_response = QwenTranscription.fetch(task=task_response.output.task_id)
    print(f'query_response: {query_response}')
    task_result = QwenTranscription.wait(task=task_response.output.task_id)
    print(f'task_result: {task_result}')

qwen3-asr-flash

The qwen3-asr-flash model supports audio files up to 5 minutes long. This model accepts a publicly accessible URL of an audio file or a direct upload of a local file as input. It also supports streaming output for recognition results.

Input: Audio file URL

Python SDK

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition
    {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        #"language": "zh", # Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":False
    }
)
print(response)

Java SDK

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", false);
        // asrOptions.put("language", "zh"); // Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
                .model("qwen3-asr-flash")
                .message(sysMessage)
                .message(userMessage)
                .parameter("asr_options", asrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }
    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

You can configure context for custom recognition using the text parameter in the System Message.

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you use a model in the US region, you must append the -us suffix.
# === Delete this comment before execution ===

curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-asr-flash",
    "input": {
        "messages": [
            {
                "content": [
                    {
                        "text": ""
                    }
                ],
                "role": "system"
            },
            {
                "content": [
                    {
                        "audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    }
                ],
                "role": "user"
            }
        ]
    },
    "parameters": {
        "asr_options": {
            "enable_itn": false
        }
    }
}'

Input: Base64-encoded audio file

You can input Base64-encoded data (Data URL) in the following format: data:<mediatype>;base64,<data>.

  • <mediatype>: The Multipurpose Internet Mail Extensions (MIME) type.

    This parameter varies depending on the audio format. For example:

    • WAV: audio/wav

    • MP3:audio/mpeg

    • M4A: audio/mp4

  • <data>: The Base64-encoded string of the audio.

    Base64 encoding increases the file size. Control the original file size to ensure that the encoded data does not exceed the input audio size limit of 10 MB.

  • Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

    Click to view sample code

    import base64, pathlib
    
    # input.mp3 is the local audio file for voice cloning. Replace it with the path to your audio file and ensure it meets the audio requirements.
    file_path = pathlib.Path("input.mp3")
    base64_str = base64.b64encode(file_path.read_bytes()).decode()
    data_uri = f"data:audio/mpeg;base64,{base64_str}"
    import java.nio.file.*;
    import java.util.Base64;
    
    public class Main {
        /**
         * filePath is the local audio file for voice cloning. Replace it with the path to your audio file and ensure it meets the audio requirements.
         */
        public static String toDataUrl(String filePath) throws Exception {
            byte[] bytes = Files.readAllBytes(Paths.get(filePath));
            String encoded = Base64.getEncoder().encodeToString(bytes);
            return "data:audio/mpeg;base64," + encoded;
        }
    
        // Example usage
        public static void main(String[] args) throws Exception {
            System.out.println(toDataUrl("input.mp3"));
        }
    }

Python SDK

The audio file used in the example is: welcome.mp3.

import base64
import dashscope
import os
import pathlib

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace with the actual path to your audio file.
file_path = "welcome.mp3"
# Replace with the actual MIME type of your audio file.
audio_mime_type = "audio/mpeg"

file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
    raise FileNotFoundError(f"Audio file not found: {file_path}")

base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition.
    {"role": "user", "content": [{"audio": data_uri}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        # "language": "zh", # Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":False
    }
)
print(response)

Java SDK

The audio file used in the example is: welcome.mp3.

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    // Replace with the actual path to your audio file.
    private static final String AUDIO_FILE = "welcome.mp3";
    // Replace with the actual MIME type of your audio file.
    private static final String AUDIO_MIME_TYPE = "audio/mpeg";

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException, IOException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", toDataUrl())))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here.
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", false);
        // asrOptions.put("language", "zh"); // Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
                .model("qwen3-asr-flash")
                .message(sysMessage)
                .message(userMessage)
                .parameter("asr_options", asrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }

    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }

    // Generate data URI
    public static String toDataUrl() throws IOException {
        byte[] bytes = Files.readAllBytes(Paths.get(AUDIO_FILE));
        String encoded = Base64.getEncoder().encodeToString(bytes);
        return "data:" + AUDIO_MIME_TYPE + ";base64," + encoded;
    }
}

Input: Absolute path of a local audio file

When using the DashScope SDK to process local image files, you must provide a file path. For instructions on how to create the file path based on your operating system and use case, see the following table.

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.png

Java SDK

Windows

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.png

Java SDK

file:///{absolute_path_of_the_file}

file:///D:images/test.png

Important

When you use local files, the API call limit is 100 queries per second (QPS), and this limit cannot be increased. Do not use this method for production environments, high-concurrency situations, or stress testing scenarios. For higher concurrency, we recommend that you upload the file to OSS and call the API using the audio file URL.

Python SDK

The audio file used in the example is: welcome.mp3.

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local audio file.
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition.
    {"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        # "language": "zh", # Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":False
    }
)
print(response)

Java SDK

The audio file used in the example is: welcome.mp3.

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        // Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local file.
        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", localFilePath)))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here.
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", false);
        // asrOptions.put("language", "zh"); // Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
                .model("qwen3-asr-flash")
                .message(sysMessage)
                .message(userMessage)
                .parameter("asr_options", asrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(JsonUtils.toJson(result));
    }
    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Streaming output

The model generates results incrementally rather than all at once. Non-streaming output returns the complete result after the model finishes processing. Streaming output returns intermediate results as they are generated. This reduces waiting time by allowing you to view the output sooner. To enable streaming output, set the appropriate parameter based on your calling method:

  • The DashScope SDK for Python: Set the stream parameter to true.

  • DashScope Java SDK: Call the service using the streamCall interface.

  • DashScope HTTP: Set X-DashScope-SSE to enable in the header.

Python SDK

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {"role": "system", "content": [{"text": ""}]},  # Configure the context for custom recognition.
    {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
    model="qwen3-asr-flash",
    messages=messages,
    result_format="message",
    asr_options={
        # "language": "zh", # Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        "enable_itn":False
    },
    stream=True
)

for response in response:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
    except:
        pass

Java SDK

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
                .build();

        MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                // Configure the context for custom recognition here.
                .content(Arrays.asList(Collections.singletonMap("text", "")))
                .build();

        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_itn", false);
        // asrOptions.put("language", "zh"); // Optional. If the language of the audio is known, specify it with this parameter to improve recognition accuracy.
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // If you use a model in the US region, append the "-us" suffix to the model name, for example, qwen3-asr-flash-us
                .model("qwen3-asr-flash")
                .message(sysMessage)
                .message(userMessage)
                .parameter("asr_options", asrOptions)
                .build();
        Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
        resultFlowable.blockingForEach(item -> {
            try {
                System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

You can configure context for custom recognition using the text parameter in the System Message.

# ======= Important =======
# The following is the URL for the Singapore region. If you use a model in the US region, replace the URL with: https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you use a model in the US region, you must append the -us suffix.
# === Delete this comment before execution ===

curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "qwen3-asr-flash",
    "input": {
        "messages": [
            {
                "content": [
                    {
                        "text": ""
                    }
                ],
                "role": "system"
            },
            {
                "content": [
                    {
                        "audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    }
                ],
                "role": "user"
            }
        ]
    },
    "parameters": {
        "incremental_output": true,
        "asr_options": {
            "enable_itn": false
        }
    }
}'

OpenAI compatible

Important

The US (Virginia) region does not support OpenAI compatible mode.

Only the Qwen3-ASR-Flash series models support calls in OpenAI compatible mode. This mode accepts only publicly accessible audio file URLs and does not support absolute paths of local audio files.

The OpenAI Python SDK version must be 1.52.0 or later, and the Node.js SDK version must be 4.68.0 or later.

The asr_options parameter is not a standard OpenAI parameter. If you use the OpenAI SDK, pass this parameter in extra_body.

Input: Audio file URL

Python SDK

from openai import OpenAI
import os

try:
    client = OpenAI(
        # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        # The following is the URL for the Singapore/US region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    

    stream_enabled = False  # Specifies whether to enable streaming output.
    completion = client.chat.completions.create(
        model="qwen3-asr-flash",
        messages=[
            {
                # Configure the context for custom recognition.
                "content": [{
                    "text": ""
                }],
                "role": "system"
            },
            {
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                        }
                    }
                ],
                "role": "user"
            }
        ],
        stream=stream_enabled,
        # When stream is set to false, do not set the stream_options parameter.
        # stream_options={"include_usage": True},
        extra_body={
            "asr_options": {
                # "language": "zh",
                "enable_itn": False
            }
        }
    )
    if stream_enabled:
        full_content = ""
        print("Streaming output:")
        for chunk in completion:
            # If stream_options.include_usage is true, the choices field of the last chunk is an empty list and must be skipped. You can get token usage from chunk.usage.
            print(chunk)
            if chunk.choices and chunk.choices[0].delta.content:
                full_content += chunk.choices[0].delta.content
        print(f"Full content: {full_content}")
    else:
        print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
    print(f"Error: {e}")

Node.js SDK

// Preparations:
// For Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 is recommended).
// 2. Run the following command to install the necessary dependencies: npm install openai

import OpenAI from "openai";

const client = new OpenAI({
  // The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore/US region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", 
});

async function main() {
  try {
    const streamEnabled = false; // Specifies whether to enable streaming output.
    const completion = await client.chat.completions.create({
      model: "qwen3-asr-flash",
      messages: [
        {
          role: "system",
          content: [
            { text: "" }
          ]
        },
        {
          role: "user",
          content: [
            {
              type: "input_audio",
              input_audio: {
                data: "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
              }
            }
          ]
        }
      ],
      stream: streamEnabled,
      // When stream is set to false, do not set the stream_options parameter.
      // stream_options: {
      //   "include_usage": true
      // },
      extra_body: {
        asr_options: {
          // language: "zh",
          enable_itn: false
        }
      }
    });

    if (streamEnabled) {
      let fullContent = "";
      console.log("Streaming output:");
      for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
        if (chunk.choices && chunk.choices.length > 0) {
          const delta = chunk.choices[0].delta;
          if (delta && delta.content) {
            fullContent += delta.content;
          }
        }
      }
      console.log(`Full content: ${fullContent}`);
    } else {
      console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
    }
  } catch (err) {
    console.error(`Error: ${err}`);
  }
}

main();

cURL

You can configure context for custom recognition using the text parameter in the System Message.

# ======= Important =======
# The following is the URL for the Singapore/US region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-asr-flash",
    "messages": [
        {
            "content": [
                {
                    "text": ""
                }
            ],
            "role": "system"
        },
        {
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
                    }
                }
            ],
            "role": "user"
        }
    ],
    "stream":false,
    "asr_options": {
        "enable_itn": false
    }
}'

Input: Base64-encoded audio file

You can input Base64-encoded data (Data URL) in the following format: data:<mediatype>;base64,<data>.

  • <mediatype>: The Multipurpose Internet Mail Extensions (MIME) type.

    This parameter varies depending on the audio format. For example:

    • WAV: audio/wav

    • MP3:audio/mpeg

    • M4A: audio/mp4

  • <data>: The Base64-encoded string of the audio.

    Base64 encoding increases the file size. Control the original file size to ensure that the encoded data does not exceed the input audio size limit of 10 MB.

  • Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

    Click to view sample code

    import base64, pathlib
    
    # input.mp3 is the local audio file for voice cloning. Replace it with the path to your audio file and ensure it meets the audio requirements.
    file_path = pathlib.Path("input.mp3")
    base64_str = base64.b64encode(file_path.read_bytes()).decode()
    data_uri = f"data:audio/mpeg;base64,{base64_str}"
    import java.nio.file.*;
    import java.util.Base64;
    
    public class Main {
        /**
         * filePath is the local audio file for voice cloning. Replace it with the path to your audio file and ensure it meets the audio requirements.
         */
        public static String toDataUrl(String filePath) throws Exception {
            byte[] bytes = Files.readAllBytes(Paths.get(filePath));
            String encoded = Base64.getEncoder().encodeToString(bytes);
            return "data:audio/mpeg;base64," + encoded;
        }
    
        // Example usage
        public static void main(String[] args) throws Exception {
            System.out.println(toDataUrl("input.mp3"));
        }
    }

Python SDK

The audio file used in the example is: welcome.mp3.

import base64
from openai import OpenAI
import os
import pathlib

try:
    # Replace with the actual path to your audio file.
    file_path = "welcome.mp3"
    # Replace with the actual MIME type of your audio file.
    audio_mime_type = "audio/mpeg"

    file_path_obj = pathlib.Path(file_path)
    if not file_path_obj.exists():
        raise FileNotFoundError(f"Audio file not found: {file_path}")

    base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
    data_uri = f"data:{audio_mime_type};base64,{base64_str}"

    client = OpenAI(
        # The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        # The following is the URL for the Singapore/US region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    

    stream_enabled = False  # Specifies whether to enable streaming output.
    completion = client.chat.completions.create(
        model="qwen3-asr-flash",
        messages=[
            {
                # Configure the context for custom recognition.
                "content": [{
                    "text": ""
                }],
                "role": "system"
            },
            {
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": data_uri
                        }
                    }
                ],
                "role": "user"
            }
        ],
        stream=stream_enabled,
        # When stream is set to false, do not set the stream_options parameter.
        # stream_options={"include_usage": True},
        extra_body={
            "asr_options": {
                # "language": "zh",
                "enable_itn": False
            }
        }
    )
    if stream_enabled:
        full_content = ""
        print("Streaming output:")
        for chunk in completion:
            # If stream_options.include_usage is true, the choices field of the last chunk is an empty list and must be skipped. You can get token usage from chunk.usage.
            print(chunk)
            if chunk.choices and chunk.choices[0].delta.content:
                full_content += chunk.choices[0].delta.content
        print(f"Full content: {full_content}")
    else:
        print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
    print(f"Error: {e}")

Node.js SDK

The audio file used in the example is: welcome.mp3.

// Preparations:
// For Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 is recommended).
// 2. Run the following command to install the necessary dependencies: npm install openai

import OpenAI from "openai";
import { readFileSync } from 'fs';

const client = new OpenAI({
  // The API keys for the Singapore/US and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore/US region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", 
});

const encodeAudioFile = (audioFilePath) => {
    const audioFile = readFileSync(audioFilePath);
    return audioFile.toString('base64');
};

// Replace with the actual path to your audio file.
const dataUri = `data:audio/mpeg;base64,${encodeAudioFile("welcome.mp3")}`;

async function main() {
  try {
    const streamEnabled = false; // Specifies whether to enable streaming output.
    const completion = await client.chat.completions.create({
      model: "qwen3-asr-flash",
      messages: [
        {
          role: "system",
          content: [
            { text: "" }
          ]
        },
        {
          role: "user",
          content: [
            {
              type: "input_audio",
              input_audio: {
                data: dataUri
              }
            }
          ]
        }
      ],
      stream: streamEnabled,
      // When stream is set to false, do not set the stream_options parameter.
      // stream_options: {
      //   "include_usage": true
      // },
      extra_body: {
        asr_options: {
          // language: "zh",
          enable_itn: false
        }
      }
    });

    if (streamEnabled) {
      let fullContent = "";
      console.log("Streaming output:");
      for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
        if (chunk.choices && chunk.choices.length > 0) {
          const delta = chunk.choices[0].delta;
          if (delta && delta.content) {
            fullContent += delta.content;
          }
        }
      }
      console.log(`Full content: ${fullContent}`);
    } else {
      console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
    }
  } catch (err) {
    console.error(`Error: ${err}`);
  }
}

main();

Core usage: context biasing

Qwen3-ASR supports context biasing, which lets you provide context to improve the recognition of domain-specific vocabulary, such as names, places, and product terms. This feature significantly improves transcription accuracy and is more flexible and powerful than traditional hotword solutions.

Length limit: The context cannot exceed 10,000 tokens.

Usage: When you call the API, pass the text in the text parameter of the System Message.

Supported text types: Supported text types include, but are not limited to, the following:

  • Hotword lists that use various separators (for example, hotword 1, hotword 2, hotword 3, hotword 4)

  • Text paragraphs or chapters in any format

  • Mixed content: Any combination of word lists and paragraphs

  • Irrelevant or meaningless text (including garbled characters). The model has high fault tolerance and is unlikely to be negatively affected by irrelevant text.

Example:

In this example, the correct transcription of an audio segment is "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB...".

Without Context Enhancement

Without context biasing, the model incorrectly recognizes some investment bank names. For example, "Bulge Bracket" is recognized as "Bird Rock".

Recognition result: "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bird Rock, BB..."

Using contextual enhancement

With context biasing, the model correctly recognizes the investment bank names.

Recognition result: "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."

To achieve this result, you can add any of the following content to the context:

  • Word lists:

    • Word list 1:

      Bulge Bracket, Boutique, Middle Market, domestic securities firms
    • Word list 2:

      Bulge Bracket Boutique Middle Market domestic securities firms
    • Word list 3:

      ['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
  • Natural language:

    The secrets of investment banking categories revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket investment banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
    Boutique investment banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep expertise and experience in specific fields.
    Middle Market investment banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
    Domestic securities firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
  • Natural language with irrelevant text: The context can contain irrelevant text. The following example includes irrelevant names.

    The secrets of investment banking categories revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket investment banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
    Boutique investment banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep expertise and experience in specific fields.
    Middle Market investment banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
    Domestic securities firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
    Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei, 
    Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu, 
    Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen, 
    Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

API reference

Qwen API reference: Recognizing audio files

Model feature comparison

The features listed in the following table for the qwen3-asr-flash and qwen3-asr-flash-2025-09-08 models also apply to their corresponding models in the US (Virginia) region: qwen3-asr-flash-us and qwen3-asr-flash-2025-09-08-us.

Features

qwen3-asr-flash-filetrans, qwen3-asr-flash-filetrans-2025-11-17

qwen3-asr-flash, qwen3-asr-flash-2025-09-08

Supported languages

Chinese (Mandarin, Sichuanese, Minnan, Wu, and Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, and Swedish

Supported audio formats

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

aac, amr, avi, aiff, flac, flv, m4a, mkv, mp3, mpeg, ogg, opus, wav, webm, wma, wmv

Sample rate

Any

Channels

Any

The models handle multi-channel audio differently:

  • Qwen3-ASR-Flash-Filetrans: You must specify the audio track index using the channel_id parameter.

  • qwen3-asr-flash: No extra processing is needed. The model averages and merges the multi-channel audio before processing.

Input format

Publicly accessible URL of the file to be recognized

Base64-encoded file, absolute path of a local file, publicly accessible URL of the file to be recognized

Audio size/duration

Audio file size up to 2 GB, duration up to 12 hours

Audio file size up to 10 MB, duration up to 5 minutes

Emotion recognition

Supported Always on

Timestamp

Supported Always on

Not supported

Punctuation prediction

Supported Always on

Context Enhancement

Supported Configurable

ITN

Supported Off by default, can be enabled. Applies only to Chinese and English.

Singing voice recognition

Supported Always on

Noise rejection

Supported Always on

Sensitive word filtering

Not supported

Speaker diarization

Not supported

Filler word filtering

Not supported

VAD

Supported Always on

Not supported

Rate limit (RPM)

100

Connection type

DashScope: Java/Python SDK, RESTful API

DashScope: Java/Python SDK, RESTful API

OpenAI: Python/Node.js SDK, RESTful API

Price

International (Singapore): $0.000035/second

US (Virginia): $0.000032/second

Chinese mainland (Beijing): $0.000032/second

FAQ

Q: How do I provide a publicly accessible audio URL for the API?

Alibaba Cloud Object Storage Service (OSS) is a highly available and reliable storage service that lets you easily generate public access URLs.

To verify that the generated URL is publicly accessible: You can access the URL in a browser or with the curl command to ensure that the audio file downloads or plays successfully. A successful request returns an HTTP 200 status code.

Q: How do I check if the audio format meets the requirements?

You can use the open-source tool ffprobe to quickly obtain detailed information about the audio:

# Query the audio container format (format_name), encoding (codec_name), sample rate (sample_rate), and number of channels (channels).
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3

Q: How do I process audio to meet the model's requirements?

You can use the open-source tool FFmpeg to crop or convert audio:

  • Crop audio: Extract a clip from a long audio file

    # -i: input file
    # -ss 00:01:30: Set the crop start time (starts at 1 minute and 30 seconds).
    # -t 00:02:00: Set the crop duration (2 minutes).
    # -c copy: Directly copy the audio stream without re-encoding for faster processing.
    # output_clip.wav: output file
    ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav
  • Format conversion

    For example, you can convert any audio file to a 16 kHz, 16-bit, mono WAV file.

    # -i: input file
    # -ac 1: Set the number of audio channels to 1 (mono).
    # -ar 16000: Set the sample rate to 16000 Hz (16 kHz).
    # -sample_fmt s16: Set the sample format to 16-bit signed integer PCM.
    # output.wav: output file
    ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav