All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech recognition - Qwen

Last Updated:Jan 16, 2026

In scenarios such as live streaming, online meetings, voice chats, or smart assistants, you may need to convert a continuous audio stream into text in real time. Qwen's real-time speech recognition lets you provide instant captions, generate meeting minutes, or respond to voice commands.

Core features

  • High-accuracy multilingual recognition: Provides high-accuracy recognition for multiple languages, including Mandarin and various Chinese dialects such as Cantonese and Sichuanese. For more information, see Model features.

  • Adaptation to complex environments: Adapts to complex acoustic environments and supports automatic language detection and intelligent filtering of non-human sounds.

  • Context biasing: Improves recognition accuracy by allowing you to provide context. This feature is more flexible and powerful than traditional hotword solutions.

  • Emotion recognition: Recognizes multiple emotional states, including surprise, calmness, happiness, sadness, disgust, anger, and fear.

Availability

Supported models:

International

In the international deployment mode, the endpoint and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled globally, excluding Mainland China.

To call the following models, use an API key from the Singapore region:

Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (stable version, alias for qwen3-asr-flash-realtime-2025-10-27), qwen3-asr-flash-realtime-2025-10-27 (snapshot version)

Mainland China

In the Mainland China deployment mode, the endpoint and data storage are located in the Beijing region. Model inference computing resources are limited to Mainland China.

To call the following models, use an API key from the Beijing region:

Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (the stable version, which is currently an alias for qwen3-asr-flash-realtime-2025-10-27) and qwen3-asr-flash-realtime-2025-10-27 (the snapshot version)

For more information, see Model list.

Model selection

Scenario

Recommended model

Reason

Intelligent quality inspection for customer service

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

Analyzes call content and customer emotions in real time to assist agents and monitor service quality.

Live streaming/Short videos

Generates real-time captions for live content to reach a multilingual audience.

Online meetings/Interviews

Records meeting speech in real time and quickly generates text summaries to improve information organization efficiency.

Transcription for professional domains (such as medical and legal)

Supports context biasing to dynamically inject domain-specific hotwords and improve the hit rate.

For more information, see Model features.

Getting started

DashScope SDK

Java

  1. Install the SDK. Make sure that the DashScope SDK version is 2.22.5 or later.

  2. Get an API key. Set the API key as an environment variable to avoid hard coding it in your code.

  3. Run the sample code.

    import com.alibaba.dashscope.audio.omni.*;
    import com.alibaba.dashscope.exception.NoApiKeyException;
    import com.google.gson.JsonObject;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    import javax.sound.sampled.LineUnavailableException;
    import java.io.File;
    import java.io.FileInputStream;
    import java.util.Base64;
    import java.util.Collections;
    import java.util.concurrent.CountDownLatch;
    import java.util.concurrent.atomic.AtomicReference;
    
    public class Qwen3AsrRealtimeUsage {
        private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
        private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
        private static final int SLEEP_INTERVAL_MS = 30;  // Sleep interval in milliseconds
    
        public static void main(String[] args) throws InterruptedException, LineUnavailableException {
            CountDownLatch finishLatch = new CountDownLatch(1);
    
            OmniRealtimeParam param = OmniRealtimeParam.builder()
                    .model("qwen3-asr-flash-realtime")
                    // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
                    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                    // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                    // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apikey("sk-xxx")
                    .apikey(System.getenv("DASHSCOPE_API_KEY"))
                    .build();
    
            OmniRealtimeConversation conversation = null;
            final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
            conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
                @Override
                public void onOpen() {
                    System.out.println("connection opened");
                }
                @Override
                public void onEvent(JsonObject message) {
                    String type = message.get("type").getAsString();
                    switch(type) {
                        case "session.created":
                            System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
                            break;
                        case "conversation.item.input_audio_transcription.completed":
                            System.out.println("transcription: " + message.get("transcript").getAsString());
                            finishLatch.countDown();
                            break;
                        case "input_audio_buffer.speech_started":
                            System.out.println("======VAD Speech Start======");
                            break;
                        case "input_audio_buffer.speech_stopped":
                            System.out.println("======VAD Speech Stop======");
                            break;
                        case "conversation.item.input_audio_transcription.text":
                            System.out.println("transcription: " + message.get("text").getAsString());
                            break;
                        default:
                            break;
                    }
                }
                @Override
                public void onClose(int code, String reason) {
                    System.out.println("connection closed code: " + code + ", reason: " + reason);
                }
            });
            conversationRef.set(conversation);
            try {
                conversation.connect();
            } catch (NoApiKeyException e) {
                throw new RuntimeException(e);
            }
    
            OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
            transcriptionParam.setLanguage("zh");
            transcriptionParam.setInputAudioFormat("pcm");
            transcriptionParam.setInputSampleRate(16000);
            // Corpus, optional. If you have a corpus, set it to enhance recognition.
            // transcriptionParam.setCorpusText("This is a stand-up comedy show");
    
            OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                    .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
                    .transcriptionConfig(transcriptionParam)
                    .build();
            conversation.updateSession(config);
    
            String filePath = "your_audio_file.pcm";
            File audioFile = new File(filePath);
            if (!audioFile.exists()) {
                log.error("Audio file not found: {}", filePath);
                return;
            }
    
            try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
                byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
                int bytesRead;
                int totalBytesRead = 0;
    
                log.info("Starting to send audio data from: {}", filePath);
    
                // Read and send audio data in chunks
                while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
                    totalBytesRead += bytesRead;
                    String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
                    // Send audio chunk to conversation
                    conversation.appendAudio(audioB64);
    
                    // Add small delay to simulate real-time audio streaming
                    Thread.sleep(SLEEP_INTERVAL_MS);
                }
    
                log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);
    
            } catch (Exception e) {
                log.error("Error sending audio from file: {}", filePath, e);
            }
    
            // Send session.finish, wait for the session to finish, and then close the connection.
            conversation.endSession();
            log.info("Task finished");
    
            System.exit(0);
        }
    }

Python

  1. Install the SDK. Make sure that the DashScope SDK version is 1.25.6 or later.

  2. Get an API key. Set the API key as an environment variable to avoid hard coding it in your code.

  3. Run the sample code.

    import logging
    import os
    import base64
    import signal
    import sys
    import time
    import dashscope
    from dashscope.audio.qwen_omni import *
    from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams
    
    
    def setup_logging():
        """Configure logging."""
        logger = logging.getLogger('dashscope')
        logger.setLevel(logging.DEBUG)
        handler = logging.StreamHandler(sys.stdout)
        handler.setLevel(logging.DEBUG)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        logger.propagate = False
        return logger
    
    
    def init_api_key():
        """Initialize the API key."""
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
        # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
        dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
        if dashscope.api_key == 'YOUR_API_KEY':
            print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')
    
    
    class MyCallback(OmniRealtimeCallback):
        """Handle real-time recognition callbacks."""
        def __init__(self, conversation):
            self.conversation = conversation
            self.handlers = {
                'session.created': self._handle_session_created,
                'conversation.item.input_audio_transcription.completed': self._handle_final_text,
                'conversation.item.input_audio_transcription.text': self._handle_stash_text,
                'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
                'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
            }
    
        def on_open(self):
            print('Connection opened')
    
        def on_close(self, code, msg):
            print(f'Connection closed, code: {code}, msg: {msg}')
    
        def on_event(self, response):
            try:
                handler = self.handlers.get(response['type'])
                if handler:
                    handler(response)
            except Exception as e:
                print(f'[Error] {e}')
    
        def _handle_session_created(self, response):
            print(f"Start session: {response['session']['id']}")
    
        def _handle_final_text(self, response):
            print(f"Final recognized text: {response['transcript']}")
    
        def _handle_stash_text(self, response):
            print(f"Got stash result: {response['stash']}")
    
    
    def read_audio_chunks(file_path, chunk_size=3200):
        """Read the audio file in chunks."""
        with open(file_path, 'rb') as f:
            while chunk := f.read(chunk_size):
                yield chunk
    
    
    def send_audio(conversation, file_path, delay=0.1):
        """Send audio data."""
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"Audio file {file_path} does not exist.")
    
        print("Processing audio file... Press 'Ctrl+C' to stop.")
        for chunk in read_audio_chunks(file_path):
            audio_b64 = base64.b64encode(chunk).decode('ascii')
            conversation.append_audio(audio_b64)
            time.sleep(delay)
    
    def main():
        setup_logging()
        init_api_key()
    
        audio_file_path = "./your_audio_file.pcm"
        conversation = OmniRealtimeConversation(
            model='qwen3-asr-flash-realtime',
            # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
            url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime',
            callback=MyCallback(conversation=None)  # Temporarily pass None and inject it later.
        )
    
        # Inject self into the callback.
        conversation.callback.conversation = conversation
    
        def handle_exit(sig, frame):
            print('Ctrl+C pressed, exiting...')
            conversation.close()
            sys.exit(0)
    
        signal.signal(signal.SIGINT, handle_exit)
    
        conversation.connect()
    
        transcription_params = TranscriptionParams(
            language='zh',
            sample_rate=16000,
            input_audio_format="pcm"
            # Corpus of the input audio to assist recognition.
            # corpus_text=""
        )
    
        conversation.update_session(
            output_modalities=[MultiModality.TEXT],
            enable_input_audio_transcription=True,
            transcription_params=transcription_params
        )
    
        try:
            send_audio(conversation, audio_file_path)
            # Send session.finish, wait for the session to finish, and then close the connection.
            conversation.end_session()
        except Exception as e:
            print(f"Error occurred: {e}")
        finally:
            conversation.close()
            print("Audio processing completed.")
    
    
    if __name__ == '__main__':
        main()

WebSocket API

The following example shows how to send a local audio file and retrieve recognition results over a WebSocket connection.

  1. Get an API key: Get an API key. For security, set the API key as an environment variable.

  2. Write and run code: Implement the complete flow of authentication, connection, sending audio, and receiving results. For more information, see Interaction flow.

    Python

    Before you run the example, install the dependencies by running the following command:

    pip uninstall websocket-client
    pip uninstall websocket
    pip install websocket-client

    Do not name the sample code file websocket.py. Otherwise, the following error may occur: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

    # pip install websocket-client
    import os
    import time
    import json
    import threading
    import base64
    import websocket
    import logging
    import logging.handlers
    from datetime import datetime
    
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
    API_KEY = os.environ.get("DASHSCOPE_API_KEY", "sk-xxx")
    QWEN_MODEL = "qwen3-asr-flash-realtime"
    # The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
    baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
    url = f"{baseUrl}?model={QWEN_MODEL}"
    print(f"Connecting to server: {url}")
    
    # Note: If you are not in VAD mode, the cumulative duration of continuously sent audio should not exceed 60s.
    enableServerVad = True
    is_running = True  # Add a running flag.
    
    headers = [
        "Authorization: Bearer " + API_KEY,
        "OpenAI-Beta: realtime=v1"
    ]
    
    def init_logger():
        formatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')
        f_handler = logging.handlers.RotatingFileHandler(
            "omni_tester.log", maxBytes=100 * 1024 * 1024, backupCount=3
        )
        f_handler.setLevel(logging.DEBUG)
        f_handler.setFormatter(formatter)
    
        console = logging.StreamHandler()
        console.setLevel(logging.DEBUG)
        console.setFormatter(formatter)
    
        logger.addHandler(f_handler)
        logger.addHandler(console)
    
    def on_open(ws):
        logger.info("Connected to server.")
    
        # Session update event.
        event_manual = {
            "event_id": "event_123",
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "input_audio_format": "pcm",
                "sample_rate": 16000,
                "input_audio_transcription": {
                    # Language identifier, optional. If you have clear language information, set it.
                    "language": "zh"
                    # Corpus, optional. If you have a corpus, set it to enhance recognition.
                    # "corpus": {
                    #     "text": ""
                    # }
                },
                "turn_detection": None
            }
        }
        event_vad = {
            "event_id": "event_123",
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "input_audio_format": "pcm",
                "sample_rate": 16000,
                "input_audio_transcription": {
                    "language": "zh"
                },
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.0,
                    "silence_duration_ms": 400
                }
            }
        }
        if enableServerVad:
            logger.info(f"Sending event: {json.dumps(event_vad, indent=2)}")
            ws.send(json.dumps(event_vad))
        else:
            logger.info(f"Sending event: {json.dumps(event_manual, indent=2)}")
            ws.send(json.dumps(event_manual))
    
    def on_message(ws, message):
        global is_running
        try:
            data = json.loads(message)
            logger.info(f"Received event: {json.dumps(data, ensure_ascii=False, indent=2)}")
            if data.get("type") == "session.finished":
                logger.info(f"Final transcript: {data.get('transcript')}")
                logger.info("Closing WebSocket connection after session finished...")
                is_running = False  # Stop the audio sending thread.
                ws.close()
        except json.JSONDecodeError:
            logger.error(f"Failed to parse message: {message}")
    
    def on_error(ws, error):
        logger.error(f"Error: {error}")
    
    def on_close(ws, close_status_code, close_msg):
        logger.info(f"Connection closed: {close_status_code} - {close_msg}")
    
    def send_audio(ws, local_audio_path):
        time.sleep(3)  # Wait for the session update to complete.
        global is_running
    
        with open(local_audio_path, 'rb') as audio_file:
            logger.info(f"Start reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
            while is_running:
                audio_data = audio_file.read(3200)  # ~0.1s PCM16/16kHz
                if not audio_data:
                    logger.info(f"Finished reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
                    if ws.sock and ws.sock.connected:
                        if not enableServerVad:
                            commit_event = {
                                "event_id": "event_789",
                                "type": "input_audio_buffer.commit"
                            }
                            ws.send(json.dumps(commit_event))
                        finish_event = {
                            "event_id": "event_987",
                            "type": "session.finish"
                        }
                        ws.send(json.dumps(finish_event))
                    break
    
                if not ws.sock or not ws.sock.connected:
                    logger.info("The WebSocket is closed. Stop sending audio.")
                    break
    
                encoded_data = base64.b64encode(audio_data).decode('utf-8')
                eventd = {
                    "event_id": f"event_{int(time.time() * 1000)}",
                    "type": "input_audio_buffer.append",
                    "audio": encoded_data
                }
                ws.send(json.dumps(eventd))
                logger.info(f"Sending audio event: {eventd['event_id']}")
                time.sleep(0.1)  # Simulate real-time collection.
    
    # Initialize the logger.
    init_logger()
    logger.info(f"Connecting to WebSocket server at {url}...")
    
    local_audio_path = "your_audio_file.pcm"
    ws = websocket.WebSocketApp(
        url,
        header=headers,
        on_open=on_open,
        on_message=on_message,
        on_error=on_error,
        on_close=on_close
    )
    
    thread = threading.Thread(target=send_audio, args=(ws, local_audio_path))
    thread.start()
    ws.run_forever()

    Java

    Before you run the example, install the Java-WebSocket dependency:

    Maven

    <dependency>
        <groupId>org.java-websocket</groupId>
        <artifactId>Java-WebSocket</artifactId>
        <version>1.5.6</version>
    </dependency>

    Gradle

    implementation 'org.java-websocket:Java-WebSocket:1.5.6'
    import org.java_websocket.client.WebSocketClient;
    import org.java_websocket.handshake.ServerHandshake;
    import org.json.JSONObject;
    
    import java.net.URI;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.Base64;
    import java.util.concurrent.atomic.AtomicBoolean;
    import java.util.logging.*;
    
    public class QwenASRRealtimeClient {
    
        private static final Logger logger = Logger.getLogger(QwenASRRealtimeClient.class.getName());
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
        // If you have not configured an environment variable, replace the following line with your Model Studio API key: private static final String API_KEY = "sk-xxx"
        private static final String API_KEY = System.getenv().getOrDefault("DASHSCOPE_API_KEY", "sk-xxx");
        private static final String MODEL = "qwen3-asr-flash-realtime";
    
        // Controls whether to use VAD mode.
        private static final boolean enableServerVad = true;
    
        private static final AtomicBoolean isRunning = new AtomicBoolean(true);
        private static WebSocketClient client;
    
        public static void main(String[] args) throws Exception {
            initLogger();
    
            // The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
            String baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime";
            String url = baseUrl + "?model=" + MODEL;
            logger.info("Connecting to server: " + url);
    
            client = new WebSocketClient(new URI(url)) {
                @Override
                public void onOpen(ServerHandshake handshake) {
                    logger.info("Connected to server.");
                    sendSessionUpdate();
                }
    
                @Override
                public void onMessage(String message) {
                    try {
                        JSONObject data = new JSONObject(message);
                        String eventType = data.optString("type");
    
                        logger.info("Received event: " + data.toString(2));
    
                        // When the finish event is received, stop the sending thread and close the connection.
                        if ("session.finished".equals(eventType)) {
                            logger.info("Final transcript: " + data.optString("transcript"));
                            logger.info("Closing WebSocket connection after session finished...");
    
                            isRunning.set(false); // Stop the audio sending thread.
                            if (this.isOpen()) {
                                this.close(1000, "ASR finished");
                            }
                        }
                    } catch (Exception e) {
                        logger.severe("Failed to parse message: " + message);
                    }
                }
    
                @Override
                public void onClose(int code, String reason, boolean remote) {
                    logger.info("Connection closed: " + code + " - " + reason);
                }
    
                @Override
                public void onError(Exception ex) {
                    logger.severe("Error: " + ex.getMessage());
                }
            };
    
            // Add request headers.
            client.addHeader("Authorization", "Bearer " + API_KEY);
            client.addHeader("OpenAI-Beta", "realtime=v1");
    
            client.connectBlocking(); // Block until the connection is established.
    
            // Replace with the path of the audio file to recognize.
            String localAudioPath = "your_audio_file.pcm";
            Thread audioThread = new Thread(() -> {
                try {
                    sendAudio(localAudioPath);
                } catch (Exception e) {
                    logger.severe("Audio sending thread error: " + e.getMessage());
                }
            });
            audioThread.start();
        }
    
        /** Session update event (enable/disable VAD). */
        private static void sendSessionUpdate() {
            JSONObject eventNoVad = new JSONObject()
                    .put("event_id", "event_123")
                    .put("type", "session.update")
                    .put("session", new JSONObject()
                            .put("modalities", new String[]{"text"})
                            .put("input_audio_format", "pcm")
                            .put("sample_rate", 16000)
                            .put("input_audio_transcription", new JSONObject()
                                    .put("language", "zh"))
                            .put("turn_detection", JSONObject.NULL) // Manual mode.
                    );
    
            JSONObject eventVad = new JSONObject()
                    .put("event_id", "event_123")
                    .put("type", "session.update")
                    .put("session", new JSONObject()
                            .put("modalities", new String[]{"text"})
                            .put("input_audio_format", "pcm")
                            .put("sample_rate", 16000)
                            .put("input_audio_transcription", new JSONObject()
                                    .put("language", "zh"))
                            .put("turn_detection", new JSONObject()
                                    .put("type", "server_vad")
                                    .put("threshold", 0.0)
                                    .put("silence_duration_ms", 400))
                    );
    
            if (enableServerVad) {
                logger.info("Sending event (VAD):\n" + eventVad.toString(2));
                client.send(eventVad.toString());
            } else {
                logger.info("Sending event (Manual):\n" + eventNoVad.toString(2));
                client.send(eventNoVad.toString());
            }
        }
    
        /** Send the audio file stream. */
        private static void sendAudio(String localAudioPath) throws Exception {
            Thread.sleep(3000); // Wait for the session to be ready.
            byte[] allBytes = Files.readAllBytes(Paths.get(localAudioPath));
            logger.info("Start reading the file.");
    
            int offset = 0;
            while (isRunning.get() && offset < allBytes.length) {
                int chunkSize = Math.min(3200, allBytes.length - offset);
                byte[] chunk = new byte[chunkSize];
                System.arraycopy(allBytes, offset, chunk, 0, chunkSize);
                offset += chunkSize;
    
                if (client != null && client.isOpen()) {
                    String encoded = Base64.getEncoder().encodeToString(chunk);
                    JSONObject eventd = new JSONObject()
                            .put("event_id", "event_" + System.currentTimeMillis())
                            .put("type", "input_audio_buffer.append")
                            .put("audio", encoded);
    
                    client.send(eventd.toString());
                    logger.info("Sending audio event: " + eventd.getString("event_id"));
                } else {
                    break; // Avoid sending after disconnection.
                }
    
                Thread.sleep(100); // Simulate real-time sending.
            }
    
            logger.info("Finished reading the file.");
    
            if (client != null && client.isOpen()) {
                // A commit is required in non-VAD mode.
                if (!enableServerVad) {
                    JSONObject commitEvent = new JSONObject()
                            .put("event_id", "event_789")
                            .put("type", "input_audio_buffer.commit");
                    client.send(commitEvent.toString());
                    logger.info("Sent commit event for manual mode.");
                }
    
                JSONObject finishEvent = new JSONObject()
                        .put("event_id", "event_987")
                        .put("type", "session.finish");
                client.send(finishEvent.toString());
                logger.info("Sent finish event.");
            }
        }
    
        /** Initialize the logger. */
        private static void initLogger() {
            logger.setLevel(Level.ALL);
            Logger rootLogger = Logger.getLogger("");
            for (Handler h : rootLogger.getHandlers()) {
                rootLogger.removeHandler(h);
            }
    
            Handler consoleHandler = new ConsoleHandler();
            consoleHandler.setLevel(Level.ALL);
            consoleHandler.setFormatter(new SimpleFormatter());
            logger.addHandler(consoleHandler);
        }
    }

    Node.js

    Before you run the example, install the dependency by running the following command:

    npm install ws
    /**
     * Qwen-ASR Realtime WebSocket Client (Node.js version)
     * Features:
     * - Supports VAD and Manual modes.
     * - Sends session.update to start a session.
     * - Continuously sends input_audio_buffer.append audio chunks.
     * - Sends input_audio_buffer.commit in Manual mode.
     * - Sends a session.finish event.
     * - Closes the connection after receiving a session.finished event.
     */
    
    import WebSocket from 'ws';
    import fs from 'fs';
    
    // ===== Configuration =====
    // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: const API_KEY = "sk-xxx"
    const API_KEY = process.env.DASHSCOPE_API_KEY || 'sk-xxx';
    const MODEL = 'qwen3-asr-flash-realtime';
    const enableServerVad = true; // true for VAD mode, false for Manual mode
    const localAudioPath = 'your_audio_file.pcm'; // Path to the PCM16, 16 kHz audio file
    
    // The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
    const baseUrl = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime';
    const url = `${baseUrl}?model=${MODEL}`;
    
    console.log(`Connecting to server: ${url}`);
    
    // ===== Status Control =====
    let isRunning = true;
    
    // ===== Establish Connection =====
    const ws = new WebSocket(url, {
        headers: {
            'Authorization': `Bearer ${API_KEY}`,
            'OpenAI-Beta': 'realtime=v1'
        }
    });
    
    // ===== Event Binding =====
    ws.on('open', () => {
        console.log('[WebSocket] Connected to server.');
        sendSessionUpdate();
        // Start the audio sending thread.
        sendAudio(localAudioPath);
    });
    
    ws.on('message', (message) => {
        try {
            const data = JSON.parse(message);
            console.log('[Received Event]:', JSON.stringify(data, null, 2));
    
            // Received finish event.
            if (data.type === 'session.finished') {
                console.log(`[Final Transcript] ${data.transcript}`);
                console.log('[Action] Closing WebSocket connection after session finished...');
                
                if (ws.readyState === WebSocket.OPEN) {
                    ws.close(1000, 'ASR finished');
                }
            }
        } catch (e) {
            console.error('[Error] Failed to parse message:', message);
        }
    });
    
    ws.on('close', (code, reason) => {
        console.log(`[WebSocket] Connection closed: ${code} - ${reason}`);
    });
    
    ws.on('error', (err) => {
        console.error('[WebSocket Error]', err);
    });
    
    // ===== Session Update =====
    function sendSessionUpdate() {
        const eventNoVad = {
            event_id: 'event_123',
            type: 'session.update',
            session: {
                modalities: ['text'],
                input_audio_format: 'pcm',
                sample_rate: 16000,
                input_audio_transcription: {
                    language: 'zh'
                },
                turn_detection: null
            }
        };
    
        const eventVad = {
            event_id: 'event_123',
            type: 'session.update',
            session: {
                modalities: ['text'],
                input_audio_format: 'pcm',
                sample_rate: 16000,
                input_audio_transcription: {
                    language: 'zh'
                },
                turn_detection: {
                    type: 'server_vad',
                    threshold: 0.0,
                    silence_duration_ms: 400
                }
            }
        };
    
        if (enableServerVad) {
            console.log('[Send Event] VAD Mode:\n', JSON.stringify(eventVad, null, 2));
            ws.send(JSON.stringify(eventVad));
        } else {
            console.log('[Send Event] Manual Mode:\n', JSON.stringify(eventNoVad, null, 2));
            ws.send(JSON.stringify(eventNoVad));
        }
    }
    
    // ===== Send Audio File Stream =====
    function sendAudio(audioPath) {
        setTimeout(() => {
            console.log(`[File Read Start] ${audioPath}`);
            const buffer = fs.readFileSync(audioPath);
    
            let offset = 0;
            const chunkSize = 3200; // Approx. 0.1s of PCM16 audio
    
            function sendChunk() {
                if (!isRunning) return;
                if (offset >= buffer.length) {
                    isRunning = false; // Stop sending audio.
                    console.log('[File Read End]');
                    if (ws.readyState === WebSocket.OPEN) {
                        if (!enableServerVad) {
                            const commitEvent = {
                                event_id: 'event_789',
                                type: 'input_audio_buffer.commit'
                            };
                            ws.send(JSON.stringify(commitEvent));
                            console.log('[Send Commit Event]');
                        }
    
                        const finishEvent = {
                            event_id: 'event_987',
                            type: 'session.finish'
                        };
                        ws.send(JSON.stringify(finishEvent));
                        console.log('[Send Finish Event]');
                    }
                    
                    return;
                }
    
                if (ws.readyState !== WebSocket.OPEN) {
                    console.log('[Stop] WebSocket is not open.');
                    return;
                }
    
                const chunk = buffer.slice(offset, offset + chunkSize);
                offset += chunkSize;
    
                const encoded = chunk.toString('base64');
                const appendEvent = {
                    event_id: `event_${Date.now()}`,
                    type: 'input_audio_buffer.append',
                    audio: encoded
                };
    
                ws.send(JSON.stringify(appendEvent));
                console.log(`[Send Audio Event] ${appendEvent.event_id}`);
    
                setTimeout(sendChunk, 100); // Simulate real-time sending.
            }
    
            sendChunk();
        }, 3000); // Wait for session configuration to complete.
    }

Core usage: Context biasing

By providing context, you can optimize the recognition of domain-specific vocabulary, such as names, places, and product terms. This significantly improves transcription accuracy. This feature is more flexible and powerful than traditional hotword solutions.

Length limit: The context content cannot exceed 10,000 tokens.

Usage:

  • WebSocket API: Set the session.input_audio_transcription.corpus.text parameter in the session.update event.

  • Python SDK: Set the corpus_text parameter.

  • Java SDK: Set the corpusText parameter.

Supported text types: Supported text types include, but are not limited to, the following:

  • Hotword lists that use various separators (for example, hotword 1, hotword 2, hotword 3, hotword 4)

  • Text paragraphs or chapters in any format

  • Mixed content: Any combination of word lists and paragraphs

  • Irrelevant or meaningless text (including garbled characters). The model has high fault tolerance and is unlikely to be negatively affected by irrelevant text.

Example:

In this example, the correct transcription of an audio segment is "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB...".

Without context biasing

Without context biasing, the model incorrectly recognizes some investment bank names. For example, "Bulge Bracket" is recognized as "Bird Rock".

Recognition result: "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bird Rock, BB..."

Using context biasing

With context biasing, the model correctly recognizes the investment bank names.

Recognition result: "What jargon from the investment banking industry do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."

To achieve this result, you can add any of the following content to the context:

  • Word lists:

    • Word list 1:

      Bulge Bracket, Boutique, Middle Market, domestic securities firms
    • Word list 2:

      Bulge Bracket Boutique Middle Market domestic securities firms
    • Word list 3:

      ['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
  • Natural language:

    The secrets of investment banking categories revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket investment banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
    Boutique investment banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep expertise and experience in specific fields.
    Middle Market investment banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
    Domestic securities firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
  • Natural language with irrelevant text: The context can contain irrelevant text. The following example includes irrelevant names.

    The secrets of investment banking categories revealed!
    Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
    Bulge Bracket investment banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
    Boutique investment banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep expertise and experience in specific fields.
    Middle Market investment banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
    Domestic securities firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
    In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
    Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei, 
    Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu, 
    Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen, 
    Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

API reference

Real-time speech recognition - Qwen API reference

Model features

Feature

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

Supported languages

Chinese (Mandarin, Sichuanese, Minnan, Wu, and Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, and Swedish

Supported audio formats

pcm, opus

Sample rate

8 kHz, 16 kHz

Channel

Mono

Input format

Binary audio stream

Audio size/duration

Unlimited

Emotion recognition

Supported Always on

Sensitive word filtering

Not supported

Speaker diarization

Not supported

Filler word filtering

Not supported

Timestamp

Not supported

Punctuation prediction

Supported Always on

Context biasing (more powerful than hotwords)

Supported Configurable

Inverse Text Normalization (ITN)

Not supported

Voice Activity Detection (VAD)

Supported Always on

Rate limit (RPS)

20

Connection type

Java/Python SDK, WebSocket API

Pricing

International: $0.00009/second

Mainland China: $0.000047/second