Build Real-Time Speech Recognition with WebSocket & DashScope SDK - Alibaba Cloud Model Studio

Convert audio streams to text in real time over WebSocket. This supports multilingual recognition, emotion detection, and VAD.

Supported regions and endpoints

	International	Mainland China
Data storage	Singapore	Beijing
Inference computing	Dynamically scheduled globally, excluding Mainland China	Limited to Mainland China
WebSocket endpoint	`wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`	`wss://dashscope.aliyuncs.com/api-ws/v1/realtime`
API key console	modelstudio.console.alibabacloud.com	bailian.console.alibabacloud.com
Pricing	$0.00009/second	$0.000047/second

For more information about deployment modes, see Compare deployment modes.

Session workflow

A typical session follows this flow:

Connect -- Establish a WebSocket connection to the service endpoint with your API key.
Configure -- Send a session.update event to set the audio format, language, and turn detection mode.
Stream audio -- Send audio chunks as Base64-encoded data through input_audio_buffer.append events.
Receive results -- The server returns intermediate transcription through conversation.item.input_audio_transcription.text events and final transcription through conversation.item.input_audio_transcription.completed events.
Finish -- Send a session.finish event to end the session. The server responds with a session.finished event.

Turn detection modes

Mode	Configuration	Behavior
Server VAD	Set `turn_detection.type` to `server_vad`	Server automatically detects speech boundaries.
Manual	Set `turn_detection` to `null`	Control turn boundaries with `input_audio_buffer.commit` events. Continuous audio must not exceed 60 seconds.

Supported models

All deployment regions support the same model family: Qwen3-ASR-Flash-Realtime.

Version type	Model ID	Description
Stable	`qwen3-asr-flash-realtime`	Points to `qwen3-asr-flash-realtime-2025-10-27`
Latest snapshot	`qwen3-asr-flash-realtime-2026-02-10`	Most recent snapshot
Snapshot	`qwen3-asr-flash-realtime-2025-10-27`	Point-in-time snapshot

The stable version alias points to a tested snapshot—use it for production. Snapshot versions are fixed releases for pinning specific model revisions.

For the complete model catalog, see Model list.

Choose a model for your scenario

Scenario	Recommended model	Reason
Customer service quality inspection	`qwen3-asr-flash-realtime-2026-02-10`	Provides real-time call analysis with emotion detection for quality monitoring.
Live streaming and short videos	`qwen3-asr-flash-realtime-2026-02-10`	Generates real-time multilingual captions for live content.
Online meetings and interviews	`qwen3-asr-flash-realtime-2026-02-10`	Provides real-time meeting transcription for text summaries.

Before you begin

Before you begin:

Install the SDK or dependencies for your chosen integration method. See the version requirements in each section below.
Obtain an API key from the Create an API key (Singapore and Beijing regions use different keys).
Set the API key as an environment variable:
```
   export DASHSCOPE_API_KEY="sk-xxx"
```
Prepare a test audio file (PCM format, 16 kHz, mono) named your_audio_file.pcm in your working directory.

Get started with the DashScope SDK

Java

Install the SDK. Ensure that the DashScope SDK version is 2.22.5 or later.

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;

public class Qwen3AsrRealtimeUsage {
    private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
    private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
    private static final int SLEEP_INTERVAL_MS = 30;  // Sleep interval in milliseconds

    public static void main(String[] args) throws InterruptedException, LineUnavailableException {
        CountDownLatch finishLatch = new CountDownLatch(1);

        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-asr-flash-realtime")
                // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apikey("sk-xxx")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                .build();

        OmniRealtimeConversation conversation = null;
        final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
        conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("connection opened");
            }
            @Override
            public void onEvent(JsonObject message) {
                String type = message.get("type").getAsString();
                switch(type) {
                    case "session.created":
                        System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
                        break;
                    case "conversation.item.input_audio_transcription.completed":
                        System.out.println("transcription: " + message.get("transcript").getAsString());
                        finishLatch.countDown();
                        break;
                    case "input_audio_buffer.speech_started":
                        System.out.println("======VAD Speech Start======");
                        break;
                    case "input_audio_buffer.speech_stopped":
                        System.out.println("======VAD Speech Stop======");
                        break;
                    case "conversation.item.input_audio_transcription.text":
                        System.out.println("transcription: " + message.get("text").getAsString());
                        break;
                    default:
                        break;
                }
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
            }
        });
        conversationRef.set(conversation);
        try {
            conversation.connect();
        } catch (NoApiKeyException e) {
            throw new RuntimeException(e);
        }

        OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
        transcriptionParam.setLanguage("zh");
        transcriptionParam.setInputAudioFormat("pcm");
        transcriptionParam.setInputSampleRate(16000);

        OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
                .transcriptionConfig(transcriptionParam)
                .build();
        conversation.updateSession(config);

        String filePath = "your_audio_file.pcm";
        File audioFile = new File(filePath);
        if (!audioFile.exists()) {
            log.error("Audio file not found: {}", filePath);
            return;
        }

        try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
            byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
            int bytesRead;
            int totalBytesRead = 0;

            log.info("Starting to send audio data from: {}", filePath);

            // Read and send audio data in chunks
            while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
                totalBytesRead += bytesRead;
                String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
                // Send audio chunk to conversation
                conversation.appendAudio(audioB64);

                // Add small delay to simulate real-time audio streaming
                Thread.sleep(SLEEP_INTERVAL_MS);
            }

            log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);

        } catch (Exception e) {
            log.error("Error sending audio from file: {}", filePath, e);
        }

        // Send session.finish, wait for the session to finish, and then close the connection.
        conversation.endSession();
        log.info("Task finished");

        System.exit(0);
    }
}

Expected output:

connection opened
start session: <session-id>
======VAD Speech Start======
transcription: <intermediate text>
======VAD Speech Stop======
transcription: <final transcribed text>
connection closed code: 1000, reason: ...

Python

Install the SDK. Ensure that the DashScope SDK version is 1.25.6 or later.

import logging
import os
import base64
import signal
import sys
import time
import dashscope
from dashscope.audio.qwen_omni import *
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams


def setup_logging():
    """Configure logging."""
    logger = logging.getLogger('dashscope')
    logger.setLevel(logging.DEBUG)
    handler = logging.StreamHandler(sys.stdout)
    handler.setLevel(logging.DEBUG)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.propagate = False
    return logger


def init_api_key():
    """Initialize the API key."""
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
    if dashscope.api_key == 'YOUR_API_KEY':
        print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')


class MyCallback(OmniRealtimeCallback):
    """Handle real-time recognition callbacks."""
    def __init__(self, conversation):
        self.conversation = conversation
        self.handlers = {
            'session.created': self._handle_session_created,
            'conversation.item.input_audio_transcription.completed': self._handle_final_text,
            'conversation.item.input_audio_transcription.text': self._handle_stash_text,
            'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
            'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
        }

    def on_open(self):
        print('Connection opened')

    def on_close(self, code, msg):
        print(f'Connection closed, code: {code}, msg: {msg}')

    def on_event(self, response):
        try:
            handler = self.handlers.get(response['type'])
            if handler:
                handler(response)
        except Exception as e:
            print(f'[Error] {e}')

    def _handle_session_created(self, response):
        print(f"Start session: {response['session']['id']}")

    def _handle_final_text(self, response):
        print(f"Final recognized text: {response['transcript']}")

    def _handle_stash_text(self, response):
        print(f"Got stash result: {response['stash']}")


def read_audio_chunks(file_path, chunk_size=3200):
    """Read the audio file in chunks."""
    with open(file_path, 'rb') as f:
        while chunk := f.read(chunk_size):
            yield chunk


def send_audio(conversation, file_path, delay=0.1):
    """Send audio data."""
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Audio file {file_path} does not exist.")

    print("Processing audio file... Press 'Ctrl+C' to stop.")
    for chunk in read_audio_chunks(file_path):
        audio_b64 = base64.b64encode(chunk).decode('ascii')
        conversation.append_audio(audio_b64)
        time.sleep(delay)

def main():
    setup_logging()
    init_api_key()

    audio_file_path = "./your_audio_file.pcm"
    conversation = OmniRealtimeConversation(
        model='qwen3-asr-flash-realtime',
        # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
        url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime',
        callback=MyCallback(conversation=None)  # Temporarily pass None and inject it later.
    )

    # Inject self into the callback.
    conversation.callback.conversation = conversation

    def handle_exit(sig, frame):
        print('Ctrl+C pressed, exiting...')
        conversation.close()
        sys.exit(0)

    signal.signal(signal.SIGINT, handle_exit)

    conversation.connect()

    transcription_params = TranscriptionParams(
        language='zh',
        sample_rate=16000,
        input_audio_format="pcm"
    )

    conversation.update_session(
        output_modalities=[MultiModality.TEXT],
        enable_input_audio_transcription=True,
        transcription_params=transcription_params
    )

    try:
        send_audio(conversation, audio_file_path)
        # Send session.finish, wait for the session to finish, and then close the connection.
        conversation.end_session()
    except Exception as e:
        print(f"Error occurred: {e}")
    finally:
        conversation.close()
        print("Audio processing completed.")


if __name__ == '__main__':
    main()

Connect with the WebSocket API

The following examples send a local audio file over a raw WebSocket connection and retrieve recognition results. For more information about the protocol, see Interaction flow.

Python

Install the required dependency:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Do not name the sample code file websocket.py. Otherwise, the following error may occur: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

# pip install websocket-client
import os
import time
import json
import threading
import base64
import websocket
import logging
import logging.handlers
from datetime import datetime

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
API_KEY = os.environ.get("DASHSCOPE_API_KEY", "sk-xxx")
QWEN_MODEL = "qwen3-asr-flash-realtime"
# The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
url = f"{baseUrl}?model={QWEN_MODEL}"
print(f"Connecting to server: {url}")

# Note: If you are not in VAD mode, the cumulative duration of continuously sent audio should not exceed 60 seconds.
enableServerVad = True
is_running = True  # Add a running flag.

headers = [
    "Authorization: Bearer " + API_KEY,
    "OpenAI-Beta: realtime=v1"
]

def init_logger():
    formatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')
    f_handler = logging.handlers.RotatingFileHandler(
        "omni_tester.log", maxBytes=100 * 1024 * 1024, backupCount=3
    )
    f_handler.setLevel(logging.DEBUG)
    f_handler.setFormatter(formatter)

    console = logging.StreamHandler()
    console.setLevel(logging.DEBUG)
    console.setFormatter(formatter)

    logger.addHandler(f_handler)
    logger.addHandler(console)

def on_open(ws):
    logger.info("Connected to server.")

    # Session update event.
    event_manual = {
        "event_id": "event_123",
        "type": "session.update",
        "session": {
            "modalities": ["text"],
            "input_audio_format": "pcm",
            "sample_rate": 16000,
            "input_audio_transcription": {
                # Language identifier, optional. If you have clear language information, set it.
                "language": "zh"
            },
            "turn_detection": None
        }
    }
    event_vad = {
        "event_id": "event_123",
        "type": "session.update",
        "session": {
            "modalities": ["text"],
            "input_audio_format": "pcm",
            "sample_rate": 16000,
            "input_audio_transcription": {
                "language": "zh"
            },
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.0,
                "silence_duration_ms": 400
            }
        }
    }
    if enableServerVad:
        logger.info(f"Sending event: {json.dumps(event_vad, indent=2)}")
        ws.send(json.dumps(event_vad))
    else:
        logger.info(f"Sending event: {json.dumps(event_manual, indent=2)}")
        ws.send(json.dumps(event_manual))

def on_message(ws, message):
    global is_running
    try:
        data = json.loads(message)
        logger.info(f"Received event: {json.dumps(data, ensure_ascii=False, indent=2)}")
        if data.get("type") == "session.finished":
            logger.info(f"Final transcript: {data.get('transcript')}")
            logger.info("Closing WebSocket connection after session finished...")
            is_running = False  # Stop the audio sending thread.
            ws.close()
    except json.JSONDecodeError:
        logger.error(f"Failed to parse message: {message}")

def on_error(ws, error):
    logger.error(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    logger.info(f"Connection closed: {close_status_code} - {close_msg}")

def send_audio(ws, local_audio_path):
    time.sleep(3)  # Wait for the session update to complete.
    global is_running

    with open(local_audio_path, 'rb') as audio_file:
        logger.info(f"Start reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
        while is_running:
            audio_data = audio_file.read(3200)  # ~0.1 second of PCM16/16 kHz audio.
            if not audio_data:
                logger.info(f"Finished reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
                if ws.sock and ws.sock.connected:
                    if not enableServerVad:
                        commit_event = {
                            "event_id": "event_789",
                            "type": "input_audio_buffer.commit"
                        }
                        ws.send(json.dumps(commit_event))
                    finish_event = {
                        "event_id": "event_987",
                        "type": "session.finish"
                    }
                    ws.send(json.dumps(finish_event))
                break

            if not ws.sock or not ws.sock.connected:
                logger.info("The WebSocket is closed. Stop sending audio.")
                break

            encoded_data = base64.b64encode(audio_data).decode('utf-8')
            eventd = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "input_audio_buffer.append",
                "audio": encoded_data
            }
            ws.send(json.dumps(eventd))
            logger.info(f"Sending audio event: {eventd['event_id']}")
            time.sleep(0.1)  # Simulate real-time collection.

# Initialize the logger.
init_logger()
logger.info(f"Connecting to WebSocket server at {url}...")

local_audio_path = "your_audio_file.pcm"
ws = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

thread = threading.Thread(target=send_audio, args=(ws, local_audio_path))
thread.start()
ws.run_forever()

Java

Install the Java-WebSocket dependency.

Maven

<dependency>
    <groupId>org.java-websocket</groupId>
    <artifactId>Java-WebSocket</artifactId>
    <version>1.5.6</version>
</dependency>

Gradle

implementation 'org.java-websocket:Java-WebSocket:1.5.6'

import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;
import org.json.JSONObject;

import java.net.URI;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Base64;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.logging.*;

public class QwenASRRealtimeClient {

    private static final Logger logger = Logger.getLogger(QwenASRRealtimeClient.class.getName());
    // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: private static final String API_KEY = "sk-xxx"
    private static final String API_KEY = System.getenv().getOrDefault("DASHSCOPE_API_KEY", "sk-xxx");
    private static final String MODEL = "qwen3-asr-flash-realtime";

    // Controls whether to use VAD mode.
    private static final boolean enableServerVad = true;

    private static final AtomicBoolean isRunning = new AtomicBoolean(true);
    private static WebSocketClient client;

    public static void main(String[] args) throws Exception {
        initLogger();

        // The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
        String baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime";
        String url = baseUrl + "?model=" + MODEL;
        logger.info("Connecting to server: " + url);

        client = new WebSocketClient(new URI(url)) {
            @Override
            public void onOpen(ServerHandshake handshake) {
                logger.info("Connected to server.");
                sendSessionUpdate();
            }

            @Override
            public void onMessage(String message) {
                try {
                    JSONObject data = new JSONObject(message);
                    String eventType = data.optString("type");

                    logger.info("Received event: " + data.toString(2));

                    // When the finish event is received, stop the sending thread and close the connection.
                    if ("session.finished".equals(eventType)) {
                        logger.info("Final transcript: " + data.optString("transcript"));
                        logger.info("Closing WebSocket connection after session finished...");

                        isRunning.set(false); // Stop the audio sending thread.
                        if (this.isOpen()) {
                            this.close(1000, "ASR finished");
                        }
                    }
                } catch (Exception e) {
                    logger.severe("Failed to parse message: " + message);
                }
            }

            @Override
            public void onClose(int code, String reason, boolean remote) {
                logger.info("Connection closed: " + code + " - " + reason);
            }

            @Override
            public void onError(Exception ex) {
                logger.severe("Error: " + ex.getMessage());
            }
        };

        // Add request headers.
        client.addHeader("Authorization", "Bearer " + API_KEY);
        client.addHeader("OpenAI-Beta", "realtime=v1");

        client.connectBlocking(); // Block until the connection is established.

        // Replace with the path of the audio file to recognize.
        String localAudioPath = "your_audio_file.pcm";
        Thread audioThread = new Thread(() -> {
            try {
                sendAudio(localAudioPath);
            } catch (Exception e) {
                logger.severe("Audio sending thread error: " + e.getMessage());
            }
        });
        audioThread.start();
    }

    /** Session update event (enable/disable VAD). */
    private static void sendSessionUpdate() {
        JSONObject eventNoVad = new JSONObject()
                .put("event_id", "event_123")
                .put("type", "session.update")
                .put("session", new JSONObject()
                        .put("modalities", new String[]{"text"})
                        .put("input_audio_format", "pcm")
                        .put("sample_rate", 16000)
                        .put("input_audio_transcription", new JSONObject()
                                .put("language", "zh"))
                        .put("turn_detection", JSONObject.NULL) // Manual mode.
                );

        JSONObject eventVad = new JSONObject()
                .put("event_id", "event_123")
                .put("type", "session.update")
                .put("session", new JSONObject()
                        .put("modalities", new String[]{"text"})
                        .put("input_audio_format", "pcm")
                        .put("sample_rate", 16000)
                        .put("input_audio_transcription", new JSONObject()
                                .put("language", "zh"))
                        .put("turn_detection", new JSONObject()
                                .put("type", "server_vad")
                                .put("threshold", 0.0)
                                .put("silence_duration_ms", 400))
                );

        if (enableServerVad) {
            logger.info("Sending event (VAD):\n" + eventVad.toString(2));
            client.send(eventVad.toString());
        } else {
            logger.info("Sending event (Manual):\n" + eventNoVad.toString(2));
            client.send(eventNoVad.toString());
        }
    }

    /** Send the audio file stream. */
    private static void sendAudio(String localAudioPath) throws Exception {
        Thread.sleep(3000); // Wait for the session to be ready.
        byte[] allBytes = Files.readAllBytes(Paths.get(localAudioPath));
        logger.info("Start reading the file.");

        int offset = 0;
        while (isRunning.get() && offset < allBytes.length) {
            int chunkSize = Math.min(3200, allBytes.length - offset);
            byte[] chunk = new byte[chunkSize];
            System.arraycopy(allBytes, offset, chunk, 0, chunkSize);
            offset += chunkSize;

            if (client != null && client.isOpen()) {
                String encoded = Base64.getEncoder().encodeToString(chunk);
                JSONObject eventd = new JSONObject()
                        .put("event_id", "event_" + System.currentTimeMillis())
                        .put("type", "input_audio_buffer.append")
                        .put("audio", encoded);

                client.send(eventd.toString());
                logger.info("Sending audio event: " + eventd.getString("event_id"));
            } else {
                break; // Avoid sending after disconnection.
            }

            Thread.sleep(100); // Simulate real-time sending.
        }

        logger.info("Finished reading the file.");

        if (client != null && client.isOpen()) {
            // A commit is required in non-VAD mode.
            if (!enableServerVad) {
                JSONObject commitEvent = new JSONObject()
                        .put("event_id", "event_789")
                        .put("type", "input_audio_buffer.commit");
                client.send(commitEvent.toString());
                logger.info("Sent commit event for manual mode.");
            }

            JSONObject finishEvent = new JSONObject()
                    .put("event_id", "event_987")
                    .put("type", "session.finish");
            client.send(finishEvent.toString());
            logger.info("Sent finish event.");
        }
    }

    /** Initialize the logger. */
    private static void initLogger() {
        logger.setLevel(Level.ALL);
        Logger rootLogger = Logger.getLogger("");
        for (Handler h : rootLogger.getHandlers()) {
            rootLogger.removeHandler(h);
        }

        Handler consoleHandler = new ConsoleHandler();
        consoleHandler.setLevel(Level.ALL);
        consoleHandler.setFormatter(new SimpleFormatter());
        logger.addHandler(consoleHandler);
    }
}

Node.js

Install the required dependency:

npm install ws

/**
 * Qwen-ASR Realtime WebSocket Client (Node.js version)
 * Features:
 * - Supports VAD and Manual modes.
 * - Sends session.update to start a session.
 * - Continuously sends input_audio_buffer.append audio chunks.
 * - Sends input_audio_buffer.commit in Manual mode.
 * - Sends a session.finish event.
 * - Closes the connection after receiving a session.finished event.
 */

import WebSocket from 'ws';
import fs from 'fs';

// ===== Configuration =====
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: const API_KEY = "sk-xxx"
const API_KEY = process.env.DASHSCOPE_API_KEY || 'sk-xxx';
const MODEL = 'qwen3-asr-flash-realtime';
const enableServerVad = true; // true for VAD mode, false for Manual mode
const localAudioPath = 'your_audio_file.pcm'; // Path to the PCM16, 16 kHz audio file

// The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
const baseUrl = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime';
const url = `${baseUrl}?model=${MODEL}`;

console.log(`Connecting to server: ${url}`);

// ===== Status Control =====
let isRunning = true;

// ===== Establish Connection =====
const ws = new WebSocket(url, {
    headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'OpenAI-Beta': 'realtime=v1'
    }
});

// ===== Event Binding =====
ws.on('open', () => {
    console.log('[WebSocket] Connected to server.');
    sendSessionUpdate();
    // Start the audio sending thread.
    sendAudio(localAudioPath);
});

ws.on('message', (message) => {
    try {
        const data = JSON.parse(message);
        console.log('[Received Event]:', JSON.stringify(data, null, 2));

        // Received finish event.
        if (data.type === 'session.finished') {
            console.log(`[Final Transcript] ${data.transcript}`);
            console.log('[Action] Closing WebSocket connection after session finished...');

            if (ws.readyState === WebSocket.OPEN) {
                ws.close(1000, 'ASR finished');
            }
        }
    } catch (e) {
        console.error('[Error] Failed to parse message:', message);
    }
});

ws.on('close', (code, reason) => {
    console.log(`[WebSocket] Connection closed: ${code} - ${reason}`);
});

ws.on('error', (err) => {
    console.error('[WebSocket Error]', err);
});

// ===== Session Update =====
function sendSessionUpdate() {
    const eventNoVad = {
        event_id: 'event_123',
        type: 'session.update',
        session: {
            modalities: ['text'],
            input_audio_format: 'pcm',
            sample_rate: 16000,
            input_audio_transcription: {
                language: 'zh'
            },
            turn_detection: null
        }
    };

    const eventVad = {
        event_id: 'event_123',
        type: 'session.update',
        session: {
            modalities: ['text'],
            input_audio_format: 'pcm',
            sample_rate: 16000,
            input_audio_transcription: {
                language: 'zh'
            },
            turn_detection: {
                type: 'server_vad',
                threshold: 0.0,
                silence_duration_ms: 400
            }
        }
    };

    if (enableServerVad) {
        console.log('[Send Event] VAD Mode:\n', JSON.stringify(eventVad, null, 2));
        ws.send(JSON.stringify(eventVad));
    } else {
        console.log('[Send Event] Manual Mode:\n', JSON.stringify(eventNoVad, null, 2));
        ws.send(JSON.stringify(eventNoVad));
    }
}

// ===== Send Audio File Stream =====
function sendAudio(audioPath) {
    setTimeout(() => {
        console.log(`[File Read Start] ${audioPath}`);
        const buffer = fs.readFileSync(audioPath);

        let offset = 0;
        const chunkSize = 3200; // Approx. 0.1 second of PCM16 audio

        function sendChunk() {
            if (!isRunning) return;
            if (offset >= buffer.length) {
                isRunning = false; // Stop sending audio.
                console.log('[File Read End]');
                if (ws.readyState === WebSocket.OPEN) {
                    if (!enableServerVad) {
                        const commitEvent = {
                            event_id: 'event_789',
                            type: 'input_audio_buffer.commit'
                        };
                        ws.send(JSON.stringify(commitEvent));
                        console.log('[Send Commit Event]');
                    }

                    const finishEvent = {
                        event_id: 'event_987',
                        type: 'session.finish'
                    };
                    ws.send(JSON.stringify(finishEvent));
                    console.log('[Send Finish Event]');
                }

                return;
            }

            if (ws.readyState !== WebSocket.OPEN) {
                console.log('[Stop] WebSocket is not open.');
                return;
            }

            const chunk = buffer.slice(offset, offset + chunkSize);
            offset += chunkSize;

            const encoded = chunk.toString('base64');
            const appendEvent = {
                event_id: `event_${Date.now()}`,
                type: 'input_audio_buffer.append',
                audio: encoded
            };

            ws.send(JSON.stringify(appendEvent));
            console.log(`[Send Audio Event] ${appendEvent.event_id}`);

            setTimeout(sendChunk, 100); // Simulate real-time sending.
        }

        sendChunk();
    }, 3000); // Wait for session configuration to complete.
}

WebSocket event reference

Direction	Event type	Description
Client to server	`session.update`	Configure session parameters (audio format, language, sample rate, turn detection mode).
Client to server	`input_audio_buffer.append`	Send a Base64-encoded audio chunk to the server.
Client to server	`input_audio_buffer.commit`	Commit the audio buffer. Required in manual mode only.
Client to server	`session.finish`	Signal that client has finished sending audio.
Server to client	`session.created`	Confirms session was created with session ID.
Server to client	`session.updated`	Confirms session configuration was applied.
Server to client	`input_audio_buffer.speech_started`	VAD detected the start of speech.
Server to client	`input_audio_buffer.speech_stopped`	VAD detected the end of speech.
Server to client	`conversation.item.created`	A new conversation item has been created for the incoming audio.
Server to client	`conversation.item.input_audio_transcription.text`	Contains intermediate transcription result.
Server to client	`conversation.item.input_audio_transcription.completed`	Final transcription result for a speech segment.
Server to client	`input_audio_buffer.committed`	Confirms audio buffer was committed.
Server to client	`session.finished`	Session ended with final transcript.

Model capabilities and limitations

Feature	Qwen3-ASR-Flash-Realtime
Supported languages	Chinese (Mandarin, Sichuanese, Minnan, Wu, and Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, and Swedish
Supported audio formats	PCM, Opus
Sample rate	8 kHz, 16 kHz
Channel	Mono
Input format	Binary audio stream
Audio size/duration	Unlimited
Emotion recognition	Supported (always on)
Sensitive word filtering	Not supported
Speaker diarization	Not supported
Filler word filtering	Not supported
Timestamp	Not supported
Punctuation prediction	Supported (always on)
Inverse Text Normalization (ITN)	Not supported
Voice Activity Detection (VAD)	Supported (always on)
Rate limit	20 requests per second (RPS)
Connection type	Java/Python SDK, WebSocket API
Pricing (International)	$0.00009/second
Pricing (Mainland China)	$0.000047/second

These features apply to all versions: qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2026-02-10, and qwen3-asr-flash-realtime-2025-10-27.

International: $0.00009/second

Chinese Mainland: $0.000047/second

API reference

Real-time speech recognition (Qwen-ASR-Realtime)