Convert audio streams to text in real time over WebSocket. This supports multilingual recognition, emotion detection, and VAD.
Supported regions and endpoints
| International | Mainland China | |
|---|---|---|
| Data storage | Singapore | Beijing |
| Inference computing | Dynamically scheduled globally, excluding Mainland China | Limited to Mainland China |
| WebSocket endpoint | wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
wss://dashscope.aliyuncs.com/api-ws/v1/realtime |
| API key console | modelstudio.console.alibabacloud.com | bailian.console.alibabacloud.com |
| Pricing | $0.00009/second | $0.000047/second |
For more information about deployment modes, see Compare deployment modes.
Session workflow
A typical session follows this flow:
-
Connect -- Establish a WebSocket connection to the service endpoint with your API key.
-
Configure -- Send a
session.updateevent to set the audio format, language, and turn detection mode. -
Stream audio -- Send audio chunks as Base64-encoded data through
input_audio_buffer.appendevents. -
Receive results -- The server returns intermediate transcription through
conversation.item.input_audio_transcription.textevents and final transcription throughconversation.item.input_audio_transcription.completedevents. -
Finish -- Send a
session.finishevent to end the session. The server responds with asession.finishedevent.
Turn detection modes
| Mode | Configuration | Behavior |
|---|---|---|
| Server VAD | Set turn_detection.type to server_vad |
Server automatically detects speech boundaries. |
| Manual | Set turn_detection to null |
Control turn boundaries with input_audio_buffer.commit events. Continuous audio must not exceed 60 seconds. |
Supported models
All deployment regions support the same model family: Qwen3-ASR-Flash-Realtime.
| Version type | Model ID | Description |
|---|---|---|
| Stable | qwen3-asr-flash-realtime |
Points to qwen3-asr-flash-realtime-2025-10-27 |
| Latest snapshot | qwen3-asr-flash-realtime-2026-02-10 |
Most recent snapshot |
| Snapshot | qwen3-asr-flash-realtime-2025-10-27 |
Point-in-time snapshot |
The stable version alias points to a tested snapshot—use it for production. Snapshot versions are fixed releases for pinning specific model revisions.
For the complete model catalog, see Model list.
Choose a model for your scenario
| Scenario | Recommended model | Reason |
|---|---|---|
| Customer service quality inspection | qwen3-asr-flash-realtime-2026-02-10 |
Provides real-time call analysis with emotion detection for quality monitoring. |
| Live streaming and short videos | qwen3-asr-flash-realtime-2026-02-10 |
Generates real-time multilingual captions for live content. |
| Online meetings and interviews | qwen3-asr-flash-realtime-2026-02-10 |
Provides real-time meeting transcription for text summaries. |
Before you begin
Before you begin:
-
Install the SDK or dependencies for your chosen integration method. See the version requirements in each section below.
-
Obtain an API key from the Create an API key (Singapore and Beijing regions use different keys).
-
Set the API key as an environment variable:
export DASHSCOPE_API_KEY="sk-xxx" -
Prepare a test audio file (PCM format, 16 kHz, mono) named
your_audio_file.pcmin your working directory.
Get started with the DashScope SDK
Java
Install the SDK. Ensure that the DashScope SDK version is 2.22.5 or later.
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
public class Qwen3AsrRealtimeUsage {
private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
private static final int SLEEP_INTERVAL_MS = 30; // Sleep interval in milliseconds
public static void main(String[] args) throws InterruptedException, LineUnavailableException {
CountDownLatch finishLatch = new CountDownLatch(1);
OmniRealtimeParam param = OmniRealtimeParam.builder()
.model("qwen3-asr-flash-realtime")
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apikey("sk-xxx")
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
OmniRealtimeConversation conversation = null;
final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("connection opened");
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
break;
case "conversation.item.input_audio_transcription.completed":
System.out.println("transcription: " + message.get("transcript").getAsString());
finishLatch.countDown();
break;
case "input_audio_buffer.speech_started":
System.out.println("======VAD Speech Start======");
break;
case "input_audio_buffer.speech_stopped":
System.out.println("======VAD Speech Stop======");
break;
case "conversation.item.input_audio_transcription.text":
System.out.println("transcription: " + message.get("text").getAsString());
break;
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
}
});
conversationRef.set(conversation);
try {
conversation.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputAudioFormat("pcm");
transcriptionParam.setInputSampleRate(16000);
OmniRealtimeConfig config = OmniRealtimeConfig.builder()
.modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
.transcriptionConfig(transcriptionParam)
.build();
conversation.updateSession(config);
String filePath = "your_audio_file.pcm";
File audioFile = new File(filePath);
if (!audioFile.exists()) {
log.error("Audio file not found: {}", filePath);
return;
}
try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
int bytesRead;
int totalBytesRead = 0;
log.info("Starting to send audio data from: {}", filePath);
// Read and send audio data in chunks
while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
totalBytesRead += bytesRead;
String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
// Send audio chunk to conversation
conversation.appendAudio(audioB64);
// Add small delay to simulate real-time audio streaming
Thread.sleep(SLEEP_INTERVAL_MS);
}
log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);
} catch (Exception e) {
log.error("Error sending audio from file: {}", filePath, e);
}
// Send session.finish, wait for the session to finish, and then close the connection.
conversation.endSession();
log.info("Task finished");
System.exit(0);
}
}
Expected output:
connection opened
start session: <session-id>
======VAD Speech Start======
transcription: <intermediate text>
======VAD Speech Stop======
transcription: <final transcribed text>
connection closed code: 1000, reason: ...
Python
Install the SDK. Ensure that the DashScope SDK version is 1.25.6 or later.
import logging
import os
import base64
import signal
import sys
import time
import dashscope
from dashscope.audio.qwen_omni import *
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams
def setup_logging():
"""Configure logging."""
logger = logging.getLogger('dashscope')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.propagate = False
return logger
def init_api_key():
"""Initialize the API key."""
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
if dashscope.api_key == 'YOUR_API_KEY':
print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')
class MyCallback(OmniRealtimeCallback):
"""Handle real-time recognition callbacks."""
def __init__(self, conversation):
self.conversation = conversation
self.handlers = {
'session.created': self._handle_session_created,
'conversation.item.input_audio_transcription.completed': self._handle_final_text,
'conversation.item.input_audio_transcription.text': self._handle_stash_text,
'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
}
def on_open(self):
print('Connection opened')
def on_close(self, code, msg):
print(f'Connection closed, code: {code}, msg: {msg}')
def on_event(self, response):
try:
handler = self.handlers.get(response['type'])
if handler:
handler(response)
except Exception as e:
print(f'[Error] {e}')
def _handle_session_created(self, response):
print(f"Start session: {response['session']['id']}")
def _handle_final_text(self, response):
print(f"Final recognized text: {response['transcript']}")
def _handle_stash_text(self, response):
print(f"Got stash result: {response['stash']}")
def read_audio_chunks(file_path, chunk_size=3200):
"""Read the audio file in chunks."""
with open(file_path, 'rb') as f:
while chunk := f.read(chunk_size):
yield chunk
def send_audio(conversation, file_path, delay=0.1):
"""Send audio data."""
if not os.path.exists(file_path):
raise FileNotFoundError(f"Audio file {file_path} does not exist.")
print("Processing audio file... Press 'Ctrl+C' to stop.")
for chunk in read_audio_chunks(file_path):
audio_b64 = base64.b64encode(chunk).decode('ascii')
conversation.append_audio(audio_b64)
time.sleep(delay)
def main():
setup_logging()
init_api_key()
audio_file_path = "./your_audio_file.pcm"
conversation = OmniRealtimeConversation(
model='qwen3-asr-flash-realtime',
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime',
callback=MyCallback(conversation=None) # Temporarily pass None and inject it later.
)
# Inject self into the callback.
conversation.callback.conversation = conversation
def handle_exit(sig, frame):
print('Ctrl+C pressed, exiting...')
conversation.close()
sys.exit(0)
signal.signal(signal.SIGINT, handle_exit)
conversation.connect()
transcription_params = TranscriptionParams(
language='zh',
sample_rate=16000,
input_audio_format="pcm"
)
conversation.update_session(
output_modalities=[MultiModality.TEXT],
enable_input_audio_transcription=True,
transcription_params=transcription_params
)
try:
send_audio(conversation, audio_file_path)
# Send session.finish, wait for the session to finish, and then close the connection.
conversation.end_session()
except Exception as e:
print(f"Error occurred: {e}")
finally:
conversation.close()
print("Audio processing completed.")
if __name__ == '__main__':
main()
Connect with the WebSocket API
The following examples send a local audio file over a raw WebSocket connection and retrieve recognition results. For more information about the protocol, see Interaction flow.
Python
Install the required dependency:
pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Do not name the sample code filewebsocket.py. Otherwise, the following error may occur:AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?
# pip install websocket-client
import os
import time
import json
import threading
import base64
import websocket
import logging
import logging.handlers
from datetime import datetime
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
API_KEY = os.environ.get("DASHSCOPE_API_KEY", "sk-xxx")
QWEN_MODEL = "qwen3-asr-flash-realtime"
# The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
url = f"{baseUrl}?model={QWEN_MODEL}"
print(f"Connecting to server: {url}")
# Note: If you are not in VAD mode, the cumulative duration of continuously sent audio should not exceed 60 seconds.
enableServerVad = True
is_running = True # Add a running flag.
headers = [
"Authorization: Bearer " + API_KEY,
"OpenAI-Beta: realtime=v1"
]
def init_logger():
formatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')
f_handler = logging.handlers.RotatingFileHandler(
"omni_tester.log", maxBytes=100 * 1024 * 1024, backupCount=3
)
f_handler.setLevel(logging.DEBUG)
f_handler.setFormatter(formatter)
console = logging.StreamHandler()
console.setLevel(logging.DEBUG)
console.setFormatter(formatter)
logger.addHandler(f_handler)
logger.addHandler(console)
def on_open(ws):
logger.info("Connected to server.")
# Session update event.
event_manual = {
"event_id": "event_123",
"type": "session.update",
"session": {
"modalities": ["text"],
"input_audio_format": "pcm",
"sample_rate": 16000,
"input_audio_transcription": {
# Language identifier, optional. If you have clear language information, set it.
"language": "zh"
},
"turn_detection": None
}
}
event_vad = {
"event_id": "event_123",
"type": "session.update",
"session": {
"modalities": ["text"],
"input_audio_format": "pcm",
"sample_rate": 16000,
"input_audio_transcription": {
"language": "zh"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.0,
"silence_duration_ms": 400
}
}
}
if enableServerVad:
logger.info(f"Sending event: {json.dumps(event_vad, indent=2)}")
ws.send(json.dumps(event_vad))
else:
logger.info(f"Sending event: {json.dumps(event_manual, indent=2)}")
ws.send(json.dumps(event_manual))
def on_message(ws, message):
global is_running
try:
data = json.loads(message)
logger.info(f"Received event: {json.dumps(data, ensure_ascii=False, indent=2)}")
if data.get("type") == "session.finished":
logger.info(f"Final transcript: {data.get('transcript')}")
logger.info("Closing WebSocket connection after session finished...")
is_running = False # Stop the audio sending thread.
ws.close()
except json.JSONDecodeError:
logger.error(f"Failed to parse message: {message}")
def on_error(ws, error):
logger.error(f"Error: {error}")
def on_close(ws, close_status_code, close_msg):
logger.info(f"Connection closed: {close_status_code} - {close_msg}")
def send_audio(ws, local_audio_path):
time.sleep(3) # Wait for the session update to complete.
global is_running
with open(local_audio_path, 'rb') as audio_file:
logger.info(f"Start reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
while is_running:
audio_data = audio_file.read(3200) # ~0.1 second of PCM16/16 kHz audio.
if not audio_data:
logger.info(f"Finished reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
if ws.sock and ws.sock.connected:
if not enableServerVad:
commit_event = {
"event_id": "event_789",
"type": "input_audio_buffer.commit"
}
ws.send(json.dumps(commit_event))
finish_event = {
"event_id": "event_987",
"type": "session.finish"
}
ws.send(json.dumps(finish_event))
break
if not ws.sock or not ws.sock.connected:
logger.info("The WebSocket is closed. Stop sending audio.")
break
encoded_data = base64.b64encode(audio_data).decode('utf-8')
eventd = {
"event_id": f"event_{int(time.time() * 1000)}",
"type": "input_audio_buffer.append",
"audio": encoded_data
}
ws.send(json.dumps(eventd))
logger.info(f"Sending audio event: {eventd['event_id']}")
time.sleep(0.1) # Simulate real-time collection.
# Initialize the logger.
init_logger()
logger.info(f"Connecting to WebSocket server at {url}...")
local_audio_path = "your_audio_file.pcm"
ws = websocket.WebSocketApp(
url,
header=headers,
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close
)
thread = threading.Thread(target=send_audio, args=(ws, local_audio_path))
thread.start()
ws.run_forever()
Java
Install the Java-WebSocket dependency.
Maven
<dependency>
<groupId>org.java-websocket</groupId>
<artifactId>Java-WebSocket</artifactId>
<version>1.5.6</version>
</dependency>
Gradle
implementation 'org.java-websocket:Java-WebSocket:1.5.6'
import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;
import org.json.JSONObject;
import java.net.URI;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Base64;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.logging.*;
public class QwenASRRealtimeClient {
private static final Logger logger = Logger.getLogger(QwenASRRealtimeClient.class.getName());
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: private static final String API_KEY = "sk-xxx"
private static final String API_KEY = System.getenv().getOrDefault("DASHSCOPE_API_KEY", "sk-xxx");
private static final String MODEL = "qwen3-asr-flash-realtime";
// Controls whether to use VAD mode.
private static final boolean enableServerVad = true;
private static final AtomicBoolean isRunning = new AtomicBoolean(true);
private static WebSocketClient client;
public static void main(String[] args) throws Exception {
initLogger();
// The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
String baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime";
String url = baseUrl + "?model=" + MODEL;
logger.info("Connecting to server: " + url);
client = new WebSocketClient(new URI(url)) {
@Override
public void onOpen(ServerHandshake handshake) {
logger.info("Connected to server.");
sendSessionUpdate();
}
@Override
public void onMessage(String message) {
try {
JSONObject data = new JSONObject(message);
String eventType = data.optString("type");
logger.info("Received event: " + data.toString(2));
// When the finish event is received, stop the sending thread and close the connection.
if ("session.finished".equals(eventType)) {
logger.info("Final transcript: " + data.optString("transcript"));
logger.info("Closing WebSocket connection after session finished...");
isRunning.set(false); // Stop the audio sending thread.
if (this.isOpen()) {
this.close(1000, "ASR finished");
}
}
} catch (Exception e) {
logger.severe("Failed to parse message: " + message);
}
}
@Override
public void onClose(int code, String reason, boolean remote) {
logger.info("Connection closed: " + code + " - " + reason);
}
@Override
public void onError(Exception ex) {
logger.severe("Error: " + ex.getMessage());
}
};
// Add request headers.
client.addHeader("Authorization", "Bearer " + API_KEY);
client.addHeader("OpenAI-Beta", "realtime=v1");
client.connectBlocking(); // Block until the connection is established.
// Replace with the path of the audio file to recognize.
String localAudioPath = "your_audio_file.pcm";
Thread audioThread = new Thread(() -> {
try {
sendAudio(localAudioPath);
} catch (Exception e) {
logger.severe("Audio sending thread error: " + e.getMessage());
}
});
audioThread.start();
}
/** Session update event (enable/disable VAD). */
private static void sendSessionUpdate() {
JSONObject eventNoVad = new JSONObject()
.put("event_id", "event_123")
.put("type", "session.update")
.put("session", new JSONObject()
.put("modalities", new String[]{"text"})
.put("input_audio_format", "pcm")
.put("sample_rate", 16000)
.put("input_audio_transcription", new JSONObject()
.put("language", "zh"))
.put("turn_detection", JSONObject.NULL) // Manual mode.
);
JSONObject eventVad = new JSONObject()
.put("event_id", "event_123")
.put("type", "session.update")
.put("session", new JSONObject()
.put("modalities", new String[]{"text"})
.put("input_audio_format", "pcm")
.put("sample_rate", 16000)
.put("input_audio_transcription", new JSONObject()
.put("language", "zh"))
.put("turn_detection", new JSONObject()
.put("type", "server_vad")
.put("threshold", 0.0)
.put("silence_duration_ms", 400))
);
if (enableServerVad) {
logger.info("Sending event (VAD):\n" + eventVad.toString(2));
client.send(eventVad.toString());
} else {
logger.info("Sending event (Manual):\n" + eventNoVad.toString(2));
client.send(eventNoVad.toString());
}
}
/** Send the audio file stream. */
private static void sendAudio(String localAudioPath) throws Exception {
Thread.sleep(3000); // Wait for the session to be ready.
byte[] allBytes = Files.readAllBytes(Paths.get(localAudioPath));
logger.info("Start reading the file.");
int offset = 0;
while (isRunning.get() && offset < allBytes.length) {
int chunkSize = Math.min(3200, allBytes.length - offset);
byte[] chunk = new byte[chunkSize];
System.arraycopy(allBytes, offset, chunk, 0, chunkSize);
offset += chunkSize;
if (client != null && client.isOpen()) {
String encoded = Base64.getEncoder().encodeToString(chunk);
JSONObject eventd = new JSONObject()
.put("event_id", "event_" + System.currentTimeMillis())
.put("type", "input_audio_buffer.append")
.put("audio", encoded);
client.send(eventd.toString());
logger.info("Sending audio event: " + eventd.getString("event_id"));
} else {
break; // Avoid sending after disconnection.
}
Thread.sleep(100); // Simulate real-time sending.
}
logger.info("Finished reading the file.");
if (client != null && client.isOpen()) {
// A commit is required in non-VAD mode.
if (!enableServerVad) {
JSONObject commitEvent = new JSONObject()
.put("event_id", "event_789")
.put("type", "input_audio_buffer.commit");
client.send(commitEvent.toString());
logger.info("Sent commit event for manual mode.");
}
JSONObject finishEvent = new JSONObject()
.put("event_id", "event_987")
.put("type", "session.finish");
client.send(finishEvent.toString());
logger.info("Sent finish event.");
}
}
/** Initialize the logger. */
private static void initLogger() {
logger.setLevel(Level.ALL);
Logger rootLogger = Logger.getLogger("");
for (Handler h : rootLogger.getHandlers()) {
rootLogger.removeHandler(h);
}
Handler consoleHandler = new ConsoleHandler();
consoleHandler.setLevel(Level.ALL);
consoleHandler.setFormatter(new SimpleFormatter());
logger.addHandler(consoleHandler);
}
}
Node.js
Install the required dependency:
npm install ws/**
* Qwen-ASR Realtime WebSocket Client (Node.js version)
* Features:
* - Supports VAD and Manual modes.
* - Sends session.update to start a session.
* - Continuously sends input_audio_buffer.append audio chunks.
* - Sends input_audio_buffer.commit in Manual mode.
* - Sends a session.finish event.
* - Closes the connection after receiving a session.finished event.
*/
import WebSocket from 'ws';
import fs from 'fs';
// ===== Configuration =====
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: const API_KEY = "sk-xxx"
const API_KEY = process.env.DASHSCOPE_API_KEY || 'sk-xxx';
const MODEL = 'qwen3-asr-flash-realtime';
const enableServerVad = true; // true for VAD mode, false for Manual mode
const localAudioPath = 'your_audio_file.pcm'; // Path to the PCM16, 16 kHz audio file
// The following is the base URL for the Singapore region. If you use a model in the Beijing region, replace the base URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
const baseUrl = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime';
const url = `${baseUrl}?model=${MODEL}`;
console.log(`Connecting to server: ${url}`);
// ===== Status Control =====
let isRunning = true;
// ===== Establish Connection =====
const ws = new WebSocket(url, {
headers: {
'Authorization': `Bearer ${API_KEY}`,
'OpenAI-Beta': 'realtime=v1'
}
});
// ===== Event Binding =====
ws.on('open', () => {
console.log('[WebSocket] Connected to server.');
sendSessionUpdate();
// Start the audio sending thread.
sendAudio(localAudioPath);
});
ws.on('message', (message) => {
try {
const data = JSON.parse(message);
console.log('[Received Event]:', JSON.stringify(data, null, 2));
// Received finish event.
if (data.type === 'session.finished') {
console.log(`[Final Transcript] ${data.transcript}`);
console.log('[Action] Closing WebSocket connection after session finished...');
if (ws.readyState === WebSocket.OPEN) {
ws.close(1000, 'ASR finished');
}
}
} catch (e) {
console.error('[Error] Failed to parse message:', message);
}
});
ws.on('close', (code, reason) => {
console.log(`[WebSocket] Connection closed: ${code} - ${reason}`);
});
ws.on('error', (err) => {
console.error('[WebSocket Error]', err);
});
// ===== Session Update =====
function sendSessionUpdate() {
const eventNoVad = {
event_id: 'event_123',
type: 'session.update',
session: {
modalities: ['text'],
input_audio_format: 'pcm',
sample_rate: 16000,
input_audio_transcription: {
language: 'zh'
},
turn_detection: null
}
};
const eventVad = {
event_id: 'event_123',
type: 'session.update',
session: {
modalities: ['text'],
input_audio_format: 'pcm',
sample_rate: 16000,
input_audio_transcription: {
language: 'zh'
},
turn_detection: {
type: 'server_vad',
threshold: 0.0,
silence_duration_ms: 400
}
}
};
if (enableServerVad) {
console.log('[Send Event] VAD Mode:\n', JSON.stringify(eventVad, null, 2));
ws.send(JSON.stringify(eventVad));
} else {
console.log('[Send Event] Manual Mode:\n', JSON.stringify(eventNoVad, null, 2));
ws.send(JSON.stringify(eventNoVad));
}
}
// ===== Send Audio File Stream =====
function sendAudio(audioPath) {
setTimeout(() => {
console.log(`[File Read Start] ${audioPath}`);
const buffer = fs.readFileSync(audioPath);
let offset = 0;
const chunkSize = 3200; // Approx. 0.1 second of PCM16 audio
function sendChunk() {
if (!isRunning) return;
if (offset >= buffer.length) {
isRunning = false; // Stop sending audio.
console.log('[File Read End]');
if (ws.readyState === WebSocket.OPEN) {
if (!enableServerVad) {
const commitEvent = {
event_id: 'event_789',
type: 'input_audio_buffer.commit'
};
ws.send(JSON.stringify(commitEvent));
console.log('[Send Commit Event]');
}
const finishEvent = {
event_id: 'event_987',
type: 'session.finish'
};
ws.send(JSON.stringify(finishEvent));
console.log('[Send Finish Event]');
}
return;
}
if (ws.readyState !== WebSocket.OPEN) {
console.log('[Stop] WebSocket is not open.');
return;
}
const chunk = buffer.slice(offset, offset + chunkSize);
offset += chunkSize;
const encoded = chunk.toString('base64');
const appendEvent = {
event_id: `event_${Date.now()}`,
type: 'input_audio_buffer.append',
audio: encoded
};
ws.send(JSON.stringify(appendEvent));
console.log(`[Send Audio Event] ${appendEvent.event_id}`);
setTimeout(sendChunk, 100); // Simulate real-time sending.
}
sendChunk();
}, 3000); // Wait for session configuration to complete.
}
WebSocket event reference
| Direction | Event type | Description |
|---|---|---|
| Client to server | session.update |
Configure session parameters (audio format, language, sample rate, turn detection mode). |
| Client to server | input_audio_buffer.append |
Send a Base64-encoded audio chunk to the server. |
| Client to server | input_audio_buffer.commit |
Commit the audio buffer. Required in manual mode only. |
| Client to server | session.finish |
Signal that client has finished sending audio. |
| Server to client | session.created |
Confirms session was created with session ID. |
| Server to client | session.updated |
Confirms session configuration was applied. |
| Server to client | input_audio_buffer.speech_started |
VAD detected the start of speech. |
| Server to client | input_audio_buffer.speech_stopped |
VAD detected the end of speech. |
| Server to client | conversation.item.created |
A new conversation item has been created for the incoming audio. |
| Server to client | conversation.item.input_audio_transcription.text |
Contains intermediate transcription result. |
| Server to client | conversation.item.input_audio_transcription.completed |
Final transcription result for a speech segment. |
| Server to client | input_audio_buffer.committed |
Confirms audio buffer was committed. |
| Server to client | session.finished |
Session ended with final transcript. |
Model capabilities and limitations
| Feature | Qwen3-ASR-Flash-Realtime |
|---|---|
| Supported languages | Chinese (Mandarin, Sichuanese, Minnan, Wu, and Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, and Swedish |
| Supported audio formats | PCM, Opus |
| Sample rate | 8 kHz, 16 kHz |
| Channel | Mono |
| Input format | Binary audio stream |
| Audio size/duration | Unlimited |
| Emotion recognition | Supported (always on) |
| Sensitive word filtering | Not supported |
| Speaker diarization | Not supported |
| Filler word filtering | Not supported |
| Timestamp | Not supported |
| Punctuation prediction | Supported (always on) |
| Inverse Text Normalization (ITN) | Not supported |
| Voice Activity Detection (VAD) | Supported (always on) |
| Rate limit | 20 requests per second (RPS) |
| Connection type | Java/Python SDK, WebSocket API |
| Pricing (International) | $0.00009/second |
| Pricing (Mainland China) | $0.000047/second |
These features apply to all versions:qwen3-asr-flash-realtime,qwen3-asr-flash-realtime-2026-02-10, andqwen3-asr-flash-realtime-2025-10-27.
International: $0.00009/second
Chinese Mainland: $0.000047/second