Real-time speech recognition transcribes audio streams to punctuated text over WebSocket, with low latency suited for live captions, online meetings, voice chat, and smart assistants.
Overview
A WebSocket streaming protocol delivers audio to the service and returns transcribed text with low latency.
-
High-accuracy recognition for Mandarin and dialects, including Cantonese and Sichuanese.
-
Robust performance in complex acoustic environments, with automatic language detection and non-speech filtering.
-
Emotion recognition across states such as surprise, calm, joy, sadness, disgust, anger, and fear.
-
Custom hotwords that improve recognition accuracy for specified terms.
-
Timestamp output for structured recognition results.
-
Configurable sample rates and multiple audio formats to fit different recording environments.
For batch scenarios such as meeting transcription, call analysis, and subtitle generation, use Non-real-time speech recognition. For model selection guidance, see Speech-to-text.
Prerequisites
-
Create an API key, with the API key configured as an environment variable.
-
To call the service through the DashScope SDK, install the latest SDK.
Quick start
The following examples show how to call real-time speech recognition through the DashScope SDK.
Fun-ASR
Recognize microphone audio
Capture audio from the microphone and stream transcribed text in real time, so text appears as the speaker talks.
Java
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Main {
public static void main(String[] args) throws InterruptedException {
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
ExecutorService executorService = Executors.newSingleThreadExecutor();
executorService.submit(new RealtimeRecognitionTask());
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
class RealtimeRecognitionTask implements Runnable {
@Override
public void run() {
RecognitionParam param = RecognitionParam.builder()
.model("fun-asr-realtime")
// The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.format("wav")
.sampleRate(16000)
.build();
Recognition recognizer = new Recognition();
ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
@Override
public void onEvent(RecognitionResult result) {
if (result.isSentenceEnd()) {
System.out.println("Final Result: " + result.getSentence().getText());
} else {
System.out.println("Intermediate Result: " + result.getSentence().getText());
}
}
@Override
public void onComplete() {
System.out.println("Recognition complete");
}
@Override
public void onError(Exception e) {
System.out.println("RecognitionCallback error: " + e.getMessage());
}
};
try {
recognizer.call(param, callback);
// Create an audio format.
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// Match the default recording device based on the format.
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// Start recording.
targetDataLine.start();
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// Record for 50 seconds and perform real-time transcription.
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// Send the recorded audio data to the streaming recognition service.
recognizer.sendAudioFrame(buffer);
buffer = ByteBuffer.allocate(1024);
// The recording rate is limited. Sleep for a short period to prevent high CPU usage.
Thread.sleep(20);
}
}
recognizer.stop();
} catch (Exception e) {
e.printStackTrace();
} finally {
// Close the WebSocket connection after the task is complete.
recognizer.getDuplexApi().close(1000, "bye");
}
System.out.println(
"[Metric] requestId: "
+ recognizer.getLastRequestId()
+ ", first package delay ms: "
+ recognizer.getFirstPackageDelay()
+ ", last package delay ms: "
+ recognizer.getLastPackageDelay());
}
}Python
The Python example requires the pyaudio library for audio capture. Install it with pip install pyaudio before running the example.
import os
import signal # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys
import dashscope
import pyaudio
from dashscope.audio.asr import *
mic = None
stream = None
# Set recording parameters
sample_rate = 16000 # sampling rate (Hz)
channels = 1 # mono channel
dtype = 'int16' # data type
format_pcm = 'pcm' # the format of the audio data
block_size = 3200 # number of frames per buffer
# Real-time speech recognition callback
class Callback(RecognitionCallback):
def on_open(self) -> None:
global mic
global stream
print('RecognitionCallback open.')
mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True)
def on_close(self) -> None:
global mic
global stream
print('RecognitionCallback close.')
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_complete(self) -> None:
print('RecognitionCallback completed.') # recognition completed
def on_error(self, message) -> None:
print('RecognitionCallback task_id: ', message.request_id)
print('RecognitionCallback error: ', message.message)
# Stop and close the audio stream if it is running
if 'stream' in globals() and stream.active:
stream.stop()
stream.close()
# Forcefully exit the program
sys.exit(1)
def on_event(self, result: RecognitionResult) -> None:
sentence = result.get_sentence()
if 'text' in sentence:
print('RecognitionCallback text: ', sentence['text'])
if RecognitionResult.is_sentence_end(sentence):
print(
'RecognitionCallback sentence end, request_id:%s, usage:%s'
% (result.get_request_id(), result.get_usage(sentence)))
def signal_handler(sig, frame):
print('Ctrl+C pressed, stop recognition ...')
# Stop recognition
recognition.stop()
print('Recognition stopped.')
print(
'[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
.format(
recognition.get_last_request_id(),
recognition.get_first_package_delay(),
recognition.get_last_package_delay(),
))
# Forcefully exit the program
sys.exit(0)
# main function
if __name__ == '__main__':
# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# Create the recognition callback
callback = Callback()
# Call recognition service by async mode, you can customize the recognition parameters, like model, format,
# sample_rate
recognition = Recognition(
model='fun-asr-realtime',
format=format_pcm,
# 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
sample_rate=sample_rate,
# Supports 8000, 16000.
semantic_punctuation_enabled=False,
callback=callback)
# Start recognition
recognition.start()
signal.signal(signal.SIGINT, signal_handler)
print("Press 'Ctrl+C' to stop recording and recognition...")
# Create a keyboard listener until "Ctrl+C" is pressed
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
recognition.send_audio_frame(data)
else:
break
recognition.stop()Recognize a local audio file
Transcribe a local audio file. This fits short, near-real-time scenarios such as voice chat, voice commands, speech input, and voice search.
Java
The audio file used in the example is asr_example.wav.
import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class TimeUtils {
private static final DateTimeFormatter formatter =
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");
public static String getTimestamp() {
return LocalDateTime.now().format(formatter);
}
}
public class Main {
public static void main(String[] args) throws InterruptedException {
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
// In a real application, call this method only once at program startup.
warmUp();
ExecutorService executorService = Executors.newSingleThreadExecutor();
executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
executorService.shutdown();
// Wait for all tasks to complete.
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
public static void warmUp() {
try {
// Lightweight GET request to establish a connection.
GeneralServiceOption warmupOption = GeneralServiceOption.builder()
.protocol(Protocol.HTTP)
.httpMethod(HttpMethod.GET)
.streamingMode(StreamingMode.OUT)
.path("assistants")
.build();
warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
} catch (Exception e) {
// Allow retry if warm-up fails.
}
}
}
class RealtimeRecognitionTask implements Runnable {
private Path filepath;
public RealtimeRecognitionTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
RecognitionParam param = RecognitionParam.builder()
.model("fun-asr-realtime")
// The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.format("wav")
.sampleRate(16000)
.build();
Recognition recognizer = new Recognition();
String threadName = Thread.currentThread().getName();
ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
@Override
public void onEvent(RecognitionResult message) {
if (message.isSentenceEnd()) {
System.out.println(TimeUtils.getTimestamp()+" "+
"[process " + threadName + "] Final Result:" + message.getSentence().getText());
} else {
System.out.println(TimeUtils.getTimestamp()+" "+
"[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
}
}
@Override
public void onComplete() {
System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
}
@Override
public void onError(Exception e) {
System.out.println(TimeUtils.getTimestamp()+" "+
"[" + threadName + "] RecognitionCallback error: " + e.getMessage());
}
};
try {
recognizer.call(param, callback);
// Replace the path with your audio file path.
System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
// Read the file and send audio in chunks.
FileInputStream fis = new FileInputStream(this.filepath.toFile());
byte[] allData = new byte[fis.available()];
int ret = fis.read(allData);
fis.close();
int sendFrameLength = 3200;
for (int i = 0; i * sendFrameLength < allData.length; i ++) {
int start = i * sendFrameLength;
int end = Math.min(start + sendFrameLength, allData.length);
ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
recognizer.sendAudioFrame(byteBuffer);
Thread.sleep(100);
}
System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
recognizer.stop();
} catch (Exception e) {
e.printStackTrace();
} finally {
// Close the WebSocket connection after the task is complete.
recognizer.getDuplexApi().close(1000, "bye");
}
System.out.println(
"["
+ threadName
+ "][Metric] requestId: "
+ recognizer.getLastRequestId()
+ ", first package delay ms: "
+ recognizer.getFirstPackageDelay()
+ ", last package delay ms: "
+ recognizer.getLastPackageDelay());
}
}
Python
The audio file used in the example is asr_example.wav.
import os
import time
import dashscope
from dashscope.audio.asr import *
# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not set an environment variable, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# The following URL is for the Singapore region. To use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
from datetime import datetime
def get_timestamp():
now = datetime.now()
formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
return formatted_timestamp
class Callback(RecognitionCallback):
def on_complete(self) -> None:
print(get_timestamp() + ' Recognition completed') # recognition complete
def on_error(self, result: RecognitionResult) -> None:
print('Recognition task_id: ', result.request_id)
print('Recognition error: ', result.message)
exit(0)
def on_event(self, result: RecognitionResult) -> None:
sentence = result.get_sentence()
if 'text' in sentence:
print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
if RecognitionResult.is_sentence_end(sentence):
print(get_timestamp() +
'RecognitionCallback sentence end, request_id:%s, usage:%s'
% (result.get_request_id(), result.get_usage(sentence)))
callback = Callback()
recognition = Recognition(model='fun-asr-realtime',
format='wav',
sample_rate=16000,
callback=callback)
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
# Read the entire file into a buffer
file_buffer = f.read()
f.close()
print("Start Recognition")
recognition.start()
# Send data in chunks of 3200 bytes
buffer_size = len(file_buffer)
offset = 0
chunk_size = 3200
while offset < buffer_size:
# Calculate the size of the current chunk
remaining_bytes = buffer_size - offset
current_chunk_size = min(chunk_size, remaining_bytes)
# Extract the current chunk from the buffer
audio_data = file_buffer[offset:offset + current_chunk_size]
# Send the audio frame
recognition.send_audio_frame(audio_data)
# Update the offset
offset += current_chunk_size
# Add a delay to simulate real-time transmission
time.sleep(0.1)
recognition.stop()
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
except Exception as e:
raise e
print(
'[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
.format(
recognition.get_last_request_id(),
recognition.get_first_package_delay(),
recognition.get_last_package_delay(),
))
Qwen-ASR
The example reads your_audio_file.pcm (PCM16, 16 kHz, mono). To convert from MP3, WAV, or other formats, use ffmpeg:
ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -f s16le your_audio_file.pcm
Java
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
public class Qwen3AsrRealtimeUsage {
private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
private static final int SLEEP_INTERVAL_MS = 30; // Sleep interval in milliseconds
public static void main(String[] args) throws InterruptedException, LineUnavailableException {
CountDownLatch finishLatch = new CountDownLatch(1);
OmniRealtimeParam param = OmniRealtimeParam.builder()
.model("qwen3-asr-flash-realtime")
// The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
// The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured the environment variable, replace the line below with your Model Studio API Key: .apikey("sk-xxx")
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
OmniRealtimeConversation conversation = null;
final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("connection opened");
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
break;
case "conversation.item.input_audio_transcription.completed":
System.out.println("transcription: " + message.get("transcript").getAsString());
finishLatch.countDown();
break;
case "input_audio_buffer.speech_started":
System.out.println("======VAD Speech Start======");
break;
case "input_audio_buffer.speech_stopped":
System.out.println("======VAD Speech Stop======");
break;
case "conversation.item.input_audio_transcription.text":
System.out.println("transcription: " + message.get("text").getAsString() + message.get("stash").getAsString());
break;
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
}
});
conversationRef.set(conversation);
try {
conversation.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("en");
transcriptionParam.setInputAudioFormat("pcm");
transcriptionParam.setInputSampleRate(16000);
OmniRealtimeConfig config = OmniRealtimeConfig.builder()
.modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
.transcriptionConfig(transcriptionParam)
.build();
conversation.updateSession(config);
String filePath = "your_audio_file.pcm";
File audioFile = new File(filePath);
if (!audioFile.exists()) {
log.error("Audio file not found: {}", filePath);
return;
}
try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
int bytesRead;
int totalBytesRead = 0;
log.info("Starting to send audio data from: {}", filePath);
// Read and send audio data in chunks
while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
totalBytesRead += bytesRead;
String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
// Send audio chunk to conversation
conversation.appendAudio(audioB64);
// Add small delay to simulate real-time audio streaming
Thread.sleep(SLEEP_INTERVAL_MS);
}
log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);
} catch (Exception e) {
log.error("Error sending audio from file: {}", filePath, e);
}
//send session.finish and wait for finish and close
conversation.endSession();
log.info("task finished");
System.exit(0);
}
}Python
import logging
import os
import base64
import signal
import sys
import time
import dashscope
from dashscope.audio.qwen_omni import *
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams
def setup_logging():
"""Configure logging output"""
logger = logging.getLogger('dashscope')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.propagate = False
return logger
def init_api_key():
"""Initialize API Key"""
# The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the line below with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
if dashscope.api_key == 'YOUR_API_KEY':
print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')
class MyCallback(OmniRealtimeCallback):
"""Real-time recognition callback handler"""
def __init__(self, conversation):
self.conversation = conversation
self.handlers = {
'session.created': self._handle_session_created,
'conversation.item.input_audio_transcription.completed': self._handle_final_text,
'conversation.item.input_audio_transcription.text': self._handle_transcription_text,
'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
}
def on_open(self):
print('Connection opened')
def on_close(self, code, msg):
print(f'Connection closed, code: {code}, msg: {msg}')
def on_event(self, response):
try:
handler = self.handlers.get(response['type'])
if handler:
handler(response)
except Exception as e:
print(f'[Error] {e}')
def _handle_session_created(self, response):
print(f"Start session: {response['session']['id']}")
def _handle_final_text(self, response):
print(f"Final recognized text: {response['transcript']}")
def _handle_transcription_text(self, response):
print(f"Got transcription result: {response['text'] + response['stash']}")
def read_audio_chunks(file_path, chunk_size=3200):
"""Read audio file in chunks"""
with open(file_path, 'rb') as f:
while chunk := f.read(chunk_size):
yield chunk
def send_audio(conversation, file_path, delay=0.1):
"""Send audio data"""
if not os.path.exists(file_path):
raise FileNotFoundError(f"Audio file {file_path} does not exist.")
print("Processing audio file... Press 'Ctrl+C' to stop.")
for chunk in read_audio_chunks(file_path):
audio_b64 = base64.b64encode(chunk).decode('ascii')
conversation.append_audio(audio_b64)
time.sleep(delay)
def main():
setup_logging()
init_api_key()
audio_file_path = "./your_audio_file.pcm"
callback = MyCallback(conversation=None)
conversation = OmniRealtimeConversation(
model='qwen3-asr-flash-realtime',
# The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime',
callback=callback,
)
callback.conversation = conversation # Inject conversation into the callback so its methods can be invoked from the callback
def handle_exit(sig, frame):
print('Ctrl+C pressed, exiting...')
conversation.close()
sys.exit(0)
signal.signal(signal.SIGINT, handle_exit)
conversation.connect()
transcription_params = TranscriptionParams(
language='en',
sample_rate=16000,
input_audio_format="pcm"
)
conversation.update_session(
output_modalities=[MultiModality.TEXT],
enable_input_audio_transcription=True,
transcription_params=transcription_params
)
try:
send_audio(conversation, audio_file_path)
# send session.finish and wait for finished and close
conversation.end_session()
except Exception as e:
print(f"Error occurred: {e}")
finally:
conversation.close()
print("Audio processing completed.")
if __name__ == '__main__':
main()Paraformer
Paraformer reuses the Fun-ASR example code. Replace the model parameter with the Paraformer model name.
Advanced features
Qwen-ASR interaction modes
The Qwen-ASR Realtime API supports two interaction modes:
-
VAD mode (default): The server detects the start and end of speech automatically (turn detection). Suitable for real-time conversation, meeting transcription, and similar scenarios. To enable, configure
session.turn_detection(enabled by default). -
Manual mode: The client controls turn detection by sending
input_audio_buffer.commit. Suitable for scenarios that need explicit control over send timing, such as voice messages in chat apps. To enable, setsession.turn_detectionto null.
Switch between modes:
-
WebSocket: Set the
turn_detectionfield in thesession.updateevent.{ "type": "session.update", "session": { "turn_detection": null } } -
Python SDK: Set the
enable_turn_detectionparameter in theupdate_sessionmethod.conversation.update_session( enable_turn_detection=False ) -
Java SDK: Set the
enableTurnDetectionparameter throughOmniRealtimeConfig.builder().OmniRealtimeConfig config = OmniRealtimeConfig.builder() .enableTurnDetection(false) .build(); conversation.updateSession(config);
For complete SDK code examples, see Qwen-ASR-Realtime Python SDK - API reference and Java SDK. For the WebSocket event lifecycle, see Event flow.
VAD turn detection
Voice Activity Detection (VAD) determines when continuous speech ends, which triggers the final-result event. All three model families enable server-side VAD by default, but the parameter names and tunability differ:
-
Qwen-ASR: Configured through
session.turn_detection. Includessilence_duration_ms(silence duration threshold; the turn ends when silence exceeds this value; server default800; recommended400for conversation and chat scenarios that need fast turn detection) andthreshold(VAD detection sensitivity; server default0.2). Qwen-ASR also supports a Manual mode that disables VAD and lets the client control turn detection throughcommit. See Qwen-ASR interaction modes above. -
Fun-ASR and Paraformer: Configured through
max_sentence_silence(VAD turn-detection silence threshold, in milliseconds). When the silence after a speech segment exceeds this threshold, the sentence is considered ended.
The parameter name varies by protocol: the same concept is silence_duration_ms in Qwen-ASR and max_sentence_silence in Fun-ASR and Paraformer. For complete field definitions, see API reference.
Improve accuracy with hotwords
Fun-ASR and Paraformer support hotwords to improve recognition accuracy for specific terms, such as brand names, personal names, and proper nouns.
For hotword configuration and usage, see Custom hotwords.
Timestamps
Fun-ASR and Paraformer return timestamps at two granularities by default: sentence level and word level. These support subtitle alignment, keyword highlighting, karaoke-style sing-along, and similar use cases. Qwen-ASR Realtime (qwen3-asr-flash-realtime) does not currently return timestamps. For timestamps, use Fun-ASR or Paraformer. The Qwen-ASR file transcription model qwen3-asr-flash-filetrans supports word-level timestamps. See Non-real-time speech recognition.
Timestamps are reported in milliseconds at two levels:
-
Sentence level:
payload.output.sentence.begin_timeandpayload.output.sentence.end_timemark the start and end of the sentence within the audio. In intermediate results,end_timemay benull. The final value is populated when the sentence ends (sentence_end = true). -
Word level: The
payload.output.sentence.wordsarray. Each element containsbegin_time,end_time,text(the character or word text), andpunctuation(the punctuation that follows the character; an empty string when none).
Example response (excerpt):
{
"payload": {
"output": {
"sentence": {
"begin_time": 170,
"end_time": 920,
"text": "Okay, I got it.",
"sentence_end": true,
"words": [
{ "begin_time": 170, "end_time": 295, "text": "Okay", "punctuation": "," },
{ "begin_time": 295, "end_time": 503, "text": "I", "punctuation": "" },
{ "begin_time": 503, "end_time": 711, "text": "got", "punctuation": "" },
{ "begin_time": 711, "end_time": 920, "text": "it", "punctuation": "" }
]
}
}
}
}
The field names above use the WebSocket JSON path. Each SDK exposes these fields with its own naming convention (dictionary keys, object properties, getter methods, and so on). For the full field mapping, see the API reference of each SDK.
For complete field definitions, see API reference.
Emotion recognition
Some Qwen-ASR and Paraformer models include the speaker's emotional state in transcription results. The output granularity and the way to enable it differ between the two.
Qwen-ASR (qwen3-asr-flash-realtime): Always on, no configuration needed. The top-level emotion field is returned in both the conversation.item.input_audio_transcription.text and conversation.item.input_audio_transcription.completed events. The value is one of seven fine-grained emotions: surprised, neutral, happy, sad, disgusted, angry, and fearful.
{
"type": "conversation.item.input_audio_transcription.text",
"emotion": "neutral",
"text": "The weather is nice today.",
"stash": ""
}
Paraformer (paraformer-realtime-8k-v2): Only this Paraformer model supports emotion recognition. Results are returned through payload.output.sentence.emo_tag and payload.output.sentence.emo_confidence. The value is one of three polarities: positive (positive emotions such as happiness or satisfaction), negative (negative emotions such as anger or low spirits), and neutral (no clear emotion). The confidence range is [0.0, 1.0].
Emotion recognition is returned only when all of the following conditions are met:
-
The model is
paraformer-realtime-8k-v2. -
Semantic turn detection is disabled:
semantic_punctuation_enabled = false(the default; no special setting needed). -
The result is returned only in the sentence-end event where
sentence_end = true.
To suppress the emotion recognition fields, set semantic_punctuation_enabled to true. This enables semantic turn detection, and the emo_tag and emo_confidence fields are no longer returned.
The field names above use the WebSocket JSON path. Each SDK exposes these fields with its own naming convention (dictionary keys, object properties, getter methods, and so on). For the full field mapping, see the API reference of each SDK.
For complete field definitions, value constraints, and examples, see API reference.
Raw WebSocket protocol
The following examples show how to connect to the server through the raw WebSocket protocol, suitable for scenarios that don't use the DashScope SDK. These are minimal runnable implementations. For the WebSocket protocol of each model, see API reference.
Connection reuse (WebSocket)
Fun-ASR and Paraformer support WebSocket connection reuse: after one recognition task ends, you can start the next task on the same connection without reconnecting.
Reuse flow: The client sends finish-task. After the server returns task-finished, the client sends run-task again to start a new task.
-
Wait for the server to return the
task-finishedevent before starting a new task. -
Each task on a reused connection must use a different
task_id. -
If a task fails, the server returns an error event and closes the connection. That connection can't be reused.
-
The connection closes automatically when no new task starts within 60 seconds after a task ends.
Qwen-ASR Realtime uses a session model. Close the connection after each session ends. Connection reuse isn't supported.
For event descriptions of each model, see the corresponding API reference.
High-concurrency best practices
The DashScope SDK includes built-in pooling that reuses WebSocket connections and recognition objects, which avoids the overhead of creating and destroying them repeatedly. Currently only the Paraformer Java SDK supports this feature.
Apply in production
Improve recognition accuracy
-
Match the model to the sample rate: For 8 kHz phone audio, use an 8 kHz model directly. Don't upsample to 16 kHz, because that distorts the signal.
-
Improve input audio quality: Use a high-quality microphone in a recording environment with high signal-to-noise ratio and no echo. At the application layer, integrate preprocessing such as noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC).
Set up resilience
-
Client-side reconnection: The client should implement automatic reconnection to handle network jitter. A Python SDK reference implementation:
-
Catch exceptions: Implement the
on_errormethod in theCallbackclass. ThedashscopeSDK calls this method when it encounters a network error or other issue. -
Notify state: When
on_erroris triggered, set a reconnect signal. In Python, usethreading.Event, a thread-safe signal flag. -
Reconnect loop: Wrap the main logic in a
forloop (for example, retry 3 times). When the reconnect signal is detected, the current recognition is interrupted, resources are cleaned up, then after a few seconds the loop creates a new connection.
-
-
Use a heartbeat to keep the connection alive: When you need a long-lived connection to the server, set the heartbeat parameter to
true. The connection stays alive even when the audio is silent for a long time. -
Rate limits: When you call the model interfaces, observe the model Rate limits rules.
Supported scope
Available models vary by deployment scope:
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.
To call any of the following models, use an API key from the Singapore region:
-
Fun-ASR: fun-asr-realtime (stable, currently equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2025-11-07 (snapshot)
-
Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (stable, currently equivalent to qwen3-asr-flash-realtime-2025-10-27), qwen3-asr-flash-realtime-2026-02-10 (latest snapshot), qwen3-asr-flash-realtime-2025-10-27 (snapshot)
Mainland China
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).
To call any of the following models, use an API key from the Beijing region:
-
Fun-ASR: fun-asr-realtime (stable, currently equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2026-02-28 (latest snapshot), fun-asr-realtime-2025-11-07 (snapshot), fun-asr-realtime-2025-09-15 (snapshot)
-
fun-asr-flash-8k-realtime (stable, currently equivalent to fun-asr-flash-8k-realtime-2026-01-28), fun-asr-flash-8k-realtime-2026-01-28
-
-
Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (stable, currently equivalent to qwen3-asr-flash-realtime-2025-10-27), qwen3-asr-flash-realtime-2026-02-10 (latest snapshot), qwen3-asr-flash-realtime-2025-10-27 (snapshot)
-
Paraformer: paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1
API reference
FAQ
What audio formats does real-time speech recognition support?
Fun-ASR and Paraformer support pcm, wav, mp3, opus, speex, aac, and amr. For Qwen-ASR, use pcm or opus. Other formats (such as wav, aac, and amr) pass session.update validation but may fail server-side decoding. Confirm the audio stream uses a recommended format before sending.
When should I use the SDK vs. the WebSocket API?
The DashScope SDK wraps WebSocket connection management, authentication, and reconnection, which makes it the fastest path to integration. The WebSocket API gives you direct, fine-grained control. Use it when the SDK doesn't cover your language, or when you need custom connection handling. The SDK is the recommended choice for most use cases.
How do I improve recognition accuracy for proper nouns?
Use hot words (supported by Fun-ASR and Paraformer). Hot words work well for fixed vocabulary .
What can I do when the connection keeps dropping?
Implement client-side reconnection logic, and enable the heartbeat parameter (heartbeat=true) to prevent the connection from dropping during long silent periods. For detailed fault-tolerance strategies, see Apply in production.