All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech recognition - Qwen

Last Updated:May 18, 2026

Real-time speech recognition transcribes audio streams to punctuated text over WebSocket, with low latency suited for live captions, online meetings, voice chat, and smart assistants.

Overview

A WebSocket streaming protocol delivers audio to the service and returns transcribed text with low latency.

  • High-accuracy recognition for Mandarin and dialects, including Cantonese and Sichuanese.

  • Robust performance in complex acoustic environments, with automatic language detection and non-speech filtering.

  • Emotion recognition across states such as surprise, calm, joy, sadness, disgust, anger, and fear.

  • Custom hotwords that improve recognition accuracy for specified terms.

  • Timestamp output for structured recognition results.

  • Configurable sample rates and multiple audio formats to fit different recording environments.

For batch scenarios such as meeting transcription, call analysis, and subtitle generation, use Non-real-time speech recognition. For model selection guidance, see Speech-to-text.

Prerequisites

Quick start

The following examples show how to call real-time speech recognition through the DashScope SDK.

Fun-ASR

Recognize microphone audio

Capture audio from the microphone and stream transcribed text in real time, so text appears as the speaker talks.

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50 seconds and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

The Python example requires the pyaudio library for audio capture. Install it with pip install pyaudio before running the example.

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # Supports 8000, 16000.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

Transcribe a local audio file. This fits short, near-real-time scenarios such as voice chat, voice commands, speech input, and voice search.

Java

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // In a real application, call this method only once at program startup.
        warmUp();

        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // Wait for all tasks to complete.
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }

    public static void warmUp() {
        try {
            // Lightweight GET request to establish a connection.
            GeneralServiceOption warmupOption = GeneralServiceOption.builder()
                    .protocol(Protocol.HTTP)
                    .httpMethod(HttpMethod.GET)
                    .streamingMode(StreamingMode.OUT)
                    .path("assistants")
                    .build();

            warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
            GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
            api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
        } catch (Exception e) {
            // Allow retry if warm-up fails.
        }
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Replace the path with your audio file path.
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read the file and send audio in chunks.
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            byte[] allData = new byte[fis.available()];
            int ret = fis.read(allData);
            fis.close();

            int sendFrameLength = 3200;
            for (int i = 0; i * sendFrameLength < allData.length; i ++) {
                int start = i * sendFrameLength;
                int end = Math.min(start + sendFrameLength, allData.length);
                ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
                recognizer.sendAudioFrame(byteBuffer);
                Thread.sleep(100);
            }

            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

The audio file used in the example is asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not set an environment variable, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following URL is for the Singapore region. To use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
        if RecognitionResult.is_sentence_end(sentence):
            print(get_timestamp() +
                  'RecognitionCallback sentence end, request_id:%s, usage:%s'
                  % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        # Read the entire file into a buffer
        file_buffer = f.read()
        f.close()
        print("Start Recognition")
        recognition.start()

        # Send data in chunks of 3200 bytes
        buffer_size = len(file_buffer)
        offset = 0
        chunk_size = 3200

        while offset < buffer_size:
            # Calculate the size of the current chunk
            remaining_bytes = buffer_size - offset
            current_chunk_size = min(chunk_size, remaining_bytes)

            # Extract the current chunk from the buffer
            audio_data = file_buffer[offset:offset + current_chunk_size]

            # Send the audio frame
            recognition.send_audio_frame(audio_data)
            # Update the offset
            offset += current_chunk_size

            # Add a delay to simulate real-time transmission
            time.sleep(0.1)

        recognition.stop()
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
except Exception as e:
    raise e

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Qwen-ASR

Note

The example reads your_audio_file.pcm (PCM16, 16 kHz, mono). To convert from MP3, WAV, or other formats, use ffmpeg:

ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -f s16le your_audio_file.pcm

Java

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;

public class Qwen3AsrRealtimeUsage {
    private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
    private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
    private static final int SLEEP_INTERVAL_MS = 30;  // Sleep interval in milliseconds

    public static void main(String[] args) throws InterruptedException, LineUnavailableException {
        CountDownLatch finishLatch = new CountDownLatch(1);

        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-asr-flash-realtime")
                // The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                // The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the line below with your Model Studio API Key: .apikey("sk-xxx")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                .build();

        OmniRealtimeConversation conversation = null;
        final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
        conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("connection opened");
            }
            @Override
            public void onEvent(JsonObject message) {
                String type = message.get("type").getAsString();
                switch(type) {
                    case "session.created":
                        System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
                        break;
                    case "conversation.item.input_audio_transcription.completed":
                        System.out.println("transcription: " + message.get("transcript").getAsString());
                        finishLatch.countDown();
                        break;
                    case "input_audio_buffer.speech_started":
                        System.out.println("======VAD Speech Start======");
                        break;
                    case "input_audio_buffer.speech_stopped":
                        System.out.println("======VAD Speech Stop======");
                        break;
                    case "conversation.item.input_audio_transcription.text":
                        System.out.println("transcription: " + message.get("text").getAsString() + message.get("stash").getAsString());
                        break;
                    default:
                        break;
                }
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
            }
        });
        conversationRef.set(conversation);
        try {
            conversation.connect();
        } catch (NoApiKeyException e) {
            throw new RuntimeException(e);
        }

        OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
        transcriptionParam.setLanguage("en");
        transcriptionParam.setInputAudioFormat("pcm");
        transcriptionParam.setInputSampleRate(16000);

        OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
                .transcriptionConfig(transcriptionParam)
                .build();
        conversation.updateSession(config);

        String filePath = "your_audio_file.pcm";
        File audioFile = new File(filePath);
        if (!audioFile.exists()) {
            log.error("Audio file not found: {}", filePath);
            return;
        }

        try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
            byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
            int bytesRead;
            int totalBytesRead = 0;

            log.info("Starting to send audio data from: {}", filePath);

            // Read and send audio data in chunks
            while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
                totalBytesRead += bytesRead;
                String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
                // Send audio chunk to conversation
                conversation.appendAudio(audioB64);

                // Add small delay to simulate real-time audio streaming
                Thread.sleep(SLEEP_INTERVAL_MS);
            }

            log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);

        } catch (Exception e) {
            log.error("Error sending audio from file: {}", filePath, e);
        }

        //send session.finish and wait for finish and close
        conversation.endSession();
        log.info("task finished");

        System.exit(0);
    }
}

Python

import logging
import os
import base64
import signal
import sys
import time
import dashscope
from dashscope.audio.qwen_omni import *
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams

def setup_logging():
    """Configure logging output"""
    logger = logging.getLogger('dashscope')
    logger.setLevel(logging.DEBUG)
    handler = logging.StreamHandler(sys.stdout)
    handler.setLevel(logging.DEBUG)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.propagate = False
    return logger

def init_api_key():
    """Initialize API Key"""
    # The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the line below with your Model Studio API Key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
    if dashscope.api_key == 'YOUR_API_KEY':
        print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')

class MyCallback(OmniRealtimeCallback):
    """Real-time recognition callback handler"""
    def __init__(self, conversation):
        self.conversation = conversation
        self.handlers = {
            'session.created': self._handle_session_created,
            'conversation.item.input_audio_transcription.completed': self._handle_final_text,
            'conversation.item.input_audio_transcription.text': self._handle_transcription_text,
            'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
            'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
        }

    def on_open(self):
        print('Connection opened')

    def on_close(self, code, msg):
        print(f'Connection closed, code: {code}, msg: {msg}')

    def on_event(self, response):
        try:
            handler = self.handlers.get(response['type'])
            if handler:
                handler(response)
        except Exception as e:
            print(f'[Error] {e}')

    def _handle_session_created(self, response):
        print(f"Start session: {response['session']['id']}")

    def _handle_final_text(self, response):
        print(f"Final recognized text: {response['transcript']}")

    def _handle_transcription_text(self, response):
        print(f"Got transcription result: {response['text'] + response['stash']}")

def read_audio_chunks(file_path, chunk_size=3200):
    """Read audio file in chunks"""
    with open(file_path, 'rb') as f:
        while chunk := f.read(chunk_size):
            yield chunk

def send_audio(conversation, file_path, delay=0.1):
    """Send audio data"""
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Audio file {file_path} does not exist.")

    print("Processing audio file... Press 'Ctrl+C' to stop.")
    for chunk in read_audio_chunks(file_path):
        audio_b64 = base64.b64encode(chunk).decode('ascii')
        conversation.append_audio(audio_b64)
        time.sleep(delay)

def main():
    setup_logging()
    init_api_key()

    audio_file_path = "./your_audio_file.pcm"
    callback = MyCallback(conversation=None)
    conversation = OmniRealtimeConversation(
        model='qwen3-asr-flash-realtime',
        # The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
        url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime',
        callback=callback,
    )
    callback.conversation = conversation  # Inject conversation into the callback so its methods can be invoked from the callback

    def handle_exit(sig, frame):
        print('Ctrl+C pressed, exiting...')
        conversation.close()
        sys.exit(0)

    signal.signal(signal.SIGINT, handle_exit)

    conversation.connect()

    transcription_params = TranscriptionParams(
        language='en',
        sample_rate=16000,
        input_audio_format="pcm"
    )

    conversation.update_session(
        output_modalities=[MultiModality.TEXT],
        enable_input_audio_transcription=True,
        transcription_params=transcription_params
    )

    try:
        send_audio(conversation, audio_file_path)
        # send session.finish and wait for finished and close
        conversation.end_session()
    except Exception as e:
        print(f"Error occurred: {e}")
    finally:
        conversation.close()
        print("Audio processing completed.")

if __name__ == '__main__':
    main()

Paraformer

Paraformer reuses the Fun-ASR example code. Replace the model parameter with the Paraformer model name.

Advanced features

Qwen-ASR interaction modes

The Qwen-ASR Realtime API supports two interaction modes:

  • VAD mode (default): The server detects the start and end of speech automatically (turn detection). Suitable for real-time conversation, meeting transcription, and similar scenarios. To enable, configure session.turn_detection (enabled by default).

  • Manual mode: The client controls turn detection by sending input_audio_buffer.commit. Suitable for scenarios that need explicit control over send timing, such as voice messages in chat apps. To enable, set session.turn_detection to null.

Switch between modes:

  • WebSocket: Set the turn_detection field in the session.update event.

    {
        "type": "session.update",
        "session": {
            "turn_detection": null
        }
    }
  • Python SDK: Set the enable_turn_detection parameter in the update_session method.

    conversation.update_session(
        enable_turn_detection=False
    )
  • Java SDK: Set the enableTurnDetection parameter through OmniRealtimeConfig.builder().

    OmniRealtimeConfig config = OmniRealtimeConfig.builder()
            .enableTurnDetection(false)
            .build();
    conversation.updateSession(config);

For complete SDK code examples, see Qwen-ASR-Realtime Python SDK - API reference and Java SDK. For the WebSocket event lifecycle, see Event flow.

VAD turn detection

Voice Activity Detection (VAD) determines when continuous speech ends, which triggers the final-result event. All three model families enable server-side VAD by default, but the parameter names and tunability differ:

  • Qwen-ASR: Configured through session.turn_detection. Includes silence_duration_ms (silence duration threshold; the turn ends when silence exceeds this value; server default 800; recommended 400 for conversation and chat scenarios that need fast turn detection) and threshold (VAD detection sensitivity; server default 0.2). Qwen-ASR also supports a Manual mode that disables VAD and lets the client control turn detection through commit. See Qwen-ASR interaction modes above.

  • Fun-ASR and Paraformer: Configured through max_sentence_silence (VAD turn-detection silence threshold, in milliseconds). When the silence after a speech segment exceeds this threshold, the sentence is considered ended.

The parameter name varies by protocol: the same concept is silence_duration_ms in Qwen-ASR and max_sentence_silence in Fun-ASR and Paraformer. For complete field definitions, see API reference.

Improve accuracy with hotwords

Fun-ASR and Paraformer support hotwords to improve recognition accuracy for specific terms, such as brand names, personal names, and proper nouns.

For hotword configuration and usage, see Custom hotwords.

Timestamps

Fun-ASR and Paraformer return timestamps at two granularities by default: sentence level and word level. These support subtitle alignment, keyword highlighting, karaoke-style sing-along, and similar use cases. Qwen-ASR Realtime (qwen3-asr-flash-realtime) does not currently return timestamps. For timestamps, use Fun-ASR or Paraformer. The Qwen-ASR file transcription model qwen3-asr-flash-filetrans supports word-level timestamps. See Non-real-time speech recognition.

Timestamps are reported in milliseconds at two levels:

  • Sentence level: payload.output.sentence.begin_time and payload.output.sentence.end_time mark the start and end of the sentence within the audio. In intermediate results, end_time may be null. The final value is populated when the sentence ends (sentence_end = true).

  • Word level: The payload.output.sentence.words array. Each element contains begin_time, end_time, text (the character or word text), and punctuation (the punctuation that follows the character; an empty string when none).

Example response (excerpt):

{
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": 920,
        "text": "Okay, I got it.",
        "sentence_end": true,
        "words": [
          { "begin_time": 170, "end_time": 295, "text": "Okay", "punctuation": "," },
          { "begin_time": 295, "end_time": 503, "text": "I", "punctuation": "" },
          { "begin_time": 503, "end_time": 711, "text": "got", "punctuation": "" },
          { "begin_time": 711, "end_time": 920, "text": "it", "punctuation": "" }
        ]
      }
    }
  }
}

The field names above use the WebSocket JSON path. Each SDK exposes these fields with its own naming convention (dictionary keys, object properties, getter methods, and so on). For the full field mapping, see the API reference of each SDK.

For complete field definitions, see API reference.

Emotion recognition

Some Qwen-ASR and Paraformer models include the speaker's emotional state in transcription results. The output granularity and the way to enable it differ between the two.

Qwen-ASR (qwen3-asr-flash-realtime): Always on, no configuration needed. The top-level emotion field is returned in both the conversation.item.input_audio_transcription.text and conversation.item.input_audio_transcription.completed events. The value is one of seven fine-grained emotions: surprised, neutral, happy, sad, disgusted, angry, and fearful.

{
  "type": "conversation.item.input_audio_transcription.text",
  "emotion": "neutral",
  "text": "The weather is nice today.",
  "stash": ""
}

Paraformer (paraformer-realtime-8k-v2): Only this Paraformer model supports emotion recognition. Results are returned through payload.output.sentence.emo_tag and payload.output.sentence.emo_confidence. The value is one of three polarities: positive (positive emotions such as happiness or satisfaction), negative (negative emotions such as anger or low spirits), and neutral (no clear emotion). The confidence range is [0.0, 1.0].

Emotion recognition is returned only when all of the following conditions are met:

  • The model is paraformer-realtime-8k-v2.

  • Semantic turn detection is disabled: semantic_punctuation_enabled = false (the default; no special setting needed).

  • The result is returned only in the sentence-end event where sentence_end = true.

To suppress the emotion recognition fields, set semantic_punctuation_enabled to true. This enables semantic turn detection, and the emo_tag and emo_confidence fields are no longer returned.

The field names above use the WebSocket JSON path. Each SDK exposes these fields with its own naming convention (dictionary keys, object properties, getter methods, and so on). For the full field mapping, see the API reference of each SDK.

For complete field definitions, value constraints, and examples, see API reference.

Raw WebSocket protocol

The following examples show how to connect to the server through the raw WebSocket protocol, suitable for scenarios that don't use the DashScope SDK. These are minimal runnable implementations. For the WebSocket protocol of each model, see API reference.

Show raw WebSocket protocol examples

Fun-ASR

The example uses the audio file asr_example.wav.

Node.js

Install the required dependencies:

npm install ws
npm install uuid

Example code:

const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate UUID

// The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured the environment variable, replace the line below with your Model Studio API Key: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with your audio file path

// Generate a 32-character random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // Flag indicating whether the task has started

// Send the run-task command when the connection is opened
ws.on('open', () => {
  console.log('Connected to server');
  sendRunTask();
});

// Handle received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('Task started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result: ', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Task billing duration (seconds): ', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('Task finished');
      ws.close();
      break;
    case 'task-failed':
      console.error('Task failed: ', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event: ', message.header.event);
  }
});

// Close the connection if the task-started event is not received
ws.on('close', () => {
  if (!taskStarted) {
    console.error('Task not started, closing connection');
  }
});

// Send the run-task command
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'fun-asr-realtime',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('Audio stream ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Failed to read audio file: ', err);
    ws.close();
  });
}

// Send the finish-task command
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Error handling
ws.on('error', (error) => {
  console.error('WebSocket error: ', error);
});

C#

Example code:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;

class Program {
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    private static bool _taskStartedReceived = false;
    private static bool _taskFinishedReceived = false;
    // The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured the environment variable, replace the line below with your Model Studio API Key: private static readonly string ApiKey = "sk-xxx"
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
    // Replace with your audio file path
    private const string AudioFilePath = "asr_example.wav";

    static async Task Main(string[] args) {
        // Establish the WebSocket connection and configure headers for authentication
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");

        await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);

        // Start a thread to asynchronously receive WebSocket messages
        var receiveTask = ReceiveMessagesAsync();

        // Send the run-task command
        string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-character random ID
        var runTaskJson = GenerateRunTaskJson(_taskId);
        await SendAsync(runTaskJson);

        // Wait for the task-started event
        while (!_taskStartedReceived) {
            await Task.Delay(100, _cancellationTokenSource.Token);
        }

        // Read the local file and send the audio stream to be recognized to the server
        await SendAudioStreamAsync(AudioFilePath);

        // Send the finish-task command to end the task
        var finishTaskJson = GenerateFinishTaskJson(_taskId);
        await SendAsync(finishTaskJson);

        // Wait for the task-finished event
        while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
            try {
                await Task.Delay(100, _cancellationTokenSource.Token);
            } catch (OperationCanceledException) {
                // Task has been cancelled, exit the loop
                break;
            }
        }

        // Close the connection
        if (!_cancellationTokenSource.IsCancellationRequested) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
        }

        _cancellationTokenSource.Cancel();
        try {
            await receiveTask;
        } catch (OperationCanceledException) {
            // Ignore the operation cancelled exception
        }
    }

    private static async Task ReceiveMessagesAsync() {
        try {
            while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
                var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
                if (message != null) {
                    var eventValue = message["header"]?["event"]?.GetValue<string>();
                    switch (eventValue) {
                        case "task-started":
                            Console.WriteLine("Task started successfully");
                            _taskStartedReceived = true;
                            break;
                        case "result-generated":
                            Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
                            if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
                                Console.WriteLine($"Task billing duration (seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
                            }
                            break;
                        case "task-finished":
                            Console.WriteLine("Task finished");
                            _taskFinishedReceived = true;
                            _cancellationTokenSource.Cancel();
                            break;
                        case "task-failed":
                            Console.WriteLine($"Task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
                            _cancellationTokenSource.Cancel();
                            break;
                    }
                }
            }
        } catch (OperationCanceledException) {
            // Ignore the operation cancelled exception
        }
    }

    private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);
        var result = await _webSocket.ReceiveAsync(segment, cancellationToken);

        if (result.MessageType == WebSocketMessageType.Close) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
            return null;
        }

        var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
        return JsonNode.Parse(message);
    }

    private static async Task SendAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        var segment = new ArraySegment<byte>(buffer);
        await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
    }

    private static async Task SendAudioStreamAsync(string filePath) {
        using (var audioStream = File.OpenRead(filePath)) {
            var buffer = new byte[1024]; // Send 100 ms of audio data each time
            int bytesRead;

            while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
                var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
                await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
                await Task.Delay(100); // 100 ms interval
            }
        }
    }

    private static string GenerateRunTaskJson(string taskId) {
        var runTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "run-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["task_group"] = "audio",
                ["task"] = "asr",
                ["function"] = "recognition",
                ["model"] = "fun-asr-realtime",
                ["parameters"] = new JsonObject {
                    ["format"] = "wav",
                    ["sample_rate"] = 16000,
                },
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(runTask);
    }

    private static string GenerateFinishTaskJson(string taskId) {
        var finishTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "finish-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(finishTask);
    }
}

PHP

The example code uses the following directory structure:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The composer.json file is shown below. Set the dependency versions as appropriate for your project:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The index.php file:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;

// The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured the environment variable, replace the line below with your Model Studio API Key: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
$audio_file_path = 'asr_example.wav'; // Replace with your audio file path

$loop = Loop::get();

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
    echo "Connected to WebSocket server\n";

    // Start asynchronous thread to receive WebSocket messages
    $conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
        $response = json_decode($msg, true);

        if (isset($response['header']['event'])) {
            handleEvent($conn, $response, $loop, $audio_file_path);
        } else {
            echo "Unknown message format\n";
        }
    });

    // Listen for connection close
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });

    // Generate task ID
    $taskId = generateTaskId();

    // Send the run-task command
    sendRunTaskMessage($conn, $taskId);

}, function ($e) {
    echo "Failed to connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task command
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "asr",
            "function" => "recognition",
            "model" => "fun-asr-realtime",
            "parameters" => [
                "format" => "wav",
                "sample_rate" => 16000
            ],
            "input" => []
        ]
    ]);
    echo "Preparing to send run-task command: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task command sent\n";
}

/**
 * Read audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Failed to read audio file\n";
    }
    return $voiceData;
}

/**
 * Split audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task command
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => []
        ]
    ]);
    echo "Preparing to send finish-task command: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task command sent\n";
}

/**
 * Handle event
 * @param $conn
 * @param $response
 * @param $loop
 * @param $audio_file_path
 */
function handleEvent($conn, $response, $loop, $audio_file_path) {
    static $taskId;
    static $chunks;
    static $allChunksSent = false;

    if (is_null($taskId)) {
        $taskId = generateTaskId();
    }

    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending audio data...\n";
            // Read audio file
            $voiceData = readAudioFile($audio_file_path);
            if ($voiceData === false) {
                echo "Failed to read audio file\n";
                $conn->close();
                return;
            }

            // Split audio data
            $chunks = splitAudioData($voiceData, 1024);

            // Define the send function
            $sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
                if (!empty($chunks)) {
                    $chunk = array_shift($chunks);
                    $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
                    $conn->send($binaryMsg);
                    // Send the next chunk after 100 ms
                    $loop->addTimer(0.1, $sendChunk);
                } else {
                    echo "All data chunks sent\n";
                    $allChunksSent = true;

                    // Send the finish-task command
                    sendFinishTaskMessage($conn, $taskId);
                }
            };

            // Start sending audio data
            $sendChunk();
            break;
        case 'result-generated':
            $result = $response['payload']['output']['sentence'];
            echo "Recognition result: " . $result['text'] . "\n";
            if (isset($response['payload']['usage']['duration'])) {
                echo "Task billing duration (seconds): " . $response['payload']['usage']['duration'] . "\n";
            }
            break;
        case 'task-finished':
            echo "Task finished\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If all data has been sent and the task is finished, close the connection
    if ($allChunksSent && $response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data has been transmitted
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closes the connection\n";
        });
    }
}

Go

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	// The following URL is for the Singapore region. To use the Beijing region model, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
	wsURL     = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
	audioFile = "asr_example.wav"                                   // Replace with your audio file path
)

var dialer = websocket.DefaultDialer

func main() {
	// The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
	// If you have not configured the environment variable, replace the line below with your Model Studio API Key: apiKey := "sk-xxx"
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Connect to the WebSocket service
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		log.Fatal("Failed to connect WebSocket: ", err)
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results
	taskStarted := make(chan bool)
	taskDone := make(chan bool)
	startResultReceiver(conn, taskStarted, taskDone)

	// Send the run-task command
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		log.Fatal("Failed to send run-task command: ", err)
	}

	// Wait for the task-started event
	waitForTaskStarted(taskStarted)

	// Send the audio file stream to be recognized
	if err := sendAudioData(conn); err != nil {
		log.Fatal("Failed to send audio: ", err)
	}

	// Send the finish-task command
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		log.Fatal("Failed to send finish-task command: ", err)
	}

	// Wait for the task to finish or fail
	<-taskDone
}

// Define structs to represent JSON data
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Output struct {
	Sentence struct {
		BeginTime int64  `json:"begin_time"`
		EndTime   *int64 `json:"end_time"`
		Text      string `json:"text"`
		Words     []struct {
			BeginTime   int64  `json:"begin_time"`
			EndTime     *int64 `json:"end_time"`
			Text        string `json:"text"`
			Punctuation string `json:"punctuation"`
		} `json:"words"`
	} `json:"sentence"`
}

type Payload struct {
	TaskGroup  string `json:"task_group"`
	Task       string `json:"task"`
	Function   string `json:"function"`
	Model      string `json:"model"`
	Parameters Params `json:"parameters"`
	Input      Input  `json:"input"`
	Output     Output `json:"output,omitempty"`
	Usage      *struct {
		Duration int `json:"duration"`
	} `json:"usage,omitempty"`
}

type Params struct {
	Format                   string `json:"format"`
	SampleRate               int    `json:"sample_rate"`
	DisfluencyRemovalEnabled bool   `json:"disfluency_removal_enabled"`
}

type Input struct {
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	return conn, err
}

// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
	go func() {
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				log.Println("Failed to parse server message: ", err)
				return
			}
			var event Event
			err = json.Unmarshal(message, &event)
			if err != nil {
				log.Println("Failed to parse event: ", err)
				continue
			}
			if handleEvent(conn, event, taskStarted, taskDone) {
				return
			}
		}
	}()
}

// Send the run-task command
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task command
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "asr",
			Function:  "recognition",
			Model:     "fun-asr-realtime",
			Parameters: Params{
				Format:     "wav",
				SampleRate: 16000,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
	select {
	case <-taskStarted:
		fmt.Println("Task started successfully")
	case <-time.After(10 * time.Second):
		log.Fatal("Timed out waiting for task-started; failed to start task")
	}
}

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024)
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

// Send the finish-task command
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task command
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Handle event
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		taskStarted <- true
	case "result-generated":
		if event.Payload.Output.Sentence.Text != "" {
			fmt.Println("Recognition result: ", event.Payload.Output.Sentence.Text)
		}
		if event.Payload.Usage != nil {
			fmt.Println("Task billing duration (seconds): ", event.Payload.Usage.Duration)
		}
	case "task-finished":
		fmt.Println("Task finished")
		taskDone <- true
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		taskDone <- true
		return true
	default:
		log.Printf("Unexpected event: %v", event)
	}
	return false
}

// Handle the task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		log.Fatalf("Task failed: %s", event.Header.ErrorMessage)
	} else {
		log.Fatal("Task failed for unknown reason")
	}
}

// Close the connection
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

Qwen-ASR

Note

The example reads your_audio_file.pcm (PCM16, 16 kHz, mono). To convert from MP3, WAV, or other formats, use ffmpeg:

ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -f s16le your_audio_file.pcm

Python

Before running the example, install the dependencies:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Don't name the example file websocket.py. This name conflicts with the websocket library and causes the following error: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?.

# pip install websocket-client
import os
import time
import json
import threading
import base64
import websocket
import logging
import logging.handlers
from datetime import datetime

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the line below with your Model Studio API Key: API_KEY="sk-xxx"
API_KEY = os.environ.get("DASHSCOPE_API_KEY", "sk-xxx")
QWEN_MODEL = "qwen3-asr-flash-realtime"
# The following baseUrl is for the Singapore region. To use the Beijing region model, replace the baseUrl with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
url = f"{baseUrl}?model={QWEN_MODEL}"
print(f"Connecting to server: {url}")

# Note: In non-VAD mode, the cumulative audio duration sent should not exceed 60 seconds
enableServerVad = True
is_running = True  # Running flag

headers = [
    "Authorization: Bearer " + API_KEY,
    "OpenAI-Beta: realtime=v1"
]

def init_logger():
    formatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')
    f_handler = logging.handlers.RotatingFileHandler(
        "omni_tester.log", maxBytes=100 * 1024 * 1024, backupCount=3
    )
    f_handler.setLevel(logging.DEBUG)
    f_handler.setFormatter(formatter)

    console = logging.StreamHandler()
    console.setLevel(logging.DEBUG)
    console.setFormatter(formatter)

    logger.addHandler(f_handler)
    logger.addHandler(console)

def on_open(ws):
    logger.info("Connected to server.")

    # session.update event
    event_manual = {
        "event_id": "event_123",
        "type": "session.update",
        "session": {
            "modalities": ["text"],
            "input_audio_format": "pcm",
            "sample_rate": 16000,
            "input_audio_transcription": {
                # Language identifier (optional). It is recommended to set this if the language is known
                "language": "en"
            },
            "turn_detection": None
        }
    }
    event_vad = {
        "event_id": "event_123",
        "type": "session.update",
        "session": {
            "modalities": ["text"],
            "input_audio_format": "pcm",
            "sample_rate": 16000,
            "input_audio_transcription": {
                "language": "en"
            },
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.0,
                "silence_duration_ms": 400
            }
        }
    }
    if enableServerVad:
        logger.info(f"Sending event: {json.dumps(event_vad, indent=2)}")
        ws.send(json.dumps(event_vad))
    else:
        logger.info(f"Sending event: {json.dumps(event_manual, indent=2)}")
        ws.send(json.dumps(event_manual))

def on_message(ws, message):
    global is_running
    try:
        data = json.loads(message)
        logger.info(f"Received event: {json.dumps(data, ensure_ascii=False, indent=2)}")
        if data.get("type") == "conversation.item.input_audio_transcription.completed":
            logger.info(f"Final transcript: {data.get('transcript')}")
        elif data.get("type") == "session.finished":
            logger.info("Closing WebSocket connection after session finished...")
            is_running = False  # Stop audio sending thread
            ws.close()
    except json.JSONDecodeError:
        logger.error(f"Failed to parse message: {message}")

def on_error(ws, error):
    logger.error(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    logger.info(f"Connection closed: {close_status_code} - {close_msg}")

def send_audio(ws, local_audio_path):
    time.sleep(3)  # Wait for session update to complete
    global is_running

    with open(local_audio_path, 'rb') as audio_file:
        logger.info(f"File read start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
        while is_running:
            audio_data = audio_file.read(3200)  # ~0.1s PCM16/16kHz
            if not audio_data:
                logger.info(f"File read complete: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
                if ws.sock and ws.sock.connected:
                    if not enableServerVad:
                        commit_event = {
                            "event_id": "event_789",
                            "type": "input_audio_buffer.commit"
                        }
                        ws.send(json.dumps(commit_event))
                    finish_event = {
                        "event_id": "event_987",
                        "type": "session.finish"
                    }
                    ws.send(json.dumps(finish_event))
                break

            if not ws.sock or not ws.sock.connected:
                logger.info("WebSocket is closed; stopping audio sending.")
                break

            encoded_data = base64.b64encode(audio_data).decode('utf-8')
            eventd = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "input_audio_buffer.append",
                "audio": encoded_data
            }
            ws.send(json.dumps(eventd))
            logger.info(f"Sending audio event: {eventd['event_id']}")
            time.sleep(0.1)  # Simulate real-time capture

# Initialize logger
init_logger()
logger.info(f"Connecting to WebSocket server at {url}...")

local_audio_path = "your_audio_file.pcm"
ws = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

thread = threading.Thread(target=send_audio, args=(ws, local_audio_path))
thread.start()
ws.run_forever()

Java

Before running the example, install the Java-WebSocket dependency:

Maven

<dependency>
    <groupId>org.java-websocket</groupId>
    <artifactId>Java-WebSocket</artifactId>
    <version>1.5.6</version>
</dependency>

Gradle

implementation 'org.java-websocket:Java-WebSocket:1.5.6'
import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;
import org.json.JSONObject;

import java.net.URI;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Base64;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.logging.*;

public class QwenASRRealtimeClient {

    private static final Logger logger = Logger.getLogger(QwenASRRealtimeClient.class.getName());
    // The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured the environment variable, replace the line below with your Model Studio API Key: private static final String API_KEY = "sk-xxx"
    private static final String API_KEY = System.getenv().getOrDefault("DASHSCOPE_API_KEY", "sk-xxx");
    private static final String MODEL = "qwen3-asr-flash-realtime";

    // Controls whether VAD mode is enabled
    private static final boolean enableServerVad = true;

    private static final AtomicBoolean isRunning = new AtomicBoolean(true);
    private static WebSocketClient client;

    public static void main(String[] args) throws Exception {
        initLogger();

        // The following baseUrl is for the Singapore region. To use the Beijing region model, replace the baseUrl with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
        String baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime";
        String url = baseUrl + "?model=" + MODEL;
        logger.info("Connecting to server: " + url);

        client = new WebSocketClient(new URI(url)) {
            @Override
            public void onOpen(ServerHandshake handshake) {
                logger.info("Connected to server.");
                sendSessionUpdate();
            }

            @Override
            public void onMessage(String message) {
                try {
                    JSONObject data = new JSONObject(message);
                    String eventType = data.optString("type");

                    logger.info("Received event: " + data.toString(2));

                    // The final transcript is in the transcription.completed event
                    if ("conversation.item.input_audio_transcription.completed".equals(eventType)) {
                        logger.info("Final transcript: " + data.optString("transcript"));
                    }

                    // On end event, stop the sending thread and close the connection
                    if ("session.finished".equals(eventType)) {
                        logger.info("Closing WebSocket connection after session finished...");

                        isRunning.set(false); // Stop the audio sending thread
                        if (this.isOpen()) {
                            this.close(1000, "ASR finished");
                        }
                    }
                } catch (Exception e) {
                    logger.severe("Failed to parse message: " + message);
                }
            }

            @Override
            public void onClose(int code, String reason, boolean remote) {
                logger.info("Connection closed: " + code + " - " + reason);
            }

            @Override
            public void onError(Exception ex) {
                logger.severe("Error: " + ex.getMessage());
            }
        };

        // Add request headers
        client.addHeader("Authorization", "Bearer " + API_KEY);
        client.addHeader("OpenAI-Beta", "realtime=v1");

        client.connectBlocking(); // Block until the connection is established

        // Replace with the path of the audio file to be recognized
        String localAudioPath = "your_audio_file.pcm";
        Thread audioThread = new Thread(() -> {
            try {
                sendAudio(localAudioPath);
            } catch (Exception e) {
                logger.severe("Audio sending thread error: " + e.getMessage());
            }
        });
        audioThread.start();
    }

    /** session.update event (enable/disable VAD) */
    private static void sendSessionUpdate() {
        JSONObject eventNoVad = new JSONObject()
                .put("event_id", "event_123")
                .put("type", "session.update")
                .put("session", new JSONObject()
                        .put("modalities", new String[]{"text"})
                        .put("input_audio_format", "pcm")
                        .put("sample_rate", 16000)
                        .put("input_audio_transcription", new JSONObject()
                                .put("language", "en"))
                        .put("turn_detection", JSONObject.NULL) // Manual mode
                );

        JSONObject eventVad = new JSONObject()
                .put("event_id", "event_123")
                .put("type", "session.update")
                .put("session", new JSONObject()
                        .put("modalities", new String[]{"text"})
                        .put("input_audio_format", "pcm")
                        .put("sample_rate", 16000)
                        .put("input_audio_transcription", new JSONObject()
                                .put("language", "en"))
                        .put("turn_detection", new JSONObject()
                                .put("type", "server_vad")
                                .put("threshold", 0.0)
                                .put("silence_duration_ms", 400))
                );

        if (enableServerVad) {
            logger.info("Sending event (VAD):\n" + eventVad.toString(2));
            client.send(eventVad.toString());
        } else {
            logger.info("Sending event (Manual):\n" + eventNoVad.toString(2));
            client.send(eventNoVad.toString());
        }
    }

    /** Send the audio file stream */
    private static void sendAudio(String localAudioPath) throws Exception {
        Thread.sleep(3000); // Wait for the session to be prepared
        byte[] allBytes = Files.readAllBytes(Paths.get(localAudioPath));
        logger.info("File read start");

        int offset = 0;
        while (isRunning.get() && offset < allBytes.length) {
            int chunkSize = Math.min(3200, allBytes.length - offset);
            byte[] chunk = new byte[chunkSize];
            System.arraycopy(allBytes, offset, chunk, 0, chunkSize);
            offset += chunkSize;

            if (client != null && client.isOpen()) {
                String encoded = Base64.getEncoder().encodeToString(chunk);
                JSONObject eventd = new JSONObject()
                        .put("event_id", "event_" + System.currentTimeMillis())
                        .put("type", "input_audio_buffer.append")
                        .put("audio", encoded);

                client.send(eventd.toString());
                logger.info("Sending audio event: " + eventd.getString("event_id"));
            } else {
                break; // Avoid sending after disconnection
            }

            Thread.sleep(100); // Simulate real-time sending
        }

        logger.info("File read complete");

        if (client != null && client.isOpen()) {
            // commit is required in non-VAD mode
            if (!enableServerVad) {
                JSONObject commitEvent = new JSONObject()
                        .put("event_id", "event_789")
                        .put("type", "input_audio_buffer.commit");
                client.send(commitEvent.toString());
                logger.info("Sent commit event for manual mode.");
            }

            JSONObject finishEvent = new JSONObject()
                    .put("event_id", "event_987")
                    .put("type", "session.finish");
            client.send(finishEvent.toString());
            logger.info("Sent finish event.");
        }
    }

    /** Initialize logger */
    private static void initLogger() {
        logger.setLevel(Level.ALL);
        Logger rootLogger = Logger.getLogger("");
        for (Handler h : rootLogger.getHandlers()) {
            rootLogger.removeHandler(h);
        }

        Handler consoleHandler = new ConsoleHandler();
        consoleHandler.setLevel(Level.ALL);
        consoleHandler.setFormatter(new SimpleFormatter());
        logger.addHandler(consoleHandler);
    }
}

Node.js

Before running the example, install the dependencies:

npm install ws
/**
 * Qwen-ASR Realtime WebSocket Client (Node.js version)
 * Features:
 * - Supports VAD mode and Manual mode
 * - Sends session.update to start the session
 * - Continuously sends audio chunks via input_audio_buffer.append
 * - In Manual mode, sends input_audio_buffer.commit
 * - Sends session.finish event
 * - Closes the connection after receiving session.finished
 */

import WebSocket from 'ws';
import fs from 'fs';

// ===== Config =====
// The API Key is different for the Singapore and Beijing regions. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured the environment variable, replace the line below with your Model Studio API Key: const API_KEY = "sk-xxx"
const API_KEY = process.env.DASHSCOPE_API_KEY || 'sk-xxx';
const MODEL = 'qwen3-asr-flash-realtime';
const enableServerVad = true; // true for VAD mode, false for Manual mode
const localAudioPath = 'your_audio_file.pcm'; // PCM16, 16 kHz audio file path

// The following baseUrl is for the Singapore region. To use the Beijing region model, replace the baseUrl with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
const baseUrl = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime';
const url = `${baseUrl}?model=${MODEL}`;

console.log(`Connecting to server: ${url}`);

// ===== State control =====
let isRunning = true;

// ===== Establish connection =====
const ws = new WebSocket(url, {
    headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'OpenAI-Beta': 'realtime=v1'
    }
});

// ===== Event binding =====
ws.on('open', () => {
    console.log('[WebSocket] Connected to server.');
    sendSessionUpdate();
    // Start the audio sending thread
    sendAudio(localAudioPath);
});

ws.on('message', (message) => {
    try {
        const data = JSON.parse(message);
        console.log('[Received Event]:', JSON.stringify(data, null, 2));

        // The final transcript is in the transcription.completed event
        if (data.type === 'conversation.item.input_audio_transcription.completed') {
            console.log(`[Final Transcript] ${data.transcript}`);
        }

        // On end event
        if (data.type === 'session.finished') {
            console.log('[Action] Closing WebSocket connection after session finished...');

            if (ws.readyState === WebSocket.OPEN) {
                ws.close(1000, 'ASR finished');
            }
        }
    } catch (e) {
        console.error('[Error] Failed to parse message:', message);
    }
});

ws.on('close', (code, reason) => {
    console.log(`[WebSocket] Connection closed: ${code} - ${reason}`);
});

ws.on('error', (err) => {
    console.error('[WebSocket Error]', err);
});

// ===== Session update =====
function sendSessionUpdate() {
    const eventNoVad = {
        event_id: 'event_123',
        type: 'session.update',
        session: {
            modalities: ['text'],
            input_audio_format: 'pcm',
            sample_rate: 16000,
            input_audio_transcription: {
                language: 'en'
            },
            turn_detection: null
        }
    };

    const eventVad = {
        event_id: 'event_123',
        type: 'session.update',
        session: {
            modalities: ['text'],
            input_audio_format: 'pcm',
            sample_rate: 16000,
            input_audio_transcription: {
                language: 'en'
            },
            turn_detection: {
                type: 'server_vad',
                threshold: 0.0,
                silence_duration_ms: 400
            }
        }
    };

    if (enableServerVad) {
        console.log('[Send Event] VAD Mode:\n', JSON.stringify(eventVad, null, 2));
        ws.send(JSON.stringify(eventVad));
    } else {
        console.log('[Send Event] Manual Mode:\n', JSON.stringify(eventNoVad, null, 2));
        ws.send(JSON.stringify(eventNoVad));
    }
}

// ===== Send the audio file stream =====
function sendAudio(audioPath) {
    setTimeout(() => {
        console.log(`[File Read Start] ${audioPath}`);
        const buffer = fs.readFileSync(audioPath);

        let offset = 0;
        const chunkSize = 3200; // ~0.1s of PCM16 audio

        function sendChunk() {
            if (!isRunning) return;
            if (offset >= buffer.length) {
                isRunning = false; // Stop sending audio
                console.log('[File Read End]');
                if (ws.readyState === WebSocket.OPEN) {
                    if (!enableServerVad) {
                        const commitEvent = {
                            event_id: 'event_789',
                            type: 'input_audio_buffer.commit'
                        };
                        ws.send(JSON.stringify(commitEvent));
                        console.log('[Send Commit Event]');
                    }

                    const finishEvent = {
                        event_id: 'event_987',
                        type: 'session.finish'
                    };
                    ws.send(JSON.stringify(finishEvent));
                    console.log('[Send Finish Event]');
                }

                return;
            }

            if (ws.readyState !== WebSocket.OPEN) {
                console.log('[Stop] WebSocket is not open.');
                return;
            }

            const chunk = buffer.slice(offset, offset + chunkSize);
            offset += chunkSize;

            const encoded = chunk.toString('base64');
            const appendEvent = {
                event_id: `event_${Date.now()}`,
                type: 'input_audio_buffer.append',
                audio: encoded
            };

            ws.send(JSON.stringify(appendEvent));
            console.log(`[Send Audio Event] ${appendEvent.event_id}`);

            setTimeout(sendChunk, 100); // Simulate real-time sending
        }

        sendChunk();
    }, 3000); // Wait for session configuration to complete
}

Paraformer

Paraformer reuses the Fun-ASR example code. Replace the model parameter with the Paraformer model name.

Connection reuse (WebSocket)

Fun-ASR and Paraformer support WebSocket connection reuse: after one recognition task ends, you can start the next task on the same connection without reconnecting.

Reuse flow: The client sends finish-task. After the server returns task-finished, the client sends run-task again to start a new task.

Important
  1. Wait for the server to return the task-finished event before starting a new task.

  2. Each task on a reused connection must use a different task_id.

  3. If a task fails, the server returns an error event and closes the connection. That connection can't be reused.

  4. The connection closes automatically when no new task starts within 60 seconds after a task ends.

Qwen-ASR Realtime uses a session model. Close the connection after each session ends. Connection reuse isn't supported.

For event descriptions of each model, see the corresponding API reference.

High-concurrency best practices

The DashScope SDK includes built-in pooling that reuses WebSocket connections and recognition objects, which avoids the overhead of creating and destroying them repeatedly. Currently only the Paraformer Java SDK supports this feature.

Show high-concurrency best practices

Prerequisites

The Java SDK reaches optimal performance through a built-in connection pool working alongside a custom object pool:

  • Connection pool: An OkHttp3 connection pool built into the SDK manages and reuses underlying WebSocket connections, which reduces network handshake overhead. This is enabled by default.

  • Object pool: Built on commons-pool2 to maintain a set of pre-connected Recognition objects. Borrowing an object from the pool eliminates connection-establishment latency and significantly reduces first-packet latency.

Implementation steps

  1. Add dependencies

    Add dashscope-sdk-java and commons-pool2 to your project's dependency configuration file, based on your build tool.

    The Maven and Gradle configurations are shown below:

    Maven

    1. Open the pom.xml file of your Maven project.

    2. Add the following dependencies inside the <dependencies> tag.

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>dashscope-sdk-java</artifactId>
        <!-- Replace 'the-latest-version' with 2.16.9 or later. See https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java for available versions. -->
        <version>the-latest-version</version>
    </dependency>
    
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-pool2</artifactId>
        <!-- Replace 'the-latest-version' with the latest version. See https://mvnrepository.com/artifact/org.apache.commons/commons-pool2 for available versions. -->
        <version>the-latest-version</version>
    </dependency>
    1. Save the pom.xml file.

    2. Update the project dependencies with a Maven command, such as mvn clean install or mvn compile.

    Gradle

    1. Open the build.gradle file of your Gradle project.

    2. Add the following dependencies inside the dependencies block.

      dependencies {
          // Replace 'the-latest-version' with 2.16.9 or later. See https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java for available versions.
          implementation group: 'com.alibaba', name: 'dashscope-sdk-java', version: 'the-latest-version'
      
          // Replace 'the-latest-version' with the latest version. See https://mvnrepository.com/artifact/org.apache.commons/commons-pool2 for available versions.
          implementation group: 'org.apache.commons', name: 'commons-pool2', version: 'the-latest-version'
      }
    3. Save the build.gradle file.

    4. Open a command line, navigate to the project root directory, and run the following Gradle command to update the dependencies.

      ./gradlew build --refresh-dependencies

      On Windows, use:

      gradlew build --refresh-dependencies
  2. Configure the connection pool

    Configure the connection pool through environment variables:

    Environment variable

    Description

    DASHSCOPE_CONNECTION_POOL_SIZE

    Connection pool size.

    Recommended value: at least twice the peak concurrency.

    Default value: 32.

    DASHSCOPE_MAXIMUM_ASYNC_REQUESTS

    Maximum number of concurrent async requests.

    Recommended value: same as DASHSCOPE_CONNECTION_POOL_SIZE.

    Default value: 32.

    DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST

    Maximum number of async requests per host.

    Recommended value: same as DASHSCOPE_CONNECTION_POOL_SIZE.

    Default value: 32.

  3. Configure the object pool

    Configure the object pool size through an environment variable:

    Environment variable

    Description

    RECOGNITION_OBJECTPOOL_SIZE

    Object pool size.

    Recommended value: 1.5 to 2 times the peak concurrency.

    Default value: 500.

    Important
    • The object pool size (RECOGNITION_OBJECTPOOL_SIZE) must be less than or equal to the connection pool size (DASHSCOPE_CONNECTION_POOL_SIZE). Otherwise, calling threads block when the object pool requests an object and the connection pool is full.

    • The object pool size shouldn't exceed your account's QPS (queries per second) limit.

    Use the following code to create the object pool:

    class RecognitionObjectPool {
        // ... See the complete code section for the full example
        public static GenericObjectPool<Recognition> getInstance() {
            lock.lock();
            if (recognitionGenericObjectPool == null) {
                int objectPoolSize = getObjectivePoolSize();
                RecognitionObjectFactory recognitionObjectFactory =
                        new RecognitionObjectFactory();
                GenericObjectPoolConfig<Recognition> config =
                        new GenericObjectPoolConfig<>();
                config.setMaxTotal(objectPoolSize);
                config.setMaxIdle(objectPoolSize);
                config.setMinIdle(objectPoolSize);
                recognitionGenericObjectPool =
                        new GenericObjectPool<>(recognitionObjectFactory, config);
            }
            lock.unlock();
            return recognitionGenericObjectPool;
        }
    }
  4. Borrow a Recognition object from the pool

    When the number of unreturned objects exceeds the pool size, the system creates additional Recognition objects. These extra objects must establish a new WebSocket connection and can't be reused.

    recognizer = RecognitionObjectPool.getInstance().borrowObject();
  5. Run speech recognition

    Call the Recognition object's call or streamCall method to run speech recognition.

  6. Return the Recognition object

    After the speech-recognition task ends, return the Recognition object so it can be reused. Don't return objects whose task didn't complete or failed.

    RecognitionObjectPool.getInstance().returnObject(recognizer);

Complete code

package org.alibaba.bailian.example.examples;

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.ApiKey;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Lock;

public class Main {
    public static void checkoutEnv(String envName, int defaultSize) {
        if (System.getenv(envName) != null) {
            System.out.println("[ENV CHECK]: " + envName + " "
                    + System.getenv(envName));
        } else {
            System.out.println("[ENV CHECK]: " + envName
                    + " Using Default which is " + defaultSize);
        }
    }

    public static void main(String[] args)
            throws NoApiKeyException, InterruptedException {
        checkoutEnv("DASHSCOPE_CONNECTION_POOL_SIZE", 32);
        checkoutEnv("DASHSCOPE_MAXIMUM_ASYNC_REQUESTS", 32);
        checkoutEnv("DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST", 32);
        checkoutEnv(RecognitionObjectPool.RECOGNITION_OBJECTPOOL_SIZE_ENV,
                RecognitionObjectPool.DEFAULT_OBJECT_POOL_SIZE);

        int threadNums = 3;
        String currentDir = System.getProperty("user.dir");
        Path[] filePaths = {
                Paths.get(currentDir, "asr_example.wav"),
                Paths.get(currentDir, "asr_example.wav"),
                Paths.get(currentDir, "asr_example.wav"),
        };
        ExecutorService executorService = Executors.newFixedThreadPool(threadNums);
        for (int i = 0; i < threadNums; i++) {
            executorService.submit(new RealtimeRecognizeTask(filePaths));
        }
        executorService.shutdown();
        executorService.awaitTermination(10, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RecognitionObjectFactory extends BasePooledObjectFactory<Recognition> {
    public RecognitionObjectFactory() {
        super();
    }

    @Override
    public Recognition create() throws Exception {
        return new Recognition();
    }

    @Override
    public PooledObject<Recognition> wrap(Recognition obj) {
        return new DefaultPooledObject<>(obj);
    }
}

class RecognitionObjectPool {
    public static GenericObjectPool<Recognition> recognitionGenericObjectPool;
    public static String RECOGNITION_OBJECTPOOL_SIZE_ENV =
            "RECOGNITION_OBJECTPOOL_SIZE";
    public static int DEFAULT_OBJECT_POOL_SIZE = 500;
    private static Lock lock = new java.util.concurrent.locks.ReentrantLock();

    public static int getObjectivePoolSize() {
        try {
            Integer n = Integer.parseInt(
                    System.getenv(RECOGNITION_OBJECTPOOL_SIZE_ENV));
            return n;
        } catch (NumberFormatException e) {
            return DEFAULT_OBJECT_POOL_SIZE;
        }
    }

    public static GenericObjectPool<Recognition> getInstance() {
        lock.lock();
        if (recognitionGenericObjectPool == null) {
            int objectPoolSize = getObjectivePoolSize();
            System.out.println("RECOGNITION_OBJECTPOOL_SIZE: "
                    + objectPoolSize);
            RecognitionObjectFactory recognitionObjectFactory =
                    new RecognitionObjectFactory();
            GenericObjectPoolConfig<Recognition> config =
                    new GenericObjectPoolConfig<>();
            config.setMaxTotal(objectPoolSize);
            config.setMaxIdle(objectPoolSize);
            config.setMinIdle(objectPoolSize);
            recognitionGenericObjectPool =
                    new GenericObjectPool<>(recognitionObjectFactory, config);
        }
        lock.unlock();
        return recognitionGenericObjectPool;
    }
}

class RealtimeRecognizeTask implements Runnable {
    private static final Object lock = new Object();
    private Path[] filePaths;

    public RealtimeRecognizeTask(Path[] filePaths) {
        this.filePaths = filePaths;
    }

    private static String getDashScopeApiKey() throws NoApiKeyException {
        String dashScopeApiKey = null;
        try {
            ApiKey apiKey = new ApiKey();
            dashScopeApiKey = ApiKey.getApiKey(null);
        } catch (NoApiKeyException e) {
            System.out.println("No API key found in environment.");
        }
        if (dashScopeApiKey == null) {
            dashScopeApiKey = "your-dashscope-apikey";
        }
        return dashScopeApiKey;
    }

    public void runCallback() {
        for (Path filePath : filePaths) {
            RecognitionParam param = null;
            try {
                param = RecognitionParam.builder()
                        .model("paraformer-realtime-v2")
                        .format("pcm")
                        .sampleRate(16000)
                        .apiKey(getDashScopeApiKey())
                        .build();
            } catch (Exception e) {
                throw new RuntimeException(e);
            }

            Recognition recognizer = null;
            final boolean[] hasError = {false};
            try {
                recognizer = RecognitionObjectPool.getInstance().borrowObject();
                String threadName = Thread.currentThread().getName();

                ResultCallback<RecognitionResult> callback =
                        new ResultCallback<RecognitionResult>() {
                            @Override
                            public void onEvent(RecognitionResult message) {
                                synchronized (lock) {
                                    if (message.isSentenceEnd()) {
                                        System.out.println("[process " + threadName
                                                + "] Fix:" + message.getSentence().getText());
                                    } else {
                                        System.out.println("[process " + threadName
                                                + "] Result: " + message.getSentence().getText());
                                    }
                                }
                            }

                            @Override
                            public void onComplete() {
                                System.out.println("[" + threadName
                                        + "] Recognition complete");
                            }

                            @Override
                            public void onError(Exception e) {
                                System.out.println("[" + threadName
                                        + "] RecognitionCallback error: " + e.getMessage());
                                hasError[0] = true;
                            }
                        };
                System.out.println("[" + threadName
                        + "] Input file_path is: " + filePath);
                FileInputStream fis = null;
                try {
                    fis = new FileInputStream(filePath.toFile());
                } catch (Exception e) {
                    System.out.println("Error when loading file: " + filePath);
                    e.printStackTrace();
                }
                recognizer.call(param, callback);

                // chunk size set to 100 ms for 16KHz sample rate
                byte[] buffer = new byte[3200];
                int bytesRead;
                while ((bytesRead = fis.read(buffer)) != -1) {
                    ByteBuffer byteBuffer;
                    if (bytesRead < buffer.length) {
                        byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                    } else {
                        byteBuffer = ByteBuffer.wrap(buffer);
                    }
                    recognizer.sendAudioFrame(byteBuffer);
                    Thread.sleep(100);
                    buffer = new byte[3200];
                }
                System.out.println("[" + threadName + "] send audio done");
                recognizer.stop();
                System.out.println("[" + threadName + "] asr task finished");
            } catch (Exception e) {
                e.printStackTrace();
                hasError[0] = true;
            }
            if (recognizer != null) {
                try {
                    if (hasError[0] == true) {
                        recognizer.getDuplexApi().close(1000, "bye");
                        RecognitionObjectPool.getInstance()
                                .invalidateObject(recognizer);
                    } else {
                        RecognitionObjectPool.getInstance()
                                .returnObject(recognizer);
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }

    @Override
    public void run() {
        runCallback();
    }
}

Recommended configuration

The configurations below come from tests where only the Paraformer real-time speech recognition service runs on Alibaba Cloud servers of the specified specs. The single-host concurrency is the number of Paraformer real-time speech recognition tasks running simultaneously on a single host (the number of worker threads).

Server configuration (Alibaba Cloud)

Max concurrency per host

Object pool size

Connection pool size

4 cores, 8 GiB

100

500

2000

8 cores, 16 GiB

200

500

2000

16 cores, 32 GiB

400

500

2000

Resource management and exception handling

  • Task succeeds: Call GenericObjectPool.returnObject() to return the Recognition object to the pool for reuse.

    Important

    Don't return Recognition objects whose task didn't complete or failed.

  • Task fails: When the SDK or business logic throws an exception that interrupts the task, do both of the following:

    1. Close the underlying WebSocket connection.

    2. Invalidate the object in the pool to prevent reuse.

    // Close the connection
    recognizer.getDuplexApi().close(1000, "bye");
    // Invalidate the failed recognizer in the object pool
    RecognitionObjectPool.getInstance().invalidateObject(recognizer);
  • For service-side TaskFailed errors, no extra handling is required.

Warmup and latency measurement

When you measure concurrent-call latency or other performance metrics of the DashScope Java SDK, perform sufficient warmup before the formal test.

Connection reuse mechanism

The DashScope Java SDK uses a global singleton connection pool to manage and reuse WebSocket connections. The mechanism behaves as follows:

  • On-demand creation: The SDK doesn't pre-create WebSocket connections at service startup. Connections are established on demand at the first call.

  • Time-limited reuse: After a request completes, the connection is kept in the pool for up to 60 seconds for reuse.

    • If a new request arrives within 60 seconds, the existing connection is reused, which avoids the handshake overhead.

    • If the connection stays idle for more than 60 seconds, it's closed automatically to release resources.

Why warmup matters

In the following scenarios, the connection pool might not have a reusable active connection, so the request must create a new one:

  • The application has just started and hasn't made any calls.

  • The service has been idle for more than 60 seconds, so the pooled connections have closed due to timeout.

In these scenarios, the first or initial requests trigger a full WebSocket connection process (TCP handshake, TLS negotiation, and protocol upgrade). The end-to-end latency is significantly higher than for subsequent requests that reuse a connection.

Recommended approach

Before formal performance stress tests or latency measurement, follow these warmup steps:

  1. Match the concurrency level of the actual test by running calls in advance (for example, for 1 to 2 minutes) to fully populate the connection pool.

  2. Confirm that the connection pool is established and maintains enough active connections, then start the formal performance-data collection.

Apply in production

Improve recognition accuracy

  • Match the model to the sample rate: For 8 kHz phone audio, use an 8 kHz model directly. Don't upsample to 16 kHz, because that distorts the signal.

  • Improve input audio quality: Use a high-quality microphone in a recording environment with high signal-to-noise ratio and no echo. At the application layer, integrate preprocessing such as noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC).

Set up resilience

  • Client-side reconnection: The client should implement automatic reconnection to handle network jitter. A Python SDK reference implementation:

    1. Catch exceptions: Implement the on_error method in the Callback class. The dashscope SDK calls this method when it encounters a network error or other issue.

    2. Notify state: When on_error is triggered, set a reconnect signal. In Python, use threading.Event, a thread-safe signal flag.

    3. Reconnect loop: Wrap the main logic in a for loop (for example, retry 3 times). When the reconnect signal is detected, the current recognition is interrupted, resources are cleaned up, then after a few seconds the loop creates a new connection.

  • Use a heartbeat to keep the connection alive: When you need a long-lived connection to the server, set the heartbeat parameter to true. The connection stays alive even when the audio is silent for a long time.

  • Rate limits: When you call the model interfaces, observe the model Rate limits rules.

Supported scope

Available models vary by deployment scope:

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

To call any of the following models, use an API key from the Singapore region:

  • Fun-ASR: fun-asr-realtime (stable, currently equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2025-11-07 (snapshot)

  • Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (stable, currently equivalent to qwen3-asr-flash-realtime-2025-10-27), qwen3-asr-flash-realtime-2026-02-10 (latest snapshot), qwen3-asr-flash-realtime-2025-10-27 (snapshot)

Mainland China

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

To call any of the following models, use an API key from the Beijing region:

  • Fun-ASR: fun-asr-realtime (stable, currently equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2026-02-28 (latest snapshot), fun-asr-realtime-2025-11-07 (snapshot), fun-asr-realtime-2025-09-15 (snapshot)

    • fun-asr-flash-8k-realtime (stable, currently equivalent to fun-asr-flash-8k-realtime-2026-01-28), fun-asr-flash-8k-realtime-2026-01-28

  • Qwen3-ASR-Flash-Realtime: qwen3-asr-flash-realtime (stable, currently equivalent to qwen3-asr-flash-realtime-2025-10-27), qwen3-asr-flash-realtime-2026-02-10 (latest snapshot), qwen3-asr-flash-realtime-2025-10-27 (snapshot)

  • Paraformer: paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1

API reference

FAQ

What audio formats does real-time speech recognition support?

Fun-ASR and Paraformer support pcm, wav, mp3, opus, speex, aac, and amr. For Qwen-ASR, use pcm or opus. Other formats (such as wav, aac, and amr) pass session.update validation but may fail server-side decoding. Confirm the audio stream uses a recommended format before sending.

When should I use the SDK vs. the WebSocket API?

The DashScope SDK wraps WebSocket connection management, authentication, and reconnection, which makes it the fastest path to integration. The WebSocket API gives you direct, fine-grained control. Use it when the SDK doesn't cover your language, or when you need custom connection handling. The SDK is the recommended choice for most use cases.

How do I improve recognition accuracy for proper nouns?

Use hot words (supported by Fun-ASR and Paraformer). Hot words work well for fixed vocabulary .

What can I do when the connection keeps dropping?

Implement client-side reconnection logic, and enable the heartbeat parameter (heartbeat=true) to prevent the connection from dropping during long silent periods. For detailed fault-tolerance strategies, see Apply in production.