This topic describes how to connect directly to the Fun-ASR real-time speech recognition service using the WebSocket protocol. This method is compatible with any programming language that supports WebSocket. To simplify the integration process for Java and Python developers, we also provide higher-level SDKs: the Python SDK and the Java SDK. However, you can still use the general protocol described in this topic for maximum development flexibility.
User guide: For an introduction to the models and for model selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.
Getting started
Preparations
Create an API key. For security, export the API key as an environment variable.
Download the sample audio file: asr_example.wav.
Sample code
Node.js
Install the required dependencies:
npm install ws
npm install uuidThe following is the sample code:
const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate a UUID
// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file
// Generate a 32-digit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);
// Create a WebSocket client
const ws = new WebSocket(url, {
headers: {
Authorization: `bearer ${apiKey}`
}
});
let taskStarted = false; // A flag that indicates whether the task has started
// Send the run-task instruction when the connection is opened
ws.on('open', () => {
console.log('Connected to the server');
sendRunTask();
});
// Process received messages
ws.on('message', (data) => {
const message = JSON.parse(data);
switch (message.header.event) {
case 'task-started':
console.log('The task has started');
taskStarted = true;
sendAudioStream();
break;
case 'result-generated':
console.log('Recognition result:', message.payload.output.sentence.text);
if (message.payload.usage) {
console.log('Billable duration of the task (in seconds):', message.payload.usage.duration);
}
break;
case 'task-finished':
console.log('The task is complete');
ws.close();
break;
case 'task-failed':
console.error('The task failed:', message.header.error_message);
ws.close();
break;
default:
console.log('Unknown event:', message.header.event);
}
});
// If the task-started event is not received, close the connection
ws.on('close', () => {
if (!taskStarted) {
console.error('The task did not start. Closing the connection.');
}
});
// Send the run-task instruction
function sendRunTask() {
const runTaskMessage = {
header: {
action: 'run-task',
task_id: TASK_ID,
streaming: 'duplex'
},
payload: {
task_group: 'audio',
task: 'asr',
function: 'recognition',
model: 'fun-asr-realtime',
parameters: {
sample_rate: 16000,
format: 'wav'
},
input: {}
}
};
ws.send(JSON.stringify(runTaskMessage));
}
// Send the audio stream
function sendAudioStream() {
const audioStream = fs.createReadStream(audioFile);
let chunkCount = 0;
function sendNextChunk() {
const chunk = audioStream.read();
if (chunk) {
ws.send(chunk);
chunkCount++;
setTimeout(sendNextChunk, 100); // Send a chunk every 100 ms
}
}
audioStream.on('readable', () => {
sendNextChunk();
});
audioStream.on('end', () => {
console.log('The audio stream has ended');
sendFinishTask();
});
audioStream.on('error', (err) => {
console.error('Error reading the audio file:', err);
ws.close();
});
}
// Send the finish-task instruction
function sendFinishTask() {
const finishTaskMessage = {
header: {
action: 'finish-task',
task_id: TASK_ID,
streaming: 'duplex'
},
payload: {
input: {}
}
};
ws.send(JSON.stringify(finishTaskMessage));
}
// Handle errors
ws.on('error', (error) => {
console.error('WebSocket error:', error);
});C#
The following is the sample code:
using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;
class Program {
private static ClientWebSocket _webSocket = new ClientWebSocket();
private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
private static bool _taskStartedReceived = false;
private static bool _taskFinishedReceived = false;
// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: private static readonly string ApiKey = "sk-xxx"
private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
// Replace with the path to your audio file
private const string AudioFilePath = "asr_example.wav";
static async Task Main(string[] args) {
// Establish a WebSocket connection and configure headers for authentication
_webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");
await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);
// Start a thread to asynchronously receive WebSocket messages
var receiveTask = ReceiveMessagesAsync();
// Send the run-task instruction
string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-digit random ID
var runTaskJson = GenerateRunTaskJson(_taskId);
await SendAsync(runTaskJson);
// Wait for the task-started event
while (!_taskStartedReceived) {
await Task.Delay(100, _cancellationTokenSource.Token);
}
// Read the local file and send the audio stream to the server for recognition
await SendAudioStreamAsync(AudioFilePath);
// Send the finish-task instruction to end the task
var finishTaskJson = GenerateFinishTaskJson(_taskId);
await SendAsync(finishTaskJson);
// Wait for the task-finished event
while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
try {
await Task.Delay(100, _cancellationTokenSource.Token);
} catch (OperationCanceledException) {
// The task has been canceled. Exit the loop.
break;
}
}
// Close the connection
if (!_cancellationTokenSource.IsCancellationRequested) {
await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
}
_cancellationTokenSource.Cancel();
try {
await receiveTask;
} catch (OperationCanceledException) {
// Ignore the operation canceled exception
}
}
private static async Task ReceiveMessagesAsync() {
try {
while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
if (message != null) {
var eventValue = message["header"]?["event"]?.GetValue<string>();
switch (eventValue) {
case "task-started":
Console.WriteLine("The task started successfully");
_taskStartedReceived = true;
break;
case "result-generated":
Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
Console.WriteLine($"Billable duration of the task (in seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
}
break;
case "task-finished":
Console.WriteLine("The task is complete");
_taskFinishedReceived = true;
_cancellationTokenSource.Cancel();
break;
case "task-failed":
Console.WriteLine($"The task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
_cancellationTokenSource.Cancel();
break;
}
}
}
} catch (OperationCanceledException) {
// Ignore the operation canceled exception
}
}
private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
var buffer = new byte[1024 * 4];
var segment = new ArraySegment<byte>(buffer);
var result = await _webSocket.ReceiveAsync(segment, cancellationToken);
if (result.MessageType == WebSocketMessageType.Close) {
await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
return null;
}
var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
return JsonNode.Parse(message);
}
private static async Task SendAsync(string message) {
var buffer = Encoding.UTF8.GetBytes(message);
var segment = new ArraySegment<byte>(buffer);
await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
}
private static async Task SendAudioStreamAsync(string filePath) {
using (var audioStream = File.OpenRead(filePath)) {
var buffer = new byte[1024]; // Send 100 ms of audio data each time
int bytesRead;
while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
await Task.Delay(100); // 100 ms interval
}
}
}
private static string GenerateRunTaskJson(string taskId) {
var runTask = new JsonObject {
["header"] = new JsonObject {
["action"] = "run-task",
["task_id"] = taskId,
["streaming"] = "duplex"
},
["payload"] = new JsonObject {
["task_group"] = "audio",
["task"] = "asr",
["function"] = "recognition",
["model"] = "fun-asr-realtime",
["parameters"] = new JsonObject {
["format"] = "wav",
["sample_rate"] = 16000,
},
["input"] = new JsonObject()
}
};
return JsonSerializer.Serialize(runTask);
}
private static string GenerateFinishTaskJson(string taskId) {
var finishTask = new JsonObject {
["header"] = new JsonObject {
["action"] = "finish-task",
["task_id"] = taskId,
["streaming"] = "duplex"
},
["payload"] = new JsonObject {
["input"] = new JsonObject()
}
};
return JsonSerializer.Serialize(finishTask);
}
}PHP
The sample code has the following directory structure:
my-php-project/
├── composer.json
├── vendor/
└── index.php
The following is the content of composer.json. Specify the dependency version numbers as needed:
{
"require": {
"react/event-loop": "^1.3",
"react/socket": "^1.11",
"react/stream": "^1.2",
"react/http": "^1.1",
"ratchet/pawl": "^0.4"
},
"autoload": {
"psr-4": {
"App\\": "src/"
}
}
}The following is the content of index.php:
<?php
require __DIR__ . '/vendor/autoload.php';
use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;
// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
$audio_file_path = 'asr_example.wav'; // Replace with the path to your audio file
$loop = Loop::get();
// Create a custom connector
$socketConnector = new SocketConnector($loop, [
'tcp' => [
'bindto' => '0.0.0.0:0',
],
'tls' => [
'verify_peer' => false,
'verify_peer_name' => false,
],
]);
$connector = new Connector($loop, $socketConnector);
$headers = [
'Authorization' => 'bearer ' . $api_key
];
$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
echo "Connected to the WebSocket server\n";
// Start a thread to asynchronously receive WebSocket messages
$conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
$response = json_decode($msg, true);
if (isset($response['header']['event'])) {
handleEvent($conn, $response, $loop, $audio_file_path);
} else {
echo "Unknown message format\n";
}
});
// Listen for connection closure
$conn->on('close', function($code = null, $reason = null) {
echo "Connection closed\n";
if ($code !== null) {
echo "Close code: " . $code . "\n";
}
if ($reason !== null) {
echo "Close reason: " . $reason . "\n";
}
});
// Generate a task ID
$taskId = generateTaskId();
// Send the run-task instruction
sendRunTaskMessage($conn, $taskId);
}, function ($e) {
echo "Cannot connect: {$e->getMessage()}\n";
});
$loop->run();
/**
* Generate a task ID
* @return string
*/
function generateTaskId(): string {
return bin2hex(random_bytes(16));
}
/**
* Send the run-task instruction
* @param $conn
* @param $taskId
*/
function sendRunTaskMessage($conn, $taskId) {
$runTaskMessage = json_encode([
"header" => [
"action" => "run-task",
"task_id" => $taskId,
"streaming" => "duplex"
],
"payload" => [
"task_group" => "audio",
"task" => "asr",
"function" => "recognition",
"model" => "fun-asr-realtime",
"parameters" => [
"format" => "wav",
"sample_rate" => 16000
],
"input" => []
]
]);
echo "Preparing to send the run-task instruction: " . $runTaskMessage . "\n";
$conn->send($runTaskMessage);
echo "The run-task instruction has been sent\n";
}
/**
* Read the audio file
* @param string $filePath
* @return bool|string
*/
function readAudioFile(string $filePath) {
$voiceData = file_get_contents($filePath);
if ($voiceData === false) {
echo "Cannot read the audio file\n";
}
return $voiceData;
}
/**
* Split the audio data
* @param string $data
* @param int $chunkSize
* @return array
*/
function splitAudioData(string $data, int $chunkSize): array {
return str_split($data, $chunkSize);
}
/**
* Send the finish-task instruction
* @param $conn
* @param $taskId
*/
function sendFinishTaskMessage($conn, $taskId) {
$finishTaskMessage = json_encode([
"header" => [
"action" => "finish-task",
"task_id" => $taskId,
"streaming" => "duplex"
],
"payload" => [
"input" => []
]
]);
echo "Preparing to send the finish-task instruction: " . $finishTaskMessage . "\n";
$conn->send($finishTaskMessage);
echo "The finish-task instruction has been sent\n";
}
/**
* Handle events
* @param $conn
* @param $response
* @param $loop
* @param $audio_file_path
*/
function handleEvent($conn, $response, $loop, $audio_file_path) {
static $taskId;
static $chunks;
static $allChunksSent = false;
if (is_null($taskId)) {
$taskId = generateTaskId();
}
switch ($response['header']['event']) {
case 'task-started':
echo "Task started. Sending audio data...\n";
// Read the audio file
$voiceData = readAudioFile($audio_file_path);
if ($voiceData === false) {
echo "Cannot read the audio file\n";
$conn->close();
return;
}
// Split the audio data
$chunks = splitAudioData($voiceData, 1024);
// Define the send function
$sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
if (!empty($chunks)) {
$chunk = array_shift($chunks);
$binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
$conn->send($binaryMsg);
// Send the next segment after 100 ms
$loop->addTimer(0.1, $sendChunk);
} else {
echo "All blocks have been sent\n";
$allChunksSent = true;
// Send the finish-task instruction
sendFinishTaskMessage($conn, $taskId);
}
};
// Start sending audio data
$sendChunk();
break;
case 'result-generated':
$result = $response['payload']['output']['sentence'];
echo "Recognition result: " . $result['text'] . "\n";
if (isset($response['payload']['usage']['duration'])) {
echo "Billable duration of the task (in seconds): " . $response['payload']['usage']['duration'] . "\n";
}
break;
case 'task-finished':
echo "The task is complete\n";
$conn->close();
break;
case 'task-failed':
echo "The task failed\n";
echo "Error code: " . $response['header']['error_code'] . "\n";
echo "Error message: " . $response['header']['error_message'] . "\n";
$conn->close();
break;
case 'error':
echo "Error: " . $response['payload']['message'] . "\n";
break;
default:
echo "Unknown event: " . $response['header']['event'] . "\n";
break;
}
// If all data has been sent and the task is complete, close the connection
if ($allChunksSent && $response['header']['event'] == 'task-finished') {
// Wait for 1 second to ensure all data has been transferred
$loop->addTimer(1, function() use ($conn) {
$conn->close();
echo "The client closes the connection\n";
});
}
}Go
package main
import (
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
"time"
"github.com/google/uuid"
"github.com/gorilla/websocket"
)
const (
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
wsURL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
audioFile = "asr_example.wav" // Replace with the path to your audio file
)
var dialer = websocket.DefaultDialer
func main() {
// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: apiKey := "sk-xxx"
apiKey := os.Getenv("DASHSCOPE_API_KEY")
// Connect to the WebSocket service
conn, err := connectWebSocket(apiKey)
if err != nil {
log.Fatal("Failed to connect to WebSocket:", err)
}
defer closeConnection(conn)
// Start a goroutine to receive results
taskStarted := make(chan bool)
taskDone := make(chan bool)
startResultReceiver(conn, taskStarted, taskDone)
// Send the run-task instruction
taskID, err := sendRunTaskCmd(conn)
if err != nil {
log.Fatal("Failed to send the run-task instruction:", err)
}
// Wait for the task-started event
waitForTaskStarted(taskStarted)
// Send the audio file stream for recognition
if err := sendAudioData(conn); err != nil {
log.Fatal("Failed to send audio:", err)
}
// Send the finish-task instruction
if err := sendFinishTaskCmd(conn, taskID); err != nil {
log.Fatal("Failed to send the finish-task instruction:", err)
}
// Wait for the task to complete or fail
<-taskDone
}
// Define a struct to represent JSON data
type Header struct {
Action string `json:"action"`
TaskID string `json:"task_id"`
Streaming string `json:"streaming"`
Event string `json:"event"`
ErrorCode string `json:"error_code,omitempty"`
ErrorMessage string `json:"error_message,omitempty"`
Attributes map[string]interface{} `json:"attributes"`
}
type Output struct {
Sentence struct {
BeginTime int64 `json:"begin_time"`
EndTime *int64 `json:"end_time"`
Text string `json:"text"`
Words []struct {
BeginTime int64 `json:"begin_time"`
EndTime *int64 `json:"end_time"`
Text string `json:"text"`
Punctuation string `json:"punctuation"`
} `json:"words"`
} `json:"sentence"`
}
type Payload struct {
TaskGroup string `json:"task_group"`
Task string `json:"task"`
Function string `json:"function"`
Model string `json:"model"`
Parameters Params `json:"parameters"`
Input Input `json:"input"`
Output Output `json:"output,omitempty"`
Usage *struct {
Duration int `json:"duration"`
} `json:"usage,omitempty"`
}
type Params struct {
Format string `json:"format"`
SampleRate int `json:"sample_rate"`
VocabularyID string `json:"vocabulary_id"`
DisfluencyRemovalEnabled bool `json:"disfluency_removal_enabled"`
}
type Input struct {
}
type Event struct {
Header Header `json:"header"`
Payload Payload `json:"payload"`
}
// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
header := make(http.Header)
header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
conn, _, err := dialer.Dial(wsURL, header)
return conn, err
}
// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
go func() {
for {
_, message, err := conn.ReadMessage()
if err != nil {
log.Println("Failed to parse the server message:", err)
return
}
var event Event
err = json.Unmarshal(message, &event)
if err != nil {
log.Println("Failed to parse the event:", err)
continue
}
if handleEvent(conn, event, taskStarted, taskDone) {
return
}
}
}()
}
// Send the run-task instruction
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
runTaskCmd, taskID, err := generateRunTaskCmd()
if err != nil {
return "", err
}
err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
return taskID, err
}
// Generate the run-task instruction
func generateRunTaskCmd() (string, string, error) {
taskID := uuid.New().String()
runTaskCmd := Event{
Header: Header{
Action: "run-task",
TaskID: taskID,
Streaming: "duplex",
},
Payload: Payload{
TaskGroup: "audio",
Task: "asr",
Function: "recognition",
Model: "fun-asr-realtime",
Parameters: Params{
Format: "wav",
SampleRate: 16000,
},
Input: Input{},
},
}
runTaskCmdJSON, err := json.Marshal(runTaskCmd)
return string(runTaskCmdJSON), taskID, err
}
// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
select {
case <-taskStarted:
fmt.Println("The task started successfully")
case <-time.After(10 * time.Second):
log.Fatal("Timed out waiting for task-started. The task failed to start.")
}
}
// Send audio data
func sendAudioData(conn *websocket.Conn) error {
file, err := os.Open(audioFile)
if err != nil {
return err
}
defer file.Close()
buf := make([]byte, 1024)
for {
n, err := file.Read(buf)
if n == 0 {
break
}
if err != nil && err != io.EOF {
return err
}
err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
if err != nil {
return err
}
time.Sleep(100 * time.Millisecond)
}
return nil
}
// Send the finish-task instruction
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
finishTaskCmd, err := generateFinishTaskCmd(taskID)
if err != nil {
return err
}
err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
return err
}
// Generate the finish-task instruction
func generateFinishTaskCmd(taskID string) (string, error) {
finishTaskCmd := Event{
Header: Header{
Action: "finish-task",
TaskID: taskID,
Streaming: "duplex",
},
Payload: Payload{
Input: Input{},
},
}
finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
return string(finishTaskCmdJSON), err
}
// Handle events
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
switch event.Header.Event {
case "task-started":
fmt.Println("Received task-started event")
taskStarted <- true
case "result-generated":
if event.Payload.Output.Sentence.Text != "" {
fmt.Println("Recognition result:", event.Payload.Output.Sentence.Text)
}
if event.Payload.Usage != nil {
fmt.Println("Billable duration of the task (in seconds):", event.Payload.Usage.Duration)
}
case "task-finished":
fmt.Println("The task is complete")
taskDone <- true
return true
case "task-failed":
handleTaskFailed(event, conn)
taskDone <- true
return true
default:
log.Printf("Unexpected event: %v", event)
}
return false
}
// Handle the task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
if event.Header.ErrorMessage != "" {
log.Fatalf("The task failed: %s", event.Header.ErrorMessage)
} else {
log.Fatal("The task failed for an unknown reason")
}
}
// Close the connection
func closeConnection(conn *websocket.Conn) {
if conn != nil {
conn.Close()
}
}
Core concepts
Interaction sequence
The client and server interact in a strict sequence to ensure correct task execution.
Establish a connection: The client sends a WebSocket connection request to the server with authentication information in the request header.
Start the task: After the connection is established, the client sends a run-task instruction to specify the model and audio parameters.
Confirm the task: The server returns a task-started event to indicate that it is ready to receive audio.
Transfer data:
The client continuously sends binary audio data.
During recognition, the server repeatedly returns result-generated events in real time, which contain intermediate and final recognition results.
End the task: After all audio is sent, the client sends a finish-task instruction.
End confirmation: After processing all remaining audio, the server returns a task-finished event, which indicates that the task has completed successfully.
Close the connection: The client or server closes the WebSocket connection.
Audio stream specifications
Channel: The binary audio sent to the server must be mono.
Format and encoding: The pcm, wav, mp3, opus, speex, aac, and amr formats are supported.
WAV files must use PCM encoding.
Opus or Speex files must be encapsulated in an Ogg container.
The amr format supports only the AMR-NB type.
Sample rate: The sample rate must be consistent with the sample_rate parameter specified in the run-task instruction and the requirements of the selected model.
Model availability
International (Singapore)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price | Free quota (Note) |
fun-asr-realtime This model is currently equivalent to fun-asr-realtime-2025-11-07. | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more. | PCM, WAV, MP3, Opus, Speex, AAC, and AMR | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-realtime-2025-11-07 | Snapshot |
China (Beijing)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price |
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07 | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more | pcm, wav, mp3, opus, speex, aac, amr | $0.000047/second |
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15. | Snapshot | |||||
fun-asr-realtime-2025-09-15 | Chinese (Mandarin), English |
API reference
Connection endpoint (URL)
wss://dashscope.aliyuncs.com/api-ws/v1/inferenceHeaders
Parameter | Type | Required | Description |
Authorization | string | Yes | The authentication token. The format is |
user-agent | string | No | The client identifier. This helps the server track the source of the request. |
X-DashScope-WorkSpace | string | No | Model Studio workspace ID. |
X-DashScope-DataInspection | string | No | Specifies whether to enable the data compliance check feature. The default value is |
Instructions (client→server)
Instructions are JSON-formatted text messages sent by the client to control the lifecycle of a recognition task.
1. run-task instruction: Start a task
Purpose: After a connection is established, send this instruction to start a speech recognition task and configure its parameters.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "asr",
"function": "recognition",
"model": "fun-asr-realtime",
"parameters": {
"format": "pcm",
"sample_rate": 16000,
"vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx"
},
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | Instruction type. Set to |
header.task_id | string | Yes | A unique ID for the task. Subsequent finish-task instructions must use the same |
header.streaming | string | Yes | The communication pattern is fixed to |
payload parameters:
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Task group. Set to |
payload.task | string | Yes | Task type. Set to |
payload.function | string | Yes | Function type. Set to |
payload.model | string | Yes | The model to use. For more information, see the model list. |
payload.input | object | Yes | Input configuration. Set to an empty object |
payload.parameters | |||
format | string | Yes | Audio format. Supported formats: |
sample_rate | integer | Yes | Audio sample rate in Hz. The fun-asr-realtime model supports a sample rate of 16000 Hz. |
vocabulary_id | string | No | The vocabulary ID. For more information, see Custom vocabulary. This parameter is not set by default. |
semantic_punctuation_enabled | boolean | No | Specifies whether to enable semantic punctuation. Default value:
Semantic punctuation more accurately identifies sentence boundaries. VAD punctuation provides faster responses. You can adjust the |
max_sentence_silence | integer | No | The silence duration threshold for VAD. In punctuation based on VAD, a sentence is considered to have ended if the silence duration exceeds this threshold. Default value: 1300. Value range: [200, 6000]. This parameter takes effect only when the |
multi_threshold_mode_enabled | boolean | No | Specifies whether to enable the feature that prevents excessively long sentences in VAD punctuation. Default value:
This parameter takes effect only when |
heartbeat | boolean | No | Specifies whether to enable the persistent connection keep-alive switch. Default value:
Note:
|
2. finish-task instruction: End a task
Purpose: After the client finishes sending audio data, send this instruction to notify the server.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. The value is fixed to |
header.task_id | string | Yes | The task ID. It must be the same as the |
header.streaming | string | Yes | Communication pattern. The value is fixed to |
payload parameters:
Parameter | Type | Required | Description |
payload.input | object | Yes | Input configuration. Set to an empty object |
Events (server→client)
Events are JSON-formatted text messages sent by the server to synchronize task status and recognition results.
1. task-started
Trigger: After the server successfully processes the run-task instruction.
Action: Notifies the client that the task has started and that it can begin sending audio data.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. The value is fixed to |
header.task_id | string | The task ID. |
2. result-generated
Trigger: When the server generates a new recognition result during the recognition process.
Action: Returns real-time recognition results, including intermediate and final sentence results.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"begin_time": 170,
"end_time": 920,
"text": "Okay, I got it",
"heartbeat": false,
"sentence_end": true,
"words": [
{
"begin_time": 170,
"end_time": 295,
"text": "Okay",
"punctuation": ","
},
{
"begin_time": 295,
"end_time": 503,
"text": "I",
"punctuation": ""
},
{
"begin_time": 503,
"end_time": 711,
"text": "got",
"punctuation": ""
},
{
"begin_time": 711,
"end_time": 920,
"text": "it",
"punctuation": ""
}
]
}
},
"usage": {
"duration": 3
}
}
}header parameters:
Parameter | Type | Description |
header.event | string | Event type. Set to |
header.task_id | string | The task ID. |
payload parameters:
Parameter | Type | Description |
output | object | output.sentence is the recognition result. See the following section for details. |
usage | object | When When |
The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.usage is as follows:
Parameter | Type | Description |
duration | integer | The billable duration of the task in seconds. |
The payload.output.sentence object has the following format:
Parameter | Type | Description |
begin_time | integer | The start time of the sentence in ms. |
end_time | integer | null | The end time of the sentence in ms. This is null for intermediate recognition results. |
text | string | The recognized text. |
words | array | Word timestamp information. |
heartbeat | boolean | null | If this value is true, you can skip processing the recognition result. This value is consistent with the heartbeat in the run-task instruction. |
sentence_end | boolean | Indicates whether the given sentence has ended. |
payload.output.sentence.words is a list of word timestamps. Each word has the following format:
Parameter | Type | Description |
begin_time | integer | The start time of the word in ms. |
end_time | integer | The end time of the word in ms. |
text | string | The recognized word. |
punctuation | string | Punctuation. |
3. task-finished
Trigger: After the server receives a finish-task instruction and finishes processing all cached audio.
Action: Indicates that the recognition task has successfully ended.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {}
},
"payload": {
"output": {}
}
}header parameters:
Parameter | Type | Description |
header.event | string | Event type. Set to |
header.task_id | string | The task ID. |
4. task-failed
Trigger: When any error occurs during task processing.
Action: Notifies the client that the task has failed and provides the reason for the failure. After receiving this event, close the WebSocket connection and handle the error.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "CLIENT_ERROR",
"error_message": "request timeout after 23 seconds.",
"attributes": {}
},
"payload": {}
}Header parameters:
Parameter | Type | Description |
header.event | string | Event type. Set to |
header.task_id | string | The task ID. |
header.error_code | string | A description of the error type. |
header.error_message | string | The specific reason for the error. |
Connection overhead and reuse
The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.
When the server receives a run-task instruction from the client, a new task starts. After the client sends a finish-task instruction, the server returns a task-finished event to end the task. After a task ends, the WebSocket connection can be reused. The client can send another run-task instruction to start a new task on the same connection.
Each task on a reused connection must have a unique task_id.
If a task fails during execution, the service returns a task-failed event and closes the connection. The connection cannot be reused.
If no new task is started within 60 seconds after a task ends, the connection automatically times out and closes.
Error codes
For information about how to troubleshoot errors, see Error messages.
FAQ
Features
Q: How can I maintain a persistent connection with the server during long periods of silence?
Set the heartbeat request parameter to true and continuously send silent audio to the server.
Note:
Silent audio: Content in an audio file or data stream that contains no sound signals.
Generation method: Use audio editing software, such as Audacity or Adobe Audition, or a command-line interface (CLI), such as FFmpeg, to create silent audio.
Q: How do I convert an audio format to a supported format?
You can use the FFmpeg tool. For more information, visit the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bitrate to control audio quality. Examples: 192k, 320k
# -ar: Specifies the sample rate
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the existing file. No value is needed.
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bitrate -ar sample_rate -ac num_channels output.ext
# Example: Convert WAV to MP3 and maintain the original quality
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: Convert MP3 to WAV in the 16-bit PCM standard format
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 16000 -ac 2 output.wav
# Example: Convert M4A to AAC to extract or convert Apple audio
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: Convert lossless FLAC to Opus for high compression
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?
Voice Service uses WebSocket because it requires full-duplex communication. This allows the server and client to actively exchange data. For example, the server can push real-time progress updates for speech synthesis or recognition. RESTful APIs, which are based on HTTP, support only a one-way, client-initiated request-response pattern. This pattern cannot support real-time interaction.
Troubleshooting
For troubleshooting information about code errors, see Error codes.
Q: Why is speech not recognized (no recognition result)?
Ensure that the audio format (
format) and sample rate (sample_rate) in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:The audio file has a .wav extension but is in the MP3 format, and the
formatrequest parameter is incorrectly set to mp3.The audio sampling rate is 3600 Hz, but the
sample_raterequest parameter is incorrectly set to 48000.
You can use the ffprobe tool to retrieve information about the container, encoding, sample rate, and channels of the audio:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxIf the preceding checks do not reveal any issues, you can use custom vocabulary to improve the recognition of specific words.