All Products
Search
Document Center

Alibaba Cloud Model Studio:WebSocket API for Fun-ASR real-time speech recognition

Last Updated:Dec 25, 2025

This topic describes how to connect directly to the Fun-ASR real-time speech recognition service using the WebSocket protocol. This method is compatible with any programming language that supports WebSocket. To simplify the integration process for Java and Python developers, we also provide higher-level SDKs: the Python SDK and the Java SDK. However, you can still use the general protocol described in this topic for maximum development flexibility.

User guide: For an introduction to the models and for model selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.

Getting started

Preparations

  1. Create an API key. For security, export the API key as an environment variable.

  2. Download the sample audio file: asr_example.wav.

Sample code

Node.js

Install the required dependencies:

npm install ws
npm install uuid

The following is the sample code:

const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate a UUID

// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file

// Generate a 32-digit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // A flag that indicates whether the task has started

// Send the run-task instruction when the connection is opened
ws.on('open', () => {
  console.log('Connected to the server');
  sendRunTask();
});

// Process received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('The task has started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result:', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Billable duration of the task (in seconds):', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('The task is complete');
      ws.close();
      break;
    case 'task-failed':
      console.error('The task failed:', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event:', message.header.event);
  }
});

// If the task-started event is not received, close the connection
ws.on('close', () => {
  if (!taskStarted) {
    console.error('The task did not start. Closing the connection.');
  }
});

// Send the run-task instruction
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'fun-asr-realtime',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send a chunk every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('The audio stream has ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading the audio file:', err);
    ws.close();
  });
}

// Send the finish-task instruction
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Handle errors
ws.on('error', (error) => {
  console.error('WebSocket error:', error);
});

C#

The following is the sample code:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;

class Program {
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    private static bool _taskStartedReceived = false;
    private static bool _taskFinishedReceived = false;
    // API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured environment variables, replace the following line with your Model Studio API key: private static readonly string ApiKey = "sk-xxx"
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
    // Replace with the path to your audio file
    private const string AudioFilePath = "asr_example.wav";

    static async Task Main(string[] args) {
        // Establish a WebSocket connection and configure headers for authentication
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");

        await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);

        // Start a thread to asynchronously receive WebSocket messages
        var receiveTask = ReceiveMessagesAsync();

        // Send the run-task instruction
        string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-digit random ID
        var runTaskJson = GenerateRunTaskJson(_taskId);
        await SendAsync(runTaskJson);

        // Wait for the task-started event
        while (!_taskStartedReceived) {
            await Task.Delay(100, _cancellationTokenSource.Token);
        }

        // Read the local file and send the audio stream to the server for recognition
        await SendAudioStreamAsync(AudioFilePath);

        // Send the finish-task instruction to end the task
        var finishTaskJson = GenerateFinishTaskJson(_taskId);
        await SendAsync(finishTaskJson);

        // Wait for the task-finished event
        while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
            try {
                await Task.Delay(100, _cancellationTokenSource.Token);
            } catch (OperationCanceledException) {
                // The task has been canceled. Exit the loop.
                break;
            }
        }

        // Close the connection
        if (!_cancellationTokenSource.IsCancellationRequested) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
        }

        _cancellationTokenSource.Cancel();
        try {
            await receiveTask;
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task ReceiveMessagesAsync() {
        try {
            while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
                var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
                if (message != null) {
                    var eventValue = message["header"]?["event"]?.GetValue<string>();
                    switch (eventValue) {
                        case "task-started":
                            Console.WriteLine("The task started successfully");
                            _taskStartedReceived = true;
                            break;
                        case "result-generated":
                            Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
                            if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
                                Console.WriteLine($"Billable duration of the task (in seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
                            }
                            break;
                        case "task-finished":
                            Console.WriteLine("The task is complete");
                            _taskFinishedReceived = true;
                            _cancellationTokenSource.Cancel();
                            break;
                        case "task-failed":
                            Console.WriteLine($"The task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
                            _cancellationTokenSource.Cancel();
                            break;
                    }
                }
            }
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);
        var result = await _webSocket.ReceiveAsync(segment, cancellationToken);

        if (result.MessageType == WebSocketMessageType.Close) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
            return null;
        }

        var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
        return JsonNode.Parse(message);
    }

    private static async Task SendAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        var segment = new ArraySegment<byte>(buffer);
        await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
    }

    private static async Task SendAudioStreamAsync(string filePath) {
        using (var audioStream = File.OpenRead(filePath)) {
            var buffer = new byte[1024]; // Send 100 ms of audio data each time
            int bytesRead;

            while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
                var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
                await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
                await Task.Delay(100); // 100 ms interval
            }
        }
    }

    private static string GenerateRunTaskJson(string taskId) {
        var runTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "run-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["task_group"] = "audio",
                ["task"] = "asr",
                ["function"] = "recognition",
                ["model"] = "fun-asr-realtime",
                ["parameters"] = new JsonObject {
                    ["format"] = "wav",
                    ["sample_rate"] = 16000,
                },
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(runTask);
    }

    private static string GenerateFinishTaskJson(string taskId) {
        var finishTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "finish-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(finishTask);
    }
}

PHP

The sample code has the following directory structure:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The following is the content of composer.json. Specify the dependency version numbers as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The following is the content of index.php:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;

// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
$audio_file_path = 'asr_example.wav'; // Replace with the path to your audio file

$loop = Loop::get();

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
    echo "Connected to the WebSocket server\n";

    // Start a thread to asynchronously receive WebSocket messages
    $conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
        $response = json_decode($msg, true);

        if (isset($response['header']['event'])) {
            handleEvent($conn, $response, $loop, $audio_file_path);
        } else {
            echo "Unknown message format\n";
        }
    });

    // Listen for connection closure
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });

    // Generate a task ID
    $taskId = generateTaskId();

    // Send the run-task instruction
    sendRunTaskMessage($conn, $taskId);

}, function ($e) {
    echo "Cannot connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "asr",
            "function" => "recognition",
            "model" => "fun-asr-realtime",
            "parameters" => [
                "format" => "wav",
                "sample_rate" => 16000
            ],
            "input" => []
        ]
    ]);
    echo "Preparing to send the run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "The run-task instruction has been sent\n";
}

/**
 * Read the audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Cannot read the audio file\n";
    }
    return $voiceData;
}

/**
 * Split the audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => []
        ]
    ]);
    echo "Preparing to send the finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "The finish-task instruction has been sent\n";
}

/**
 * Handle events
 * @param $conn
 * @param $response
 * @param $loop
 * @param $audio_file_path
 */
function handleEvent($conn, $response, $loop, $audio_file_path) {
    static $taskId;
    static $chunks;
    static $allChunksSent = false;

    if (is_null($taskId)) {
        $taskId = generateTaskId();
    }

    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started. Sending audio data...\n";
            // Read the audio file
            $voiceData = readAudioFile($audio_file_path);
            if ($voiceData === false) {
                echo "Cannot read the audio file\n";
                $conn->close();
                return;
            }

            // Split the audio data
            $chunks = splitAudioData($voiceData, 1024);

            // Define the send function
            $sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
                if (!empty($chunks)) {
                    $chunk = array_shift($chunks);
                    $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
                    $conn->send($binaryMsg);
                    // Send the next segment after 100 ms
                    $loop->addTimer(0.1, $sendChunk);
                } else {
                    echo "All blocks have been sent\n";
                    $allChunksSent = true;

                    // Send the finish-task instruction
                    sendFinishTaskMessage($conn, $taskId);
                }
            };

            // Start sending audio data
            $sendChunk();
            break;
        case 'result-generated':
            $result = $response['payload']['output']['sentence'];
            echo "Recognition result: " . $result['text'] . "\n";
            if (isset($response['payload']['usage']['duration'])) {
                echo "Billable duration of the task (in seconds): " . $response['payload']['usage']['duration'] . "\n";
            }
            break;
        case 'task-finished':
            echo "The task is complete\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "The task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If all data has been sent and the task is complete, close the connection
    if ($allChunksSent && $response['header']['event'] == 'task-finished') {
        // Wait for 1 second to ensure all data has been transferred
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "The client closes the connection\n";
        });
    }
}

Go

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
	wsURL     = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
	audioFile = "asr_example.wav"                                   // Replace with the path to your audio file
)

var dialer = websocket.DefaultDialer

func main() {
	// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured environment variables, replace the following line with your Model Studio API key: apiKey := "sk-xxx"
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Connect to the WebSocket service
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		log.Fatal("Failed to connect to WebSocket:", err)
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results
	taskStarted := make(chan bool)
	taskDone := make(chan bool)
	startResultReceiver(conn, taskStarted, taskDone)

	// Send the run-task instruction
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		log.Fatal("Failed to send the run-task instruction:", err)
	}

	// Wait for the task-started event
	waitForTaskStarted(taskStarted)

	// Send the audio file stream for recognition
	if err := sendAudioData(conn); err != nil {
		log.Fatal("Failed to send audio:", err)
	}

	// Send the finish-task instruction
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		log.Fatal("Failed to send the finish-task instruction:", err)
	}

	// Wait for the task to complete or fail
	<-taskDone
}

// Define a struct to represent JSON data
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Output struct {
	Sentence struct {
		BeginTime int64  `json:"begin_time"`
		EndTime   *int64 `json:"end_time"`
		Text      string `json:"text"`
		Words     []struct {
			BeginTime   int64  `json:"begin_time"`
			EndTime     *int64 `json:"end_time"`
			Text        string `json:"text"`
			Punctuation string `json:"punctuation"`
		} `json:"words"`
	} `json:"sentence"`
}

type Payload struct {
	TaskGroup  string `json:"task_group"`
	Task       string `json:"task"`
	Function   string `json:"function"`
	Model      string `json:"model"`
	Parameters Params `json:"parameters"`
	Input      Input  `json:"input"`
	Output     Output `json:"output,omitempty"`
	Usage      *struct {
		Duration int `json:"duration"`
	} `json:"usage,omitempty"`
}

type Params struct {
	Format                   string `json:"format"`
	SampleRate               int    `json:"sample_rate"`
	VocabularyID             string `json:"vocabulary_id"`
	DisfluencyRemovalEnabled bool   `json:"disfluency_removal_enabled"`
}

type Input struct {
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	return conn, err
}

// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
	go func() {
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				log.Println("Failed to parse the server message:", err)
				return
			}
			var event Event
			err = json.Unmarshal(message, &event)
			if err != nil {
				log.Println("Failed to parse the event:", err)
				continue
			}
			if handleEvent(conn, event, taskStarted, taskDone) {
				return
			}
		}
	}()
}

// Send the run-task instruction
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task instruction
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "asr",
			Function:  "recognition",
			Model:     "fun-asr-realtime",
			Parameters: Params{
				Format:     "wav",
				SampleRate: 16000,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
	select {
	case <-taskStarted:
		fmt.Println("The task started successfully")
	case <-time.After(10 * time.Second):
		log.Fatal("Timed out waiting for task-started. The task failed to start.")
	}
}

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024)
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

// Send the finish-task instruction
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task instruction
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Handle events
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		taskStarted <- true
	case "result-generated":
		if event.Payload.Output.Sentence.Text != "" {
			fmt.Println("Recognition result:", event.Payload.Output.Sentence.Text)
		}
		if event.Payload.Usage != nil {
			fmt.Println("Billable duration of the task (in seconds):", event.Payload.Usage.Duration)
		}
	case "task-finished":
		fmt.Println("The task is complete")
		taskDone <- true
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		taskDone <- true
		return true
	default:
		log.Printf("Unexpected event: %v", event)
	}
	return false
}

// Handle the task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		log.Fatalf("The task failed: %s", event.Header.ErrorMessage)
	} else {
		log.Fatal("The task failed for an unknown reason")
	}
}

// Close the connection
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

Core concepts

Interaction sequence

The client and server interact in a strict sequence to ensure correct task execution.

image
  1. Establish a connection: The client sends a WebSocket connection request to the server with authentication information in the request header.

  2. Start the task: After the connection is established, the client sends a run-task instruction to specify the model and audio parameters.

  3. Confirm the task: The server returns a task-started event to indicate that it is ready to receive audio.

  4. Transfer data:

    • The client continuously sends binary audio data.

    • During recognition, the server repeatedly returns result-generated events in real time, which contain intermediate and final recognition results.

  5. End the task: After all audio is sent, the client sends a finish-task instruction.

  6. End confirmation: After processing all remaining audio, the server returns a task-finished event, which indicates that the task has completed successfully.

  7. Close the connection: The client or server closes the WebSocket connection.

Audio stream specifications

  • Channel: The binary audio sent to the server must be mono.

  • Format and encoding: The pcm, wav, mp3, opus, speex, aac, and amr formats are supported.

    • WAV files must use PCM encoding.

    • Opus or Speex files must be encapsulated in an Ogg container.

    • The amr format supports only the AMR-NB type.

  • Sample rate: The sample rate must be consistent with the sample_rate parameter specified in the run-task instruction and the requirements of the selected model.

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

fun-asr-realtime

Equivalent to fun-asr-realtime-2025-11-07

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more

pcm, wav, mp3, opus, speex, aac, amr

$0.000047/second

fun-asr-realtime-2025-11-07

This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.

Snapshot

fun-asr-realtime-2025-09-15

Chinese (Mandarin), English

API reference

Connection endpoint (URL)

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Parameter

Type

Required

Description

Authorization

string

Yes

The authentication token. The format is Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.

user-agent

string

No

The client identifier. This helps the server track the source of the request.

X-DashScope-WorkSpace

string

No

Model Studio workspace ID.

X-DashScope-DataInspection

string

No

Specifies whether to enable the data compliance check feature. The default value is enable. Do not enable this parameter unless it is necessary.

Instructions (client→server)

Instructions are JSON-formatted text messages sent by the client to control the lifecycle of a recognition task.

1. run-task instruction: Start a task

Purpose: After a connection is established, send this instruction to start a speech recognition task and configure its parameters.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "fun-asr-realtime",
        "parameters": {
            "format": "pcm",
            "sample_rate": 16000,
            "vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx"
        },
        "input": {}
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type. Set to run-task.

header.task_id

string

Yes

A unique ID for the task. Subsequent finish-task instructions must use the same task_id.

header.streaming

string

Yes

The communication pattern is fixed to duplex.

payload parameters:

Parameter

Type

Required

Description

payload.task_group

string

Yes

Task group. Set to audio.

payload.task

string

Yes

Task type. Set to asr.

payload.function

string

Yes

Function type. Set to recognition.

payload.model

string

Yes

The model to use. For more information, see the model list.

payload.input

object

Yes

Input configuration. Set to an empty object {}.

payload.parameters

format

string

Yes

Audio format. Supported formats: pcmwavmp3opusspeexaacamr. For detailed constraints, see Audio stream specifications.

sample_rate

integer

Yes

Audio sample rate in Hz.

The fun-asr-realtime model supports a sample rate of 16000 Hz.

vocabulary_id

string

No

The vocabulary ID. For more information, see Custom vocabulary.

This parameter is not set by default.

semantic_punctuation_enabled

boolean

No

Specifies whether to enable semantic punctuation.

Default value: false.

  • true: Uses semantic punctuation. Disables punctuation based on Voice Activity Detection (VAD). This is suitable for meeting transcription and provides high accuracy.

  • false: Uses VAD punctuation. Disables semantic punctuation. This is suitable for interactive scenarios and provides low latency.

Semantic punctuation more accurately identifies sentence boundaries. VAD punctuation provides faster responses. You can adjust the semantic_punctuation_enabled parameter to switch between punctuation methods for different scenarios.

max_sentence_silence

integer

No

The silence duration threshold for VAD. In punctuation based on VAD, a sentence is considered to have ended if the silence duration exceeds this threshold.
Unit: milliseconds (ms).

Default value: 1300.

Value range: [200, 6000].

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation is used).

multi_threshold_mode_enabled

boolean

No

Specifies whether to enable the feature that prevents excessively long sentences in VAD punctuation.

Default value: false.

  • true: Enables the feature, which limits the length of sentences split by VAD to avoid overly long segments.

  • false: Disables the feature.

This parameter takes effect only when semantic_punctuation_enabled is set to false (VAD punctuation is used).

heartbeat

boolean

No

Specifies whether to enable the persistent connection keep-alive switch.

Default value: false.

  • true: Enables the switch. The connection with the server is maintained without interruption when you continuously send silent audio.

  • false: Disables the switch. The connection times out and closes after 60 seconds, even if you continuously send silent audio.

Note:

  • Silent audio: Content in an audio file or data stream that contains no sound signals.

  • Generation method: Use audio editing software, such as Audacity or Adobe Audition, or a command-line interface (CLI), such as FFmpeg, to create silent audio.

2. finish-task instruction: End a task

Purpose: After the client finishes sending audio data, send this instruction to notify the server.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type. The value is fixed to finish-task.

header.task_id

string

Yes

The task ID. It must be the same as the task_id in the run-task instruction.

header.streaming

string

Yes

Communication pattern. The value is fixed to duplex.

payload parameters:

Parameter

Type

Required

Description

payload.input

object

Yes

Input configuration. Set to an empty object {}.

Events (server→client)

Events are JSON-formatted text messages sent by the server to synchronize task status and recognition results.

1. task-started

Trigger: After the server successfully processes the run-task instruction.
Action: Notifies the client that the task has started and that it can begin sending audio data.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type. The value is fixed to task-started.

header.task_id

string

The task ID.

2. result-generated

Trigger: When the server generates a new recognition result during the recognition process.
Action: Returns real-time recognition results, including intermediate and final sentence results.

Example:

{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": 920,
        "text": "Okay, I got it",
        "heartbeat": false,
        "sentence_end": true,
        "words": [
          {
            "begin_time": 170,
            "end_time": 295,
            "text": "Okay",
            "punctuation": ","
          },
          {
            "begin_time": 295,
            "end_time": 503,
            "text": "I",
            "punctuation": ""
          },
          {
            "begin_time": 503,
            "end_time": 711,
            "text": "got",
            "punctuation": ""
          },
          {
            "begin_time": 711,
            "end_time": 920,
            "text": "it",
            "punctuation": ""
          }
        ]
      }
    },
    "usage": {
      "duration": 3
    }
  }
}

header parameters:

Parameter

Type

Description

header.event

string

Event type. Set to result-generated.

header.task_id

string

The task ID.

payload parameters:

Parameter

Type

Description

output

object

output.sentence is the recognition result. See the following section for details.

usage

object

When payload.output.sentence.sentence_end is false (the current sentence is not finished, see payload.output.sentence parameters), usage is null.

When payload.output.sentence.sentence_end is true (the current sentence is finished, see payload.output.sentence parameters), usage.duration is the billable duration of the current task in seconds.

The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.usage is as follows:

Parameter

Type

Description

duration

integer

The billable duration of the task in seconds.

The payload.output.sentence object has the following format:

Parameter

Type

Description

begin_time

integer

The start time of the sentence in ms.

end_time

integer | null

The end time of the sentence in ms. This is null for intermediate recognition results.

text

string

The recognized text.

words

array

Word timestamp information.

heartbeat

boolean | null

If this value is true, you can skip processing the recognition result. This value is consistent with the heartbeat in the run-task instruction.

sentence_end

boolean

Indicates whether the given sentence has ended.

payload.output.sentence.words is a list of word timestamps. Each word has the following format:

Parameter

Type

Description

begin_time

integer

The start time of the word in ms.

end_time

integer

The end time of the word in ms.

text

string

The recognized word.

punctuation

string

Punctuation.

3. task-finished

Trigger: After the server receives a finish-task instruction and finishes processing all cached audio.
Action: Indicates that the recognition task has successfully ended.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {}
    },
    "payload": {
        "output": {}
    }
}

header parameters:

Parameter

Type

Description

header.event

string

Event type. Set to task-finished.

header.task_id

string

The task ID.

4. task-failed

Trigger: When any error occurs during task processing.
Action: Notifies the client that the task has failed and provides the reason for the failure. After receiving this event, close the WebSocket connection and handle the error.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "CLIENT_ERROR",
        "error_message": "request timeout after 23 seconds.",
        "attributes": {}
    },
    "payload": {}
}

Header parameters:

Parameter

Type

Description

header.event

string

Event type. Set to task-failed.

header.task_id

string

The task ID.

header.error_code

string

A description of the error type.

header.error_message

string

The specific reason for the error.

Connection overhead and reuse

The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.

When the server receives a run-task instruction from the client, a new task starts. After the client sends a finish-task instruction, the server returns a task-finished event to end the task. After a task ends, the WebSocket connection can be reused. The client can send another run-task instruction to start a new task on the same connection.

Important
  1. Each task on a reused connection must have a unique task_id.

  2. If a task fails during execution, the service returns a task-failed event and closes the connection. The connection cannot be reused.

  3. If no new task is started within 60 seconds after a task ends, the connection automatically times out and closes.

Error codes

For information about how to troubleshoot errors, see Error messages.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Note:

  • Silent audio: Content in an audio file or data stream that contains no sound signals.

  • Generation method: Use audio editing software, such as Audacity or Adobe Audition, or a command-line interface (CLI), such as FFmpeg, to create silent audio.

Q: How do I convert an audio format to a supported format?

You can use the FFmpeg tool. For more information, visit the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bitrate to control audio quality. Examples: 192k, 320k
# -ar: Specifies the sample rate
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the existing file. No value is needed.
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bitrate -ar sample_rate -ac num_channels output.ext

# Example: Convert WAV to MP3 and maintain the original quality
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: Convert MP3 to WAV in the 16-bit PCM standard format
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 16000 -ac 2 output.wav
# Example: Convert M4A to AAC to extract or convert Apple audio
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: Convert lossless FLAC to Opus for high compression
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?

Voice Service uses WebSocket because it requires full-duplex communication. This allows the server and client to actively exchange data. For example, the server can push real-time progress updates for speech synthesis or recognition. RESTful APIs, which are based on HTTP, support only a one-way, client-initiated request-response pattern. This pattern cannot support real-time interaction.

Troubleshooting

For troubleshooting information about code errors, see Error codes.

Q: Why is speech not recognized (no recognition result)?

  1. Ensure that the audio format (format) and sample rate (sample_rate) in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:

    • The audio file has a .wav extension but is in the MP3 format, and the format request parameter is incorrectly set to mp3.

    • The audio sampling rate is 3600 Hz, but the sample_rate request parameter is incorrectly set to 48000.

    You can use the ffprobe tool to retrieve information about the container, encoding, sample rate, and channels of the audio:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. If the preceding checks do not reveal any issues, you can use custom vocabulary to improve the recognition of specific words.