DashScope provides a real-time speech recognition service through its WebSocket API. The service supports multiple programming languages and audio formats, and emphasizes connection reuse and security. - Alibaba Cloud Model Studio

This topic describes how to connect directly to the Fun-ASR real-time speech recognition service using the WebSocket protocol. This method is compatible with any programming language that supports WebSocket. To simplify the integration process for Java and Python developers, we also provide higher-level SDKs: the Python SDK and the Java SDK. However, you can still use the general protocol described in this topic for maximum development flexibility.

User guide: For an introduction to the models and for model selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.

Getting started

Preparations

Create an API key. For security, export the API key as an environment variable.
Download the sample audio file: asr_example.wav.

Sample code

Node.js

Install the required dependencies:

npm install ws
npm install uuid

The following is the sample code:

const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate a UUID

// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file

// Generate a 32-digit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // A flag that indicates whether the task has started

// Send the run-task instruction when the connection is opened
ws.on('open', () => {
  console.log('Connected to the server');
  sendRunTask();
});

// Process received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('The task has started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result:', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Billable duration of the task (in seconds):', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('The task is complete');
      ws.close();
      break;
    case 'task-failed':
      console.error('The task failed:', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event:', message.header.event);
  }
});

// If the task-started event is not received, close the connection
ws.on('close', () => {
  if (!taskStarted) {
    console.error('The task did not start. Closing the connection.');
  }
});

// Send the run-task instruction
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'fun-asr-realtime',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send a chunk every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('The audio stream has ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading the audio file:', err);
    ws.close();
  });
}

// Send the finish-task instruction
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Handle errors
ws.on('error', (error) => {
  console.error('WebSocket error:', error);
});

C#

The following is the sample code:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;

class Program {
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    private static bool _taskStartedReceived = false;
    private static bool _taskFinishedReceived = false;
    // API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured environment variables, replace the following line with your Model Studio API key: private static readonly string ApiKey = "sk-xxx"
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
    // Replace with the path to your audio file
    private const string AudioFilePath = "asr_example.wav";

    static async Task Main(string[] args) {
        // Establish a WebSocket connection and configure headers for authentication
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");

        await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);

        // Start a thread to asynchronously receive WebSocket messages
        var receiveTask = ReceiveMessagesAsync();

        // Send the run-task instruction
        string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-digit random ID
        var runTaskJson = GenerateRunTaskJson(_taskId);
        await SendAsync(runTaskJson);

        // Wait for the task-started event
        while (!_taskStartedReceived) {
            await Task.Delay(100, _cancellationTokenSource.Token);
        }

        // Read the local file and send the audio stream to the server for recognition
        await SendAudioStreamAsync(AudioFilePath);

        // Send the finish-task instruction to end the task
        var finishTaskJson = GenerateFinishTaskJson(_taskId);
        await SendAsync(finishTaskJson);

        // Wait for the task-finished event
        while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
            try {
                await Task.Delay(100, _cancellationTokenSource.Token);
            } catch (OperationCanceledException) {
                // The task has been canceled. Exit the loop.
                break;
            }
        }

        // Close the connection
        if (!_cancellationTokenSource.IsCancellationRequested) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
        }

        _cancellationTokenSource.Cancel();
        try {
            await receiveTask;
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task ReceiveMessagesAsync() {
        try {
            while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
                var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
                if (message != null) {
                    var eventValue = message["header"]?["event"]?.GetValue<string>();
                    switch (eventValue) {
                        case "task-started":
                            Console.WriteLine("The task started successfully");
                            _taskStartedReceived = true;
                            break;
                        case "result-generated":
                            Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
                            if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
                                Console.WriteLine($"Billable duration of the task (in seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
                            }
                            break;
                        case "task-finished":
                            Console.WriteLine("The task is complete");
                            _taskFinishedReceived = true;
                            _cancellationTokenSource.Cancel();
                            break;
                        case "task-failed":
                            Console.WriteLine($"The task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
                            _cancellationTokenSource.Cancel();
                            break;
                    }
                }
            }
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);
        var result = await _webSocket.ReceiveAsync(segment, cancellationToken);

        if (result.MessageType == WebSocketMessageType.Close) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
            return null;
        }

        var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
        return JsonNode.Parse(message);
    }

    private static async Task SendAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        var segment = new ArraySegment<byte>(buffer);
        await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
    }

    private static async Task SendAudioStreamAsync(string filePath) {
        using (var audioStream = File.OpenRead(filePath)) {
            var buffer = new byte[1024]; // Send 100 ms of audio data each time
            int bytesRead;

            while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
                var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
                await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
                await Task.Delay(100); // 100 ms interval
            }
        }
    }

    private static string GenerateRunTaskJson(string taskId) {
        var runTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "run-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["task_group"] = "audio",
                ["task"] = "asr",
                ["function"] = "recognition",
                ["model"] = "fun-asr-realtime",
                ["parameters"] = new JsonObject {
                    ["format"] = "wav",
                    ["sample_rate"] = 16000,
                },
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(runTask);
    }

    private static string GenerateFinishTaskJson(string taskId) {
        var finishTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "finish-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(finishTask);
    }
}

PHP

The sample code has the following directory structure:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The following is the content of composer.json. Specify the dependency version numbers as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The following is the content of index.php:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;

// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured environment variables, replace the following line with your Model Studio API key: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
$audio_file_path = 'asr_example.wav'; // Replace with the path to your audio file

$loop = Loop::get();

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
    echo "Connected to the WebSocket server\n";

    // Start a thread to asynchronously receive WebSocket messages
    $conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
        $response = json_decode($msg, true);

        if (isset($response['header']['event'])) {
            handleEvent($conn, $response, $loop, $audio_file_path);
        } else {
            echo "Unknown message format\n";
        }
    });

    // Listen for connection closure
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });

    // Generate a task ID
    $taskId = generateTaskId();

    // Send the run-task instruction
    sendRunTaskMessage($conn, $taskId);

}, function ($e) {
    echo "Cannot connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "asr",
            "function" => "recognition",
            "model" => "fun-asr-realtime",
            "parameters" => [
                "format" => "wav",
                "sample_rate" => 16000
            ],
            "input" => []
        ]
    ]);
    echo "Preparing to send the run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "The run-task instruction has been sent\n";
}

/**
 * Read the audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Cannot read the audio file\n";
    }
    return $voiceData;
}

/**
 * Split the audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => []
        ]
    ]);
    echo "Preparing to send the finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "The finish-task instruction has been sent\n";
}

/**
 * Handle events
 * @param $conn
 * @param $response
 * @param $loop
 * @param $audio_file_path
 */
function handleEvent($conn, $response, $loop, $audio_file_path) {
    static $taskId;
    static $chunks;
    static $allChunksSent = false;

    if (is_null($taskId)) {
        $taskId = generateTaskId();
    }

    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started. Sending audio data...\n";
            // Read the audio file
            $voiceData = readAudioFile($audio_file_path);
            if ($voiceData === false) {
                echo "Cannot read the audio file\n";
                $conn->close();
                return;
            }

            // Split the audio data
            $chunks = splitAudioData($voiceData, 1024);

            // Define the send function
            $sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
                if (!empty($chunks)) {
                    $chunk = array_shift($chunks);
                    $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
                    $conn->send($binaryMsg);
                    // Send the next segment after 100 ms
                    $loop->addTimer(0.1, $sendChunk);
                } else {
                    echo "All blocks have been sent\n";
                    $allChunksSent = true;

                    // Send the finish-task instruction
                    sendFinishTaskMessage($conn, $taskId);
                }
            };

            // Start sending audio data
            $sendChunk();
            break;
        case 'result-generated':
            $result = $response['payload']['output']['sentence'];
            echo "Recognition result: " . $result['text'] . "\n";
            if (isset($response['payload']['usage']['duration'])) {
                echo "Billable duration of the task (in seconds): " . $response['payload']['usage']['duration'] . "\n";
            }
            break;
        case 'task-finished':
            echo "The task is complete\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "The task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If all data has been sent and the task is complete, close the connection
    if ($allChunksSent && $response['header']['event'] == 'task-finished') {
        // Wait for 1 second to ensure all data has been transferred
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "The client closes the connection\n";
        });
    }
}

Go

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	// The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference/
	wsURL     = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
	audioFile = "asr_example.wav"                                   // Replace with the path to your audio file
)

var dialer = websocket.DefaultDialer

func main() {
	// API keys are different for the Singapore and China (Beijing) regions. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured environment variables, replace the following line with your Model Studio API key: apiKey := "sk-xxx"
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Connect to the WebSocket service
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		log.Fatal("Failed to connect to WebSocket:", err)
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results
	taskStarted := make(chan bool)
	taskDone := make(chan bool)
	startResultReceiver(conn, taskStarted, taskDone)

	// Send the run-task instruction
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		log.Fatal("Failed to send the run-task instruction:", err)
	}

	// Wait for the task-started event
	waitForTaskStarted(taskStarted)

	// Send the audio file stream for recognition
	if err := sendAudioData(conn); err != nil {
		log.Fatal("Failed to send audio:", err)
	}

	// Send the finish-task instruction
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		log.Fatal("Failed to send the finish-task instruction:", err)
	}

	// Wait for the task to complete or fail
	<-taskDone
}

// Define a struct to represent JSON data
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Output struct {
	Sentence struct {
		BeginTime int64  `json:"begin_time"`
		EndTime   *int64 `json:"end_time"`
		Text      string `json:"text"`
		Words     []struct {
			BeginTime   int64  `json:"begin_time"`
			EndTime     *int64 `json:"end_time"`
			Text        string `json:"text"`
			Punctuation string `json:"punctuation"`
		} `json:"words"`
	} `json:"sentence"`
}

type Payload struct {
	TaskGroup  string `json:"task_group"`
	Task       string `json:"task"`
	Function   string `json:"function"`
	Model      string `json:"model"`
	Parameters Params `json:"parameters"`
	Input      Input  `json:"input"`
	Output     Output `json:"output,omitempty"`
	Usage      *struct {
		Duration int `json:"duration"`
	} `json:"usage,omitempty"`
}

type Params struct {
	Format                   string `json:"format"`
	SampleRate               int    `json:"sample_rate"`
	VocabularyID             string `json:"vocabulary_id"`
	DisfluencyRemovalEnabled bool   `json:"disfluency_removal_enabled"`
}

type Input struct {
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	return conn, err
}

// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
	go func() {
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				log.Println("Failed to parse the server message:", err)
				return
			}
			var event Event
			err = json.Unmarshal(message, &event)
			if err != nil {
				log.Println("Failed to parse the event:", err)
				continue
			}
			if handleEvent(conn, event, taskStarted, taskDone) {
				return
			}
		}
	}()
}

// Send the run-task instruction
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task instruction
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "asr",
			Function:  "recognition",
			Model:     "fun-asr-realtime",
			Parameters: Params{
				Format:     "wav",
				SampleRate: 16000,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
	select {
	case <-taskStarted:
		fmt.Println("The task started successfully")
	case <-time.After(10 * time.Second):
		log.Fatal("Timed out waiting for task-started. The task failed to start.")
	}
}

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024)
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

// Send the finish-task instruction
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task instruction
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Handle events
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		taskStarted <- true
	case "result-generated":
		if event.Payload.Output.Sentence.Text != "" {
			fmt.Println("Recognition result:", event.Payload.Output.Sentence.Text)
		}
		if event.Payload.Usage != nil {
			fmt.Println("Billable duration of the task (in seconds):", event.Payload.Usage.Duration)
		}
	case "task-finished":
		fmt.Println("The task is complete")
		taskDone <- true
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		taskDone <- true
		return true
	default:
		log.Printf("Unexpected event: %v", event)
	}
	return false
}

// Handle the task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		log.Fatalf("The task failed: %s", event.Header.ErrorMessage)
	} else {
		log.Fatal("The task failed for an unknown reason")
	}
}

// Close the connection
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

Core concepts

Interaction sequence

The client and server interact in a strict sequence to ensure correct task execution.

Establish a connection: The client sends a WebSocket connection request to the server with authentication information in the request header.
Start the task: After the connection is established, the client sends a run-task instruction to specify the model and audio parameters.
Confirm the task: The server returns a task-started event to indicate that it is ready to receive audio.
Transfer data:
- The client continuously sends binary audio data.
- During recognition, the server repeatedly returns result-generated events in real time, which contain intermediate and final recognition results.
End the task: After all audio is sent, the client sends a finish-task instruction.
End confirmation: After processing all remaining audio, the server returns a task-finished event, which indicates that the task has completed successfully.
Close the connection: The client or server closes the WebSocket connection.

Audio stream specifications

Channel: The binary audio sent to the server must be mono.
Format and encoding: The pcm, wav, mp3, opus, speex, aac, and amr formats are supported.
- WAV files must use PCM encoding.
- Opus or Speex files must be encapsulated in an Ogg container.
- The amr format supports only the AMR-NB type.
Sample rate: The sample rate must be consistent with the sample_rate parameter specified in the run-task instruction and the requirements of the selected model.

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model	Version	Supported languages	Supported sample rates	Scenarios	Supported audio formats	Price
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07	Stable	Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.	16 kHz	ApsaraVideo Live, conferences, call centers, and more	pcm, wav, mp3, opus, speex, aac, amr	$0.000047/second
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.	Snapshot
fun-asr-realtime-2025-09-15		Chinese (Mandarin), English

API reference

Connection endpoint (URL)

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Parameter	Type	Required	Description
Authorization	string	Yes	The authentication token. The format is `Bearer <your_api_key>`. Replace "`<your_api_key>`" with your actual API key.
user-agent	string	No	The client identifier. This helps the server track the source of the request.
X-DashScope-WorkSpace	string	No	Model Studio workspace ID.
X-DashScope-DataInspection	string	No	Specifies whether to enable the data compliance check feature. The default value is `enable`. Do not enable this parameter unless it is necessary.

Instructions (client→server)

Instructions are JSON-formatted text messages sent by the client to control the lifecycle of a recognition task.

1. run-task instruction: Start a task

Purpose: After a connection is established, send this instruction to start a speech recognition task and configure its parameters.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "fun-asr-realtime",
        "parameters": {
            "format": "pcm",
            "sample_rate": 16000,
            "vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx"
        },
        "input": {}
    }
}

header parameters:

Parameter	Type	Required	Description
header.action	string	Yes	Instruction type. Set to `run-task`.
header.task_id	string	Yes	A unique ID for the task. Subsequent finish-task instructions must use the same `task_id`.
header.streaming	string	Yes	The communication pattern is fixed to `duplex`.

payload parameters:

Parameter	Type	Required	Description
payload.task_group	string	Yes	Task group. Set to `audio`.
payload.task	string	Yes	Task type. Set to `asr`.
payload.function	string	Yes	Function type. Set to `recognition`.
payload.model	string	Yes	The model to use. For more information, see the model list.
payload.input	object	Yes	Input configuration. Set to an empty object `{}`.
payload.parameters
format	string	Yes	Audio format. Supported formats: `pcm`, `wav`, `mp3`, `opus`, `speex`, `aac`, `amr`. For detailed constraints, see Audio stream specifications.
sample_rate	integer	Yes	Audio sample rate in Hz. The fun-asr-realtime model supports a sample rate of 16000 Hz.
vocabulary_id	string	No	The vocabulary ID. For more information, see Custom vocabulary. This parameter is not set by default.
semantic_punctuation_enabled	boolean	No	Specifies whether to enable semantic punctuation. Default value: `false`. true: Uses semantic punctuation. Disables punctuation based on Voice Activity Detection (VAD). This is suitable for meeting transcription and provides high accuracy. false: Uses VAD punctuation. Disables semantic punctuation. This is suitable for interactive scenarios and provides low latency. Semantic punctuation more accurately identifies sentence boundaries. VAD punctuation provides faster responses. You can adjust the `semantic_punctuation_enabled` parameter to switch between punctuation methods for different scenarios.
max_sentence_silence	integer	No	The silence duration threshold for VAD. In punctuation based on VAD, a sentence is considered to have ended if the silence duration exceeds this threshold. Unit: milliseconds (ms). Default value: 1300. Value range: [200, 6000]. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is set to `false` (VAD punctuation is used).
multi_threshold_mode_enabled	boolean	No	Specifies whether to enable the feature that prevents excessively long sentences in VAD punctuation. Default value: `false`. true: Enables the feature, which limits the length of sentences split by VAD to avoid overly long segments. false: Disables the feature. This parameter takes effect only when `semantic_punctuation_enabled` is set to `false` (VAD punctuation is used).
heartbeat	boolean	No	Specifies whether to enable the persistent connection keep-alive switch. Default value: `false`. true: Enables the switch. The connection with the server is maintained without interruption when you continuously send silent audio. false: Disables the switch. The connection times out and closes after 60 seconds, even if you continuously send silent audio. Note: Silent audio: Content in an audio file or data stream that contains no sound signals. Generation method: Use audio editing software, such as Audacity or Adobe Audition, or a command-line interface (CLI), such as FFmpeg, to create silent audio.

2. finish-task instruction: End a task

Purpose: After the client finishes sending audio data, send this instruction to notify the server.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

header parameters:

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. The value is fixed to `finish-task`.
header.task_id	string	Yes	The task ID. It must be the same as the `task_id` in the run-task instruction.
header.streaming	string	Yes	Communication pattern. The value is fixed to `duplex`.

payload parameters:

Parameter	Type	Required	Description
payload.input	object	Yes	Input configuration. Set to an empty object `{}`.

Events (server→client)

Events are JSON-formatted text messages sent by the server to synchronize task status and recognition results.

1. task-started

Trigger: After the server successfully processes the run-task instruction.
Action: Notifies the client that the task has started and that it can begin sending audio data.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter	Type	Description
header.event	string	The event type. The value is fixed to `task-started`.
header.task_id	string	The task ID.

2. result-generated

Trigger: When the server generates a new recognition result during the recognition process.
Action: Returns real-time recognition results, including intermediate and final sentence results.

Example:

{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": 920,
        "text": "Okay, I got it",
        "heartbeat": false,
        "sentence_end": true,
        "words": [
          {
            "begin_time": 170,
            "end_time": 295,
            "text": "Okay",
            "punctuation": ","
          },
          {
            "begin_time": 295,
            "end_time": 503,
            "text": "I",
            "punctuation": ""
          },
          {
            "begin_time": 503,
            "end_time": 711,
            "text": "got",
            "punctuation": ""
          },
          {
            "begin_time": 711,
            "end_time": 920,
            "text": "it",
            "punctuation": ""
          }
        ]
      }
    },
    "usage": {
      "duration": 3
    }
  }
}

header parameters:

Parameter	Type	Description
header.event	string	Event type. Set to `result-generated`.
header.task_id	string	The task ID.

payload parameters:

Parameter

Type

Description

output

object

output.sentence is the recognition result. See the following section for details.

usage

object

When payload.output.sentence.sentence_end is false (the current sentence is not finished, see payload.output.sentence parameters), usage is null.

When payload.output.sentence.sentence_end is true (the current sentence is finished, see payload.output.sentence parameters), usage.duration is the billable duration of the current task in seconds.

The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.usage is as follows:

Parameter	Type	Description
duration	integer	The billable duration of the task in seconds.

The payload.output.sentence object has the following format:

Parameter	Type	Description
begin_time	integer	The start time of the sentence in ms.
end_time	integer \| null	The end time of the sentence in ms. This is null for intermediate recognition results.
text	string	The recognized text.
words	array	Word timestamp information.
heartbeat	boolean \| null	If this value is true, you can skip processing the recognition result. This value is consistent with the heartbeat in the run-task instruction.
sentence_end	boolean	Indicates whether the given sentence has ended.

payload.output.sentence.words is a list of word timestamps. Each word has the following format:

Parameter	Type	Description
begin_time	integer	The start time of the word in ms.
end_time	integer	The end time of the word in ms.
text	string	The recognized word.
punctuation	string	Punctuation.

3. task-finished

Trigger: After the server receives a finish-task instruction and finishes processing all cached audio.
Action: Indicates that the recognition task has successfully ended.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {}
    },
    "payload": {
        "output": {}
    }
}

header parameters:

Parameter	Type	Description
header.event	string	Event type. Set to `task-finished`.
header.task_id	string	The task ID.

4. task-failed

Trigger: When any error occurs during task processing.
Action: Notifies the client that the task has failed and provides the reason for the failure. After receiving this event, close the WebSocket connection and handle the error.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "CLIENT_ERROR",
        "error_message": "request timeout after 23 seconds.",
        "attributes": {}
    },
    "payload": {}
}

Header parameters:

Parameter	Type	Description
header.event	string	Event type. Set to `task-failed`.
header.task_id	string	The task ID.
header.error_code	string	A description of the error type.
header.error_message	string	The specific reason for the error.

Connection overhead and reuse

The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.

When the server receives a run-task instruction from the client, a new task starts. After the client sends a finish-task instruction, the server returns a task-finished event to end the task. After a task ends, the WebSocket connection can be reused. The client can send another run-task instruction to start a new task on the same connection.

Important

Each task on a reused connection must have a unique task_id.
If a task fails during execution, the service returns a task-failed event and closes the connection. The connection cannot be reused.
If no new task is started within 60 seconds after a task ends, the connection automatically times out and closes.

Error codes

For information about how to troubleshoot errors, see Error messages.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Note:

Silent audio: Content in an audio file or data stream that contains no sound signals.
Generation method: Use audio editing software, such as Audacity or Adobe Audition, or a command-line interface (CLI), such as FFmpeg, to create silent audio.

Q: How do I convert an audio format to a supported format?

You can use the FFmpeg tool. For more information, visit the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bitrate to control audio quality. Examples: 192k, 320k
# -ar: Specifies the sample rate
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the existing file. No value is needed.
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bitrate -ar sample_rate -ac num_channels output.ext

# Example: Convert WAV to MP3 and maintain the original quality
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: Convert MP3 to WAV in the 16-bit PCM standard format
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 16000 -ac 2 output.wav
# Example: Convert M4A to AAC to extract or convert Apple audio
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: Convert lossless FLAC to Opus for high compression
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?

Voice Service uses WebSocket because it requires full-duplex communication. This allows the server and client to actively exchange data. For example, the server can push real-time progress updates for speech synthesis or recognition. RESTful APIs, which are based on HTTP, support only a one-way, client-initiated request-response pattern. This pattern cannot support real-time interaction.

Troubleshooting

For troubleshooting information about code errors, see Error codes.

Q: Why is speech not recognized (no recognition result)?

Ensure that the audio format (format) and sample rate (sample_rate) in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:
- The audio file has a .wav extension but is in the MP3 format, and the format request parameter is incorrectly set to mp3.
- The audio sampling rate is 3600 Hz, but the sample_rate request parameter is incorrectly set to 48000.
You can use the ffprobe tool to retrieve information about the container, encoding, sample rate, and channels of the audio:
```
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
```
If the preceding checks do not reveal any issues, you can use custom vocabulary to improve the recognition of specific words.