Implement WebSocket Streaming for Paraformer Real-Time ASR - Model Studio

Important

This document applies only to the China (Beijing) region. To use the models, you must use a China (Beijing) region API key.

Access the real-time speech recognition service through WebSocket.

The DashScope SDK supports only Java and Python. For other languages, use the WebSocket connection described here.

User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.

WebSocket provides full-duplex communication: the client and server establish a persistent connection with a single handshake, allowing both parties to push data to each other, providing better real-time performance.

WebSocket libraries are available for most languages (Go: gorilla/websocket, PHP: Ratchet, Node.js: ws). Familiarize yourself with WebSocket basics before starting.

Prerequisites

You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.

Note

For temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.

Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.

To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

Model availability

	paraformer-realtime-v2	paraformer-realtime-8k-v2
Scenarios	Scenarios such as live streaming and meetings	Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail
Sample rate	Any	8 kHz
Languages	Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese	Chinese
Punctuation prediction	✅ Default (no configuration needed)	✅ Default (no configuration needed)
Inverse text normalization (ITN)	✅ Default (no configuration needed)	✅ Default (no configuration needed)
Specify recognition language	✅ Specify using the `language_hints` parameter	❌
Emotion recognition	❌	✅ (Click to view usage) Emotion recognition has the following constraints: It is only available for the `paraformer-realtime-8k-v2` model. You must disable semantic punctuation. This is controlled by the `semantic_punctuation_enabled` parameter of the run-task instruction. By default, semantic punctuation is disabled. The emotion recognition result is displayed only when the value of `payload.output.sentence.sentence_end` is `true`. You can obtain the emotion recognition results by parsing the result-generated event, where the `payload.output.sentence.emo_tag` and `payload.output.sentence.emo_confidence` fields represent the emotion and emotion confidence level of the current sentence, respectively.

Interaction flow

The client sends two message types to the server: JSON instructions and binary audio (must be single-channel). Server responses are called events.

The interaction flow is as follows:

Establish a connection: The client connects to the server via WebSocket.
Start a task:
- The client sends the run-task instruction to start the task.
- The client receives the task-started event from the server, which indicates that the task has started successfully.
Send an audio stream:
- The client starts sending binary audio and simultaneously receives the result-generated event from the server, which contains the speech recognition result.
Notify the server to end the task:
- The client sends the finish-task instruction to notify the server to end the task, and continues to receive the result-generated event returned by the server.
End the task:
- The client receives the task-finished event from the server, which marks the end of the task.
Close the connection: The client closes the WebSocket connection.

URL

The WebSocket URL is as follows:

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Parameter	Type	Required	Description
Authorization	string	Yes	The authentication token in the format `Bearer <your_api_key>`. Replace "`<your_api_key>`" with your actual API key.
user-agent	string	No	The client identifier. It helps the server track the request source.
X-DashScope-WorkSpace	string	No	Model Studio workspace ID.
X-DashScope-DataInspection	string	No	Specifies whether to enable the data compliance check. Default: `enable`. Do not enable this parameter unless necessary.

Instructions (Client → Server)

Instructions are JSON messages (Text Frames) sent from client to server to control task start/end and mark boundaries. Binary audio (single-channel) is sent separately, not in instructions.

Send instructions in this order (out-of-order sends may fail):

run-task instruction - Starts the task and saves the returned task_id for step 3.
Binary audio (mono) - Send after receiving the task-started event.
finish-task instruction - Ends the task after all audio is sent, using the same task_id from step 1.

1. run-task instruction: Start a task

This instruction starts a speech recognition task. Use the same task_id when sending the finish-task instruction.

Important

When to send: After the WebSocket connection is established.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // random uuid
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "paraformer-realtime-v2",
        "parameters": {
            "format": "pcm", // Audio format
            "sample_rate": 16000, // Sample rate
            "disfluency_removal_enabled": false, // Filter disfluent words
            "language_hints": [
                "en"
            ] // Specify language, only supported by the paraformer-realtime-v2 model
        }
        "input": {}
    }
}

header parameters:

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. Fixed value: "run-task".
header.task_id	string	Yes	The current task ID. A 32-bit universally unique identifier (UUID), consisting of 32 randomly generated letters and numbers. It can include hyphens (for example, `"2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx"`) or not (for example, `"2bf83b9abaeb4fda8d9axxxxxxxxxxxx"`). Most programming languages have built-in APIs for generating UUIDs. For example, in Python: `import uuid def generateTaskId(): # Generate a random UUID return uuid.uuid4().hex` When you later send the finish-task instruction, use the same task_id that you used for the run-task instruction.
header.streaming	string	Yes	Fixed string: "duplex"

payload parameters:

Parameter	Type	Required	Description
payload.task_group	string	Yes	Fixed string: "audio".
payload.task	string	Yes	Fixed string: "asr".
payload.function	string	Yes	Fixed string: "recognition".
payload.model	string	Yes	The name of the model. For a list of supported models, see Model List.
payload.input	object	Yes	Fixed format: {}.
payload.parameters
format	string	Yes	The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr. Important opus/speex: Must be encapsulated in Ogg. wav: Must be PCM encoded. amr: Only the AMR-NB type is supported.
sample_rate	integer	Yes	The audio sampling rate in Hz. This parameter varies by model: paraformer-realtime-v2 supports any sample rate. paraformer-realtime-8k-v2 supports only an 8000 Hz sample rate.
disfluency_removal_enabled	boolean	No	Specifies whether to filter out disfluent words: true false (default)
language_hints	array[string]	No	The language code for recognition. If you cannot determine the language in advance, leave this parameter unset for automatic detection. Currently supported language codes: zh: Chinese en: English ja: Japanese yue: Cantonese ko: Korean de: German fr: French ru: Russian This parameter only applies to models that support multiple languages (see Model List).
semantic_punctuation_enabled	boolean	No	Specifies whether to enable semantic sentence segmentation (disabled by default): true: Uses semantic segmentation (disables VAD segmentation). false (default): Uses VAD segmentation. Semantic segmentation provides higher accuracy and is ideal for meeting transcription. VAD segmentation has lower latency and is ideal for interactive scenarios. Applies to v2 and later models.
max_sentence_silence	integer	No	The VAD sentence segmentation silence threshold (ms). If silence after a speech segment exceeds this value, the sentence ends. Range: 200-6000 ms. Default: 800 ms. Applies only when `semantic_punctuation_enabled` is false (VAD mode) and model is v2 or later.
multi_threshold_mode_enabled	boolean	No	Specifies whether to prevent VAD from over-segmenting long sentences (disabled by default). Applies only when `semantic_punctuation_enabled` is false (VAD mode) and model is v2 or later.
punctuation_prediction_enabled	boolean	No	Specifies whether to automatically add punctuation to results (enabled by default): true (default) false Applies to v2 and later models only.
heartbeat	boolean	No	Specifies whether to maintain a persistent server connection: true: Keeps connection alive when sending silent audio continuously. false (default): Connection times out after 60s even with silent audio. Silent audio: audio with no sound signal. Generate it with editing software (Audacity, Adobe Audition) or FFmpeg. Applies to v2 and later models only.
inverse_text_normalization_enabled	boolean	No	Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals (enabled by default). Applies to v2 and later models only.

2. finish-task instruction: End a task

This instruction ends the speech recognition task. The client sends this instruction after all audio has been sent.

Important

When to send: After the audio is completely sent.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

header parameters:

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. Fixed value: "finish-task".
header.task_id	string	Yes	The current task ID. Must be the same as the task_id you used to send the run-task instruction.
header.streaming	string	Yes	Fixed string: "duplex"

payload parameters:

Parameter	Type	Required	Description
payload.input	object	Yes	Fixed format: {}.

Binary audio (Client → Server)

Send audio after receiving the task-started event. You can use real-time audio (microphone) or file audio, which must be single-channel. Upload the audio via the WebSocket binary channel, and we recommend sending 100 ms of audio every 100 ms.

Events (Server → Client)

Events are JSON messages from server to client representing different processing stages.

1. task-started event: Task has started

The task-started event confirms successful task start. Wait for this event before sending audio or the finish-task instruction — otherwise the task will fail.

The payload of the task-started event has no content.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter	Type	Description
header.event	string	The event type. Fixed value: "task-started".
header.task_id	string	The task_id generated by the client.

2. result-generated event: Speech recognition result

While the client sends the audio for recognition and the finish-task instruction, the server continuously returns the result-generated event, which contains the speech recognition result.

You can determine whether the result is intermediate or final by checking if payload.sentence.endTime in the result-generated event is null.

Example:

{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": null,
        "text": "Okay, I got it",
        "heartbeat": false,
        "sentence_end": true,
        "emo_tag": "neutral", // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
        "emo_confidence": 0.914, // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
        "words": [
          {
            "begin_time": 170,
            "end_time": 295,
            "text": "Okay",
            "punctuation": ","
          },
          {
            "begin_time": 295,
            "end_time": 503,
            "text": "I",
            "punctuation": ""
          },
          {
            "begin_time": 503,
            "end_time": 711,
            "text": "got",
            "punctuation": ""
          },
          {
            "begin_time": 711,
            "end_time": 920,
            "text": "it",
            "punctuation": ""
          }
        ]
      }
    },
    "usage": {
      "duration": 3
    }
  }
}

header parameters:

Parameter	Type	Description
header.event	string	The event type. Fixed value: "result-generated".
header.task_id	string	The task_id generated by the client.

payload parameters:

Parameter

Type

Description

output

object

output.sentence is the recognition result. See the following text for details.

usage

object

When payload.output.sentence.sentence_end is false (the current sentence is not finished, see payload.output.sentence parameter description), usage is null.

When payload.output.sentence.sentence_end is true (the current sentence is complete, see payload.output.sentence parameter description), usage.duration is the billable duration of the current task in seconds.

The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.output.usage is as follows:

Parameter	Type	Description
duration	integer	The billable duration of the task, in seconds.

The format of payload.output.sentence is as follows:

Parameter	Type	Description
begin_time	integer	The start time of the sentence, in ms.
end_time	integer \| null	The end time of the sentence, in ms. If this is an intermediate recognition result, the value is null.
text	string	The recognized text.
words	array	Character timestamp information.
heartbeat	boolean \| null	If this value is true, you can skip processing the recognition result.
sentence_end	boolean	Indicates whether the given sentence has ended.
emo_tag	string	The emotion of the current sentence: positive: Positive emotions, such as happiness or satisfaction negative: Negative emotions, such as anger or dullness neutral: No obvious emotion Emotion recognition has the following constraints: It is only available for the `paraformer-realtime-8k-v2` model. You must disable semantic punctuation. This is controlled by the `semantic_punctuation_enabled` parameter of the run-task instruction. By default, semantic punctuation is disabled. The emotion recognition result is displayed only when the value of `payload.output.sentence.sentence_end` is `true`.
emo_confidence	number	The confidence level of the emotion recognized in the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints: It is only available for the `paraformer-realtime-8k-v2` model. You must disable semantic punctuation. This is controlled by the `semantic_punctuation_enabled` parameter of the run-task instruction. By default, semantic punctuation is disabled. The emotion recognition result is displayed only when the value of `payload.output.sentence.sentence_end` is `true`.

payload.output.sentence.words is a list of character timestamps, where each word has the following format:

Parameter	Type	Description
begin_time	integer	The start time of the character, in ms.
end_time	integer	The end time of the character, in ms.
text	string	The character.
punctuation	string	The punctuation mark.

3. task-finished event: Task has ended

When you receive the task-finished event, the task has ended. Close the WebSocket connection and terminate the program.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {}
    },
    "payload": {
        "output": {},
        "usage": null
    }
}

header parameters:

Parameter	Type	Description
header.event	string	The event type. Fixed value: "task-finished".
header.task_id	string	The task_id generated by the client.

4. task-failed event: Task failed

If you receive a task-failed event, close the connection and analyze the error message. Fix your code if the failure is due to a programming issue.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "CLIENT_ERROR",
        "error_message": "request timeout after 23 seconds.",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter	Type	Description
header.event	string	The event type. Fixed value: "task-failed".
header.task_id	string	The task_id generated by the client.
header.error_code	string	A description of the error type.
header.error_message	string	The specific reason for the error.

Connection overhead and reuse

The WebSocket service supports connection reuse to reduce overhead.

The server starts a task on receiving a run-task instruction. After the client sends a finish-task instruction and receives the task-finished event, the connection can be reused by sending a new run-task instruction.

Important

Different tasks within a reused connection must use different task_ids.
If a failure occurs during task execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.
If no new task is started within 60 seconds after a task ends, the connection automatically times out and disconnects.

Code examples

These examples show basic implementation. Adapt them to your scenarios. WebSocket clients use asynchronous programming to send/receive messages simultaneously. Follow these steps:

Establish a WebSocket connection
Call a WebSocket library function to establish a connection, passing the Headers and URL. Implementation varies by language/library.
Listen for server messages
Use callback functions (observer pattern) from the WebSocket library to listen for server messages. Implementation depends on the language. The server returns binary audio streams and events.
Monitor events:
- task-started: A task-started event indicates that the task has started successfully. You can send binary audio or a finish-task instruction to the server only after this event is triggered. Otherwise, the task will fail.
- result-generated: When the client sends binary audio, the server may continuously send the result-generated event, which contains the speech recognition result.
- task-finished: When you receive a task-finished event, the task is complete. You can then close the WebSocket connection and terminate the program.
- task-failed: A task-failed event means the task has failed. You must close the WebSocket connection and correct the code based on the error message.
Send messages to the server (pay close attention to the timing)
Send instructions and binary audio to the server from a thread other than the one that listens for server messages. For example, you can use the main thread. The specific implementation varies depending on the programming language.
Send instructions in this order (out-of-order sends may fail):
1. run-task instruction - Starts the task and saves the returned task_id for step 3.
2. Binary audio (mono) - Send after receiving the task-started event.
3. finish-task instruction - Ends the task after all audio is sent, using the same task_id from step 1.
Close the WebSocket connection
Close the WebSocket connection when the program finishes, an exception occurs, or you receive a task-finished event or a task-failed event. You can typically do this by calling the close function in the WebSocket library.

Click to view complete example

In the following example, the audio file used is asr_example.wav.

Go

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	wsURL     = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
	audioFile = "asr_example.wav"                                   // Replace with the path to your audio file
)

var dialer = websocket.DefaultDialer

func main() {
	// Get API key from environment. For testing only, you can use: apiKey := "your_api_key" (not recommended in production)
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Connect to the WebSocket service
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		log.Fatal("Failed to connect to WebSocket: ", err)
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results
	taskStarted := make(chan bool)
	taskDone := make(chan bool)
	startResultReceiver(conn, taskStarted, taskDone)

	// Send the run-task instruction
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		log.Fatal("Failed to send run-task instruction: ", err)
	}

	// Wait for the task-started event
	waitForTaskStarted(taskStarted)

	// Send the audio file stream for recognition
	if err := sendAudioData(conn); err != nil {
		log.Fatal("Failed to send audio: ", err)
	}

	// Send the finish-task instruction
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		log.Fatal("Failed to send finish-task instruction: ", err)
	}

	// Wait for the task to complete or fail
	<-taskDone
}

// Define structs to represent JSON data
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Output struct {
	Sentence struct {
		BeginTime int64  `json:"begin_time"`
		EndTime   *int64 `json:"end_time"`
		Text      string `json:"text"`
		Words     []struct {
			BeginTime   int64  `json:"begin_time"`
			EndTime     *int64 `json:"end_time"`
			Text        string `json:"text"`
			Punctuation string `json:"punctuation"`
		} `json:"words"`
	} `json:"sentence"`
}

type Payload struct {
	TaskGroup  string `json:"task_group"`
	Task       string `json:"task"`
	Function   string `json:"function"`
	Model      string `json:"model"`
	Parameters Params `json:"parameters"`
	Input  Input  `json:"input"`
	Output Output `json:"output,omitempty"`
	Usage  *struct {
		Duration int `json:"duration"`
	} `json:"usage,omitempty"`
}

type Params struct {
	Format                   string   `json:"format"`
	SampleRate               int      `json:"sample_rate"`
	DisfluencyRemovalEnabled bool     `json:"disfluency_removal_enabled"`
	LanguageHints            []string `json:"language_hints"`
}

type Input struct {
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	return conn, err
}

// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
	go func() {
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				log.Println("Failed to parse server message: ", err)
				return
			}
			var event Event
			err = json.Unmarshal(message, &event)
			if err != nil {
				log.Println("Failed to parse event: ", err)
				continue
			}
			if handleEvent(conn, event, taskStarted, taskDone) {
				return
			}
		}
	}()
}

// Send the run-task instruction
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task instruction
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "asr",
			Function:  "recognition",
			Model:     "paraformer-realtime-v2",
			Parameters: Params{
				Format:     "wav",
				SampleRate: 16000,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
	select {
	case <-taskStarted:
		fmt.Println("Task started successfully")
	case <-time.After(10 * time.Second):
		log.Fatal("Timed out waiting for task-started, task failed to start")
	}
}

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024) // Assume 100 ms of audio data is approximately 1024 bytes
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

// Send the finish-task instruction
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task instruction
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Handle event
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		taskStarted <- true
	case "result-generated":
		if event.Payload.Output.Sentence.Text != "" {
			fmt.Println("Recognition result: ", event.Payload.Output.Sentence.Text)
		}
		if event.Payload.Usage != nil {
			fmt.Println("Task billable duration (seconds): ", event.Payload.Usage.Duration)
		}
	case "task-finished":
		fmt.Println("Task finished")
		taskDone <- true
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		taskDone <- true
		return true
	default:
		log.Printf("Unexpected event: %v", event)
	}
	return false
}

// Handle task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		log.Fatalf("Task failed: %s", event.Header.ErrorMessage)
	} else {
		log.Fatal("Task failed for an unknown reason")
	}
}

// Close the connection
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

C#

Example code:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;

class Program {
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    private static bool _taskStartedReceived = false;
    private static bool _taskFinishedReceived = false;
    // Get API key from environment. For testing only, you can use: private const string ApiKey="your_api_key" (not recommended in production)
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // WebSocket server address
    private const string WebSocketUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/";
    // Replace with the path to your audio file
    private const string AudioFilePath = "asr_example.wav";

    static async Task Main(string[] args) {
        // Establish a WebSocket connection and configure headers for authentication
        _webSocket.Options.SetRequestHeader("Authorization", ApiKey);

        await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);

        // Start a thread to asynchronously receive WebSocket messages
        var receiveTask = ReceiveMessagesAsync();

        // Send the run-task instruction
        string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-bit random ID
        var runTaskJson = GenerateRunTaskJson(_taskId);
        await SendAsync(runTaskJson);

        // Wait for the task-started event
        while (!_taskStartedReceived) {
            await Task.Delay(100, _cancellationTokenSource.Token);
        }

        // Read the local file and send the audio stream for recognition to the server
        await SendAudioStreamAsync(AudioFilePath);

        // Send the finish-task instruction to end the task
        var finishTaskJson = GenerateFinishTaskJson(_taskId);
        await SendAsync(finishTaskJson);

        // Wait for the task-finished event
        while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
            try {
                await Task.Delay(100, _cancellationTokenSource.Token);
            } catch (OperationCanceledException) {
                // The task has been canceled, exit the loop
                break;
            }
        }

        // Close the connection
        if (!_cancellationTokenSource.IsCancellationRequested) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
        }

        _cancellationTokenSource.Cancel();
        try {
            await receiveTask;
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task ReceiveMessagesAsync() {
        try {
            while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
                var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
                if (message != null) {
                    var eventValue = message["header"]?["event"]?.GetValue<string>();
                    switch (eventValue) {
                        case "task-started":
                            Console.WriteLine("Task started successfully");
                            _taskStartedReceived = true;
                            break;
                        case "result-generated":
                            Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
                            if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
                                Console.WriteLine($"Task billable duration (seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
                            }
                            break;
                        case "task-finished":
                            Console.WriteLine("Task finished");
                            _taskFinishedReceived = true;
                            _cancellationTokenSource.Cancel();
                            break;
                        case "task-failed":
                            Console.WriteLine($"Task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
                            _cancellationTokenSource.Cancel();
                            break;
                    }
                }
            }
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);
        var result = await _webSocket.ReceiveAsync(segment, cancellationToken);

        if (result.MessageType == WebSocketMessageType.Close) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
            return null;
        }

        var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
        return JsonNode.Parse(message);
    }

    private static async Task SendAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        var segment = new ArraySegment<byte>(buffer);
        await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
    }

    private static async Task SendAudioStreamAsync(string filePath) {
        using (var audioStream = File.OpenRead(filePath)) {
            var buffer = new byte[1024]; // Send 100 ms of audio data each time
            int bytesRead;

            while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
                var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
                await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
                await Task.Delay(100); // 100 ms interval
            }
        }
    }

    private static string GenerateRunTaskJson(string taskId) {
        var runTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "run-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["task_group"] = "audio",
                ["task"] = "asr",
                ["function"] = "recognition",
                ["model"] = "paraformer-realtime-v2",
                ["parameters"] = new JsonObject {
                    ["format"] = "wav",
                    ["sample_rate"] = 16000,
                    ["disfluency_removal_enabled"] = false
                },
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(runTask);
    }

    private static string GenerateFinishTaskJson(string taskId) {
        var finishTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "finish-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(finishTask);
    }
}

PHP

The example code directory structure is:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The content of composer.json is as follows. Determine the version numbers of the dependencies as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The content of index.php is as follows:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;

# Get API key from environment. For testing only, you can use: $api_key="your_api_key" (not recommended in production)
$api_key = getenv("DASHSCOPE_API_KEY");
$websocket_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
$audio_file_path = 'asr_example.wav'; // Replace with the path to your audio file

$loop = Loop::get();

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
    echo "Connected to WebSocket server\n";

    // Start a thread to asynchronously receive WebSocket messages
    $conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
        $response = json_decode($msg, true);

        if (isset($response['header']['event'])) {
            handleEvent($conn, $response, $loop, $audio_file_path);
        } else {
            echo "Unknown message format\n";
        }
    });

    // Listen for connection closure
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });

    // Generate a task ID
    $taskId = generateTaskId();

    // Send the run-task instruction
    sendRunTaskMessage($conn, $taskId);

}, function ($e) {
    echo "Could not connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "asr",
            "function" => "recognition",
            "model" => "paraformer-realtime-v2",
            "parameters" => [
                "format" => "wav",
                "sample_rate" => 16000
            ],
            // If you are not using the custom vocabulary feature, do not pass the resources parameter
            //"resources" => [
            //    [
            //        "resource_id" => "xxxxxxxxxxxx",
            //        "resource_type" => "asr_phrase"
            //    ]
            //],
            "input" => []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read the audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Could not read audio file\n";
    }
    return $voiceData;
}

/**
 * Split the audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle event
 * @param $conn
 * @param $response
 * @param $loop
 * @param $audio_file_path
 */
function handleEvent($conn, $response, $loop, $audio_file_path) {
    static $taskId;
    static $chunks;
    static $allChunksSent = false;

    if (is_null($taskId)) {
        $taskId = generateTaskId();
    }

    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending audio data...\n";
            // Read the audio file
            $voiceData = readAudioFile($audio_file_path);
            if ($voiceData === false) {
                echo "Could not read audio file\n";
                $conn->close();
                return;
            }

            // Split the audio data
            $chunks = splitAudioData($voiceData, 1024);

            // Define the send function
            $sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
                if (!empty($chunks)) {
                    $chunk = array_shift($chunks);
                    $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
                    $conn->send($binaryMsg);
                    // Send the next chunk after 100 ms
                    $loop->addTimer(0.1, $sendChunk);
                } else {
                    echo "All data blocks sent\n";
                    $allChunksSent = true;

                    // Send the finish-task instruction
                    sendFinishTaskMessage($conn, $taskId);
                }
            };

            // Start sending audio data
            $sendChunk();
            break;
        case 'result-generated':
            $result = $response['payload']['output']['sentence'];
            echo "Recognition result: " . $result['text'] . "\n";
            if (isset($response['payload']['usage']['duration'])) {
                echo "Task billable duration (seconds): " . $response['payload']['usage']['duration'] . "\n";
            }
            break;
        case 'task-finished':
            echo "Task finished\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If all data has been sent and the task is finished, close the connection
    if ($allChunksSent && $response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data has been transmitted
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }
}

Node.js

Install the required dependencies:

npm install ws
npm install uuid

Example code:

import fs from 'fs';
import WebSocket from 'ws';
import { v4 as uuidv4 } from 'uuid'; // Used to generate UUIDs

// Get API key from environment. For testing only, you can use: apiKey = 'your_api_key' (not recommended in production)
const apiKey = process.env.DASHSCOPE_API_KEY;
const url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file

// Generate a 32-bit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // Flag to indicate if the task has started

// Send the run-task instruction when the connection is open
ws.on('open', () => {
  console.log('Connected to server');
  sendRunTask();
});

// Handle received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('Task started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result: ', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Task billable duration (seconds): ', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('Task finished');
      ws.close();
      break;
    case 'task-failed':
      console.error('Task failed: ', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event: ', message.header.event);
  }
});

// If the task-started event is not received, close the connection
ws.on('close', () => {
  if (!taskStarted) {
    console.error('Task not started, closing connection');
  }
});

// Send the run-task instruction
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'paraformer-realtime-v2',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('Audio stream ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading audio file: ', err);
    ws.close();
  });
}

// Send the finish-task instruction
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Handle errors
ws.on('error', (error) => {
  console.error('WebSocket error: ', error);
});

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How to maintain a persistent connection with the server during long periods of silence?

Set heartbeat parameter to true and send silent audio continuously.

Silent audio: audio with no sound signal. Generate it with editing software (Audacity, Adobe Audition) or FFmpeg.

Q: How to convert an audio format to the required format?

You can use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext

# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: Can I view the time range for each sentence?

Yes. Results include start/end timestamps for each sentence to determine time ranges.

Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?

The Speech Service uses WebSocket instead of HTTP/HTTPS or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to proactively push data, such as real-time progress updates for synthesis or recognition. RESTful APIs over HTTP only support client-initiated request-response cycles and cannot meet real-time interaction requirements.

Q: How to recognize a local file (recording file)?

Convert the local file to a binary audio stream and upload via WebSocket binary channel using the send method. See code snippet below and complete example in Code examples section.

Click to view the code snippet

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024) // Assume 100 ms of audio data is approximately 1024 bytes
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

private static async Task SendAudioStreamAsync(string filePath) {
    using (var audioStream = File.OpenRead(filePath)) {
        var buffer = new byte[1024]; // Send 100 ms of audio data each time
        int bytesRead;

        while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
            var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
            await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
            await Task.Delay(100); // 100 ms interval
        }
    }
}

PHP

// Read the audio file
$voiceData = readAudioFile($audio_file_path);
if ($voiceData === false) {
    echo "Could not read audio file\n";
    $conn->close();
    return;
}

// Split the audio data
$chunks = splitAudioData($voiceData, 1024);

// Define the send function
$sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
    if (!empty($chunks)) {
        $chunk = array_shift($chunks);
        $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
        $conn->send($binaryMsg);
        // Send the next chunk after 100 ms
        $loop->addTimer(0.1, $sendChunk);
    } else {
        echo "All data blocks sent\n";
        $allChunksSent = true;

        // Send the finish-task instruction
        sendFinishTaskMessage($conn, $taskId);
    }
};

// Start sending audio data
$sendChunk();

Node.js

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('Audio stream ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading audio file: ', err);
    ws.close();
  });
}

Troubleshooting

If an error occurs in your code, refer to Error codes for troubleshooting.

Q: Why there is no recognition result?

Verify audio format and sampleRate/sample_rate match parameter constraints. Common errors:
- The audio file has a .wav extension but is in MP3 format, and the format parameter is incorrectly set to `mp3`.
- The audio sample rate is 3600 Hz, but the sampleRate/sample_rate parameter is incorrectly set to 48000.
Use ffprobe to check audio info (container, encoding, sample rate, channels):
```
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
```
When you use the paraformer-realtime-v2 model, check whether the language set in language_hints matches the actual language of the audio.
For example, the audio is in Chinese, but language_hints is set to en (English).