All Products
Search
Document Center

Alibaba Cloud Model Studio:WebSocket API for Paraformer real-time speech recognition

Last Updated:Dec 15, 2025
Important

This document applies only to the China (Beijing) region. To use the models, you must use a China (Beijing) region API key.

This topic describes how to access the real-time speech recognition service through a WebSocket connection.

The DashScope SDK currently supports only Java and Python. To develop Paraformer real-time speech recognition applications in other programming languages, you can communicate with the service through a WebSocket connection.

User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.

WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.

For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:

  • Go: gorilla/websocket

  • PHP: Ratchet

  • Node.js: ws

Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

Note

To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

Model availability

paraformer-realtime-v2

paraformer-realtime-8k-v2

Scenarios

Scenarios such as live streaming and meetings

Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail

Sample rate

Any

8 kHz

Languages

Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian

Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese

Chinese

Punctuation prediction

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Inverse text normalization (ITN)

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Custom vocabulary

✅ See Custom hotwords

✅ See Custom hotwords

Specify recognition language

✅ Specify using the language_hints parameter

Emotion recognition

✅ (Click to view usage)

Emotion recognition has the following constraints:

  • It is only available for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation. This is controlled by the semantic_punctuation_enabled parameter of the run-task instruction. By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the value of payload.output.sentence.sentence_end is true.

You can obtain the emotion recognition results by parsing the result-generated event, where the payload.output.sentence.emo_tag and payload.output.sentence.emo_confidence fields represent the emotion and emotion confidence level of the current sentence, respectively.

Interaction flow

image

The client sends two types of messages to the server: instructions in JSON format and binary audio (must be single-channel audio). Messages returned from the server to the client are called events.

The interaction flow between the client and the server, in chronological order, is as follows:

  1. Establish a connection: The client establishes a WebSocket connection with the server.

  2. Start a task:

  3. Send an audio stream:

  4. Notify the server to end the task:

  5. End the task:

  6. Close the connection: The client closes the WebSocket connection.

URL

The WebSocket URL is as follows:

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Parameter

Type

Required

Description

Authorization

string

Yes

The authentication token. The format is Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.

user-agent

string

No

The client identifier. This helps the server track the source of the request.

X-DashScope-WorkSpace

string

No

Model Studio workspace ID.

X-DashScope-DataInspection

string

No

Specifies whether to enable the data compliance check feature. The default value is enable. Do not enable this parameter unless it is necessary.

Instructions (Client → Server)

Instructions are messages sent from the client to the server. They are in JSON format, sent as Text Frames, and are used to control the start and end of a task and to mark task boundaries.

Note

The binary audio (must be single-channel) sent from the client to the server is not included in any instruction and must be sent separately.

Send instructions in the following strict order. Otherwise, the task may fail:

  1. Send the run-task instruction

    • Starts the speech recognition task.

    • The returned task_id is required for the subsequent finish-task instruction and must be the same.

  2. Send binary audio (mono)

    • Sends the audio for recognition.

    • You must send the audio after you receive the task-started event from the server.

  3. Send the finish-task instruction

    • Ends the speech recognition task.

    • Send this instruction after the audio has been completely sent.

1. run-task instruction: Start a task

This instruction starts a speech recognition task. The task_id is also required when you send the finish-task instruction and the same task_id must be used.

Important

When to send: After the WebSocket connection is established.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // random uuid
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "paraformer-realtime-v2",
        "parameters": {
            "format": "pcm", // Audio format
            "sample_rate": 16000, // Sample rate
            "vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx", // Hotword ID supported by paraformer-realtime-v2
            "disfluency_removal_enabled": false, // Filter disfluent words
            "language_hints": [
                "en"
            ] // Specify language, only supported by the paraformer-realtime-v2 model
        },
        "resources": [ // If you are not using the custom vocabulary feature, do not pass the resources parameter
            {
                "resource_id": "xxxxxxxxxxxx", // Hotword ID supported by paraformer-realtime-v2
                "resource_type": "asr_phrase"
            }
        ],
        "input": {}
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type.

For this instruction, the value is fixed as "run-task".

header.task_id

string

Yes

The current task ID.

A 32-bit universally unique identifier (UUID), consisting of 32 randomly generated letters and numbers. It can include hyphens (for example, "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx") or not (for example, "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most programming languages have built-in APIs for generating UUIDs. For example, in Python:

import uuid

def generateTaskId():
    # Generate a random UUID
    return uuid.uuid4().hex

When you later send the finish-task instruction, use the same task_id that you used for the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameters:

Parameter

Type

Required

Description

payload.task_group

string

Yes

Fixed string: "audio".

payload.task

string

Yes

Fixed string: "asr".

payload.function

string

Yes

Fixed string: "recognition".

payload.model

string

Yes

The name of the model. For a list of supported models, see Model List.

payload.input

object

Yes

Fixed format: {}.

payload.parameters

format

string

Yes

The format of the audio to be recognized.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr.

Important

opus/speex: Must be encapsulated in Ogg.

wav: Must be PCM encoded.

amr: Only the AMR-NB type is supported.

sample_rate

integer

Yes

The audio sampling rate in Hz.

This parameter varies by model:

  • paraformer-realtime-v2 supports any sample rate.

  • paraformer-realtime-8k-v2 supports only an 8000 Hz sample rate.

vocabulary_id

string

No

The ID of the hotword vocabulary. This parameter takes effect only when it is set. Use this field to set the hotword ID for v2 and later models.

The hotword information for this hotword ID is applied to the speech recognition request. For more information, see Custom vocabulary.

disfluency_removal_enabled

boolean

No

Specifies whether to filter out disfluent words:

  • true

  • false (default)

language_hints

array[string]

No

The language code of the language to be recognized. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language.

Currently supported language codes:

  • zh: Chinese

  • en: English

  • ja: Japanese

  • yue: Cantonese

  • ko: Korean

  • de: German

  • fr: French

  • ru: Russian

This parameter only applies to models that support multiple languages (see Model List).

semantic_punctuation_enabled

boolean

No

Specifies whether to enable semantic sentence segmentation. This feature is disabled by default.

  • true: Enables semantic sentence segmentation and disables Voice Activity Detection (VAD) sentence segmentation.

  • false (default): Enables VAD sentence segmentation and disables semantic sentence segmentation.

Semantic sentence segmentation provides higher accuracy and is suitable for meeting transcription scenarios. VAD sentence segmentation has lower latency and is suitable for interactive scenarios.

By adjusting the semantic_punctuation_enabled parameter, you can flexibly switch the sentence segmentation method to suit different scenarios.

This parameter is effective only for v2 and later models.

max_sentence_silence

integer

No

The silence duration threshold for VAD sentence segmentation, in ms.

If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.

The parameter ranges from 200 ms to 6000 ms. The default value is 800 ms.

This parameter is effective only when the semantic_punctuation_enabled parameter is `false` (VAD segmentation) and the model is v2 or later.

multi_threshold_mode_enabled

boolean

No

If this parameter is set to `true`, it prevents VAD from segmenting sentences that are too long. This feature is disabled by default.

This parameter is effective only when the semantic_punctuation_enabled parameter is `false` (VAD segmentation) and the model is v2 or later.

punctuation_prediction_enabled

boolean

No

Specifies whether to automatically add punctuation to the recognition results:

  • true (default)

  • false

This parameter is effective only for v2 and later models.

heartbeat

boolean

No

Controls whether to maintain a persistent connection with the server:

  • true: Maintains the connection with the server without interruption when you continuously send silent audio.

  • false (default): Even if silent audio is continuously sent, the connection will disconnect after 60 seconds due to a timeout.

    Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

This parameter is effective only for v2 and later models.

inverse_text_normalization_enabled

boolean

No

Specifies whether to enable Inverse Text Normalization (ITN).

This feature is enabled by default (`true`). When enabled, Chinese numerals are converted to Arabic numerals.

This parameter is effective only for v2 and later models.

payload.resources (This is a list. Do not pass this parameter if you are not using the custom vocabulary feature.)

resource_type

string

No

Fixed string "asr_phrase". This parameter must be used with the resource_id parameter.

2. finish-task instruction: End a task

This instruction ends the speech recognition task. The client sends this instruction after all audio has been sent.

Important

When to send: After the audio is completely sent.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type.

For this instruction, the value is fixed as "finish-task".

header.task_id

string

Yes

The current task ID.

Must be the same as the task_id you used to send the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameters:

Parameter

Type

Required

Description

payload.input

object

Yes

Fixed format: {}.

Binary audio (Client → Server)

The client must send the audio stream for recognition after receiving the task-started event.

You can send a real-time audio stream (for example, from a microphone) or an audio stream from a recording file. The audio must be single-channel.

The audio is uploaded through the WebSocket binary channel. We recommend sending 100 ms of audio every 100 ms.

Events (Server → Client)

Events are messages returned from the server to the client. They are in JSON format and represent different processing stages.

1. task-started event: Task has started

The task-started event from the server indicates that the task has started successfully. You must wait to receive this event before you send the audio for recognition or the finish-task instruction. If you send them before receiving this event, the task will fail.

The payload of the task-started event has no content.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-started".

header.task_id

string

The task_id generated by the client.

2. result-generated event: Speech recognition result

While the client sends the audio for recognition and the finish-task instruction, the server continuously returns the result-generated event, which contains the speech recognition result.

You can determine whether the result is intermediate or final by checking if payload.sentence.endTime in the result-generated event is null.

Example:

{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": null,
        "text": "Okay, I got it",
        "heartbeat": false,
        "sentence_end": true,
        "emo_tag": "neutral", // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
        "emo_confidence": 0.914, // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
        "words": [
          {
            "begin_time": 170,
            "end_time": 295,
            "text": "Okay",
            "punctuation": ","
          },
          {
            "begin_time": 295,
            "end_time": 503,
            "text": "I",
            "punctuation": ""
          },
          {
            "begin_time": 503,
            "end_time": 711,
            "text": "got",
            "punctuation": ""
          },
          {
            "begin_time": 711,
            "end_time": 920,
            "text": "it",
            "punctuation": ""
          }
        ]
      }
    },
    "usage": {
      "duration": 3
    }
  }
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "result-generated".

header.task_id

string

The task_id generated by the client.

payload parameters:

Parameter

Type

Description

output

object

output.sentence is the recognition result. See the following text for details.

usage

object

When payload.output.sentence.sentence_end is false (the current sentence is not finished, see payload.output.sentence parameter description), usage is null.

When payload.output.sentence.sentence_end is true (the current sentence is complete, see payload.output.sentence parameter description), usage.duration is the billable duration of the current task in seconds.

The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.output.usage is as follows:

Parameter

Type

Description

duration

integer

The billable duration of the task, in seconds.

The format of payload.output.sentence is as follows:

Parameter

Type

Description

begin_time

integer

The start time of the sentence, in ms.

end_time

integer | null

The end time of the sentence, in ms. If this is an intermediate recognition result, the value is null.

text

string

The recognized text.

words

array

Character timestamp information.

heartbeat

boolean | null

If this value is true, you can skip processing the recognition result.

sentence_end

boolean

Indicates whether the given sentence has ended.

emo_tag

string

The emotion of the current sentence:

  • positive: Positive emotions, such as happiness or satisfaction

  • negative: Negative emotions, such as anger or dullness

  • neutral: No obvious emotion

Emotion recognition has the following constraints:

  • It is only available for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation. This is controlled by the semantic_punctuation_enabled parameter of the run-task instruction. By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the value of payload.output.sentence.sentence_end is true.

emo_confidence

number

The confidence level of the emotion recognized in the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level.

Emotion recognition has the following constraints:

  • It is only available for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation. This is controlled by the semantic_punctuation_enabled parameter of the run-task instruction. By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the value of payload.output.sentence.sentence_end is true.

payload.output.sentence.words is a list of character timestamps, where each word has the following format:

Parameter

Type

Description

begin_time

integer

The start time of the character, in ms.

end_time

integer

The end time of the character, in ms.

text

string

The character.

punctuation

string

The punctuation mark.

3. task-finished event: Task has ended

When you receive the task-finished event from the server, the task has ended. At this point, you can close the WebSocket connection and terminate the program.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {}
    },
    "payload": {
        "output": {},
        "usage": null
    }
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-finished".

header.task_id

string

The task_id generated by the client.

4. task-failed event: Task failed

If you receive a task-failed event, the task has failed. You must close the WebSocket connection and handle the error. Analyze the error message to determine the cause. If the failure is due to a programming issue, adjust your code to fix it.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "CLIENT_ERROR",
        "error_message": "request timeout after 23 seconds.",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-failed".

header.task_id

string

The task_id generated by the client.

header.error_code

string

A description of the error type.

header.error_message

string

The specific reason for the error.

Connection overhead and reuse

The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.

The server starts a new task when it receives a run-task instruction from the client. When the client sends a finish-task instruction, the server returns a task-finished event to end the task. After the task ends, the WebSocket connection can be reused. The client can start another task by sending a new run-task instruction.

Important
  1. Different tasks within a reused connection must use different task_ids.

  2. If a failure occurs during task execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.

  3. If no new task is started within 60 seconds after a task ends, the connection automatically times out and disconnects.

Code examples

The code examples provide a basic implementation to help you run the service. You must develop the code for your specific business scenarios.

When writing WebSocket client code, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:

  1. Establish a WebSocket connection

    Call a WebSocket library function to establish a WebSocket connection, passing the Headers and URL. The specific implementation varies depending on the programming language or library.

  2. Listen for server messages

    You can use the callback functions (observer pattern) provided by the WebSocket library to listen for messages from the server. The specific implementation depends on the programming language.

    The server returns two types of messages: binary audio streams and events.

    Monitor events:

    • task-started: A task-started event indicates that the task has started successfully. You can send binary audio or a finish-task instruction to the server only after this event is triggered. Otherwise, the task will fail.

    • result-generated: When the client sends binary audio, the server may continuously send the result-generated event, which contains the speech recognition result.

    • task-finished: When you receive a task-finished event, the task is complete. You can then close the WebSocket connection and terminate the program.

    • task-failed: A task-failed event means the task has failed. You must close the WebSocket connection and correct the code based on the error message.

  3. Send messages to the server (pay close attention to the timing)

    Send instructions and binary audio to the server from a thread other than the one that listens for server messages. For example, you can use the main thread. The specific implementation varies depending on the programming language.

    Send instructions in the following strict order. Otherwise, the task may fail:

    1. Send the run-task instruction

      • Starts the speech recognition task.

      • The returned task_id is required for the subsequent finish-task instruction and must be the same.

    2. Send binary audio (mono)

      • Sends the audio for recognition.

      • You must send the audio after you receive the task-started event from the server.

    3. Send the finish-task instruction

      • Ends the speech recognition task.

      • Send this instruction after the audio has been completely sent.

  4. Close the WebSocket connection

    Close the WebSocket connection when the program finishes, an exception occurs, or you receive a task-finished event or a task-failed event. You can typically do this by calling the close function in the WebSocket library.

Click to view complete example

In the following example, the audio file used is asr_example.wav.

Go

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	wsURL     = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/" // WebSocket server address
	audioFile = "asr_example.wav"                                   // Replace with the path to your audio file
)

var dialer = websocket.DefaultDialer

func main() {
	// If you have not configured the API key as an environment variable, you can replace the next line with: apiKey := "your_api_key". We do not recommend hard coding the API key directly into the code in a production environment to reduce the risk of API key leakage.
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Connect to the WebSocket service
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		log.Fatal("Failed to connect to WebSocket: ", err)
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results
	taskStarted := make(chan bool)
	taskDone := make(chan bool)
	startResultReceiver(conn, taskStarted, taskDone)

	// Send the run-task instruction
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		log.Fatal("Failed to send run-task instruction: ", err)
	}

	// Wait for the task-started event
	waitForTaskStarted(taskStarted)

	// Send the audio file stream for recognition
	if err := sendAudioData(conn); err != nil {
		log.Fatal("Failed to send audio: ", err)
	}

	// Send the finish-task instruction
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		log.Fatal("Failed to send finish-task instruction: ", err)
	}

	// Wait for the task to complete or fail
	<-taskDone
}

// Define structs to represent JSON data
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Output struct {
	Sentence struct {
		BeginTime int64  `json:"begin_time"`
		EndTime   *int64 `json:"end_time"`
		Text      string `json:"text"`
		Words     []struct {
			BeginTime   int64  `json:"begin_time"`
			EndTime     *int64 `json:"end_time"`
			Text        string `json:"text"`
			Punctuation string `json:"punctuation"`
		} `json:"words"`
	} `json:"sentence"`
}

type Payload struct {
	TaskGroup  string `json:"task_group"`
	Task       string `json:"task"`
	Function   string `json:"function"`
	Model      string `json:"model"`
	Parameters Params `json:"parameters"`
	// If you are not using the custom vocabulary feature, do not pass the resources parameter
	// Resources  []Resource `json:"resources"`
	Input  Input  `json:"input"`
	Output Output `json:"output,omitempty"`
	Usage  *struct {
		Duration int `json:"duration"`
	} `json:"usage,omitempty"`
}

type Params struct {
	Format                   string   `json:"format"`
	SampleRate               int      `json:"sample_rate"`
	VocabularyID             string   `json:"vocabulary_id"`
	DisfluencyRemovalEnabled bool     `json:"disfluency_removal_enabled"`
	LanguageHints            []string `json:"language_hints"`
}

// If you are not using the custom vocabulary feature, do not pass the resources parameter
type Resource struct {
	ResourceID   string `json:"resource_id"`
	ResourceType string `json:"resource_type"`
}

type Input struct {
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	return conn, err
}

// Start a goroutine to asynchronously receive WebSocket messages
func startResultReceiver(conn *websocket.Conn, taskStarted chan<- bool, taskDone chan<- bool) {
	go func() {
		for {
			_, message, err := conn.ReadMessage()
			if err != nil {
				log.Println("Failed to parse server message: ", err)
				return
			}
			var event Event
			err = json.Unmarshal(message, &event)
			if err != nil {
				log.Println("Failed to parse event: ", err)
				continue
			}
			if handleEvent(conn, event, taskStarted, taskDone) {
				return
			}
		}
	}()
}

// Send the run-task instruction
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task instruction
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "asr",
			Function:  "recognition",
			Model:     "paraformer-realtime-v2",
			Parameters: Params{
				Format:     "wav",
				SampleRate: 16000,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Wait for the task-started event
func waitForTaskStarted(taskStarted chan bool) {
	select {
	case <-taskStarted:
		fmt.Println("Task started successfully")
	case <-time.After(10 * time.Second):
		log.Fatal("Timed out waiting for task-started, task failed to start")
	}
}

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024) // Assume 100 ms of audio data is approximately 1024 bytes
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}

// Send the finish-task instruction
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task instruction
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Handle event
func handleEvent(conn *websocket.Conn, event Event, taskStarted chan<- bool, taskDone chan<- bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		taskStarted <- true
	case "result-generated":
		if event.Payload.Output.Sentence.Text != "" {
			fmt.Println("Recognition result: ", event.Payload.Output.Sentence.Text)
		}
		if event.Payload.Usage != nil {
			fmt.Println("Task billable duration (seconds): ", event.Payload.Usage.Duration)
		}
	case "task-finished":
		fmt.Println("Task finished")
		taskDone <- true
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		taskDone <- true
		return true
	default:
		log.Printf("Unexpected event: %v", event)
	}
	return false
}

// Handle task-failed event
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		log.Fatalf("Task failed: %s", event.Header.ErrorMessage)
	} else {
		log.Fatal("Task failed for an unknown reason")
	}
}

// Close the connection
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

C#

Example code:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;

class Program {
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    private static bool _taskStartedReceived = false;
    private static bool _taskFinishedReceived = false;
    // If you have not configured the API key as an environment variable, you can replace the next line with: private const string ApiKey="your_api_key". We do not recommend hard coding the API key directly into the code in a production environment to reduce the risk of API key leakage.
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // WebSocket server address
    private const string WebSocketUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/";
    // Replace with the path to your audio file
    private const string AudioFilePath = "asr_example.wav";

    static async Task Main(string[] args) {
        // Establish a WebSocket connection and configure headers for authentication
        _webSocket.Options.SetRequestHeader("Authorization", ApiKey);

        await _webSocket.ConnectAsync(new Uri(WebSocketUrl), _cancellationTokenSource.Token);

        // Start a thread to asynchronously receive WebSocket messages
        var receiveTask = ReceiveMessagesAsync();

        // Send the run-task instruction
        string _taskId = Guid.NewGuid().ToString("N"); // Generate a 32-bit random ID
        var runTaskJson = GenerateRunTaskJson(_taskId);
        await SendAsync(runTaskJson);

        // Wait for the task-started event
        while (!_taskStartedReceived) {
            await Task.Delay(100, _cancellationTokenSource.Token);
        }

        // Read the local file and send the audio stream for recognition to the server
        await SendAudioStreamAsync(AudioFilePath);

        // Send the finish-task instruction to end the task
        var finishTaskJson = GenerateFinishTaskJson(_taskId);
        await SendAsync(finishTaskJson);

        // Wait for the task-finished event
        while (!_taskFinishedReceived && !_cancellationTokenSource.IsCancellationRequested) {
            try {
                await Task.Delay(100, _cancellationTokenSource.Token);
            } catch (OperationCanceledException) {
                // The task has been canceled, exit the loop
                break;
            }
        }

        // Close the connection
        if (!_cancellationTokenSource.IsCancellationRequested) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
        }

        _cancellationTokenSource.Cancel();
        try {
            await receiveTask;
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task ReceiveMessagesAsync() {
        try {
            while (_webSocket.State == WebSocketState.Open && !_cancellationTokenSource.IsCancellationRequested) {
                var message = await ReceiveMessageAsync(_cancellationTokenSource.Token);
                if (message != null) {
                    var eventValue = message["header"]?["event"]?.GetValue<string>();
                    switch (eventValue) {
                        case "task-started":
                            Console.WriteLine("Task started successfully");
                            _taskStartedReceived = true;
                            break;
                        case "result-generated":
                            Console.WriteLine($"Recognition result: {message["payload"]?["output"]?["sentence"]?["text"]?.GetValue<string>()}");
                            if (message["payload"]?["usage"] != null && message["payload"]?["usage"]?["duration"] != null) {
                                Console.WriteLine($"Task billable duration (seconds): {message["payload"]?["usage"]?["duration"]?.GetValue<int>()}");
                            }
                            break;
                        case "task-finished":
                            Console.WriteLine("Task finished");
                            _taskFinishedReceived = true;
                            _cancellationTokenSource.Cancel();
                            break;
                        case "task-failed":
                            Console.WriteLine($"Task failed: {message["header"]?["error_message"]?.GetValue<string>()}");
                            _cancellationTokenSource.Cancel();
                            break;
                    }
                }
            }
        } catch (OperationCanceledException) {
            // Ignore the operation canceled exception
        }
    }

    private static async Task<JsonNode?> ReceiveMessageAsync(CancellationToken cancellationToken) {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);
        var result = await _webSocket.ReceiveAsync(segment, cancellationToken);

        if (result.MessageType == WebSocketMessageType.Close) {
            await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", cancellationToken);
            return null;
        }

        var message = Encoding.UTF8.GetString(buffer, 0, result.Count);
        return JsonNode.Parse(message);
    }

    private static async Task SendAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        var segment = new ArraySegment<byte>(buffer);
        await _webSocket.SendAsync(segment, WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
    }

    private static async Task SendAudioStreamAsync(string filePath) {
        using (var audioStream = File.OpenRead(filePath)) {
            var buffer = new byte[1024]; // Send 100 ms of audio data each time
            int bytesRead;

            while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
                var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
                await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
                await Task.Delay(100); // 100 ms interval
            }
        }
    }

    private static string GenerateRunTaskJson(string taskId) {
        var runTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "run-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["task_group"] = "audio",
                ["task"] = "asr",
                ["function"] = "recognition",
                ["model"] = "paraformer-realtime-v2",
                ["parameters"] = new JsonObject {
                    ["format"] = "wav",
                    ["sample_rate"] = 16000,
                    ["vocabulary_id"] = "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx",
                    ["disfluency_removal_enabled"] = false
                },
                // If you are not using the custom vocabulary feature, do not pass the resources parameter
                //["resources"] = new JsonArray {
                //    new JsonObject {
                //        ["resource_id"] = "xxxxxxxxxxxx",
                //        ["resource_type"] = "asr_phrase"
                //    }
                //},
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(runTask);
    }

    private static string GenerateFinishTaskJson(string taskId) {
        var finishTask = new JsonObject {
            ["header"] = new JsonObject {
                ["action"] = "finish-task",
                ["task_id"] = taskId,
                ["streaming"] = "duplex"
            },
            ["payload"] = new JsonObject {
                ["input"] = new JsonObject()
            }
        };
        return JsonSerializer.Serialize(finishTask);
    }
}

PHP

The example code directory structure is:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The content of composer.json is as follows. Determine the version numbers of the dependencies as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The content of index.php is as follows:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;
use Ratchet\rfc6455\Messaging\Frame;

# If you have not configured the API key as an environment variable, you can replace the next line with: $api_key="your_api_key". We do not recommend hard coding the API key directly into the code in a production environment to reduce the risk of API key leakage.
$api_key = getenv("DASHSCOPE_API_KEY");
$websocket_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
$audio_file_path = 'asr_example.wav'; // Replace with the path to your audio file

$loop = Loop::get();

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $audio_file_path) {
    echo "Connected to WebSocket server\n";

    // Start a thread to asynchronously receive WebSocket messages
    $conn->on('message', function($msg) use ($conn, $loop, $audio_file_path) {
        $response = json_decode($msg, true);

        if (isset($response['header']['event'])) {
            handleEvent($conn, $response, $loop, $audio_file_path);
        } else {
            echo "Unknown message format\n";
        }
    });

    // Listen for connection closure
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });

    // Generate a task ID
    $taskId = generateTaskId();

    // Send the run-task instruction
    sendRunTaskMessage($conn, $taskId);

}, function ($e) {
    echo "Could not connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "asr",
            "function" => "recognition",
            "model" => "paraformer-realtime-v2",
            "parameters" => [
                "format" => "wav",
                "sample_rate" => 16000
            ],
            // If you are not using the custom vocabulary feature, do not pass the resources parameter
            //"resources" => [
            //    [
            //        "resource_id" => "xxxxxxxxxxxx",
            //        "resource_type" => "asr_phrase"
            //    ]
            //],
            "input" => []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read the audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Could not read audio file\n";
    }
    return $voiceData;
}

/**
 * Split the audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle event
 * @param $conn
 * @param $response
 * @param $loop
 * @param $audio_file_path
 */
function handleEvent($conn, $response, $loop, $audio_file_path) {
    static $taskId;
    static $chunks;
    static $allChunksSent = false;

    if (is_null($taskId)) {
        $taskId = generateTaskId();
    }

    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending audio data...\n";
            // Read the audio file
            $voiceData = readAudioFile($audio_file_path);
            if ($voiceData === false) {
                echo "Could not read audio file\n";
                $conn->close();
                return;
            }

            // Split the audio data
            $chunks = splitAudioData($voiceData, 1024);

            // Define the send function
            $sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
                if (!empty($chunks)) {
                    $chunk = array_shift($chunks);
                    $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
                    $conn->send($binaryMsg);
                    // Send the next chunk after 100 ms
                    $loop->addTimer(0.1, $sendChunk);
                } else {
                    echo "All data blocks sent\n";
                    $allChunksSent = true;

                    // Send the finish-task instruction
                    sendFinishTaskMessage($conn, $taskId);
                }
            };

            // Start sending audio data
            $sendChunk();
            break;
        case 'result-generated':
            $result = $response['payload']['output']['sentence'];
            echo "Recognition result: " . $result['text'] . "\n";
            if (isset($response['payload']['usage']['duration'])) {
                echo "Task billable duration (seconds): " . $response['payload']['usage']['duration'] . "\n";
            }
            break;
        case 'task-finished':
            echo "Task finished\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If all data has been sent and the task is finished, close the connection
    if ($allChunksSent && $response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data has been transmitted
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }
}

Node.js

Install the required dependencies:

npm install ws
npm install uuid

Example code:

const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate UUIDs

// If you have not configured the API key as an environment variable, you can replace the next line with: apiKey = 'your_api_key'. We do not recommend hard coding the API key directly into the code in a production environment to reduce the risk of API key leakage.
const apiKey = process.env.DASHSCOPE_API_KEY;
const url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file

// Generate a 32-bit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // Flag to indicate if the task has started

// Send the run-task instruction when the connection is open
ws.on('open', () => {
  console.log('Connected to server');
  sendRunTask();
});

// Handle received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('Task started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result: ', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Task billable duration (seconds): ', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('Task finished');
      ws.close();
      break;
    case 'task-failed':
      console.error('Task failed: ', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event: ', message.header.event);
  }
});

// If the task-started event is not received, close the connection
ws.on('close', () => {
  if (!taskStarted) {
    console.error('Task not started, closing connection');
  }
});

// Send the run-task instruction
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'paraformer-realtime-v2',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('Audio stream ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading audio file: ', err);
    ws.close();
  });
}

// Send the finish-task instruction
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Handle errors
ws.on('error', (error) => {
  console.error('WebSocket error: ', error);
});

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How to maintain a persistent connection with the server during long periods of silence?

You can set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

Q: How to convert an audio format to the required format?

You can use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext

# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus
Q: Can I view the time range for each sentence?

Yes, you can. The speech recognition results include the start and end timestamps for each sentence. You can use these timestamps to determine the time range of each sentence.

Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?

Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful because it requires full-duplex communication. WebSocket allows the server and client to actively exchange data in both directions, such as pushing real-time speech synthesis or recognition progress. In contrast, HTTP-based RESTful APIs only support a one-way, client-initiated request-response model, which is unsuitable for real-time interaction.

Q: How to recognize a local file (recording file)?

Convert the local file into a binary audio stream and upload the stream for recognition through the binary channel of the WebSocket. You can typically do this using the send method of a WebSocket library. A code snippet is shown below. For a complete example, see Code examples.

Click to view the code snippet

// Send audio data
func sendAudioData(conn *websocket.Conn) error {
	file, err := os.Open(audioFile)
	if err != nil {
		return err
	}
	defer file.Close()

	buf := make([]byte, 1024) // Assume 100 ms of audio data is approximately 1024 bytes
	for {
		n, err := file.Read(buf)
		if n == 0 {
			break
		}
		if err != nil && err != io.EOF {
			return err
		}
		err = conn.WriteMessage(websocket.BinaryMessage, buf[:n])
		if err != nil {
			return err
		}
		time.Sleep(100 * time.Millisecond)
	}
	return nil
}
private static async Task SendAudioStreamAsync(string filePath) {
    using (var audioStream = File.OpenRead(filePath)) {
        var buffer = new byte[1024]; // Send 100 ms of audio data each time
        int bytesRead;

        while ((bytesRead = await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0) {
            var segment = new ArraySegment<byte>(buffer, 0, bytesRead);
            await _webSocket.SendAsync(segment, WebSocketMessageType.Binary, true, _cancellationTokenSource.Token);
            await Task.Delay(100); // 100 ms interval
        }
    }
}
// Read the audio file
$voiceData = readAudioFile($audio_file_path);
if ($voiceData === false) {
    echo "Could not read audio file\n";
    $conn->close();
    return;
}

// Split the audio data
$chunks = splitAudioData($voiceData, 1024);

// Define the send function
$sendChunk = function() use ($conn, &$chunks, $loop, &$sendChunk, &$allChunksSent, $taskId) {
    if (!empty($chunks)) {
        $chunk = array_shift($chunks);
        $binaryMsg = new Frame($chunk, true, Frame::OP_BINARY);
        $conn->send($binaryMsg);
        // Send the next chunk after 100 ms
        $loop->addTimer(0.1, $sendChunk);
    } else {
        echo "All data blocks sent\n";
        $allChunksSent = true;

        // Send the finish-task instruction
        sendFinishTaskMessage($conn, $taskId);
    }
};

// Start sending audio data
$sendChunk();
// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('Audio stream ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading audio file: ', err);
    ws.close();
  });
}

Troubleshooting

If an error occurs in your code, refer to Error codes for troubleshooting.

Q: Why there is no recognition result?

  1. Check whether the audio format and sampleRate/sample_rate in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:

    • The audio file has a .wav extension but is in MP3 format, and the format parameter is incorrectly set to `mp3`.

    • The audio sample rate is 3600 Hz, but the sampleRate/sample_rate parameter is incorrectly set to 48000.

    You can use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channels:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. When you use the paraformer-realtime-v2 model, check whether the language set in language_hints matches the actual language of the audio.

    For example, the audio is in Chinese, but language_hints is set to en (English).

  3. If all the preceding checks pass, you can use custom hotwords to improve the recognition of specific words.