Client events - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

This topic describes the client events for the Qwen-TTS Realtime API.

Reference: Real-time speech synthesis - Qwen.

session.update

Updates the session configuration. After a WebSocket connection is established, send this event as the first step of the interaction. If you do not send this event, the system uses the default configurations. After the server successfully processes this event, it returns a session.updated event as confirmation.

event_id string (Required)

A unique event ID generated by the client. It must be unique within a single WebSocket connection session. We strongly recommend using a Universally Unique Identifier (UUID).

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "voice": "Cherry",
        "mode": "server_commit",
        "language_type": "Chinese",
        "response_format": "pcm",
        "sample_rate": 24000,
        "instructions": "",
        "optimize_instructions": false
    }
}

type string (Required)

The event type. Set to session.update.

session object (Optional)

The session configuration.

Properties

voice string (Required)

The voice used for speech synthesis. For more information, see Supported voices.

System voices and custom voices are supported:

System voices: Available only for the Qwen3-TTS-Instruct-Flash-Realtime, Qwen3-TTS-Flash-Realtime, and Qwen-TTS-Realtime model series. For voice samples, see Supported voices.
Custom voices
- Voices customized using Voice cloning (Qwen): Available only for the Qwen3-TTS-VC-Realtime series.
- Voices customized using Voice design (Qwen): Available only for the Qwen3-TTS-VD-Realtime series.

mode string (Optional)

The interaction pattern. Valid values:

server_commit (default): The server automatically determines when to synthesize, balancing latency and quality. This pattern is recommended for most scenarios.
commit: The client manually triggers synthesis. This pattern provides the lowest latency but requires you to manage sentence integrity.

language_type string (Optional)

The language of the synthesized audio. The default value is Auto.

Auto: Use this value when the language of the text is uncertain or the text contains multiple languages. The model automatically matches the pronunciation for different language segments in the text, but cannot guarantee perfect accuracy.
Specific language: Use this for single-language text. Specifying a language significantly improves synthesis quality and usually performs better than Auto. Valid values include the following:
- Chinese
- English
- German
- Italian
- Portuguese
- Spanish
- Japanese
- Korean
- French
- Russian

response_format string (Optional)

The format of the audio output from the model.

Supported formats:

pcm (default)
wav
mp3
opus

Qwen-TTS-Realtime (see Supported models) supports only pcm.

sample_rate integer (Optional)

The sample rate (in Hz) of the audio output from the model.

Supported sample rates:

8000
16000
24000 (default)
48000

Qwen-TTS-Realtime (see Supported models) supports only 24000.

speech_rate float (Optional)

The speech rate of the audio. A value of 1.0 is the normal speed. A value less than 1.0 is slower, and a value greater than 1.0 is faster.

Default value: 1.0.

Valid range: [0.5, 2.0].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

volume integer (Optional)

The volume of the audio.

Default value: 50.

Valid range: [0, 100].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

pitch_rate float (Optional)

The pitch of the synthesized audio.

Default value: 1.0.

Valid range: [0.5, 2.0].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

bit_rate integer (Optional)

The bitrate (in kbps) of the audio. A higher bitrate results in better audio quality and a larger file size. This parameter is available only when the audio format (response_format) is set to opus.

Default value: 128.

Valid range: [6, 510].

Qwen-TTS-Realtime (see Supported models) does not support this parameter.

instructions string (Optional)

Sets instructions, see Real-time speech synthesis - Qwen.

Default value: None. The parameter has no effect if not set.

Length limit: The length cannot exceed 1600 tokens.

Supported languages: Only Chinese and English are supported.

Scope: This feature is available only for the Qwen3-TTS-Instruct-Flash-Realtime model series.

optimize_instructions boolean (Optional)

Specifies whether to optimize the instructions to improve the naturalness and expressiveness of the speech synthesis.

Default value: false.

Behavior: When set to true, the system enhances the semantics and rewrites the content of instructions to generate internal instructions that are better suited for speech synthesis.

Scenarios: Enable this feature in scenarios that require high-quality, fine-grained vocal expression.

Dependency: This parameter depends on the instructions parameter being set. If instructions is empty, this parameter has no effect.

Scope: This feature is available only for the Qwen3-TTS-Instruct-Flash-Realtime model series.

input_text_buffer.append

Appends text to be synthesized to the text buffer. In server_commit mode, the text is appended to the server-side text buffer. In commit mode, the text is appended to the client-side text buffer.

event_id string (Required)

A unique event ID generated by the client. It must be unique within a single WebSocket connection session. We strongly recommend using a UUID.

{
  "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
  "type": "input_text_buffer.append",
  "text": "Hello, I am Qwen."
}

type string (Required)

The event type. Set to input_text_buffer.append.

text string (Required)

The text to be synthesized.

input_text_buffer.commit

Commits the user input text buffer to create a new user message item in the conversation. If the input text buffer is empty, this event causes an error. In server_commit mode, submitting this event synthesizes all previous text immediately, and the server no longer caches text. In commit mode, the client must commit the text buffer to create a user message item. Committing the input text buffer does not create a response from the model. The server responds with an input_text_buffer.committed event.

event_id string (Required)

A unique event ID generated by the client. It must be unique within a single WebSocket connection session. We strongly recommend using a UUID.

{
  "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
  "type": "input_text_buffer.commit"
}

type string (Required)

The event type. Set to input_text_buffer.commit.

input_text_buffer.clear

Clears the text in the buffer. The server responds with an input_text_buffer.cleared event.

event_id string (Required)

A unique event ID generated by the client. It must be unique within a single WebSocket connection session. We strongly recommend using a UUID.

{
  "event_id": "event_2728",
  "type": "input_text_buffer.clear"
}

type string (Required)

The event type. Set to input_text_buffer.clear.

session.finish

The client sends a session.finish event to notify the server that there is no more text input. The server returns the remaining audio and then closes the connection.

event_id string (Required)

A unique event ID generated by the client. It must be unique within a single WebSocket connection session. We strongly recommend using a UUID.

{
  "event_id": "event_2239",
  "type": "session.finish"
}

type string (Required)

The event type. Set to session.finish.