WebSocket interaction protocol for real-time speech synthesis - Alibaba Cloud Model Studio

This topic explains the WebSocket interaction flow between the server and the client for real-time TTS.

User Guide: For a model overview and selection recommendations, see Real-time speech synthesis - Qwen.

qwen-tts uses a WebSocket persistent connection with event-driven responses. Clients input text in real time and receive audio streams continuously. Two modes are supported:

ServerCommit mode: The server automatically segments text and starts synthesis. Developers do not need to manage state chunking. Suitable for latency-sensitive scenarios without manual timing control needs.
Commit mode: The client controls when to submit each text segment. Suitable for complex control logic like precise synchronization across concurrent model generation.

Mode descriptions:

In ServerCommit mode, after multiple input_text_buffer.append calls, the system determines the synthesis start point based on internal rules.
In ServerCommit mode, if input_text_buffer.commit is called, the system immediately synthesizes buffer content. The session remains in ServerCommit mode.
In Commit mode, calling input_text_buffer.append does not trigger synthesis. Synthesis is triggered only by explicit input_text_buffer.commit calls.

qwen-tts

Key flow steps:

Connection: The client initiates a WebSocket connection. The server returns session.created indicating session initialization.
Text input: The client adds text to the buffer via multiple input_text_buffer.append calls.
Triggering synthesis:
- ServerCommit: The system automatically determines when to start synthesis. The client can call commit to force manual synthesis.
- Commit: Synthesis triggered only by commit operations.
Audio generation: The server first sends response.created indicating the task start.
Audio streaming: The server returns audio chunks via response.audio.delta (base64 encoded) until audio.done is sent.
Ending the session: The client explicitly calls session.finish to clear the server state, then closes the connection.

This flow supports manual control and automated calls with consistent protocol across languages and styles.