This topic explains the WebSocket interaction flow between the server and the client for real-time TTS.
User Guide: For a model overview and selection recommendations, see Real-time speech synthesis - Qwen.
qwen-tts uses a WebSocket persistent connection with event-driven responses. Clients input text in real time and receive audio streams continuously. Two modes are supported:
-
ServerCommit mode: The server automatically segments text and starts synthesis. Developers do not need to manage state chunking. Suitable for latency-sensitive scenarios without manual timing control needs.
-
Commit mode: The client controls when to submit each text segment. Suitable for complex control logic like precise synchronization across concurrent model generation.
Mode descriptions:
-
In ServerCommit mode, after multiple
input_text_buffer.appendcalls, the system determines the synthesis start point based on internal rules. -
In ServerCommit mode, if
input_text_buffer.commitis called, the system immediately synthesizes buffer content. The session remains in ServerCommit mode. -
In Commit mode, calling
input_text_buffer.appenddoes not trigger synthesis. Synthesis is triggered only by explicitinput_text_buffer.commitcalls.
Key flow steps:
-
Connection: The client initiates a WebSocket connection. The server returns
session.createdindicating session initialization. -
Text input: The client adds text to the buffer via multiple
input_text_buffer.appendcalls. -
Triggering synthesis:
-
ServerCommit: The system automatically determines when to start synthesis. The client can call
committo force manual synthesis. -
Commit: Synthesis triggered only by
commitoperations.
-
-
Audio generation: The server first sends
response.createdindicating the task start. -
Audio streaming: The server returns audio chunks via
response.audio.delta(base64 encoded) untilaudio.doneis sent. -
Ending the session: The client explicitly calls
session.finishto clear the server state, then closes the connection.
This flow supports manual control and automated calls with consistent protocol across languages and styles.