All Products
Search
Document Center

Alibaba Cloud Model Studio:Interaction flow for real-time speech synthesis

Last Updated:Mar 15, 2026

This topic explains the WebSocket interaction flow between the server and the client for real-time TTS.

User Guide: For a model overview and selection recommendations, see Real-time speech synthesis - Qwen.

qwen-tts uses a WebSocket persistent connection with event-driven responses. Clients input text in real time and receive audio streams continuously. Two modes are supported:

  • ServerCommit mode: The server automatically segments text and starts synthesis. Developers do not need to manage state chunking. Suitable for latency-sensitive scenarios without manual timing control needs.

  • Commit mode: The client controls when to submit each text segment. Suitable for complex control logic like precise synchronization across concurrent model generation.

Mode descriptions:
  • In ServerCommit mode, after multiple input_text_buffer.append calls, the system determines the synthesis start point based on internal rules.

  • In ServerCommit mode, if input_text_buffer.commit is called, the system immediately synthesizes buffer content. The session remains in ServerCommit mode.

  • In Commit mode, calling input_text_buffer.append does not trigger synthesis. Synthesis is triggered only by explicit input_text_buffer.commit calls.

qwen-tts

Key flow steps:

  1. Connection: The client initiates a WebSocket connection. The server returns session.created indicating session initialization.

  2. Text input: The client adds text to the buffer via multiple input_text_buffer.append calls.

  3. Triggering synthesis:

    • ServerCommit: The system automatically determines when to start synthesis. The client can call commit to force manual synthesis.

    • Commit: Synthesis triggered only by commit operations.

  4. Audio generation: The server first sends response.created indicating the task start.

  5. Audio streaming: The server returns audio chunks via response.audio.delta (base64 encoded) until audio.done is sent.

  6. Ending the session: The client explicitly calls session.finish to clear the server state, then closes the connection.

This flow supports manual control and automated calls with consistent protocol across languages and styles.