All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time multimodal interaction flow

Last Updated:Mar 28, 2026

This topic describes the interaction flow between the real-time multimodal server and the client. Two turn-detection modes determine when a user finishes speaking:

  • VAD mode (default) -- The server runs Voice Activity Detection (VAD) on incoming audio and responds when speech is detected. Use this mode for continuous audio streaming like hands-free voice assistants.

  • Manual mode -- The client explicitly commits audio and requests a response. Use this mode for push-to-talk interfaces like voice messaging in chat applications.

Configure both modes via the session.turn_detection parameter in the Client events event.

VAD mode

Set session.turn_detection to "server_vad" to enable VAD mode. The server detects speech boundaries on incoming audio and manages the interaction flow.

server_vad

Event sequence

  • The server sends input_audio_buffer.speech_started when it detects the start of speech.

  • The client can append audio to the buffer at any time by sending input_audio_buffer.append.

  • The server sends input_audio_buffer.speech_stopped when it detects the end of speech.

  • The server commits the buffered audio and sends input_audio_buffer.committed.

  • The server creates a user message from the committed audio and sends conversation.item.created.

Note: If session.turn_detection is not explicitly set, the server defaults to "server_vad".

Manual mode

Set session.turn_detection to null to enable manual mode. The client controls when to submit audio and request a response. The server does not perform automatic speech detection.

manual

Event sequence

  1. The client appends audio to the buffer by sending input_audio_buffer.append.

  2. The client commits the buffer by sending input_audio_buffer.commit, creating a user message in the conversation.

  3. The server confirms the commit by sending input_audio_buffer.committed.

  4. The server sends conversation.item.created with the new user message.

  5. The client sends response.create to request a response.

Mode comparison

Feature

VAD mode

Manual mode

session.turn_detection value

"server_vad"

null

Speech detection

Server-side (VAD)

None (client-controlled)

Audio submission

Continuous streaming

Explicit commit

Response trigger

Server responds when speech is detected

Explicit response.create from client

Typical use case

Hands-free voice assistants

Push-to-talk chat applications

Related topics