Understand the real-time multimodal interaction flow - Alibaba Cloud Model Studio

This topic describes the interaction flow between the real-time multimodal server and the client. Two turn-detection modes determine when a user finishes speaking:

VAD mode (default) -- The server runs Voice Activity Detection (VAD) on incoming audio and responds when speech is detected. Use this mode for continuous audio streaming like hands-free voice assistants.
Manual mode -- The client explicitly commits audio and requests a response. Use this mode for push-to-talk interfaces like voice messaging in chat applications.

Configure both modes via the session.turn_detection parameter in the Client events event.

VAD mode

Set session.turn_detection to "server_vad" to enable VAD mode. The server detects speech boundaries on incoming audio and manages the interaction flow.

server_vad

Event sequence

The server sends input_audio_buffer.speech_started when it detects the start of speech.
The client can append audio to the buffer at any time by sending input_audio_buffer.append.
The server sends input_audio_buffer.speech_stopped when it detects the end of speech.
The server commits the buffered audio and sends input_audio_buffer.committed.
The server creates a user message from the committed audio and sends conversation.item.created.

Note: If session.turn_detection is not explicitly set, the server defaults to "server_vad".

Manual mode

Set session.turn_detection to null to enable manual mode. The client controls when to submit audio and request a response. The server does not perform automatic speech detection.

manual

Event sequence

The client appends audio to the buffer by sending input_audio_buffer.append.
The client commits the buffer by sending input_audio_buffer.commit, creating a user message in the conversation.
The server confirms the commit by sending input_audio_buffer.committed.
The server sends conversation.item.created with the new user message.
The client sends response.create to request a response.

Mode comparison

Feature	VAD mode	Manual mode
`session.turn_detection` value	`"server_vad"`	`null`
Speech detection	Server-side (VAD)	None (client-controlled)
Audio submission	Continuous streaming	Explicit commit
Response trigger	Server responds when speech is detected	Explicit `response.create` from client
Typical use case	Hands-free voice assistants	Push-to-talk chat applications

Alibaba Cloud Model Studio:Real-time multimodal interaction flow

VAD mode

Event sequence

Manual mode

Event sequence

Mode comparison

Related topics