This topic describes the interaction flow between the real-time multimodal server and the client. Two turn-detection modes determine when a user finishes speaking:
VAD mode (default) -- The server runs Voice Activity Detection (VAD) on incoming audio and responds when speech is detected. Use this mode for continuous audio streaming like hands-free voice assistants.
Manual mode -- The client explicitly commits audio and requests a response. Use this mode for push-to-talk interfaces like voice messaging in chat applications.
Configure both modes via the session.turn_detection parameter in the Client events event.
VAD mode
Set session.turn_detection to "server_vad" to enable VAD mode. The server detects speech boundaries on incoming audio and manages the interaction flow.
Event sequence
The server sends
input_audio_buffer.speech_startedwhen it detects the start of speech.The client can append audio to the buffer at any time by sending
input_audio_buffer.append.The server sends
input_audio_buffer.speech_stoppedwhen it detects the end of speech.The server commits the buffered audio and sends
input_audio_buffer.committed.The server creates a user message from the committed audio and sends
conversation.item.created.
Note: Ifsession.turn_detectionis not explicitly set, the server defaults to"server_vad".
Manual mode
Set session.turn_detection to null to enable manual mode. The client controls when to submit audio and request a response. The server does not perform automatic speech detection.
Event sequence
The client appends audio to the buffer by sending
input_audio_buffer.append.The client commits the buffer by sending
input_audio_buffer.commit, creating a user message in the conversation.The server confirms the commit by sending
input_audio_buffer.committed.The server sends
conversation.item.createdwith the new user message.The client sends
response.createto request a response.
Mode comparison
Feature | VAD mode | Manual mode |
|
|
|
Speech detection | Server-side (VAD) | None (client-controlled) |
Audio submission | Continuous streaming | Explicit commit |
Response trigger | Server responds when speech is detected | Explicit |
Typical use case | Hands-free voice assistants | Push-to-talk chat applications |