All Products
Search
Document Center

Intelligent Media Services:AIAgentConfig

Last Updated:Mar 20, 2026

Parameter

Type

Description

Example

object

Specifies the configuration for the AI Agent.

Greeting

string

The greeting the AI Agent delivers at the start of a session. Changes to this value take effect in the next session. By default, no greeting is used.

你好

WakeUpQuery

string

A user-defined query that the AI Agent responds to immediately when the session starts.

今天天气怎么样?

MaxIdleTime

integer

The maximum idle time in seconds. If the session remains idle for this period, the agent automatically ends the session. Default: 600.

600

UserOnlineTimeout

integer

The time in seconds the agent waits for a user to join. If a user does not join within this period, the agent terminates the session. Default: 60.

60

UserOfflineTimeout

integer

The timeout duration, in seconds, before the AI Agent terminates the session after the user has left. Default: 5.

5

EnablePushToTalk

boolean

Specifies whether to enable push-to-talk mode. Default: false.

false

GracefulShutdown

boolean

Specifies whether to enable graceful shutdown. Default: false.

When enabled, if the session is terminated, the AI Agent completes its current utterance before disconnecting. The agent speaks for a maximum of 10 seconds.

false

Volume

integer

The speaking volume of the AI Agent.

  • If you do not set this parameter, the agent uses an adaptive volume mode by default.

  • If this parameter is set, the valid range is 0 to 400. The final output volume is calculated as: (Workflow Output Volume * Volume) / 100.

  1. If Volume is 0, the output is muted.

  2. If Volume is 100, the output volume is unchanged.

  3. If Volume is 200, the output volume is doubled.

100

WorkflowOverrideParams

string

Specifies parameters to override the workflow configuration. By default, this is not set.

{}

AvatarUrl

string

The URL for the AI Agent's profile image in audio-only calls. By default, no image is specified.

http://example.com/a.jpg

AvatarUrlType

string

The type of the profile image URL. By default, this is not set.

USER

EnableIntelligentSegment

boolean

Specifies whether to enable intelligent sentence segmentation. If enabled, the system intelligently merges short, consecutive user utterances into a single sentence. Default: true.

true

AsrConfig

object

Specifies the Automatic Speech Recognition (ASR) configuration.

AsrLanguageId

string

The language ID for ASR. Valid values:

  • zh_mandarin: Chinese

  • en: English

  • zh_en: Chinese and English

  • es: Spanish

  • jp: Japanese

zh_mandarin

AsrMaxSilence

integer

The silence detection threshold for sentence segmentation. A silence period longer than this duration triggers a sentence break. Unit: milliseconds. Valid range: 200 to 1200. Default: 400.

400

AsrHotWords

array

A list of hotwords to improve ASR accuracy. You can specify up to 128 hotwords.

string

A hotword. The string must be between 1 and 10 characters in length.

检查

VadLevel

integer

Controls the sensitivity of the voice activity detection (VAD) for interruptions. A higher value makes the agent harder to interrupt. Valid range: 0 to 11. Default: 11.

  • 0: Disables VAD.

  • 1-10: Adjusts the interruption sensitivity.

  • 11: An enhanced mode with improved noise resistance and less impact on audio quality.

11

CustomParams

string

Specifies pass-through parameters for custom ASR integrations.

mode=fast&sample=16000&format=wav

VadDuration

integer

The minimum duration of voice activity, in milliseconds, required to trigger an interruption. This helps control interruption sensitivity. A value of 0 disables this feature. Valid range: 200 to 2000. A typical setting is between 200 and 500, which corresponds to 1 to 4 words. By default, this parameter is not set and the feature is inactive.

300

TtsConfig

object

Specifies the Text-to-Speech (TTS) configuration.

VoiceId

string

The ID of the voice to use for synthesis. Changes take effect on the next utterance. If not specified, the agent uses the default voice from its template. This parameter only applies to preset TTS voices. Maximum length: 64 characters. For available values, see Voice Demos.

longcheng_v2

VoiceIdList

array

A list of available voices.

string

A voice ID.

zhixiaoxia

PronunciationRules

array

A list of pronunciation rules for TTS, applied sequentially. You can specify up to 20 rules.

object

A TTS pronunciation rule.

Word

string

The word to replace. It must consist of Chinese characters, be 10 characters or fewer, and contain no spaces.

一一零

Pronunciation

string

The target pronunciation for the word. It must consist of Chinese characters, be 10 characters or fewer, and contain no spaces.

幺幺零

Type

string

The type of pronunciation rule. Valid value:

  • replacement: Replaces the Word with the specified Pronunciation.

replacement

ModelId

string

Specifies the model ID. Currently, only minimax models are supported. Valid values: speech-01-turbo and speech-02-turbo.

speech-01-turbo

LanguageId

string

Specifies the language ID. Currently, only minimax models are supported. By default, this parameter is empty. Setting this parameter enhances performance for the specified language or dialect. If you are unsure of the language, set the value to "auto" to enable automatic detection. Supported values include:

Supported languages

  • Chinese

  • Chinese,Yue: Cantonese

  • English

  • Arabic

  • Russian

  • Spanish

  • French

  • Portuguese

  • German

  • Turkish

  • Dutch

  • Ukrainian

  • Vietnamese

  • Indonesian

  • Japanese

  • Italian

  • Korean

  • Thai

  • Polish

  • Romanian

  • Greek

  • Czech

  • Finnish

  • Hindi

  • auto: Enables automatic language detection.

Chinese

Emotion

string

Specifies the emotion for the synthesized speech. Currently, only minimax models support this feature. Valid values:

  • happy

  • sad

  • angry

  • fearful

  • disgusted

  • surprised

  • calm

happy

SpeechRate

number

The speech rate. Supported on all platforms.

1.0

LlmConfig

object

Specifies the Large Language Model (LLM) configuration.

LlmHistory

array

The LLM/MLLM conversation history context.

object

A single turn in the conversation.

Role

string

The role of the participant in the conversation. Valid values:

  • user

  • assistant

  • system

  • function

  • plugin

  • tool

user

Content

string

The text content of the message for the specified role.

你好

LlmHistoryLimit

integer

The maximum number of conversational turns to retain in the LLM/MLLM history. Default: 10.

10

LlmSystemPrompt

string

The system prompt for the LLM at the start of the call.

你是一位友好且乐于助人的助手,专注于为用户提供准确的信息和建议。

BailianAppParams

string

Parameters for Alibaba Cloud Model Studio (Bailian) applications, formatted as a JSON string. For parameter format details, see Alibaba Cloud Model Studio (Bailian) application parameters.

"{\"biz_params\":{\"user_defined_params\":{\"your_plugin_id\":{\"article_index\":2}}},\"memory_id\":\"your_memory_id\",\"image_list\":[\"https://your_image_url\"],\"rag_options\":{\"pipeline_ids\":[\"your_id\"],\"file_ids\":[\"文档ID1\",\"文档ID2\"],\"metadata_filter\":{\"name\":\"张三\"},\"structured_filter\":{\"key1\":\"value1\",\"key2\":\"value2\"},\"tags\":[\"标签1\",\"标签2\"]}}"

OpenAIExtraQuery

string

Additional query parameters for an OpenAI-compatible LLM. Parameters must be in key=value format, with multiple parameters separated by &. All values must be strings.

api-version=2024-02-01&api-key=sk-xxx

LlmCompleteReply

boolean

If enabled, the AI Agent sends the complete LLM result to the client after the full response is generated. This setting does not affect the streaming of subtitles.

true

FunctionMap

array

A list of function mappings used to associate AI Agent capabilities with LLM functions. This is currently only supported for function calling with user-defined, OpenAI-compatible LLMs.

object

A single mapping rule.

Function

string

The name of the built-in function provided by the AI Agent system. Currently, only hangup is supported.

hangup

MatchFunction

string

The user-defined LLM function name that corresponds to the agent's built-in function. For details on the custom LLM protocol, see LLM standard interface.

hangup

OutputMinLength

integer

The minimum length in characters for a text output chunk. Text shorter than this value is buffered. Valid range: 0 to 100. A value of 0 or an empty value (default) disables this limit.

5

OutputMaxDelay

integer

The maximum delay in milliseconds before buffered text is forcibly sent. Valid range: 1000 to 10000. A value of 0 or an empty value (default) disables this limit.

2000

HistorySyncWithTTS

boolean

Specifies whether the LLM message history should be synchronized with the content played by TTS. Default: false. When enabled, the saved history reflects the exact content played by the TTS, including interruptions.

Note

When a user interrupts the AI Agent, the system inserts an <ims_agent_interrupted> tag at the interruption point in the assistant's message history. This updated message is then used in the context for the next LLM request. For example:

[
  {"role": "user", "content": "Tell me a story."},
  {"role": "assistant", "content": "Sure, I can tell you a story from Romance of the Three Kingdoms. Would you<ims_agent_interrupted> like to hear it?"},
  {"role": "user", "content": "How about another one?"}
]

false

AvatarConfig

object

The avatar configuration. This takes effect only if the workflow includes an avatar node.

AvatarId

string

The model ID of the avatar.

5257

InterruptConfig

object

Specifies the speech interruption strategy configuration.

EnableVoiceInterrupt

boolean

Specifies whether to allow voice interrupt. Default: true.

true

InterruptWords

array

A list of specific words or phrases that trigger a conversation interruption.

string

A specific word or phrase that triggers a conversation interruption.

打断一下

NoInterruptMode

string

The ASR processing policy when interruptions are disabled.

  • cache: Caches the ASR text. The system processes the cached text in the next turn.

  • discard: Discards the ASR text immediately.

By default, ASR text is cached.

cache

KeepInterruptWordsForLLM

boolean

Specifies whether to include the interruption keywords in the text sent to the LLM. Default: false.

VoiceprintConfig

object

Specifies the voiceprint recognition configuration.

UseVoiceprint

boolean

Specifies whether to enable voiceprint recognition. Default: false. If you enable this feature, you must provide a valid voiceprint ID.

false

VoiceprintId

string

The unique ID for voiceprint recognition. By default, this is not set. You must register the provided voiceprint ID. For more information, see Register a voiceprint.

zhixiaoxia

RegistrationMode

string

TurnDetectionConfig

object

Specifies the conversational turn detection configuration.

TurnEndWords

array

A list of keywords that indicate the end of a user's turn.

string

A keyword that indicates the end of a user's turn.

我说完了

Mode

string

The mode for turn detection.

  • Normal (Default): Does not use AI to determine semantic completeness.

  • Semantic: Uses AI to determine if the user has finished speaking based on semantic context.

Semantic

SemanticWaitDuration

integer

The pause detection time in AI mode. Unit: milliseconds. Default: -1.

  • -1: The AI automatically determines an appropriate wait time.

  • 0-10000: A custom wait time. We recommend a value between 0 and 1500 ms.

Note

This parameter is only effective in Semantic mode.

-1

Eagerness

string

Controls how quickly the AI responds after detecting a pause. This parameter is only effective in Semantic mode.

  • Low: Patient waiting. The AI waits up to 6 seconds, reducing the risk of interruption.

  • Medium: Balanced waiting. The AI waits up to 4 seconds. Suitable for most scenarios.

  • High: Quick response. The AI waits up to 2 seconds. This increases responsiveness but may also increase the risk of accidental cut-offs.

By default, this parameter is not set.

High

ExperimentalConfig

string

Parameters for experimental features. Contact support for assistance if you need to use this.

""

VcrConfig

object

Configuration for video content recognition, which sends callbacks to the client about content that is identified in the video stream.

StillFrameMotion

object

Specifies the still frame detection configuration.

Enabled

boolean

Specifies whether to enable still frame detection. Default: false.

false

CallbackDelay

integer

The delay in milliseconds before a still frame detection event is triggered. The system sends a notification only after the frame has been static for this duration. If not set, the value from the console configuration is used. Valid range: 200 to 5000.

3000

InvalidFrameMotion

object

Specifies the parameters for invalid frame detection.

Enabled

boolean

Specifies whether to enable invalid frame detection. Default: false.

false

CallbackDelay

integer

The delay in milliseconds before an invalid frame detection event is triggered. The system sends a notification only after the frame has been invalid for this duration. If not set, the value from the console configuration is used. Valid range: 200 to 5000.

3000

PeopleCount

object

Configuration for the people counting feature.

Enabled

boolean

Specifies whether to enable the feature. Default: false.

false

Equipment

object

Configuration for device identification.

Enabled

boolean

Specifies whether to check for prohibited devices. Default: false.

false

HeadMotion

object

Configuration for head motion detection.

Enabled

boolean

Specifies whether to enable head motion detection. Default: false.

false

LookAway

object

Configuration for gaze deviation detection.

Enabled

boolean

Specifies whether to enable gaze deviation detection. Default: false.

true

AmbientSoundConfig

object

Specifies the ambient sound configuration.

ResourceId

string

The ID of the ambient sound. You can obtain this ID from the advanced configuration section of the agent settings in the console.

f67901c595834************

Volume

integer

The volume of the ambient sound. Valid range: 0 to 100. A value of 0 disables the sound.

50

AutoSpeechConfig

object

Manages the agent's proactive speech events, such as playing prompts during LLM delays or when the user is silent.

UserIdle

object

Prompts for when a user is idle for an extended period.

WaitTime

integer

The idle time threshold in milliseconds that triggers a prompt. Required. Valid range: 5000 to 600000.

5000

MaxRepeats

integer

The maximum number of times to prompt the user. After this limit is reached, the call is terminated. Required. Valid range: 0 to 10.

5

Messages

array

A collection of up to 10 query prompts. Each prompt must be 100 characters or less. The sum of all probabilities must be 100%.

object

A prompt and its probability.

Text

string

The text of the prompt. Maximum length: 100 characters.

您还在吗?

Probability

number

The probability of this prompt being selected. Valid range: 0.0 to 1.0.

0.5

LlmPending

object

Configuration for prompts played during LLM response delays.

WaitTime

integer

The LLM response time threshold in milliseconds. If the response time exceeds this value, a prompt is played. Required. Valid range: 500 to 10000. Set this based on the actual performance of your LLM.

3000

Messages

array

A list of prompts. You can specify up to 10 prompts, each with a maximum length of 100 characters. The sum of probabilities for all prompts must be 1.0.

object

A prompt and its probability.

Text

string

The text of the prompt. Maximum length: 100 characters.

稍等一下

Probability

number

The probability of this prompt being selected. Valid range: 0.0 to 1.0.

0.5

BackChannelingConfigs

array

Configuration for backchanneling, which plays short, affirming phrases at specific triggers to acknowledge the user's speech.

object

A single backchanneling configuration.

Enabled

boolean

Specifies whether to enable this backchanneling rule. Required.

true

TriggerStage

string

The trigger for the backchanneling phrase. Valid value:

  • pause_detected: Triggers when a brief pause in the user's speech is detected.

pause_detected

Probability

number

The probability of this rule being triggered. Required. Valid range: 0.0 to 1.0.

0.5

Words

array

A list of backchanneling phrases. You can specify up to 10 phrases, each with a maximum length of 20 characters. The sum of probabilities for all phrases must be 1.0.

object

A backchanneling phrase and its probability.

Text

string

The text of the phrase. Required. Maximum length: 20 characters. Multi-language support.

嗯嗯

Probability

number

The probability of this phrase being selected. Required. Valid range: 0.0 to 1.0.

0.3

BackChannelingConfig

array

Important This parameter is deprecated. Use BackChannelingConfigs instead.

object

A single backchanneling configuration.

Enabled

boolean

Specifies whether to enable this backchanneling rule. Required.

true

TriggerStage

string

The trigger timing for the backchanneling response. Valid values:

  • pause_detected (A brief pause in speech is detected)

pause_detected

Probability

number

The probability that the feature is triggered. The valid range is 0.0–1.0. This parameter is required.

0.5

Words

array

A collection of up to 10 backchanneling phrases. Each phrase must be 20 characters or less. The sum of the probabilities must be 1.0.

object

Configuration for a backchanneling phrase.

Text

string

The text of the phrase. The maximum length is 20 characters. Multiple languages are supported. This parameter is required.

嗯嗯

Probability

number

The probability that this phrase is triggered. The value must be between 0.0 and 1.0. This parameter is required.

0.3