All Products
Search
Document Center

Intelligent Media Services:AIAgentConfig

Last Updated:Feb 05, 2026

Parameter

Type

Description

Example

object

Parameters for the agent template.

Greeting

string

The greeting message. Changes take effect the next time the agent joins a session. By default, no greeting is set.

你好

WakeUpQuery

string

An instruction from the user before the call starts. The agent responds to this instruction immediately after the call begins.

今天天气怎么样?

MaxIdleTime

integer

The maximum time to wait for user interaction before the agent goes offline. Unit: seconds. Default: 600.

600

UserOnlineTimeout

integer

The timeout period for the agent to shut down the task if the user does not join the session. Unit: seconds. Default: 60.

60

UserOfflineTimeout

integer

The timeout period for the agent to shut down the task after the user leaves the session. Unit: seconds. Default: 5.

5

EnablePushToTalk

boolean

Specifies whether to enable push-to-talk mode. Default: false.

false

GracefulShutdown

boolean

Specifies whether to enable graceful shutdown. Default: false.

Graceful shutdown: When the agent is stopped, it finishes speaking the current sentence before stopping. The playback lasts for a maximum of 10 seconds.

false

Volume

integer

The speaking volume of the agent.

  • If you do not set this parameter, the agent uses the recommended adaptive volume mode from Alibaba Cloud.

  • If you set this parameter, the valid range is 0 to 400. The output volume is calculated as: Output Volume = Voice Output Volume in Workflow × volume/100. Examples:

  1. If volume is 0, the output volume is 0.

  2. If volume is 100, the output volume is the original volume.

  3. If volume is 200, the output volume is twice the original volume.

100

WorkflowOverrideParams

string

The parameters to overwrite the workflow. By default, this is not set.

{}

AvatarUrl

string

The URL of the agent's profile picture for voice calls. By default, this is not set.

http://example.com/a.jpg

AvatarUrlType

string

The type of the agent's profile picture URL. By default, this is not set.

USER

EnableIntelligentSegment

boolean

The switch for intelligent sentence segmentation. When enabled, pauses in the user's speech are intelligently merged into a single sentence. Default: true.

true

AsrConfig

object

Speech recognition configuration.

AsrLanguageId

string

The language ID for Automatic Speech Recognition (ASR). Valid values:

  • zh_mandarin: Chinese

  • en: English

  • zh_en: Chinese-English mixed

  • es: Spanish

  • jp: Japanese

zh_mandarin

AsrMaxSilence

integer

The threshold for speech segmentation. A silence duration exceeding this threshold is considered a sentence break. The valid range is 200 ms to 1200 ms. Default: 400 ms.

400

AsrHotWords

array

A list of hotwords for ASR. The list can contain up to 128 words.

string

The hotword string. The string must be 1 to 10 characters in length.

检查

VadLevel

integer

The threshold parameter for interruptions. Valid range: [0, 11]. Default: 11.

  • 0: Disables the Voice Activity Detection (VAD) feature.

  • 1–10: A higher value makes it more difficult to interrupt.

  • 11: Significantly different from other values. It provides lower audio degradation during preprocessing and stronger anti-interference capabilities.

11

CustomParams

string

The pass-through parameters for custom ASR integration.

mode=fast&sample=16000&format=wav

VadDuration

integer

The minimum duration threshold for VAD to control interruption sensitivity. A value of 0 disables this feature. Valid range: 200 to 2000 milliseconds. A common range is [200, 500], which corresponds to 1 to 4 words. By default, this parameter is empty and does not take effect.

300

TtsConfig

object

Speech synthesis configuration.

VoiceId

string

The voice ID. Changes take effect on the next sentence. If you do not set this parameter, the voice ID configured in the agent template is used. This parameter is valid only for preset Text-to-Speech (TTS) voices. The value can be up to 64 characters long. For more information about valid values, see Examples of intelligent speech effects.

longcheng_v2

VoiceIdList

array

A list of available voices.

string

Voice

zhixiaoxia

PronunciationRules

array

The pronunciation rules for TTS. The array can contain up to 20 rules. The rules are executed in order.

object

A TTS pronunciation rule.

Word

string

The word to be replaced. It must be less than 10 characters long, consist of Chinese characters, and not contain spaces.

一一零

Pronunciation

string

The target pronunciation. It must be less than 10 characters long, consist of Chinese characters, and not contain spaces.

幺幺零

Type

string

The type of the pronunciation rule. Valid value:

  • replacement: Directly replaces the Word with the Pronunciation.

replacement

ModelId

string

Currently, only minimax is supported. Valid values: speech-01-turbo / speech-02-turbo

speech-01-turbo

LanguageId

string

Currently, only minimax is supported. The default value is empty. This parameter enhances the recognition of specified minority languages and dialects, improving speech performance in those scenarios. If the language type is unclear, you can set this to "auto", and the model will automatically determine the language. The following values are supported:

Supported languages

  • Chinese: Chinese

  • Chinese,Yue: Cantonese

  • English: English

  • Arabic: Arabic

  • Russian: Russian

  • Spanish: Spanish

  • French: French

  • Portuguese: Portuguese

  • German: German

  • Turkish: Turkish

  • Dutch: Dutch

  • Ukrainian: Ukrainian

  • Vietnamese: Vietnamese

  • Indonesian: Indonesian

  • Japanese: Japanese

  • Italian: Italian

  • Korean: Korean

  • Thai: Thai

  • Polish: Polish

  • Romanian: Romanian

  • Greek: Greek

  • Czech: Czech

  • Finnish: Finnish

  • Hindi: Hindi

  • auto: Automatic detection

Chinese

Emotion

string

Currently, only minimax is supported. Minimax supports the following seven emotions:

  • happy: Happy

  • sad: Sad

  • angry: Angry

  • fearful: Fearful

  • disgusted: Disgusted

  • surprised: Surprised

  • calm: Neutral

happy

SpeechRate

number

Supported on all platforms. For cosyvoice, the default is 1.0 and the valid range is 0.5 to 2.0. For minimax, the default is 1.0 and the valid range is 0.5 to 2.0.

1.0

LlmConfig

object

Large language model configuration.

LlmHistory

array

The historical conversation context for the Large Language Model (LLM)/MLLM.

object

A single conversation.

Role

string

The role of the participant in the conversation. Valid values:

  • user: The user

  • assistant: The assistant

  • system: The system

  • function: The function

  • plugin: The plugin

  • tool: The tool

user

Content

string

The actual text content of the conversation. It records the specific expression or response of the role in the conversation.

你好

LlmHistoryLimit

integer

The maximum number of conversation rounds to retain in the LLM/MLLM history. Default: 10.

10

LlmSystemPrompt

string

The system prompt for the LLM after the call starts.

你是一位友好且乐于助人的助手,专注于为用户提供准确的信息和建议。

BailianAppParams

string

The parameters for Alibaba Cloud Model Studio Application Center, in a JSON string format. For more information about the parameter format, see Alibaba Cloud Model Studio Application Center parameters

"{\"biz_params\":{\"user_defined_params\":{\"your_plugin_id\":{\"article_index\":2}}},\"memory_id\":\"your_memory_id\",\"image_list\":[\"https://your_image_url\"],\"rag_options\":{\"pipeline_ids\":[\"your_id\"],\"file_ids\":[\"文档ID1\",\"文档ID2\"],\"metadata_filter\":{\"name\":\"张三\"},\"structured_filter\":{\"key1\":\"value1\",\"key2\":\"value2\"},\"tags\":[\"标签1\",\"标签2\"]}}"

OpenAIExtraQuery

string

Extra query parameters for an OpenAI protocol-based LLM. Parameters must be in key=value format, with multiple parameters connected by ampersands (&). All values must be strings.

api-version=2024-02-01&api-key=sk-xxx

LlmCompleteReply

boolean

If enabled, the agent sends the complete LLM result to the client after the LLM provides a full response. This switch does not affect the streaming generation of captions.

true

FunctionMap

array

A list of function mappings used to associate agent capabilities with LLM functions. Currently, this only supports function invocation for user-defined, OpenAI protocol-based LLMs.

object

A single mapping rule.

Function

string

The name of a built-in function provided by the Alibaba agent system. Currently, only hangup is supported.

hangup

MatchFunction

string

The name of the LLM function to map to this feature. This is user-defined and used to invoke the corresponding feature in the LLM. For more information about the user-defined LLM protocol, see Standard LLM interfaces

hangup

OutputMinLength

integer

The minimum length of the text output in characters. Text shorter than this length is cached and waits to be concatenated. Valid range: [0, 100]. A value of 0 or empty means no limit. Default: empty.

5

OutputMaxDelay

integer

The maximum delay for text output in milliseconds. If this time is exceeded, the cached text is forcibly output. Valid range: [1000, 10000]. A value of 0 or empty means no limit. Default: empty.

2000

HistorySyncWithTTS

boolean

Specifies whether the LLM message history is consistent with the content played by TTS. Default: false. If enabled, the saved LLM messages will be consistent with the content played by TTS.

false

AvatarConfig

object

The digital human configuration. This takes effect only if the workflow contains a digital human node.

AvatarId

string

The model ID of the digital human.

5257

InterruptConfig

object

The configuration for the voice interruption policy.

EnableVoiceInterrupt

boolean

Specifies whether to support voice interruption. Default: true.

true

InterruptWords

array

A list of specific words or phrases that trigger a conversation interruption.

string

A specific word or phrase that triggers a conversation interruption.

打断一下

NoInterruptMode

string

The ASR processing policy in mode.

  • cache: Caches the ASR text. After the current round ends, the cached ASR text is processed in the next round.

  • discard: Discards the ASR text directly.

The default behavior is to cache the ASR text.

cache

VoiceprintConfig

object

Voiceprint configuration.

UseVoiceprint

boolean

Specifies whether to use voiceprint recognition. Default: false. A valid voiceprint ID must be provided when you enable this feature.

false

VoiceprintId

string

The unique ID for voiceprint recognition. By default, this is not set. The provided voiceprint ID must be registered through the voiceprint registration interface. For more information, see Register a human voiceprint.

zhixiaoxia

RegistrationMode

string

TurnDetectionConfig

object

Configuration for conversation round detection.

TurnEndWords

array

A list of keywords used to determine the end of a user's turn.

string

A keyword used to determine the end of a user's turn.

我说完了

Mode

string

The mode for turn detection.

  • Normal (default): Does not use AI to determine semantics.

  • Semantic: Uses AI to determine if the user has finished speaking based on contextual semantics.

Semantic

SemanticWaitDuration

integer

The pause detection time in AI mode. Unit: milliseconds. Default: -1.

  • -1: The AI automatically determines the appropriate waiting time.

  • 0–10000: A custom waiting time. A value between 0 ms and 1500 ms is recommended.

Note

This parameter is not valid in Normal mode.

-1

Eagerness

string

This parameter is valid only in Semantic mode. It controls how quickly the agent responds after the AI detects a pause:

  • "Low": Waits patiently. The maximum waiting time is 6 seconds. This reduces the risk of interruption.

  • "Medium": Balanced waiting. The maximum waiting time is 4 seconds. This is suitable for most scenarios.

  • "High": Responds quickly. The maximum waiting time is 2 seconds. This improves speed but may increase the risk of incorrect turn-taking.

By default, this field is empty.

High

ExperimentalConfig

string

Parameters for experimental features. If you have any requirements, contact technical support.

""

VcrConfig

object

Configuration for the video content recognition feature. This supports sending a callback to the client with the content that the algorithm recognizes in the video.

StillFrameMotion

object

Configuration for still frame detection.

Enabled

boolean

Specifies whether to enable still frame detection. Default: false.

false

CallbackDelay

integer

The delay for sending a notification after a still frame is detected. After you set this, a notification is triggered only after a still frame persists for the specified duration. Unit: milliseconds. By default, this is empty, and the configuration in the console is used for the call. Valid range: [200, 5000].

3000

InvalidFrameMotion

object

Parameter configuration for invalid frame detection.

Enabled

boolean

Specifies whether to enable invalid frame detection.

false

CallbackDelay

integer

The delay for sending a notification after an invalid frame is detected. After you set this, a notification is triggered only after an invalid frame persists for the specified duration. Unit: milliseconds. By default, this is empty, and the configuration in the console is used for the call. Valid range: [200, 5000].

3000

PeopleCount

object

Configuration for the people counting feature.

Enabled

boolean

The switch for this feature. Default: false.

false

Equipment

object

Configuration for device recognition.

Enabled

boolean

Specifies whether to check for prohibited devices. Default: false.

false

HeadMotion

object

Configuration for head motion recognition.

Enabled

boolean

Specifies whether to enable head motion recognition. Default: false.

false

LookAway

object

Configuration for gaze aversion recognition.

Enabled

boolean

Specifies whether to enable gaze aversion recognition. Default: false.

true

AmbientSoundConfig

object

Configuration for ambient sound during a call.

ResourceId

string

The ID of the ambient sound. You can obtain this ID from the advanced configuration of the agent in the console.

f67901c595834************

Volume

integer

The volume of the background sound for the call. Valid range: [0, 100]. A value of 0 means the sound is off.

50

AutoSpeechConfig

object

The configuration module for the agent's automatic speech, including prompts while waiting for the LLM and inquiries during long periods of user silence.

UserIdle

object

Configuration for inquiry playback when the user is silent for a long time.

WaitTime

integer

The silence duration threshold. Unit: milliseconds. Required. An inquiry is triggered if the silence exceeds this duration. Valid range: 5000–600000 ms.

5000

MaxRepeats

integer

The maximum number of inquiries. Valid range: 0–10. Required. If this number is exceeded, no more inquiries are triggered, and the call is terminated.

5

Messages

array

A collection of inquiry prompts. You can have up to 10 prompts. Each prompt can be up to 100 characters long. The sum of probabilities must be 100%.

object

The structure of an inquiry prompt.

Text

string

The text of the inquiry prompt. It can be up to 100 characters long.

您还在吗?

Probability

number

The probability of selecting this prompt. Valid range: 0–1, which corresponds to 0%–100%.

0.5

LlmPending

object

Configuration for playback when the LLM response is delayed.

WaitTime

integer

The threshold for waiting for an LLM response. A prompt is played if the wait time exceeds this threshold. Required. Unit: milliseconds. Valid range: 500–10000 ms. Set this based on the actual performance of your LLM.

3000

Messages

array

A collection of inquiry prompts. You can have up to 10 prompts. Each prompt can be up to 100 characters long. The sum of probabilities must be 100%.

object

The structure of an inquiry prompt.

Text

string

The text of the inquiry prompt. It can be up to 100 characters long.

稍等一下

Probability

number

The probability of selecting this prompt. Valid range: 0–1, which corresponds to 0%–100%.

0.5

BackChannelingConfigs

array

object

Enabled

boolean

TriggerStage

string

Probability

number

Words

array

object

Text

string

Probability

number

BackChannelingConfig

array

The configuration module for the backchanneling feature. When enabled, the system randomly plays short backchanneling phrases at specific trigger points.

object

A single backchanneling configuration.

Enabled

boolean

Specifies whether to enable the backchanneling feature. Required. Valid values: true, false.

true

TriggerStage

string

The trigger for backchanneling. Valid value:

  • pause_detected (a short pause in speech is detected)

pause_detected

Probability

number

The probability of triggering the feature. Valid range: 0.0–1.0. Required.

0.5

Words

array

A collection of backchanneling phrases. You can have up to 10 phrases. Each phrase can be up to 20 characters long. The sum of probabilities must be 1.0.

object

Configuration for a backchanneling phrase.

Text

string

The text of the phrase. It can be up to 20 characters long and supports multiple languages. Required.

嗯嗯

Probability

number

The probability of triggering this phrase. Valid range: 0.0–1.0. Required.

0.3