This topic describes how to quickly create an audio and video agent.
Service activation
Before you use Real-time Conversational AI, make sure that the following requirements are met:
Intelligent Media Services (IMS) subscription is upgraded to IMS Enterprise Standard Edition or Ultimate Edition. To upgrade subscribed service, go to the IMS Subscription page.
The Real-time Conversational AI feature is enabled. To enable the feature, go to the buy page.
Step 1: Create an audio and video workflow
Log on to the Intelligent Media Services console and click Create Workflow Template.
Select Voice Call, Digital Human Call, Visual Understanding Call, or Video Call as needed, and configure the workflow nodes.
STT speech to text
This node is responsible for converting speech input into readable text format, supporting multi-language recognition.
System Preset: System preset models allow you to select a language model, set silent time, and configure custom hot words.
Language Model: You can select different language models based on your business scenario.
Silent Time: The waiting time for the agent when there is no voice input from the user.
Configure Custom Hot Words: Configuring hot words can improve the recognition of business domain vocabulary. For configuration details, see Speech Recognition Hot Words.
Self-developed Integration: Alibaba Cloud supports the integration of your self-developed speech-to-text models. For integration details, see STT Standard Interface.
NoteIf you need to integrate a self-developed speech-to-text model, the process involves whitelist operations. For detailed information, please join the group for consultation.
LLM large language model
Based on the text input converted by STT, LLM can use large pre-trained language models to understand and generate natural language text.
Currently, Real-time Conversational AI supports the integration of Tongyi Qianwen (system preset), Alibaba Cloud Model Studio, Alibaba Tongyi Xingchen, and self-developed large models.
Alibaba Cloud Model Studio
Alibaba Cloud Model Studio is a one-stop platform for large model development and application building. When choosing to connect to language models and services provided by Alibaba Cloud Model Studio, you can choose to connect to Alibaba Cloud Model Studio Model Hub or Application Center.
Model Hub: You can select a suitable model from the Alibaba Cloud Model Studio Model Marketplace, click View Details to get the ModelId. Get the API-Key by clicking API-Key.
Application Center: You need to first create an agent application in Alibaba Cloud Model Studio, and then obtain the AppId.
Click Call to enter the agent application interface, click API KEY in the upper right corner to get the API-KEY.
Tongyi Xingchen
Tongyi Xingchen provides the ability to customize deeply personalized agents, allowing you to quickly create an agent with its own unique persona and style. Combined with digital human voice real-time interaction capabilities, it can engage in rich interactions in different specified scenarios.
ModelId: Currently, Tongyi Xingchen offers five models for you to choose from:
xingchen-lite
,xingchen-base
,xingchen-plus
,xingchen-plus-v2
, andxingchen-max
.API-KEY: Please go to the Xingchen console to create and obtain an API KEY.
Self-developed model
Real-time Conversational AI also supports the integration of your self-developed large models. You can integrate your large model according to the OpenAI specification.
OpenAI Specification: If you choose to integrate according to the OpenAI specification, you need to fill in the following parameters:
Name
Description
Example Value
ModelId
OpenAI standard model field, representing the model name
abc
API-KEY
OpenAI standard api_key field, representing API authentication information
AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI
Target Model HTTPS Address
OpenAI standard base_url field, representing the target service request address
http://www.abc.com
For more details on self-developed LLM integration, see LLM Standard Interface.
TTS text to speech
This node is responsible for converting processed text back into speech format so that users can hear the system's response.
You can select a text-to-speech model suitable for your application scenario, including the following: System Preset Template, Self-developed Template, and Third-party Plugin.
System Preset Template: When selecting a preset template, you need to configure the voice tone. For examples of various types of intelligent voice effects, see Intelligent Voice Effect Examples.
Self-developed Template: You can add your self-developed large model to the workflow through a standardized protocol. For details, see TTS Standard Interface.
Third-party Plugin: Currently, only MiniMax Voice Model is supported, which can meet the needs of complex productivity and multilingual character dialogue scenarios, supporting a maximum context window of 245k. For specific details, see MiniMax Voice Model.
In the TTS node, you can also filter the content input by LLM.
Digital human
This node is responsible for generating a digital human video stream with actions, expressions, and lip synchronization corresponding to the processed text and audio.
Currently, the digital human node supports Connecting To Digital Human Plugins or Connecting To Alibaba Model Studio Platform:
Connecting To Digital Human Plugins:
FaceUnity: You need to consult FaceUnity Technology customer service to activate FaceUnity 3D digital human service and obtain AppId, AppKey, and AvatarId.
Tongyi Xingchen: To use Tongyi Xingchen digital humans, please go to the Tongyi Xingchen console to obtain ModelId, AppKey, and AvatarId.
Connecting To Alibaba Model Studio Platform: To connect to Alibaba Cloud Model Studio platform digital humans, you need to obtain ModelId, AppKey, and AvatarId in advance. For detailed information, see Digital Human Integration.
NoteIf you need to integrate Xingchen digital humans or Alibaba Cloud Model Studio digital humans, the process involves whitelist operations. For detailed information, please join the group for consultation.
Video frame extraction
This node is responsible for extracting single or multiple frames of images from a video.
MLLM multi-modal large language model
Based on the video processing by the preceding nodes, MLLM can understand the input images and text, and generate natural language text.
You can select a language model suitable for your application scenario: use the built-in large language model (Tongyi Qianwen) or integrate your self-developed multi-modal large language model according to the OpenAI specification or Alibaba specification. For self-developed integration, refer to MLLM Standard Interface.
Click Save to complete the audio and video workflow creation.
Step 2: Create an audio and video agent
Log on to the Intelligent Media Services console and click Create Agent.
Configure the basic information and bind the audio and video workflow.
Bind a specific audio and video workflow. The AI agent will run according to this workflow.
When configuring the agent, select an existing Alibaba Real-Time Communication (ARTC) application under the current account. If there is no ARTC application under the logged-in account, you can also choose to have the system automatically create one. For more information about real-time audio and video, see Introduction to Real-time Audio and Video.
NoteReal-time Conversational AI depends on ARTC applications, which serve as a communication bridge to ensure the normal operation of the conversation function.
When the workflow type you bind is Voice Call, you can upload a custom image in the advanced feature configuration to display this image in the voice call scenario.
Click Submit to complete the audio and video agent creation.
Step 3: Experience the agent
After the message conversation agent is created, you can experience the agent by scanning the experience QR code.
Generate a Demo experience QR code in the console.
Use DingTalk, WeChat, or a browser to scan the QR code, or copy the experience URL to your browser to experience the H5 version of the Demo.
Integrate audio and video agents
For information on how to integrate audio and video agents into your project, see Audio and Video Call Agent Integration.