In AI real-time interaction, a workflow consists of a sequence of nodes, each dedicated to a task, such as speech-to-text (STT), text-to-speech (TTS), LLM text generation, and avatar integration. An AI agent follows a structured workflow to interact with end users. Intelligent Media Services (IMS) provides preset workflow templates for multiple scenarios, including audio calls and avatar calls. This topic introduces how to configure a real-time workflow template.
Limitations
Preset templates are accessible to all users.
Custom templates are exclusive to their creators.
Preset templates cannot be deleted.
Workflow types
There are four types of workflows: audio call, avatar call, visual call, and messaging. Each workflow type has preset nodes. You can configure the nodes as needed.
Type | Scenario | Nodes |
Audio call | Ideal for one-on-one or group calls. Users can talk with intelligent assistants and obtain instant feedback and services. |
|
Avatar call | Users can make video calls with a 3D avatar, which features rich body language and facial expressions. It creates more engaging and realistic interactions. |
|
Visual call | In video calls with users, the AI agent analyzes camera feeds. This process involves object recognition, scene understanding and segmentation, semantic analysis, and behavior recognition. It can understand the contextual setting of objects. |
|
Messaging | Users can communicate with agents through voice or text messages. |
|
Create a template in the console
Log on to the IMS console. In the left-side navigation pane, choose Intelligent Production > AI Real-time Interaction > Workflows. Click Create Workflow Template.
Configure the basic Information.
Customize the workflow name. You will reference it when creating an AI agent.
Choose a workflow type and configure the nodes.
Each workflow comes with preset nodes that cannot be added or removed, but you can edit the nodes.
Speech-to-text
This node converts input audio into text. Multiple languages are supported. You can use the preset STT model, third-party plug-in, or self-developed model.
Preset: The preset model supports the following configurations:
Language Model: Select the language based on the input.
Silent Time: When there is no voice input, the STT node waits for the specified period before transmitting the transcribed text to the LLM.
Custom Hotword: Upload hotword files to ensure accurate recognition of specific terms or phrases related to your business.
Third-party plug-in: only iFLYTEK is supported. Visit iFLYTEK official website to learn more.
Self-developed: To integrate a self-developed STT model, see STT standard interface.
NoteTo use a self-developed model, whitelisting is required. For more information, consult us by joining the DingTalk Group.
LLM
This node uses a pre-trained LLM to understand the input text converted by the STT node and generate text responses.
You have the following options:
Use the preset Qwen model
Integrate a model from Alibaba Cloud Model Studio
Integrate a model from Tongyi Xingchen
Access a self-developed model
Alibaba Cloud Model studio
Alibaba Cloud Model Studio is an all-in-one platform for foundation model development and application building. You can integrate a model from Model Center or Application Center of Model Studio.
Model Center: Deploy a model in Model Studio and get the ModelId and API key.
Application Center: Create an agent application in Model Studio and get its AppId.
Click Call, then click API KEY in the upper right corner to get API-KEY.
NoteTo learn how to access Model Studio in AI real-time interaction, see Pass through business parameters to Alibaba Cloud Model Studio.
Tongyi Xingchen
Tongyi Xingchen lets you customize an AI agent with a special persona. Combined with the real-time voice communication capability, it can engage in interactions across various specified scenarios.
To integrate a model from Tongyi Xingchen, specify ModelId and API-KEY.
ModelId: Tongyi Xingchen supports
xingchen-lite
,xingchen-base
,xingchen-plus
,xingchen-plus-v2
, andxingchen-max
.API-KEY: Visit Tongyi Xingchen and create an API key.
Self-developed model
Integrate a self-developed model into the workflow based on the OpenAI specifications:
Name
Description
Sample value
ModelId
The model field, which represents the model name.
abc
API-KEY
The api_key field.
AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI
HTTPS URL of Destination Model
The base_url field, which represents the service request address.
http://www.abc.com
For more information on self-developed LLMs, see Access LLMs.
Text-to-speech
This node converts processed text back into speech, allowing users to hear the responses.
The following options are available:
Preset Template: You can select a voice. For voice effects, see Intelligent voice samples.
Self-developed Template: Integrate your self-developed model into the workflow. For specific requirements, see Access TTS models.
Third-party Plug-in: Only the MiniMax Speech Model is supported. This model caters to complex productivity and multilingual dialogue scenarios, supporting a context window of 245k. For more information, see MiniMax model.
Alibaba Cloud Model Studio: If you want to customize agent voice, access Alibaba Cloud Model Studio.
You can also select specific symbols to prevent the agent from vocalizing them.
Avatar
This node generates the video stream of the avatar which moves and speaks according to the processed text and audio, with rich facial expressions.
Avatars from Faceunity and Tongyi Xingchen are supported.
Faceunity: Consult Faceunity customer support to activate the service. You must get AppId, AppKey, and AvatarId.
Tongyi Xingchen: Go to Tongyi Xingchen platform to get ModelId, AppKey, and AvatarId.
NoteTo integrate an avatar from Tongyi Xingchen, you need to request whitelisting. For more information, join the DingTalk group to contact us.
To learn more about avatar integration, see Avatar integration.
Frame Extractor
This node extracts single or multiple frames from the video for further processing.
Multi-modal LLM
MLLM interprets the input images and text to generate text responses.
You can select a preset Qwen model or integrate a self-developed model following the OpenAI specifications. For more information, see Access MLLMs.
Click Save.
Manage workflow templates
All workflow templates are displayed on the Real-time Workflow Template page. You can perform the following operations on a template:
View details: Click Manage in the Actions column.
Modify: On the workflow details page, click Modify in the upper-right corner. Modify workflow name and node configurations.
Delete: Click Delete in the Actions column.
The preset templates cannot be deleted.
If an AI agent is running based on a custom workflow template, this template cannot be deleted.
Use workflow templates
Initiate a workflow in the IMS console:
When you create an AI agent, select a workflow template to automate the processing of audio and video streams over RTC. The tasks such as TTS, STT, and smart communication are performed.