Real-time Conversational AI enables efficient audio and video interaction between AI agents and users.
Introduction
Real-time Conversational AI is a solution that enables enterprises to build applications for human-AI interactions. You can create a human-like agent in the console within 10 minutes. The agent communicates with end users through the Global Realtime Transport Network (GRTN), suitable for various scenarios such as online customer service, AI assistants, AI companions, matchmaking assistants, and virtual teachers.
Capabilities
An AI agent is a virtual user that interacts with end users. You can configure five types of workflows for an agent to meet different business scenarios:
Audio/Video call
Audio call Users communicate with intelligent assistants through voice.
| Avatar call Users can make video calls with avatars, which provide more realistic interactions.
| Vision call The agent provides feedback based on the voice and camera feed.
| Video call The avatar communicates with end users through two-way video call.
|
Take audio call as an example: You only need to configure the following three nodes to create a voice call workflow.
| |||
Messaging
Users communicate with the agent through voice or text messages.
|
|
For interactive messaging, configure the following nodes:
| |
New features
Semantic endpointingAI knows the perfect moment to respond. AI intelligently determines if the user has finished speaking based on conversational context, preventing it from interrupting them during natural pauses. Powered by Alibaba Cloud's semantic endpointing technology, it achieves natural interaction with low latency and an accuracy rate of up to 95%. | AI acoustics V2.5Full-duplex conversation in noisy environments. AI acoustics V2.5 is available. Compared to V2.0, it reduces interference from far-field human voices, enabling smooth, full-duplex conversations in various scenarios such as offices, cafeterias, shopping malls, and on the street. |
Terms
SessionId | SessionId is defined by developers. We recommend that you set it as the unique identifier of chat records. Usage examples:
|
Messaging | A workflow where the agent interacts with users through voice or text messages. |
Audio call | A workflow where users interact with the agent through voice to obtain timely feedback and service support. |
Avatar call | A workflow where users interact with the agent that has a virtual character with rich body movements and facial expressions. This enhances the authenticity and user engagement in the conversations. |
Vision call | A workflow where the MLLM-based agent provides feedback based on the users' voice input and camera feeds. This allows users to obtain a more intuitive, efficient, and personalized interaction experience, breaking the limitations of traditional voice or text communication. |
Video call | A workflow that combines the advantages of avatar and vision calls to allow users to engage in two-way video calls with the agent. The avatar can understand the camera feeds and provide feedback, enhancing interaction and authenticity. |
Interactive messages (IM) | A value-added service of ApsaraVideo Live that enhances message communication in live rooms and improves the interactive experience. |
ApsaraVideo Real-time Communication (ARTC) | A value-added service of ApsaraVideo Live that provides a stable, high-quality, and low-latency interactive streaming service based on advanced multimedia technologies and over 3,200 points of presence worldwide. Web Real-Time Communication (WebRTC) technology is used for real-time human-AI communications. For more information, see Overview of ARTC. |
Real-time workflow | A workflow consists of a sequence of nodes, each dedicated to a task, such as speech-to-text (STT), text-to-speech (TTS), LLM, and self-managed vector database. You can flexibly orchestrate nodes through plug-in and drag-and-drop methods. An AI agent follows the structured workflow to interact with end users. |
AI agent | An AI agent is a human-like virtual user that interact with end users. You can create one or use the default agent. |
Benefits
High availability and low latency worldwide
Relying on Alibaba Cloud's Global Realtime Transport Network (GRTN), with coverage of more than 3,200 POPs worldwide and Quality of Service (QoS) optimization, users can interact with agents from anywhere in the world.
Easy access and debugging
You can integrate AI components, such as STT service, LLM, speech synthesis service, and self-developed vector databases, into workflows as plug-ins to quickly develop and debug your business solutions.
Highly human-like
Alibaba Cloud continuously iterates and optimizes features such as smart noise reduction, intelligent interruption, and intelligent sentence segmentation to make AI agents behave more like humans.
Easy integration
Alibaba Cloud provides you with four integration methods to meet application construction requirements under different scenarios.
How it works
The following diagram illustrates how Real-time Conversational AI works:

A user initiates a real-time audio or video call request to a cloud-hosted AI agent by using a client SDK.
After the agent receives the request from the user, the workflow starts and an AI response is generated.
The agent ingests the audio or video stream that contains the response to the ARTC network. The user subscribes to the stream for playback. The conversation between the user and the agent is established.
Features
Feature | Description |
Real-time workflow | You can orchestrate a workflow in the console. A workflow may contain the following nodes:
|
Custom agent profile | Upload an image for the AI agent. The image is displayed during voice calls. |
Emotion recognition | Recognize users' emotions and generate empathetic responses. |
Welcome message | Configure the welcome message in the IMS console. When the user starts a conversation, the agent broadcasts the welcome message first. |
Proactive messages | Configure the business server to allow the agent to proactively push audio and video content to the user by using OpenAPI. |
Live subtitles | The conversation content can be presented in real time on the user interface. |
Intelligent noise reduction | Automatically filter the noise from the user side during a conversation. If multiple users are speaking at the same time, the voice with the highest volume is preferentially collected. |
Intelligent interruption | Recognize the conversation interruption intention of users. |
Intelligent sentence segmentation | Automatically identify and segment long or complex sentences to improve text readability and user experience. |
Audio sentence callback | You can configure this callback in the console to store audio data in Object Storage Service (OSS). |
Push-to-talk mode | The user can set the call mode to the push-to-talk mode at the beginning of or during a call, and interact with the agent by pressing a button. |
ASR hotwords | You can define business-related hotwords to improve the speech recognition accuracy of intelligent agents |
Voiceprint-based noise suppression | In a multi-speaker scenario, the agent can identify the voiceprint characteristics of the main speaker to accurately capture their speech and minimize interference from background noise. |
Human takeover | When the agent encounters situations beyond its capabilities or requires critical decision-making, human agents can take over the conversations with users. |
Graceful shutdown | When the business server stops the agent, the agent can complete the current sentence. This prevents abrupt interruptions of conversations. |
Data archiving | The conversations between the agent and users are converted into text for storage. You can call API operations to consume the data. In addition, you can store audio and video data of calls OSS or ApsaraVideo VOD. |
Billing
Real-time Conversational AI is in public preview and does not charge fees.
FAQ
How does an agent access a large language model (LLM) deployed in Alibaba Cloud Model Studio?
The client reports the "AgentNotFound" error when starting a messaging conversation
The client reports "UnsupportedWorkflowType" error when starting a messaging conversation
Contact us
To obtain more information and technical support, join the DingTalk group (ID: 106730016696) to contact us.







