Real-time Conversational AI enables efficient audio and video interactions between AI and users. This document describes the solution's capabilities and benefits.
Product introduction
Real-time Conversational AI is a solution that helps businesses quickly build audio and video call applications for interactions between AI and users. The visual configuration interface lets you build a dedicated AI agent in as little as 10 minutes. The agent interacts with end users in real time through the ApsaraVideo Real-time Communication network. This solution is suitable for various scenarios, such as online customer service, AI assistants, AI companions, matchmaking assistants, and virtual teachers.
Capabilities
An AI agent in Real-time Conversational AI is a highly realistic, cloud-based user that interacts with end users through audio and video calls, messaging. You can configure workflows for the agent to enable the following capabilities:
Audio and video calls
Audio call Users communicate with the AI agent through voice.
| Digital human call Users interact with a digital human through video to enhance the realism of the experience.
| Visual understanding call Through video interaction, the agent provides feedback based on both voice and visual input.
| Video call The digital human uses visual understanding to engage in two-way video calls with the user. |
For a tutorial, see Quick Start for audio and video calls. You can configure the following three nodes to create an audio call workflow.
| |||
Messaging
Users can communicate directly with the agent through voice or text in a chat dialog box.
|
|
For example, see Quick Start for messaging: Configure the following flow to create a messaging conversation.
| |
Terms
SessionId | A developer-defined ID. We recommend that you use it as the unique identifier for chat records. Examples:
|
Messaging | Users interact with the agent through voice or text in a chat dialog box. This allows both parties to quickly share ideas, ask questions, or get information. |
Audio call | Users interact with the intelligent assistant through voice to get timely information and support. |
3D digital human call | Uses 3D technology to simulate a virtual character for interaction. A 3D digital human can perform voice interactions and use rich body movements and facial expressions to enhance the realism and engagement of the user experience. |
Visual understanding call | A new interactive method that combines video and audio. It analyzes images captured by the camera in real time and combines them with user voice commands to provide accurate feedback through multi-modal interaction. This gives users a more intuitive, efficient, and personalized intelligent interaction experience during calls, breaking the limits of traditional voice or text communication. |
Video call | Combines the benefits of digital humans and visual understanding. During a video call, both the digital human's and the user's video are displayed. The digital human can understand the user's video feed and provide feedback, which enhances the sense of interaction and realism. |
Interactive messages | A service that enhances message communication between users and improves the interactive experience. |
ApsaraVideo Real-time Communication (ARTC) | Real-time audio and video calls between users and AI agents require Web Real-Time Communication (WebRTC) technology. Backed by over 3,200 points of presence (POPs) worldwide and years of audio and video technology experience, Alibaba Cloud ApsaraVideo Real-time Communication (ARTC) provides high availability (HA), high-quality, and ultra-low-latency audio and video communication services. For more information, see Introduction to ApsaraVideo Real-time Communication. |
Real-time workflow | A real-time workflow is a key part of an AI agent. It lets you flexibly orchestrate AI components, such as speech-to-text, large language models, speech synthesis, and self-developed vector databases, using plug-ins and drag-and-drop actions. The AI agent operates according to this predefined workflow. |
AI agent | A highly realistic cloud-based user defined in the Real-time Conversational AI solution. An AI agent can be preset by the system or created by a user. It can directly interact with end users through audio and video. |
Benefits
High availability and low latency worldwide: Backed by the ApsaraVideo Real-time Communication network from Alibaba Cloud, this solution ensures high availability and low latency worldwide. The network has over 3,200 points of presence (POPs) and uses Quality of Service (QoS) optimization, allowing users to have smooth audio and video calls with AI agents from anywhere.
Easy to access and debug: You can integrate AI components, such as speech-to-text, large language models, speech synthesis, and self-developed vector databases, as plug-ins into a workflow. This lets you quickly deploy your service and conveniently debug the entire technical solution.
Highly human-like: Alibaba Cloud continuously iterates and optimizes features such as intelligent noise reduction, intelligent interruption, and semantic endpointing to make the agent's interactive behavior more human-like.
Easy to integrate: Alibaba Cloud provides four integration methods to help you build a Real-time Conversational AI system suitable for various application scenarios.
How it works

A user initiates a real-time audio or video call with a cloud-based AI agent using a client software development kit (SDK).
The AI agent receives the user's audio and video input, runs the workflow, and outputs the AI's response.
The AI agent pushes the audio and video stream of the response to the ApsaraVideo Real-time Communication network. The user subscribes to the stream for playback, which completes the conversation between the user and the AI agent.
Features
Feature | Description |
Real-time workflow | Flexibly orchestrate the agent's AI workflow in a visual editor.
|
Custom agent image | Upload an image for the agent you create to display its avatar in voice call scenarios. |
Agent emotion recognition | The agent can recognize the user's current emotion and provide an emotional response. |
Welcome message | Configure a welcome message in the console. When a user starts a conversation with the AI agent, the agent plays the welcome message. |
Proactive broadcast | The business server can use an OpenAPI operation to have the agent proactively output audio and video content to the user. |
Real-time captions | The conversation between the user and the agent can be rendered in real time on the end-user interface. |
Intelligent noise reduction | The AI agent automatically filters noise from the user's side during the conversation. When multiple people on the user's side are speaking at the same time, the voice with the highest volume is prioritized. |
Intelligent interruption | When talking with the AI agent, the agent can effectively recognize the user's intent to interrupt the conversation. |
Intelligent sentence segmentation | The agent can automatically recognize and segment long or complex sentences to improve text readability and user experience. |
Sentence-by-sentence audio callback | Configure a callback in the console to store real-time audio data in OSS. |
Walkie-talkie mode | Users can set the call mode to walkie-talkie mode at the start of or during a call and interact with the agent by pressing a button. |
ASR hotwords | Define business-related hotwords to improve the AI agent's accuracy in speech recognition. |
Voiceprint denoising | In a multi-person conversation scenario, the agent identifies the voiceprint features of the main speaker to more accurately capture and retain their speech while reducing interference from irrelevant noise. |
Human takeover | When a user interacts with the agent, if a situation cannot be handled or a key decision needs to be made, a human can take over to make the decision. |
Graceful shutdown | When the business server needs to stop the agent, the agent is allowed to finish its current utterance before stopping. This avoids abrupt interruptions to the conversation. |
Data archiving | Convert the conversation between the user and the AI agent into text and store it. Businesses can call an API operation to consume this data. Businesses can also store the audio and video data from calls between users and AI agents on the Object Storage Service (OSS) or ApsaraVideo VOD (VOD) platform. |
Billing
Real-time Conversational AI is currently in public preview and is free of charge for a limited time.
FAQ
Does the AI agent need to be deployed on the customer's origin server?
Why does the client report an "AgentNotFound" error when starting a messaging conversation?
The client reports an "UnsupportedWorkflowType" error when you start a message conversation
Contact us
For more product inquiries or support, join our DingTalk group by searching for the group ID 106730016696.







