What is Real-time Conversational AI - Intelligent Media Services

Real-time Conversational AI enables efficient audio and video interaction between AI agents and users.

Introduction

Real-time Conversational AI is a solution that enables enterprises to build applications for human-AI interactions. You can create a human-like agent in the console within 10 minutes. The agent communicates with end users through the Global Realtime Transport Network (GRTN), suitable for various scenarios such as online customer service, AI assistants, AI companions, matchmaking assistants, and virtual teachers.

Capabilities

An AI agent is a virtual user that interacts with end users. Multiple types of workflows are available to meet different business scenarios:

Audio/Video call

Audio call

Users communicate with intelligent assistants through voice.

555d2e763e3c49c23ac59cb7060d2a44

Avatar call

Users can make video calls with avatars, which provide more realistic interactions.

lQDPJwMuwU90JFXNC6zNBaCwNbn8uKeIjbgHiTmd5-WQAA_1440_2988

Vision call

The agent provides feedback based on the voice and camera feed.

lQDPJwpRBT4ppFXNC6zNBaCwzODP1_m-L7MHiTmc7Nh_AA_1440_2988

Video call

The avatar communicates with end users through two-way video call.

Take audio call as an example:

You only need to configure the following three nodes to create a voice call workflow.

Messaging

Users communicate with the agent through voice or text messages.

lQDPKHl9TD29I1XNC6zNBaCwklTx59f8apsHiTmbKaTPAA_1440_2988

For interactive messaging, configure the following nodes:

New features

Semantic endpointing

AI knows the perfect moment to respond.

AI intelligently determines if the user has finished speaking based on conversational context, preventing it from interrupting them during natural pauses. Powered by Alibaba Cloud's semantic endpointing technology, it achieves natural interaction with low latency and an accuracy rate of up to 95%.

AI acoustics V2.5

Full-duplex conversation in noisy environments.

AI acoustics V2.5 is available. Compared to V2.0, it reduces interference from far-field human voices, enabling smooth, full-duplex conversations in various scenarios such as offices, cafeterias, shopping malls, and on the street.

Terms

SessionId	SessionId is defined by developers. We recommend that you set it as the unique identifier of chat records. Usage examples: User-associated: When users chat with the AI agent on mobile or PC, SessionId can link conversations across different times. Session-associated: When a user initiates multiple sessions, SessionId can be used to separate them.
Messaging	A workflow where the agent interacts with users through voice or text messages.
Audio call	A workflow where users interact with the agent through voice to obtain timely feedback and service support.
Avatar call	A workflow where users interact with the agent that has a virtual character with rich body movements and facial expressions. This enhances the authenticity and user engagement in the conversations.
Vision call	A workflow where the MLLM-based agent provides feedback based on the users' voice input and camera feeds. This allows users to obtain a more intuitive, efficient, and personalized interaction experience, breaking the limitations of traditional voice or text communication.
Video call	A workflow that combines the advantages of avatar and vision calls to allow users to engage in two-way video calls with the agent. The avatar can understand the camera feeds and provide feedback, enhancing interaction and authenticity.
Interactive messages (IM)	A value-added service of ApsaraVideo Live that enhances message communication in live rooms and improves the interactive experience.
ApsaraVideo Real-time Communication (ARTC)	A value-added service of ApsaraVideo Live that provides a stable, high-quality, and low-latency interactive streaming service based on advanced multimedia technologies and over 3,200 points of presence worldwide. Web Real-Time Communication (WebRTC) technology is used for real-time human-AI communications. For more information, see Overview of ARTC.
Real-time workflow	A workflow consists of a sequence of nodes, each dedicated to a task, such as speech-to-text (STT), text-to-speech (TTS), LLM, and self-managed vector database. You can flexibly orchestrate nodes through plug-in and drag-and-drop methods. An AI agent follows the structured workflow to interact with end users.
AI agent	An AI agent is a human-like virtual user that interact with end users. You can create one or use the default agent.

Benefits

High availability and low latency worldwide
Relying on Alibaba Cloud's Global Realtime Transport Network (GRTN), with coverage of more than 3,200 POPs worldwide and Quality of Service (QoS) optimization, users can interact with agents from anywhere in the world.
Easy access and debugging
You can integrate AI components, such as STT service, LLM, speech synthesis service, and self-developed vector databases, into workflows as plug-ins to quickly develop and debug your business solutions.
Highly human-like
Alibaba Cloud continuously iterates and optimizes features such as smart noise reduction, intelligent interruption, and intelligent sentence segmentation to make AI agents behave more like humans.
Easy integration
Alibaba Cloud provides you with four integration methods to meet application construction requirements under different scenarios.

How it works

The following diagram illustrates how Real-time Conversational AI works:

A user initiates a real-time audio or video call request to a cloud-hosted AI agent by using a client SDK.
After the agent receives the request from the user, the workflow starts and an AI response is generated.
The agent ingests the audio or video stream that contains the response to the ARTC network. The user subscribes to the stream for playback. The conversation between the user and the agent is established.

Features

Feature	Description
Real-time workflow	You can orchestrate a workflow in the console. A workflow may contain the following nodes: Speech-to-text: Alibaba Cloud Qwen model is integrated. Text-to-speech Alibaba Cloud Qwen model is integrated. Your self-developed speech synthesis module can be integrated based on standard protocols. You can integrate the model of MiniMax as a third-party plug-in. LLM Alibaba Cloud Qwen model is integrated. AI models from Alibaba Cloud Model Studio can be integrated. Your self-developed LLM can be integrated based on OpenAI specification. Avatar You can integrate the avatar from Faceunity or Alibaba Cloud Model Studio. Video frame extraction Extract frames from camera feeds for model understanding. Multi-modal LLM (MLLM) Alibaba Cloud Qwen model is integrated. Your self-developed MLLM can be integrated based on OpenAI specification.
Custom agent profile	Upload an image for the AI agent. The image is displayed during voice calls.
Emotion recognition	Recognize users' emotions and generate empathetic responses.
Welcome message	Configure the welcome message in the IMS console. When the user starts a conversation, the agent broadcasts the welcome message first.
Proactive messages	Configure the business server to allow the agent to proactively push audio and video content to the user by using OpenAPI.
Live subtitles	The conversation content can be presented in real time on the user interface.
Intelligent noise reduction	Automatically filter the noise from the user side during a conversation. If multiple users are speaking at the same time, the voice with the highest volume is preferentially collected.
Intelligent interruption	Recognize the conversation interruption intention of users.
Intelligent sentence segmentation	Automatically identify and segment long or complex sentences to improve text readability and user experience.
Audio sentence callback	You can configure this callback in the console to store audio data in Object Storage Service (OSS).
Push-to-talk mode	The user can set the call mode to the push-to-talk mode at the beginning of or during a call, and interact with the agent by pressing a button.
ASR hotwords	You can define business-related hotwords to improve the speech recognition accuracy of intelligent agents
Voiceprint-based noise suppression	In a multi-speaker scenario, the agent can identify the voiceprint characteristics of the main speaker to accurately capture their speech and minimize interference from background noise.
Human takeover	When the agent encounters situations beyond its capabilities or requires critical decision-making, human agents can take over the conversations with users.
Graceful shutdown	When the business server stops the agent, the agent can complete the current sentence. This prevents abrupt interruptions of conversations.
Data archiving	The conversations between the agent and users are converted into text for storage. You can call API operations to consume the data. In addition, you can store audio and video data of calls OSS or ApsaraVideo VOD.

Billing

Real-time Conversational AI is in public preview and does not charge fees.

FAQ

Contact us

To obtain more information and technical support, join the DingTalk group (ID: 106730016696) to contact us.