Integration solution with UI - Intelligent Media Services

This topic describes the Real-time Conversational AI solution that provides UI components.

Overview

This solution is based on AICallKit SDK and provides UI components for audio and video applications. You can flexibly reuse functional modules of AUI Kits based on your business requirements to quickly bring real-time and interactive AI to your app. This solution is designed for enterprises and developers who want to build Real-time Conversational AI scenarios in an efficient and quick manner. The functional modules of AUI Kits significantly reduce the development time and costs and ensure app quality and stability. For more information about how to integrate AUI Kits for Real-time Conversational AI, see the following topics:

For more information about server-side development, see Server-side integration and API references.

Features

Feature	Description
Real-time workflow	You can orchestrate a workflow in the console. A workflow may contain the following nodes: Speech-to-text: Alibaba Cloud Qwen model is integrated. Text-to-speech Alibaba Cloud Qwen model is integrated. Your self-developed speech synthesis module can be integrated based on standard protocols. You can integrate the model of MiniMax as a third-party plug-in. LLM Alibaba Cloud Qwen model is integrated. AI models from Alibaba Cloud Model Studio can be integrated. Your self-developed LLM can be integrated based on OpenAI standards. Avatar You can integrate the avatar from Faceunity or Alibaba Cloud Model Studio. Video frame extraction Extract frames from camera feeds for model understanding. Multi-modal LLM (MLLM) Alibaba Cloud Qwen model is integrated. Your self-developed MLLM can be integrated based on OpenAI standards.
Custom agent profile	Upload an image for the AI agent. The image is displayed during voice calls.
Emotion recognition	Recognize users' emotions and generate empathetic responses.
Welcome message	Configure the welcome message in the IMS console. When the user starts a conversation, the agent broadcasts the welcome message first.
Proactive messages	Configure the business server to allow the agent to proactively push audio and video content to the user by using OpenAPI.
Live subtitles	The conversation content can be presented in real time on the user interface.
Intelligent noise reduction	Automatically filter the noise from the user side during a conversation. If multiple users are speaking at the same time, the voice with the highest volume is preferentially collected.
Intelligent interruption	Recognize the conversation interruption intention of users.
Intelligent sentence segmentation	Automatically identify and segment long or complex sentences to improve text readability and user experience.
Audio sentence callback	You can configure this callback in the console to store audio data in Object Storage Service (OSS).
Push-to-talk mode	The user can set the call mode to the push-to-talk mode at the beginning of or during a call, and interact with the agent by pressing a button.
ASR hotwords	You can define business-related hotwords to improve the speech recognition accuracy of intelligent agents
Voiceprint-based noise suppression	In a multi-speaker scenario, the agent can identify the voiceprint characteristics of the main speaker to accurately capture their speech and minimize interference from background noise.
Human takeover	When the agent encounters situations beyond its capabilities or requires critical decision-making, human agents can take over the conversations with users.
Graceful shutdown	When the business server stops the agent, the agent can complete the current sentence. This prevents abrupt interruptions of conversations.
Data archiving	The conversations between the agent and users are converted into text for storage. You can call API operations to consume the data. In addition, you can store audio and video data of calls OSS or ApsaraVideo VOD.