All Products
Search
Document Center

Intelligent Media Services:Integration solution with UI

Last Updated:Nov 07, 2025

This topic describes the Real-time Conversational AI solution that provides UI components.

Overview

This solution is based on AICallKit SDK and provides UI components for audio and video applications. You can flexibly reuse functional modules of AUI Kits based on your business requirements to quickly bring real-time and interactive AI to your app. This solution is designed for enterprises and developers who want to build Real-time Conversational AI scenarios in an efficient and quick manner. The functional modules of AUI Kits significantly reduce the development time and costs and ensure app quality and stability. For more information about how to integrate AUI Kits for Real-time Conversational AI, see the following topics:

For more information about server-side development, see Server-side integration and API references.

Features

Feature

Description

Real-time workflow

You can orchestrate a workflow in the console. A workflow may contain the following nodes:

  • Speech-to-text: Alibaba Cloud Qwen model is integrated. 

  • Text-to-speech

    • Alibaba Cloud Qwen model is integrated.

    • Your self-developed speech synthesis module can be integrated based on standard protocols.

    • You can integrate the model of MiniMax as a third-party plug-in.

  • LLM

    • Alibaba Cloud Qwen model is integrated.

    • AI models from Alibaba Cloud Model Studio can be integrated.

    • Your self-developed LLM can be integrated based on OpenAI standards.

  • Avatar

    • You can integrate the avatar from Faceunity or Alibaba Cloud Model Studio.

  • Video frame extraction

    • Extract frames from camera feeds for model understanding.

  • Multi-modal LLM (MLLM)

    • Alibaba Cloud Qwen model is integrated.

    • Your self-developed MLLM can be integrated based on OpenAI standards.

Custom agent profile

Upload an image for the AI agent. The image is displayed during voice calls. 

Emotion recognition

Recognize users' emotions and generate empathetic responses.

Welcome message

Configure the welcome message in the IMS console. When the user starts a conversation, the agent broadcasts the welcome message first. 

Proactive messages

Configure the business server to allow the agent to proactively push audio and video content to the user by using OpenAPI. 

Live subtitles

The conversation content can be presented in real time on the user interface. 

Intelligent noise reduction

Automatically filter the noise from the user side during a conversation. If multiple users are speaking at the same time, the voice with the highest volume is preferentially collected. 

Intelligent interruption

Recognize the conversation interruption intention of users. 

Intelligent sentence segmentation

Automatically identify and segment long or complex sentences to improve text readability and user experience. 

Audio sentence callback

You can configure this callback in the console to store audio data in Object Storage Service (OSS). 

Push-to-talk mode

The user can set the call mode to the push-to-talk mode at the beginning of or during a call, and interact with the agent by pressing a button. 

ASR hotwords

You can define business-related hotwords to improve the speech recognition accuracy of intelligent agents 

Voiceprint-based noise suppression

In a multi-speaker scenario, the agent can identify the voiceprint characteristics of the main speaker to accurately capture their speech and minimize interference from background noise.

Human takeover

When the agent encounters situations beyond its capabilities or requires critical decision-making, human agents can take over the conversations with users.

Graceful shutdown

When the business server stops the agent, the agent can complete the current sentence. This prevents abrupt interruptions of conversations. 

Data archiving

The conversations between the agent and users are converted into text for storage. You can call API operations to consume the data. In addition, you can store audio and video data of calls OSS or ApsaraVideo VOD.