All Products
Search
Document Center

Intelligent Media Services:AI companionship

Last Updated:Dec 15, 2025

This topic provides a solution guide to help you develop and launch AI companionship applications.

Background

AI companionship products have seen a recent surge in innovation and diversity, spanning genres such as role-playing, emotional chat, and psychological therapy. While many current AI chat applications are based on offline text or voice messages in IM-style interfaces, the release of models such as GPT-4o is driving the adoption of multimodal technology for real-time voice and video interactions, creating more immersive and authentic virtual entertainment experiences.

Alibaba Cloud's solution integrates leading third-party LLMs and TTS technologies to enable real-time, interactive companionship with dynamic, evolving storylines where users can both consume and create content. This provides users with a personalized companionship experience while inspiring their own creativity.

Options

Interaction modes

Real-time Conversational AI offers two interaction modes for AI companionship scenarios. You can choose a mode by specifying the call type when creating your agent and then integrating the corresponding SDK. You can first experience the effects by trying our demo. To integrate the service, see Quick start for audio/video calls.

Audio-only call

Avatar call

Example

555d2e763e3c49c23ac59cb7060d2a44

lQDPJxjZw5Ame9nNC6zNBaCw89zk0Od4uB8HWJitduNrAA_1440_2988

Interaction

  • User: Audio

  • AI companion: Audio

  • User: Audio

  • AI companion: Video

Cost

Low

Medium

Client SDKs

For detailed SDK integration instructions, see Developer guide.

SDK

Description

Web

Recommended

  • Desktop browsers, such as Chrome.

  • Mobile H5, such as Alipay H5, DingTalk H5, and WeChat mini program H5.

  • In-app WebViews.

Note
  • Use on native mobile browsers is not recommended due to potential WebRTC compatibility issues on some devices.

  • Direct integration with native WeChat Mini Program components is not supported. Use the H5 version within a mini program instead.

Android/iOS

Recommended for native applications on Android or iOS.

Other

For development on Windows or macOS desktops, contact us by joining our DingTalk group (ID: 106730016696).

Basic features

Personalized calls

Alibaba Cloud provides a rich set of APIs that allow you to create a tailored call experience for each user. You can implement this by configuring call startup parameters when initiating a call.

Setting

Description

Modifiable during call?

LLM prompt

You can pass user-specific information as part of the initial prompt to enable the AI to provide a more authentic and personal companionship experience.

Yes

ASR language

Set the speech recognition language (such as Chinese or English).

Yes

TTS voice

Set the AI's voice and timbre.

Yes

Avatar

If using a VideoAgent with multiple avatars, you can specify which one to use for the call.

No

Welcome message

Set a custom welcome message for each user, such as "Hi, Alice, it's great to see you again!"

No

Pass user information to the model

When multiple users are online, the LLM needs to distinguish which input comes from which user. Real-time conversational AI provides the ability to pass custom information, such as a UserID, through to the model. For details, see Pass through business parameters to Alibaba Cloud Model Studio.

Detect and handle user silence

You can monitor the timestamp of each user utterance by listening for the intent_recognized callback. See Agent callbacks for details. This allows you to handle cases where a user is silent for an extended period. Common actions include:

Conversation archiving

You can save the audio data and text transcripts generated during the entire companionship session. For instructions, see Data archiving.

Advanced features

Spoken language assessment (Per-sentence)

For scenarios where you want to evaluate a user's pronunciation, Real-time conversational AI offers the ability to record each user utterance as a separate audio file. These audio files are saved in real time to your specified Object Storage Service (OSS) bucket, which you can then use for pronunciation assessment.

Note

Real-time Conversational AI provides the per-sentence audio recording capability but does not include the assessment feature itself. To configure per-sentence audio callbacks, see Agent callbacks.