AI companionship - Intelligent Media Services - Alibaba Cloud Documentation Center

This topic provides a solution guide to help you develop and launch AI companionship applications.

Background

AI companionship products have seen a recent surge in innovation and diversity, spanning genres such as role-playing, emotional chat, and psychological therapy. While many current AI chat applications are based on offline text or voice messages in IM-style interfaces, the release of models such as GPT-4o is driving the adoption of multimodal technology for real-time voice and video interactions, creating more immersive and authentic virtual entertainment experiences.

Alibaba Cloud's solution integrates leading third-party LLMs and TTS technologies to enable real-time, interactive companionship with dynamic, evolving storylines where users can both consume and create content. This provides users with a personalized companionship experience while inspiring their own creativity.

Options

Interaction modes

Real-time Conversational AI offers two interaction modes for AI companionship scenarios. You can choose a mode by specifying the call type when creating your agent and then integrating the corresponding SDK. You can first experience the effects by trying our demo. To integrate the service, see Quick start for audio/video calls.

	Audio-only call	Avatar call
Example
Interaction	User: Audio AI companion: Audio	User: Audio AI companion: Video
Cost	Low	Medium

Client SDKs

For detailed SDK integration instructions, see Developer guide.

SDK	Description
Web	Recommended Desktop browsers, such as Chrome. Mobile H5, such as Alipay H5, DingTalk H5, and WeChat mini program H5. In-app WebViews. Note Use on native mobile browsers is not recommended due to potential WebRTC compatibility issues on some devices. Direct integration with native WeChat Mini Program components is not supported. Use the H5 version within a mini program instead.
Android/iOS	Recommended for native applications on Android or iOS.
Other	For development on Windows or macOS desktops, contact us by joining our DingTalk group (ID: 106730016696).

Basic features

Personalized calls

Alibaba Cloud provides a rich set of APIs that allow you to create a tailored call experience for each user. You can implement this by configuring call startup parameters when initiating a call.

Setting	Description	Modifiable during call?
LLM prompt	You can pass user-specific information as part of the initial prompt to enable the AI to provide a more authentic and personal companionship experience.	Yes
ASR language	Set the speech recognition language (such as Chinese or English).	Yes
TTS voice	Set the AI's voice and timbre.	Yes
Avatar	If using a `VideoAgent` with multiple avatars, you can specify which one to use for the call.	No
Welcome message	Set a custom welcome message for each user, such as "Hi, Alice, it's great to see you again!"	No

Pass user information to the model

When multiple users are online, the LLM needs to distinguish which input comes from which user. Real-time conversational AI provides the ability to pass custom information, such as a UserID, through to the model. For details, see Pass through business parameters to Alibaba Cloud Model Studio.

Detect and handle user silence

You can monitor the timestamp of each user utterance by listening for the intent_recognized callback. See Agent callbacks for details. This allows you to handle cases where a user is silent for an extended period. Common actions include:

End the conversation: See StopAIAgentInstance.
Play a reminder: Have the AI play a reminder after X seconds of silence. See Vocalize notifications from AI agent.
Trigger the next question: Send a text input to the LLM to have it ask the next question. See Send text input to an LLM via API.

Conversation archiving

You can save the audio data and text transcripts generated during the entire companionship session. For instructions, see Data archiving.

Advanced features

Spoken language assessment (Per-sentence)

For scenarios where you want to evaluate a user's pronunciation, Real-time conversational AI offers the ability to record each user utterance as a separate audio file. These audio files are saved in real time to your specified Object Storage Service (OSS) bucket, which you can then use for pronunciation assessment.

Note

Real-time Conversational AI provides the per-sentence audio recording capability but does not include the assessment feature itself. To configure per-sentence audio callbacks, see Agent callbacks.