This topic provides a solution guide to help you develop and launch AI companionship applications.
Background
AI companionship products have seen a recent surge in innovation and diversity, spanning genres such as role-playing, emotional chat, and psychological therapy. While many current AI chat applications are based on offline text or voice messages in IM-style interfaces, the release of models such as GPT-4o is driving the adoption of multimodal technology for real-time voice and video interactions, creating more immersive and authentic virtual entertainment experiences.
Alibaba Cloud's solution integrates leading third-party LLMs and TTS technologies to enable real-time, interactive companionship with dynamic, evolving storylines where users can both consume and create content. This provides users with a personalized companionship experience while inspiring their own creativity.
Options
Interaction modes
Real-time Conversational AI offers two interaction modes for AI companionship scenarios. You can choose a mode by specifying the call type when creating your agent and then integrating the corresponding SDK. You can first experience the effects by trying our demo. To integrate the service, see Quick start for audio/video calls.
Audio-only call | Avatar call | |
Example |
|
|
Interaction |
|
|
Cost | Low | Medium |
Client SDKs
For detailed SDK integration instructions, see Developer guide.
SDK | Description |
Recommended
Note
| |
Recommended for native applications on Android or iOS. | |
Other | For development on Windows or macOS desktops, contact us by joining our DingTalk group (ID: 106730016696). |
Basic features
Personalized calls
Alibaba Cloud provides a rich set of APIs that allow you to create a tailored call experience for each user. You can implement this by configuring call startup parameters when initiating a call.
Setting | Description | Modifiable during call? |
LLM prompt | You can pass user-specific information as part of the initial prompt to enable the AI to provide a more authentic and personal companionship experience. | Yes |
ASR language | Set the speech recognition language (such as Chinese or English). | Yes |
TTS voice | Set the AI's voice and timbre. | Yes |
Avatar | If using a | No |
Welcome message | Set a custom welcome message for each user, such as "Hi, Alice, it's great to see you again!" | No |
Pass user information to the model
When multiple users are online, the LLM needs to distinguish which input comes from which user. Real-time conversational AI provides the ability to pass custom information, such as a UserID, through to the model. For details, see Pass through business parameters to Alibaba Cloud Model Studio.
Detect and handle user silence
You can monitor the timestamp of each user utterance by listening for the intent_recognized callback. See Agent callbacks for details. This allows you to handle cases where a user is silent for an extended period. Common actions include:
End the conversation: See StopAIAgentInstance.
Play a reminder: Have the AI play a reminder after X seconds of silence. See Vocalize notifications from AI agent.
Trigger the next question: Send a text input to the LLM to have it ask the next question. See Send text input to an LLM via API.
Conversation archiving
You can save the audio data and text transcripts generated during the entire companionship session. For instructions, see Data archiving.
Advanced features
Spoken language assessment (Per-sentence)
For scenarios where you want to evaluate a user's pronunciation, Real-time conversational AI offers the ability to record each user utterance as a separate audio file. These audio files are saved in real time to your specified Object Storage Service (OSS) bucket, which you can then use for pronunciation assessment.
Real-time Conversational AI provides the per-sentence audio recording capability but does not include the assessment feature itself. To configure per-sentence audio callbacks, see Agent callbacks.

