This topic provides a solution guide to help you launch an AI-powered spoken language tutoring service to meet learners' needs for improving their speaking skills.
Background
An AI-powered speaking tutor addresses the challenges of finding practice partners and overcoming time and location constraints. It offers on-demand practice sessions, analyzes a learner's historical data to pinpoint issues, and provides personalized exercises with instant feedback and corrections.
Furthermore, an AI tutor can simulate a wide variety of scenarios and topics, broadening the learner's practical language skills. By providing a low-pressure, stress-free learning environment, it helps learners build confidence, overcome speaking anxiety, and improve their oral proficiency.
Options
Tutoring modes
Real-time Conversational AI offers two interaction modes for an AI speaking tutor. You can choose a mode by specifying the call type when creating your AI agent and then integrating the corresponding SDK. You can try these modes firsthand in our demo. To integrate the service, see Quick start for audio/video calls.
Audio-only call | Avatar call | |
Example |
|
|
Interaction |
|
|
Cost | Low | Medium |
Client SDKs
For detailed SDK integration instructions, see Developer guide.
SDK | Description |
Recommended
Note
| |
Recommended for native applications on Android or iOS. | |
Other | For development on Windows or macOS desktops, contact us by joining our DingTalk group (ID: 106730016696). |
Basic features
Personalized calls and scene switching
Alibaba Cloud provides a rich set of APIs to create a tailored session for each learner. You can achieve this by configuring call startup parameters when initiating a call.
Real-time Conversational AI also allows users to switch conversation scenes mid-session without ending the call. For example, transitioning from a "directions" practice to a "shopping" practice. To do this, redefine the LLM prompt for the new scene.
Setting | Description | Modifiable during call? |
LLM prompt | Pass learner-specific information (such as proficiency level or learning goals) as part of the initial prompt to enable the AI to provide a more targeted practice session. | Yes |
ASR language | Set the speech recognition language (such as Chinese or English). | Yes |
TTS voice | Set the AI tutor's voice and timbre. | Yes |
Avatar | If using a | No |
Welcome message | Set a custom welcome message for each learner, such as, "Hi, Alice! Today, we'll be practicing a shopping scenario." | No |
Send custom messages to clients
If you need to send custom information, such as test questions or informational cards, to the client in real-time, our platform provides a dedicated channel for this. Once received, the client can render the content or perform any custom action.
There are two ways to implement this:
Method 1: Your server can send custom messages directly to the client. See Send proactive messages to clients.
Method 2: You can embed custom commands within the LLM's response.
NoteThe custom commands can be marked with special characters, such as
{}or[]. These markers can be filtered out by the TTS node so they are not spoken aloud. Parse this content to handle custom business logic.
Pass user information to the model
When multiple users are online, the LLM needs to distinguish which input comes from which user. Real-time conversational AI provides the ability to pass custom information, such as a UserID, through to the model. For details, see Pass through business parameters to Alibaba Cloud Model Studio.
Detect and handle user silence
You can monitor the timestamp of each user utterance by listening for the intent_recognized callback. See Agent callbacks for details. This allows you to handle cases where a user is silent for an extended period. Common actions include:
End the conversation: See StopAIAgentInstance.
Play a reminder: Have the AI play a reminder after X seconds of silence. See Vocalize notifications from AI agent.
Trigger the next question: Send a text input to the LLM to have it ask the next question. See Send text input to an LLM via API.
Conversation archiving
You can save the audio data and text transcripts from the entire tutoring session. For instructions, see Data archiving.
Advanced features
Spoken language assessment (Per-sentence)
For scenarios where you want to evaluate a user's pronunciation, Real-time conversational AI offers the ability to record each user utterance as a separate audio file. These audio files are saved in real time to your specified Object Storage Service (OSS) bucket, which you can then use for pronunciation assessment.
Real-time Conversational AI provides the per-sentence audio recording capability but does not include the assessment feature itself. To configure per-sentence audio callbacks, see Agent callbacks.

