All Products
Search
Document Center

Intelligent Media Services:AI speaking tutor

Last Updated:Dec 15, 2025

This topic provides a solution guide to help you launch an AI-powered spoken language tutoring service to meet learners' needs for improving their speaking skills.

Background

An AI-powered speaking tutor addresses the challenges of finding practice partners and overcoming time and location constraints. It offers on-demand practice sessions, analyzes a learner's historical data to pinpoint issues, and provides personalized exercises with instant feedback and corrections.

Furthermore, an AI tutor can simulate a wide variety of scenarios and topics, broadening the learner's practical language skills. By providing a low-pressure, stress-free learning environment, it helps learners build confidence, overcome speaking anxiety, and improve their oral proficiency.

Options

Tutoring modes

Real-time Conversational AI offers two interaction modes for an AI speaking tutor. You can choose a mode by specifying the call type when creating your AI agent and then integrating the corresponding SDK. You can try these modes firsthand in our demo. To integrate the service, see Quick start for audio/video calls.

Audio-only call

Avatar call

Example

555d2e763e3c49c23ac59cb7060d2a44

lQDPJxjZw5Ame9nNC6zNBaCw89zk0Od4uB8HWJitduNrAA_1440_2988

Interaction

  • Learner: Audio

  • AI tutor: Audio

  • Learner: Audio

  • AI tutor: Video

Cost

Low

Medium

Client SDKs

For detailed SDK integration instructions, see Developer guide.

SDK

Description

Web

Recommended

  • Desktop browsers, such as Chrome.

  • Mobile H5, such as Alipay H5, DingTalk H5, and WeChat mini program H5.

  • In-app WebViews.

Note
  • Use on native mobile browsers is not recommended due to potential WebRTC compatibility issues on some devices.

  • Direct integration with native WeChat Mini Program components is not supported. Use the H5 version within a mini program instead.

Android/iOS

Recommended for native applications on Android or iOS.

Other

For development on Windows or macOS desktops, contact us by joining our DingTalk group (ID: 106730016696).

Basic features

Personalized calls and scene switching

Alibaba Cloud provides a rich set of APIs to create a tailored session for each learner. You can achieve this by configuring call startup parameters when initiating a call.

Real-time Conversational AI also allows users to switch conversation scenes mid-session without ending the call. For example, transitioning from a "directions" practice to a "shopping" practice. To do this, redefine the LLM prompt for the new scene.

Setting

Description

Modifiable during call?

LLM prompt

Pass learner-specific information (such as proficiency level or learning goals) as part of the initial prompt to enable the AI to provide a more targeted practice session.

Yes

ASR language

Set the speech recognition language (such as Chinese or English).

Yes

TTS voice

Set the AI tutor's voice and timbre.

Yes

Avatar

If using a VideoAgent with multiple avatars, you can specify which one to use for the call.

No

Welcome message

Set a custom welcome message for each learner, such as, "Hi, Alice! Today, we'll be practicing a shopping scenario."

No

Send custom messages to clients

If you need to send custom information, such as test questions or informational cards, to the client in real-time, our platform provides a dedicated channel for this. Once received, the client can render the content or perform any custom action.

image

There are two ways to implement this:

  • Method 1: Your server can send custom messages directly to the client. See Send proactive messages to clients.

  • Method 2: You can embed custom commands within the LLM's response.

    Note

    The custom commands can be marked with special characters, such as {} or []. These markers can be filtered out by the TTS node so they are not spoken aloud. Parse this content to handle custom business logic.

Pass user information to the model

When multiple users are online, the LLM needs to distinguish which input comes from which user. Real-time conversational AI provides the ability to pass custom information, such as a UserID, through to the model. For details, see Pass through business parameters to Alibaba Cloud Model Studio.

Detect and handle user silence

You can monitor the timestamp of each user utterance by listening for the intent_recognized callback. See Agent callbacks for details. This allows you to handle cases where a user is silent for an extended period. Common actions include:

Conversation archiving

You can save the audio data and text transcripts from the entire tutoring session. For instructions, see Data archiving.

Advanced features

Spoken language assessment (Per-sentence)

For scenarios where you want to evaluate a user's pronunciation, Real-time conversational AI offers the ability to record each user utterance as a separate audio file. These audio files are saved in real time to your specified Object Storage Service (OSS) bucket, which you can then use for pronunciation assessment.

Note

Real-time Conversational AI provides the per-sentence audio recording capability but does not include the assessment feature itself. To configure per-sentence audio callbacks, see Agent callbacks.