Real-time speech synthesis - CosyVoice/Sambert - Alibaba Cloud Model Studio

Speech synthesis, also known as text-to-speech (TTS), is a technology that converts text into natural-sounding speech. It uses machine learning algorithms to learn the prosody, intonation, and pronunciation rules of a language from numerous audio samples, enabling it to generate human-like, natural speech from text input.

Core features

Generates high-fidelity audio in real time and supports natural-sounding speech in multiple languages, including Chinese and English.
Provides a voice cloning feature to quickly create custom voices.
Supports streaming input and output for low-latency responses in real-time interactive scenarios.
Allows fine-grained control over speech performance by adjusting speech rate, pitch, volume, and bitrate.
Compatible with major audio formats and supports an audio sampling rate of up to 48 kHz.

Availability

Supported regions: This service is available only in the China (Beijing) region. Use an API key from the China (Beijing) region.
Supported models: cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
Supported voices: See CosyVoice voice list.

Model selection

Scenario	Recommended model	Reason	Notes
Brand voice customization/Personalized voice cloning service	cosyvoice-v3-plus	Features the strongest voice cloning capability and supports 48 kHz high-quality audio output. The combination of high-quality audio and voice cloning helps create a human-like brand voiceprint.	Higher cost ($0.286706 per 10,000 characters). Recommended for core scenarios
Smart customer service / Voice assistant	cosyvoice-v3-flash	This is the lowest-cost model ($0.14335 per 10,000 characters). It supports streaming interaction and emotional expression, provides fast response times, and is highly cost-effective.
Dialect broadcasting system	cosyvoice-v3-flash, cosyvoice-v3-plus	Supports various dialects such as Northeastern Mandarin and Minnan, suitable for local content broadcasting.	cosyvoice-v3-plus has a higher cost ($0.286706 per 10,000 characters)
Educational applications (including formula reading)	cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus	Supports converting LaTeX formulas to speech, suitable for explaining math, physics, and chemistry lessons.	cosyvoice-v2 and cosyvoice-v3-plus have a higher cost ($0.286706 per 10,000 characters).
Structured voice announcements (news/bulletins)	cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2	Supports SSML to control speech rate, pauses, pronunciation, and more, enhancing the professionalism of broadcasts.	Requires additional development for SSML generation logic. Does not support setting emotions.
Precise audio-text alignment (for caption generation, lecture playback, dictation training)	cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2	Supports timestamp output, which allows synchronization between synthesized audio and the original text.	The timestamp feature must be explicitly enabled because it is disabled by default.
Multilingual products for global markets	cosyvoice-v3-flash, cosyvoice-v3-plus	Supports multiple languages.	Sambert does not support streaming input and is more expensive than cosyvoice-v3-flash.

For more information, see Feature comparison.

Getting started

The following code shows an example of how to call the API. For more code examples that cover common scenarios, see our repository on GitHub.

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

CosyVoice

Save synthesized audio to a file

Python

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured the API Key in an environment variable, replace "your-api-key" with your API Key.
# dashscope.api_key = "your-api-key"

# Model
# Different model versions require corresponding voice versions:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized to get the binary audio.
audio = synthesizer.call("What is the weather like today?")
# The first time you send text, a WebSocket connection must be established, so the first-packet latency includes the connection setup time.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Java

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    // Different model versions require corresponding voice versions:
    // cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
    // cosyvoice-v2: Use voices such as longxiaochun_v2.
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If you have not configured the API Key in an environment variable, uncomment the following line and replace "your-api-key" with your API Key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (the second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until the audio is returned.
            audio = synthesizer.call("What is the weather like today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection when the task is finished.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to the local file "output.mp3".
            File file = new File("output.mp3");
            // The first time you send text, a WebSocket connection must be established, so the first-packet latency includes the connection setup time.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + " First-packet latency (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Convert text generated by an LLM into speech in real time and play it through a speaker

The following code shows how to play the real-time text output from the Qwen large language model (qwen-turbo) on an on-premises device.

Python

Before running the Python example, you need to install a third-party audio playback library using pip.

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# If you have not configured the API Key in an environment variable, uncomment the following line and replace "apiKey" with your API Key.
# dashscope.api_key = "apiKey"

# Different model versions require corresponding voice versions:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("websocket is open.")
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        # stop player
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self._stream.write(data)


def synthesizer_with_llm():
    callback = Callback()
    synthesizer = SpeechSynthesizer(
        model=model,
        voice=voice,
        format=AudioFormat.PCM_22050HZ_MONO_16BIT,
        callback=callback,
    )

    messages = [{"role": "user", "content": "Please introduce yourself."}]
    responses = Generation.call(
        model="qwen-turbo",
        messages=messages,
        result_format="message",  # set result format as 'message'
        stream=True,  # enable stream output
        incremental_output=True,  # enable incremental output 
    )
    for response in responses:
        if response.status_code == HTTPStatus.OK:
            print(response.output.choices[0]["message"]["content"], end="")
            synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
        else:
            print(
                "Request id: %s, Status code: %s, error code: %s, error message: %s"
                % (
                    response.request_id,
                    response.status_code,
                    response.code,
                    response.message,
                )
            )
    synthesizer.streaming_complete()
    print('Request ID: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
    synthesizer_with_llm()

Java

import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
import javax.sound.sampled.*;

public class Main {
    // Different model versions require corresponding voice versions:
    // cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
    // cosyvoice-v2: Use voices such as longxiaochun_v2.
    private static String model = "cosyvoice-v3-flash";
    private static String voice = "longanyang";
    public static void process() throws NoApiKeyException, InputRequiredException {
        // Playback thread
        class PlaybackRunnable implements Runnable {
            // Set the audio format. Configure this based on your device,
            // synthesized audio parameters, and platform. Here, it is set to
            // 22050 Hz 16-bit single channel. We recommend that you choose other
            // sample rates and formats based on the model sample rate and device
            // compatibility.
            private AudioFormat af = new AudioFormat(22050, 16, 1, true, false);
            private DataLine.Info info = new DataLine.Info(SourceDataLine.class, af);
            private SourceDataLine targetSource = null;
            private AtomicBoolean runFlag = new AtomicBoolean(true);
            private ConcurrentLinkedQueue<ByteBuffer> queue =
                    new ConcurrentLinkedQueue<>();

            // Prepare the player.
            public void prepare() throws LineUnavailableException {
                targetSource = (SourceDataLine) AudioSystem.getLine(info);
                targetSource.open(af, 4096);
                targetSource.start();
            }

            public void put(ByteBuffer buffer) {
                queue.add(buffer);
            }

            // Stop playback.
            public void stop() {
                runFlag.set(false);
            }

            @Override
            public void run() {
                if (targetSource == null) {
                    return;
                }

                while (runFlag.get()) {
                    if (queue.isEmpty()) {
                        try {
                            Thread.sleep(100);
                        } catch (InterruptedException e) {
                        }
                        continue;
                    }

                    ByteBuffer buffer = queue.poll();
                    if (buffer == null) {
                        continue;
                    }

                    byte[] data = buffer.array();
                    targetSource.write(data, 0, data.length);
                }

                // Play all remaining cached audio.
                if (!queue.isEmpty()) {
                    ByteBuffer buffer = null;
                    while ((buffer = queue.poll()) != null) {
                        byte[] data = buffer.array();
                        targetSource.write(data, 0, data.length);
                    }
                }
                // Release the player.
                targetSource.drain();
                targetSource.stop();
                targetSource.close();
            }
        }

        // Create a subclass that inherits from ResultCallback<SpeechSynthesisResult>
        // to implement the callback interface.
        class ReactCallback extends ResultCallback<SpeechSynthesisResult> {
            private PlaybackRunnable playbackRunnable = null;

            public ReactCallback(PlaybackRunnable playbackRunnable) {
                this.playbackRunnable = playbackRunnable;
            }

            // Callback for when the service returns the streaming synthesis result.
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // Get the binary data of the streaming result using getAudioFrame.
                if (result.getAudioFrame() != null) {
                    // Stream the data to the player.
                    playbackRunnable.put(result.getAudioFrame());
                }
            }

            // Callback for when the service completes the synthesis.
            @Override
            public void onComplete() {
                // Notify the playback thread to end.
                playbackRunnable.stop();
            }

            // Callback for when an error occurs.
            @Override
            public void onError(Exception e) {
                // Tell the playback thread to end.
                System.out.println(e);
                playbackRunnable.stop();
            }
        }

        PlaybackRunnable playbackRunnable = new PlaybackRunnable();
        try {
            playbackRunnable.prepare();
        } catch (LineUnavailableException e) {
            throw new RuntimeException(e);
        }
        Thread playbackThread = new Thread(playbackRunnable);
        // Start the playback thread.
        playbackThread.start();
        /*******  Call the generative AI model to get streaming text *******/
        // Prepare for the LLM call.
        Generation gen = new Generation();
        Message userMsg = Message.builder()
                .role(Role.USER.getValue())
                .content("Please introduce yourself.")
                .build();
        GenerationParam genParam =
                GenerationParam.builder()
                        // If you have not configured the API Key in an environment variable, uncomment the following line and replace "apiKey" with your API Key.
                        // .apiKey("apikey")
                        .model("qwen-turbo")
                        .messages(Arrays.asList(userMsg))
                        .resultFormat(GenerationParam.ResultFormat.MESSAGE)
                        .topP(0.8)
                        .incrementalOutput(true)
                        .build();
        // Prepare the speech synthesis task.
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If you have not configured the API Key in an environment variable, uncomment the following line and replace "apiKey" with your API Key.
                        // .apiKey("apikey")
                        .model(model)
                        .voice(voice)
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT)
                        .build();
        SpeechSynthesizer synthesizer =
                new SpeechSynthesizer(param, new ReactCallback(playbackRunnable));
        Flowable<GenerationResult> result = gen.streamCall(genParam);
        result.blockingForEach(message -> {
            String text =
                    message.getOutput().getChoices().get(0).getMessage().getContent().trim();
            if (text != null && !text.isEmpty()) {
                System.out.println("LLM output: " + text);
                synthesizer.streamingCall(text);
            }
        });
        synthesizer.streamingComplete();
        System.out.print("Request ID: " + synthesizer.getLastRequestId());
        try {
            // Wait for the playback thread to finish playing all audio.
            playbackThread.join();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) throws NoApiKeyException, InputRequiredException {
        process();
        System.exit(0);
    }
}

API reference

Feature comparison

Feature	cosyvoice-v3-plus	cosyvoice-v3-flash	cosyvoice-v2
Supported languages	Varies by voice: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian		Varies by voice: Chinese, English (British, American), Korean, Japanese
Audio format	pcm, wav, mp3, opus
Audio sampling rate	8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz
Voice cloning	Supported. See CosyVoice voice cloning API.
SSML	Supported. See SSML markup language overview. This feature applies to cloned voices and the system voices marked as supported in the voice list.
LaTeX	Supported. See Convert LaTeX formulas to speech.
Volume adjustment	Supported
Speech rate adjustment	Supported
Pitch adjustment	Supported
Bitrate adjustment	Supported. Only available for the opus format.
Timestamp	Supported. Disabled by default but can be enabled. This feature applies to cloned voices and the system voices marked as supported in the voice list.
Instruction control (Instruct)	Supported. This feature applies to cloned voices and the system voices marked as supported in the voice list.		Not supported
Streaming input	Supported
Streaming output	Supported
Rate limit (RPS)	3
Connection types	Java/Python SDK, WebSocket API
Price	$0.286706 per 10,000 characters	$0.14335 per 10,000 characters	$0.286706 per 10,000 characters

FAQ

Q: What should I do if the speech synthesis has incorrect pronunciation? How can I control the pronunciation of polyphonic characters?

Replace the polyphonic character with a homophone to quickly correct the pronunciation.
Use Speech Synthesis Markup Language (SSML) to control the pronunciation.