All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio file recognition - Fun-ASR/Paraformer

Last Updated:Jan 16, 2026

The Fun-ASR/Paraformer audio file recognition models convert recorded audio into text. They support both single-file and batch processing and are suitable for scenarios that do not require real-time results.

Core features

  • Multilingual recognition: Supports multiple languages, including Chinese (with various dialects), English, Japanese, Korean, German, French, and Russian.

  • Broad format compatibility: Supports any sample rate and is compatible with various mainstream audio and video formats, such as AAC, WAV, and MP3.

  • Long audio file processing: Supports asynchronous transcription for a single audio file up to 12 hours in duration and 2 GB in size.

  • Singing recognition: Transcribes entire songs, even with background music. This feature is available only for the fun-asr and fun-asr-2025-11-07 models.

  • Rich recognition features: Provides configurable features such as speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement to meet specific requirements.

Availability

Supported models:

International

In the International deployment mode, access points and data storage are located in the Singapore region, and model inference compute resources are dynamically scheduled globally (excluding Mainland China).

When you call the following models, select an API key from the Singapore region:

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)

Mainland China

In the Mainland China deployment mode, endpoints and data storage are located in the Beijing region, and model inference compute resources are limited to Mainland China.

When calling the following models, use an API key from the Beijing region:

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)

  • Paraformer: paraformer-v2, paraformer-8k-v2

For more information, see Model list.

Model selection

Scenario

Recommended model

Reason

Chinese recognition (meetings/live streaming)

fun-asr

Deeply optimized for Chinese, covering multiple dialects. Strong far-field Voice Activity Detection (VAD) and noise robustness make it suitable for real-world scenarios with noise or multiple distant speakers, resulting in higher accuracy.

Multilingual recognition (international conferences)

fun-asr-mtl, paraformer-v2

A single model can handle multiple languages, which simplifies development and deployment.

Entertainment content analysis and caption generation

fun-asr

Features unique singing recognition capabilities to effectively transcribe songs and singing segments in live streams. Combined with its noise robustness, it is ideal for processing complex media audio.

Caption generation for news/interview programs

fun-asr, paraformer-v2

Long audio processing, punctuation prediction, and timestamps allow for direct generation of structured captions.

Far-field voice interaction for smart hardware

fun-asr

Far-field VAD is specially optimized to more accurately capture and recognize user commands from a distance in noisy environments such as homes and vehicles.

For more information, see Model feature comparison.

Getting started

The following sections provide sample code for API calls.

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

Fun-ASR

Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("fun-asr")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.

  • First result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 3834
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 2480,
                "text": "Hello World, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 760,
                        "end_time": 3240,
                        "text": "Hello World, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 760,
                                "end_time": 1000,
                                "text": "Hello",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1000,
                                "end_time": 1120,
                                "text": " World",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 1400,
                                "end_time": 1920,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1920,
                                "end_time": 2520,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2520,
                                "end_time": 2840,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2840,
                                "end_time": 3240,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }
  • Second result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 4726
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 3800,
                "text": "Hello World, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 680,
                        "end_time": 4480,
                        "text": "Hello World, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 680,
                                "end_time": 960,
                                "text": "Hello",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 960,
                                "end_time": 1080,
                                "text": " World",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 1480,
                                "end_time": 2160,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2160,
                                "end_time": 3080,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3080,
                                "end_time": 3520,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3520,
                                "end_time": 4480,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }

Paraformer

Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json


# To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("paraformer-v2")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.

  • First result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 4726
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 4720,
                "text": "Hello world, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 0,
                        "end_time": 4720,
                        "text": "Hello world, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 0,
                                "end_time": 629,
                                "text": "Hello ",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 629,
                                "end_time": 944,
                                "text": "world",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 944,
                                "end_time": 1888,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1888,
                                "end_time": 3146,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3146,
                                "end_time": 3776,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3776,
                                "end_time": 4720,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }
  • Second result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 3834
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 3720,
                "text": "Hello word, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 100,
                        "end_time": 3820,
                        "text": "Hello word, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 100,
                                "end_time": 596,
                                "text": "Hello ",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 596,
                                "end_time": 844,
                                "text": "word",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 844,
                                "end_time": 1588,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1588,
                                "end_time": 2580,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2580,
                                "end_time": 3076,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3076,
                                "end_time": 3820,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }

API reference

Model feature comparison

Feature

Fun-ASR

Paraformer

Supported languages

Varies by model:

  • fun-asr, fun-asr-2025-11-07:

    Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), Mandarin accents from regions such as Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong, and Taiwan (including accents from Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, and Japanese

  • fun-asr-2025-08-25:

    Chinese (Mandarin), English

  • fun-asr-mtl, fun-asr-mtl-2025-08-25:

    Chinese (Mandarin and Cantonese), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish

Varies by model:

  • paraformer-v2:

    Chinese (Mandarin, Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, Shanghai), English, Japanese, Korean, German, French, Russian

  • paraformer-8k-v2: Chinese (Mandarin)

Supported audio formats

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, and wmv

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Sample rate

Any

Varies by model:

  • paraformer-v2: Any

  • paraformer-8k-v2: 8 kHz

Sound channel

Any

Input format

Publicly accessible URL of the file to be recognized. Up to 100 audio files are supported.

Audio size/duration

Each audio file cannot exceed 2 GB in size and 12 hours in duration.

Emotion recognition

Not supported

Timestamp

Supported Always on

Supported Off by default, can be enabled

Punctuation prediction

Supported Always on

Hotwords

Supported Configurable

ITN

Supported Always on

Singing recognition

Supported This feature is available only for the fun-asr and fun-asr-2025-11-07 models.

Not supported

Noise rejection

Supported Always on

Sensitive word filtering

Supported By default, filters content from the Alibaba Cloud Model Studio sensitive word list. Custom filtering is required for other content.

Speaker diarization

Supported Off by default, can be enabled

Filler word filtering

Supported Off by default, can be enabled

VAD

Supported Always on

Rate limiting (RPS)

Job submission API: 10

Task query API: 20

Job submission API: 20

Task query API: 20

Connection type

DashScope: Java/Python SDK, RESTful API

Price

International: $0.000035/second

Mainland China: $0.000032/second

Mainland China: $0.000012/second

FAQ

Q: How can I improve recognition accuracy?

You should consider all relevant factors and take appropriate action.

Key factors include the following:

  1. Sound quality: The quality of the recording device, the sample rate, and environmental noise affect audio clarity. High-quality audio is essential for accurate recognition.

  2. Speaker characteristics: Differences in pitch, speech rate, accent, and dialect can make recognition more difficult, especially for rare dialects or heavy accents.

  3. Language and vocabulary: Mixed languages, professional jargon, or slang can make recognition more difficult. You can configure hotwords to optimize recognition for these cases.

  4. Contextual understanding: Lack of context can lead to semantic ambiguity, especially in situations where context is necessary for correct recognition.

Optimization methods:

  1. Optimize audio quality: Use high-performance microphones and devices that support the recommended sample rate. Reduce environmental noise and echo.

  2. Adapt to the speaker: For scenarios that involve strong accents or diverse dialects, choose a model that supports those dialects.

  3. Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific terms. For more information, see Custom hotwords.

  4. Preserve context: Avoid segmenting audio into clips that are too short.