All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio file recognition - Fun-ASR/Paraformer

Last Updated:Mar 24, 2026

Convert recorded audio to text using Fun-ASR, Paraformer. These models support single-file and batch processing for non-real-time scenarios.

Core features

  • Multilingual recognition: Supports Chinese (multiple dialects), English, Japanese, Korean, German, French, and Russian.

  • Broad format compatibility: Supports any sample rate and is compatible with various mainstream audio and video formats, such as AAC, WAV, and MP3.

  • Long audio file processing: Supports asynchronous transcription up to 12 hours and 2 GB per file.

  • Singing recognition: Transcribes songs with background music (fun-asr and fun-asr-2025-11-07 only).

  • Rich recognition features: Includes speaker diarization, sensitive word filtering, sentence and word-level timestamps (all configurable).

Availability

Supported models:

International

In the International deployment mode, access points and data storage are located in the Singapore region, and model inference compute resources are dynamically scheduled globally (excluding Chinese Mainland).

When you call the following models, select an API key from the Singapore region:

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)

Chinese Mainland

In the Chinese Mainland deployment mode, endpoints and data storage are located in the Beijing region, and model inference compute resources are limited to Chinese Mainland.

When calling the following models, use an API key from the Beijing region:

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)

  • Paraformer: paraformer-v2, paraformer-8k-v2

For more information, see Model list.

Model selection

Scenario

Recommended model

Reason

Chinese recognition (meetings/live streaming)

fun-asr

Deeply optimized for Chinese dialects. Strong far-field VAD and noise robustness deliver higher accuracy in noisy environments with distant speakers.

Multilingual recognition (international conferences)

fun-asr-mtl, paraformer-v2

Single model handles multiple languages, simplifying development and deployment.

Entertainment content analysis and caption generation

fun-asr

Unique singing recognition effectively transcribes songs and singing segments in live streams. Its noise robustness makes it ideal for complex media audio.

Caption generation for news/interview programs

fun-asr, paraformer-v2

Long audio processing, punctuation prediction, and timestamps enable structured caption generation.

Far-field voice interaction for smart hardware

fun-asr

Far-field VAD accurately captures and recognizes distant user commands in noisy environments like homes and vehicles.

For more information, see Model feature comparison.

Getting started

Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

Fun-ASR

File recognition uses asynchronous invocation due to file size and processing time. Submit tasks and then poll for results.

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("fun-asr")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

Results print to the console in JSON format with transcribed text, start/end timestamps (milliseconds).

  • First result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 3834
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 2480,
                "text": "Hello World, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 760,
                        "end_time": 3240,
                        "text": "Hello World, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 760,
                                "end_time": 1000,
                                "text": "Hello",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1000,
                                "end_time": 1120,
                                "text": " World",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 1400,
                                "end_time": 1920,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1920,
                                "end_time": 2520,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2520,
                                "end_time": 2840,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2840,
                                "end_time": 3240,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }
  • Second result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 4726
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 3800,
                "text": "Hello World, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 680,
                        "end_time": 4480,
                        "text": "Hello World, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 680,
                                "end_time": 960,
                                "text": "Hello",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 960,
                                "end_time": 1080,
                                "text": " World",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 1480,
                                "end_time": 2160,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2160,
                                "end_time": 3080,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3080,
                                "end_time": 3520,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3520,
                                "end_time": 4480,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }

Paraformer

File recognition uses asynchronous invocation due to file size and processing time. Submit tasks and then poll for results.

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json


# To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("paraformer-v2")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

Results print to the console in JSON format with transcribed text, start/end timestamps (milliseconds).

  • First result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 4726
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 4720,
                "text": "Hello world, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 0,
                        "end_time": 4720,
                        "text": "Hello world, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 0,
                                "end_time": 629,
                                "text": "Hello ",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 629,
                                "end_time": 944,
                                "text": "world",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 944,
                                "end_time": 1888,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1888,
                                "end_time": 3146,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3146,
                                "end_time": 3776,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3776,
                                "end_time": 4720,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }
  • Second result

    {
        "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
        "properties": {
            "audio_format": "pcm_s16le",
            "channels": [
                0
            ],
            "original_sampling_rate": 16000,
            "original_duration_in_milliseconds": 3834
        },
        "transcripts": [
            {
                "channel_id": 0,
                "content_duration_in_milliseconds": 3720,
                "text": "Hello word, this is Alibaba Speech Lab.",
                "sentences": [
                    {
                        "begin_time": 100,
                        "end_time": 3820,
                        "text": "Hello word, this is Alibaba Speech Lab.",
                        "sentence_id": 1,
                        "words": [
                            {
                                "begin_time": 100,
                                "end_time": 596,
                                "text": "Hello ",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 596,
                                "end_time": 844,
                                "text": "word",
                                "punctuation": ", "
                            },
                            {
                                "begin_time": 844,
                                "end_time": 1588,
                                "text": "this is",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 1588,
                                "end_time": 2580,
                                "text": "Alibaba",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 2580,
                                "end_time": 3076,
                                "text": "Speech",
                                "punctuation": ""
                            },
                            {
                                "begin_time": 3076,
                                "end_time": 3820,
                                "text": "Lab",
                                "punctuation": "."
                            }
                        ]
                    }
                ]
            }
        ]
    }

API reference

Model feature comparison

Feature

Fun-ASR

Paraformer

Languages

Varies by model:

  • fun-asr and fun-asr-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin); regional Mandarin accents including Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan; plus English and Japanese

  • fun-asr-2025-08-25: Chinese (Mandarin) and English

  • fun-asr-mtl and fun-asr-mtl-2025-08-25: Chinese (Mandarin and Cantonese), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish

Varies by model:

  • paraformer-v2: Chinese (Mandarin, Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, and Shanghai dialects), English, Japanese, Korean, German, French, and Russian

  • paraformer-8k-v2: Chinese (Mandarin only)

Audio formats

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Sample rate

Any

Varies by model:

  • paraformer-v2: Any

  • paraformer-8k-v2: 8 kHz

Sound channel

Any

Input format

Publicly accessible URL of the file to be recognized. Up to 100 audio files are supported.

Audio size/duration

Each audio file can be up to 2 GB in size and 12 hours in duration.

Emotion recognition

Not supported

Timestamp

Supported Always on

Supported Off by default, can be enabled

Punctuation prediction

Supported Always on

Hotwords

Not supported

ITN

Supported Always on

Singing recognition

Supported This feature is available only for the fun-asr and fun-asr-2025-11-07 models.

Not supported

Noise rejection

Supported Always on

Sensitive word filtering

Supported By default, filters content from the Alibaba Cloud Model Studio sensitive word list. Custom filtering is required for other content.

Speaker diarization

Supported Off by default, can be enabled

Filler word filtering

Supported Off by default, can be enabled

VAD

Supported Always on

Rate limiting (RPS)

Job submission API: 10

Task query API: 20

Job submission API: 20

Task query API: 20

Connection type

DashScope: Java/Python SDK, RESTful API

Price

International: $0.000035/second

Chinese Mainland: $0.000032/second

Chinese Mainland: $0.000012/second

FAQ

Q: How can I improve recognition accuracy?

Key factors:

  1. Audio quality: Recording device quality, sample rate, and environmental noise directly impact recognition accuracy. Use high-quality audio.

  2. Speaker characteristics: Pitch, speech rate, accent, and dialect variations (especially rare dialects or heavy accents) increase recognition difficulty.

  3. Language and vocabulary: Mixed languages, jargon, or slang reduce recognition accuracy. Configure hotwords for technical terms.

  4. Contextual understanding: Semantic ambiguity occurs without sufficient context.

Optimization methods:

  1. Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize noise and echo.

  2. Adapt to the speaker: Choose dialect-specific models for strong accents or regional speech.

  3. Configure hotwords: Add technical terms, proper nouns, and domain-specific vocabulary. See Customize hotwords for details.

  4. Preserve context: Avoid overly short audio segments.