Recording file recognition - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

The Fun-ASR/Paraformer audio file recognition models convert recorded audio into text. They support both single-file and batch processing and are suitable for scenarios that do not require real-time results.

Core features

Multilingual recognition: Supports multiple languages, including Chinese (with various dialects), English, Japanese, Korean, German, French, and Russian.
Broad format compatibility: Supports any sample rate and is compatible with various mainstream audio and video formats, such as AAC, WAV, and MP3.
Long audio file processing: Supports asynchronous transcription for a single audio file up to 12 hours in duration and 2 GB in size.
Singing recognition: Transcribes entire songs, even with background music. This feature is available only for the fun-asr and fun-asr-2025-11-07 models.
Rich recognition features: Provides configurable features such as speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement to meet specific requirements.

Availability

Supported models:

International

In the International deployment mode, access points and data storage are located in the Singapore region, and model inference compute resources are dynamically scheduled globally (excluding Mainland China).

When you call the following models, select an API key from the Singapore region:

Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)

Mainland China

In the Mainland China deployment mode, endpoints and data storage are located in the Beijing region, and model inference compute resources are limited to Mainland China.

When calling the following models, use an API key from the Beijing region:

Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
Paraformer: paraformer-v2, paraformer-8k-v2

For more information, see Model list.

Model selection

Scenario	Recommended model	Reason
Chinese recognition (meetings/live streaming)	fun-asr	Deeply optimized for Chinese, covering multiple dialects. Strong far-field Voice Activity Detection (VAD) and noise robustness make it suitable for real-world scenarios with noise or multiple distant speakers, resulting in higher accuracy.
Multilingual recognition (international conferences)	fun-asr-mtl, paraformer-v2	A single model can handle multiple languages, which simplifies development and deployment.
Entertainment content analysis and caption generation	fun-asr	Features unique singing recognition capabilities to effectively transcribe songs and singing segments in live streams. Combined with its noise robustness, it is ideal for processing complex media audio.
Caption generation for news/interview programs	fun-asr, paraformer-v2	Long audio processing, punctuation prediction, and timestamps allow for direct generation of structured captions.
Far-field voice interaction for smart hardware	fun-asr	Far-field VAD is specially optimized to more accurately capture and recognize user commands from a distance in noisy environments such as homes and vehicles.

For more information, see Model feature comparison.

Getting started

The following sections provide sample code for API calls.

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

Fun-ASR

Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("fun-asr")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.

First result

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 2480,
            "text": "Hello World, this is Alibaba Speech Lab.",
            "sentences": [
                {
                    "begin_time": 760,
                    "end_time": 3240,
                    "text": "Hello World, this is Alibaba Speech Lab.",
                    "sentence_id": 1,
                    "words": [
                        {
                            "begin_time": 760,
                            "end_time": 1000,
                            "text": "Hello",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1000,
                            "end_time": 1120,
                            "text": " World",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 1400,
                            "end_time": 1920,
                            "text": "this is",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1920,
                            "end_time": 2520,
                            "text": "Alibaba",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2520,
                            "end_time": 2840,
                            "text": "Speech",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2840,
                            "end_time": 3240,
                            "text": "Lab",
                            "punctuation": "."
                        }
                    ]
                }
            ]
        }
    ]
}

Second result

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 4726
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 3800,
            "text": "Hello World, this is Alibaba Speech Lab.",
            "sentences": [
                {
                    "begin_time": 680,
                    "end_time": 4480,
                    "text": "Hello World, this is Alibaba Speech Lab.",
                    "sentence_id": 1,
                    "words": [
                        {
                            "begin_time": 680,
                            "end_time": 960,
                            "text": "Hello",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 960,
                            "end_time": 1080,
                            "text": " World",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 1480,
                            "end_time": 2160,
                            "text": "this is",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2160,
                            "end_time": 3080,
                            "text": "Alibaba",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3080,
                            "end_time": 3520,
                            "text": "Speech",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3520,
                            "end_time": 4480,
                            "text": "Lab",
                            "punctuation": "."
                        }
                    ]
                }
            ]
        }
    ]
}

Paraformer

Python

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json


# To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        if transcription['subtask_status'] == 'SUCCEEDED':
            url = transcription['transcription_url']
            result = json.loads(request.urlopen(url).read().decode('utf8'))
            print(json.dumps(result, indent=4,
                            ensure_ascii=False))
        else:
            print('transcription failed!')
            print(transcription)
else:
    print('Error: ', transcription_response.output.message)

Java

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model("paraformer-v2")
                        // language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete and get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Get the transcription result.
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

First result

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 4726
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 4720,
            "text": "Hello world, this is Alibaba Speech Lab.",
            "sentences": [
                {
                    "begin_time": 0,
                    "end_time": 4720,
                    "text": "Hello world, this is Alibaba Speech Lab.",
                    "sentence_id": 1,
                    "words": [
                        {
                            "begin_time": 0,
                            "end_time": 629,
                            "text": "Hello ",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 629,
                            "end_time": 944,
                            "text": "world",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 944,
                            "end_time": 1888,
                            "text": "this is",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1888,
                            "end_time": 3146,
                            "text": "Alibaba",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3146,
                            "end_time": 3776,
                            "text": "Speech",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3776,
                            "end_time": 4720,
                            "text": "Lab",
                            "punctuation": "."
                        }
                    ]
                }
            ]
        }
    ]
}

Second result

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 3720,
            "text": "Hello word, this is Alibaba Speech Lab.",
            "sentences": [
                {
                    "begin_time": 100,
                    "end_time": 3820,
                    "text": "Hello word, this is Alibaba Speech Lab.",
                    "sentence_id": 1,
                    "words": [
                        {
                            "begin_time": 100,
                            "end_time": 596,
                            "text": "Hello ",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 596,
                            "end_time": 844,
                            "text": "word",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 844,
                            "end_time": 1588,
                            "text": "this is",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1588,
                            "end_time": 2580,
                            "text": "Alibaba",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2580,
                            "end_time": 3076,
                            "text": "Speech",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3076,
                            "end_time": 3820,
                            "text": "Lab",
                            "punctuation": "."
                        }
                    ]
                }
            ]
        }
    ]
}

API reference

Model feature comparison

Feature	Fun-ASR	Paraformer
Languages	Varies by model: fun-asr and fun-asr-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin); regional Mandarin accents including Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan; plus English and Japanese fun-asr-2025-08-25: Chinese (Mandarin) and English fun-asr-mtl and fun-asr-mtl-2025-08-25: Chinese (Mandarin and Cantonese), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish	Varies by model: paraformer-v2: Chinese (Mandarin, Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, and Shanghai dialects), English, Japanese, Korean, German, French, and Russian paraformer-8k-v2: Chinese (Mandarin only)
Audio formats	aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv	aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Sample rate	Any	Varies by model: paraformer-v2: Any paraformer-8k-v2: 8 kHz
Sound channel	Any
Input format	Publicly accessible URL of the file to be recognized. Up to 100 audio files are supported.
Audio size/duration	Each audio file cannot exceed 2 GB in size and 12 hours in duration.
Emotion recognition	Not supported
Timestamp	Supported Always on	Supported Off by default, can be enabled
Punctuation prediction	Supported Always on
Hotwords	Supported Configurable
ITN	Supported Always on
Singing recognition	Supported This feature is available only for the fun-asr and fun-asr-2025-11-07 models.	Not supported
Noise rejection	Supported Always on
Sensitive word filtering	Supported By default, filters content from the Alibaba Cloud Model Studio sensitive word list. Custom filtering is required for other content.
Speaker diarization	Supported Off by default, can be enabled
Filler word filtering	Supported Off by default, can be enabled
VAD	Supported Always on
Rate limiting (RPS)	Job submission API: 10 Task query API: 20	Job submission API: 20 Task query API: 20
Connection type	DashScope: Java/Python SDK, RESTful API
Price	International: $0.000035/second Mainland China: $0.000032/second	Mainland China: $0.000012/second

FAQ

Q: How can I improve recognition accuracy?

You should consider all relevant factors and take appropriate action.

Key factors include the following:

Sound quality: The quality of the recording device, the sample rate, and environmental noise affect audio clarity. High-quality audio is essential for accurate recognition.
Speaker characteristics: Differences in pitch, speech rate, accent, and dialect can make recognition more difficult, especially for rare dialects or heavy accents.
Language and vocabulary: Mixed languages, professional jargon, or slang can make recognition more difficult. You can configure hotwords to optimize recognition for these cases.
Contextual understanding: Lack of context can lead to semantic ambiguity, especially in situations where context is necessary for correct recognition.

Optimization methods:

Optimize audio quality: Use high-performance microphones and devices that support the recommended sample rate. Reduce environmental noise and echo.
Adapt to the speaker: For scenarios that involve strong accents or diverse dialects, choose a model that supports those dialects.
Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific terms. For more information, see Custom hotwords.
Preserve context: Avoid segmenting audio into clips that are too short.