The Fun-ASR/Paraformer audio file recognition models convert recorded audio into text. They support both single-file and batch processing and are suitable for scenarios that do not require real-time results.
Core features
Multilingual recognition: Supports multiple languages, including Chinese (with various dialects), English, Japanese, Korean, German, French, and Russian.
Broad format compatibility: Supports any sample rate and is compatible with various mainstream audio and video formats, such as AAC, WAV, and MP3.
Long audio file processing: Supports asynchronous transcription for a single audio file up to 12 hours in duration and 2 GB in size.
Singing recognition: Transcribes entire songs, even with background music. This feature is available only for the fun-asr and fun-asr-2025-11-07 models.
Rich recognition features: Provides configurable features such as speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement to meet specific requirements.
Availability
Supported models:
International
In the International deployment mode, access points and data storage are located in the Singapore region, and model inference compute resources are dynamically scheduled globally (excluding Mainland China).
When you call the following models, select an API key from the Singapore region:
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
Mainland China
In the Mainland China deployment mode, endpoints and data storage are located in the Beijing region, and model inference compute resources are limited to Mainland China.
When calling the following models, use an API key from the Beijing region:
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
Paraformer: paraformer-v2, paraformer-8k-v2
For more information, see Model list.
Model selection
Scenario | Recommended model | Reason |
Chinese recognition (meetings/live streaming) | fun-asr | Deeply optimized for Chinese, covering multiple dialects. Strong far-field Voice Activity Detection (VAD) and noise robustness make it suitable for real-world scenarios with noise or multiple distant speakers, resulting in higher accuracy. |
Multilingual recognition (international conferences) | fun-asr-mtl, paraformer-v2 | A single model can handle multiple languages, which simplifies development and deployment. |
Entertainment content analysis and caption generation | fun-asr | Features unique singing recognition capabilities to effectively transcribe songs and singing segments in live streams. Combined with its noise robustness, it is ideal for processing complex media audio. |
Caption generation for news/interview programs | fun-asr, paraformer-v2 | Long audio processing, punctuation prediction, and timestamps allow for direct generation of structured captions. |
Far-field voice interaction for smart hardware | fun-asr | Far-field VAD is specially optimized to more accurately capture and recognize user commands from a distance in noisy environments such as homes and vehicles. |
For more information, see Model feature comparison.
Getting started
The following sections provide sample code for API calls.
You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.
Fun-ASR
Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.
Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='fun-asr',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
// Create transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("fun-asr")
// language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request.
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete and get the result.
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result.
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}Paraformer
Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.
Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='paraformer-v2',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// Create transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("paraformer-v2")
// language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request.
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete and get the result.
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result.
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}API reference
Model feature comparison
Feature | Fun-ASR | Paraformer |
Supported languages | Varies by model:
| Varies by model:
|
Supported audio formats | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, and wmv | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv |
Sample rate | Any | Varies by model:
|
Sound channel | Any | |
Input format | Publicly accessible URL of the file to be recognized. Up to 100 audio files are supported. | |
Audio size/duration | Each audio file cannot exceed 2 GB in size and 12 hours in duration. | |
Emotion recognition | ||
Timestamp | Always on | Off by default, can be enabled |
Punctuation prediction | Always on | |
Hotwords | Configurable | |
ITN | Always on | |
Singing recognition | This feature is available only for the fun-asr and fun-asr-2025-11-07 models. | |
Noise rejection | Always on | |
Sensitive word filtering | By default, filters content from the Alibaba Cloud Model Studio sensitive word list. Custom filtering is required for other content. | |
Speaker diarization | Off by default, can be enabled | |
Filler word filtering | Off by default, can be enabled | |
VAD | Always on | |
Rate limiting (RPS) | Job submission API: 10 Task query API: 20 | Job submission API: 20 Task query API: 20 |
Connection type | DashScope: Java/Python SDK, RESTful API | |
Price | International: $0.000035/second Mainland China: $0.000032/second | Mainland China: $0.000012/second |
FAQ
Q: How can I improve recognition accuracy?
You should consider all relevant factors and take appropriate action.
Key factors include the following:
Sound quality: The quality of the recording device, the sample rate, and environmental noise affect audio clarity. High-quality audio is essential for accurate recognition.
Speaker characteristics: Differences in pitch, speech rate, accent, and dialect can make recognition more difficult, especially for rare dialects or heavy accents.
Language and vocabulary: Mixed languages, professional jargon, or slang can make recognition more difficult. You can configure hotwords to optimize recognition for these cases.
Contextual understanding: Lack of context can lead to semantic ambiguity, especially in situations where context is necessary for correct recognition.
Optimization methods:
Optimize audio quality: Use high-performance microphones and devices that support the recommended sample rate. Reduce environmental noise and echo.
Adapt to the speaker: For scenarios that involve strong accents or diverse dialects, choose a model that supports those dialects.
Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific terms. For more information, see Custom hotwords.
Preserve context: Avoid segmenting audio into clips that are too short.