Convert recorded audio to text using Fun-ASR, Paraformer. These models support single-file and batch processing for non-real-time scenarios.
Core features
Multilingual recognition: Supports Chinese (multiple dialects), English, Japanese, Korean, German, French, and Russian.
Broad format compatibility: Supports any sample rate and is compatible with various mainstream audio and video formats, such as AAC, WAV, and MP3.
Long audio file processing: Supports asynchronous transcription up to 12 hours and 2 GB per file.
Singing recognition: Transcribes songs with background music (fun-asr and fun-asr-2025-11-07 only).
Rich recognition features: Includes speaker diarization, sensitive word filtering, sentence and word-level timestamps (all configurable).
Availability
Supported models:
International
In the International deployment mode, access points and data storage are located in the Singapore region, and model inference compute resources are dynamically scheduled globally (excluding Chinese Mainland).
When you call the following models, select an API key from the Singapore region:
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
Chinese Mainland
In the Chinese Mainland deployment mode, endpoints and data storage are located in the Beijing region, and model inference compute resources are limited to Chinese Mainland.
When calling the following models, use an API key from the Beijing region:
Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25), fun-asr-mtl-2025-08-25 (snapshot)
Paraformer: paraformer-v2, paraformer-8k-v2
For more information, see Model list.
Model selection
Scenario | Recommended model | Reason |
Chinese recognition (meetings/live streaming) | fun-asr | Deeply optimized for Chinese dialects. Strong far-field VAD and noise robustness deliver higher accuracy in noisy environments with distant speakers. |
Multilingual recognition (international conferences) | fun-asr-mtl, paraformer-v2 | Single model handles multiple languages, simplifying development and deployment. |
Entertainment content analysis and caption generation | fun-asr | Unique singing recognition effectively transcribes songs and singing segments in live streams. Its noise robustness makes it ideal for complex media audio. |
Caption generation for news/interview programs | fun-asr, paraformer-v2 | Long audio processing, punctuation prediction, and timestamps enable structured caption generation. |
Far-field voice interaction for smart hardware | fun-asr | Far-field VAD accurately captures and recognizes distant user commands in noisy environments like homes and vehicles. |
For more information, see Model feature comparison.
Getting started
Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.
Fun-ASR
File recognition uses asynchronous invocation due to file size and processing time. Submit tasks and then poll for results.
Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='fun-asr',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1.
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
// Create transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("fun-asr")
// language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request.
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete and get the result.
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result.
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}Paraformer
File recognition uses asynchronous invocation due to file size and processing time. Submit tasks and then poll for results.
Python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
# To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='paraformer-v2',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)Java
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// Create transcription request parameters.
TranscriptionParam param =
TranscriptionParam.builder()
// To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("paraformer-v2")
// language_hints is an optional parameter that specifies the language code of the audio to be recognized. This parameter is supported only by the paraformer-v2 model of the Paraformer series. For the value range, see the API reference.
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// Submit the transcription request.
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("RequestId: " + result.getRequestId());
// Block and wait for the task to complete and get the result.
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// Get the transcription result.
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}API reference
Model feature comparison
Feature | Fun-ASR | Paraformer |
Languages | Varies by model:
| Varies by model:
|
Audio formats | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv |
Sample rate | Any | Varies by model:
|
Sound channel | Any | |
Input format | Publicly accessible URL of the file to be recognized. Up to 100 audio files are supported. | |
Audio size/duration | Each audio file can be up to 2 GB in size and 12 hours in duration. | |
Emotion recognition | ||
Timestamp | Always on | Off by default, can be enabled |
Punctuation prediction | Always on | |
Hotwords | ||
ITN | Always on | |
Singing recognition | This feature is available only for the fun-asr and fun-asr-2025-11-07 models. | |
Noise rejection | Always on | |
Sensitive word filtering | By default, filters content from the Alibaba Cloud Model Studio sensitive word list. Custom filtering is required for other content. | |
Speaker diarization | Off by default, can be enabled | |
Filler word filtering | Off by default, can be enabled | |
VAD | Always on | |
Rate limiting (RPS) | Job submission API: 10 Task query API: 20 | Job submission API: 20 Task query API: 20 |
Connection type | DashScope: Java/Python SDK, RESTful API | |
Price | International: $0.000035/second Chinese Mainland: $0.000032/second | Chinese Mainland: $0.000012/second |
FAQ
Q: How can I improve recognition accuracy?
Key factors:
Audio quality: Recording device quality, sample rate, and environmental noise directly impact recognition accuracy. Use high-quality audio.
Speaker characteristics: Pitch, speech rate, accent, and dialect variations (especially rare dialects or heavy accents) increase recognition difficulty.
Language and vocabulary: Mixed languages, jargon, or slang reduce recognition accuracy. Configure hotwords for technical terms.
Contextual understanding: Semantic ambiguity occurs without sufficient context.
Optimization methods:
Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize noise and echo.
Adapt to the speaker: Choose dialect-specific models for strong accents or regional speech.
Configure hotwords: Add technical terms, proper nouns, and domain-specific vocabulary. See Customize hotwords for details.
Preserve context: Avoid overly short audio segments.