The parameters and interfaces for the Fun-ASR audio file recognition Python SDK.
User guide: For more information about models and selection suggestions, see Audio file recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.
NoteFor temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.
Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Model availability
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr Currently, fun-asr-2025-11-07 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy | Snapshot | ||
fun-asr-2025-08-25 | |||
fun-asr-mtl Currently, fun-asr-mtl-2025-08-25 | Stable | ||
fun-asr-mtl-2025-08-25 | Snapshot |
Languages supported:
fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.
fun-asr-2025-08-25: Mandarin and English.
fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.
Sample rates supported: Any
Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Chinese Mainland
In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr Currently, fun-asr-2025-11-07 | Stable | $0.000032 / second | No free quota |
fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy | Snapshot | ||
fun-asr-2025-08-25 | |||
fun-asr-mtl Currently, fun-asr-mtl-2025-08-25 | Stable | ||
fun-asr-mtl-2025-08-25 | Snapshot |
Languages supported:
fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.
fun-asr-2025-08-25: Mandarin and English.
fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.
Sample rates supported: Any
Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Limitations
The input must be a publicly accessible file URL (HTTP/HTTPS), for example, https://your-domain.com/file.mp3. Local files and Base64 audio are not supported.
When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.
When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:
The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.
Specify the URLs using the file_urls parameter. A single request supports up to 100 URLs.
Audio formats
aac,amr,avi,flac,flv,m4a,mkv,mov,mp3,mp4,mpeg,ogg,opus,wav,webm,wma,wmvImportantBecause many audio and video formats and their variants exist, it is not technically feasible to test all of them. The API cannot guarantee that all formats can be correctly recognized. You should test your files to verify that you can obtain the expected speech recognition results.
Audio sample rate: Any
Audio file size and duration
The maximum file size is 2 GB. The maximum duration is 12 hours. For files exceeding these limits, split or compress the file before uploading. For more information about best practices for file pre-processing, see Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios).
Number of audio files for batch processing
A single request supports up to 100 file URLs.
Recognizable languages: fun-asr supports Chinese and English. fun-asr-mtl-2025-08-25 supports Chinese, Cantonese, English, Japanese, Thai, Vietnamese, and Indonesian.
Getting started
The Transcription core class provides interfaces to submit tasks, wait for completion, and query results. Two recognition methods:
Asynchronous submission and synchronous waiting: Submit a task and block until it completes to get the result.
Asynchronous submission and asynchronous query: Submit a task and query the result when needed.
Asynchronous submission and synchronous waiting
Call the Transcription class's
async_call()method with request parameters.NoteTasks enter the
PENDINGstate after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
Call
Transcription.wait()to block until the task completes.Task statuses:
PENDING,RUNNING,SUCCEEDED,FAILED. Thewaitmethod blocks onPENDING/RUNNINGand returns when the status reachesSUCCEEDEDorFAILED.Returns a TranscriptionResponse.
Asynchronous submission and asynchronous query
Call the Transcription class's
async_call()method with request parameters.NoteTasks enter the
PENDINGstate after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.
Poll
Transcription.fetch()until you get the final result.Stop polling when status is
SUCCEEDEDorFAILED.Returns a TranscriptionResponse.
Request parameters
Set request parameters in Transcription.async_call().
Parameter | Type | Default | Required | Description |
model | str | - | Yes | Model for audio/video transcription. See Model availability. |
file_urls | list[str] | - | Yes | URLs of audio/video files to transcribe (HTTP/HTTPS). Up to 100 URLs per request. If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix. |
channel_id | list[int] | [0] | No | Audio track indexes to recognize in multi-track files (0-indexed). Examples: [0] = first track only, [0, 1] = first and second tracks. Default: [0]. Important Each track is billed separately. Example: [0, 1] = two charges per file. |
special_word_filter | str | - | No | Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words. If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with asterisks ( If this parameter is passed, the following sensitive word processing policies can be implemented:
The value of this parameter must be a JSON string with the following structure: JSON field descriptions:
|
diarization_enabled | bool | False | No | Automatic speaker diarization is disabled by default. This feature applies to single-channel audio only (not supported for multi-channel audio). When enabled, recognition results include the For an example of |
speaker_count | int | - | No | Reference speaker count (integer, 2-100). Only applies when Default: auto-detected. This parameter helps guide the algorithm but does not guarantee the exact count. |
language_hints | list[str] | - | No | Language codes for recognition. Leave unset for automatic language detection. The system reads only the first value in the array and ignores any additional values. The language codes supported by different models are as follows:
|
speech_noise_threshold | float | - | No |
Response results
TranscriptionResponse
TranscriptionResponse encapsulates the basic information of a task, such as task_id and task_status, and the execution result. The execution result corresponds to the output property. For more information, see TranscriptionOutput.
Parameters to note:
Parameter | Description |
status_code | HTTP request status code. |
code |
|
message |
|
task_id | Task ID. |
task_status | Task status. The valid values are If a task contains multiple subtasks, the status of the entire task is marked as |
results | Recognition results of subtasks. |
subtask_status | Subtask status. The valid values are |
file_url | The URL of the audio file to be recognized. |
transcription_url | The URL of the audio recognition result. The recognition result is saved as a JSON file. You can download the file from the URL specified by |
TranscriptionOutput
TranscriptionOutput is the output property of TranscriptionResponse, containing task execution results.
Key parameters:
Parameter | Description |
code | The error code. You can use this field together with the |
message | The error message. You can use this message with the |
task_id | Task ID. |
task_status | Task status. The valid values are If a task contains multiple subtasks, the status of the entire task is marked as |
results | Recognition results of subtasks. |
subtask_status | Subtask status. The valid values are |
file_url | The URL of the audio file to be recognized. |
transcription_url | The URL of the audio recognition result. The recognition result is saved in a JSON file. You can download the file from the URL specified by |
Recognition result description
The recognition result is saved as a JSON file.
The key parameters are as follows:
Parameter | Type | Description |
audio_format | string | The format of the audio in the source file. |
channels | array[integer] | The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on. |
original_sampling_rate | integer | The sample rate of the audio in the source file (Hz). |
original_duration_in_milliseconds | integer | The original duration of the audio in the source file (ms). |
channel_id | integer | The index of the transcribed audio track, starting from 0. |
content_duration | integer | The duration of the content in the audio track that is identified as speech (ms). Important Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies. |
transcript | string | The paragraph-level speech transcription result. |
sentences | array | The sentence-level speech transcription result. |
words | array | The word-level speech transcription result. |
begin_time | integer | Start timestamp (ms). |
end_time | integer | End timestamp (ms). |
text | string | The speech transcription result. |
speaker_id | integer | The index of the current speaker, starting from 0. This is used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled. |
punctuation | string | The predicted punctuation mark after the word, if any. |
Key interfaces
Core class (Transcription)
Import: from dashscope.audio.asr import Transcription
Member method | Method signature | Description |
async_call | | Asynchronously submits a speech recognition task. |
wait | | Blocks the current thread until the asynchronous task is complete (task status is This method returns a TranscriptionResponse. |
fetch | | Asynchronously queries the execution result of the current task. This method returns a TranscriptionResponse. |
Error codes
If an error occurs, see Error messages to troubleshoot the issue.
If a task contains multiple subtasks, the overall task status is marked as SUCCEEDED if at least one subtask succeeds. You must check the subtask_status field to determine the result of each subtask.
Example of an error response:
{
"task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
"task_status": "SUCCEEDED",
"submit_time": "2024-12-16 16:30:59.170",
"scheduled_time": "2024-12-16 16:30:59.204",
"end_time": "2024-12-16 16:31:02.375",
"results": [
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
"transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
"subtask_status": "SUCCEEDED"
},
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
"code": "InvalidFile.DownloadFailed",
"message": "The audio file cannot be downloaded.",
"subtask_status": "FAILED"
}
],
"task_metrics": {
"TOTAL": 2,
"SUCCEEDED": 1,
"FAILED": 1
}
}FAQ
Features
Q: Is audio in Base64 encoding supported?
This service recognizes audio from publicly accessible URLs only. It does not support audio in Base64 encoding, binary streams, or local files.
Q: How do I provide an audio file as a publicly accessible URL?
You can typically follow these steps. This is a general guide, and the specific steps may vary for different storage products. We recommend that you upload the audio to Object Storage Service (OSS).
When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.
When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:
The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.
Q: How long does it take to get the recognition result?
Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) varies with the queue length and file duration. The longer the audio file, the longer the processing time.
Troubleshooting
If a code error occurs, refer to Error codes to troubleshoot the issue.
Q: Why can't I get a result after continuous polling?
This may be because of rate limiting.
Q: Why is the audio not recognized (no recognition result)?
Check whether the audio format and sample rate are correct and meet the parameter constraints.
You can use the ffprobe tool to retrieve information about the audio container, codec, sample rate, and channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx