Transcribe audio and video files using Paraformer speech recognition models via the DashScope Python SDK.
This service is available only in the China (Beijing) region. Use an API key from China (Beijing).
For model selection, see Audio file recognition - Fun-ASR/Paraformer.
Prerequisites
Before you begin, ensure the following:
-
An activated Model Studio service with an API key
-
The API key exported as an environment variable (prevents hardcoding credentials)
-
The latest DashScope SDK installed
For temporary access or high-risk operations (like accessing or deleting sensitive data), use a temporary authentication token instead of a long-term API key. Tokens expire after 60 seconds, reducing credential leakage risk. Replace the API key in your code with the temporary token.
Models
| Feature | paraformer-v2 | paraformer-8k-v2 |
|---|---|---|
| Use case | Multilingual recognition for live streaming, meetings | Chinese recognition for telephony, voicemail |
| Sample rate | Any | 8 kHz |
| Languages | Chinese (Mandarin + 18 dialects), English, Japanese, Korean, German, French, Russian | Chinese |
| Punctuation and ITN | Enabled by default | Enabled by default |
| Custom hotwords | Supported | Supported |
| Language hints | Supported via language_hints |
Not supported |
Supported Chinese dialects (paraformer-v2): Shanghai, Wu, Min Nan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese.
Limitations
Input requirements
The service does not accept local files or base64-encoded audio. Provide publicly accessible file URLs (HTTP or HTTPS).
Specify URLs using the file_urls parameter. A single request supports up to 100 URLs.
When using the SDK to access OSS files, you cannot use temporary URLs with the oss:// prefix. The RESTful API supports oss:// URLs, but they expire after 48 hours. Do not use temporary oss:// URLs in production, high-concurrency scenarios, or stress testing. The upload credential API is limited to 100 QPS with no scale-out. For production, use standard HTTPS URLs from Alibaba Cloud OSS.
Supported formats
aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Not all format variants produce correct results. Test files before production.
Audio constraints
| Constraint | Limit |
|---|---|
| File size | 2 GB |
| Duration | 12 hours |
| URLs per request | 100 |
| Sample rate (paraformer-v2) | Any |
| Sample rate (paraformer-8k-v2) | 8 kHz |
To process larger files, see Preprocess video files to improve transcription efficiency.
Recognizable languages
-
paraformer-v2: Chinese (Mandarin and 18 dialects), English, Japanese, Korean, German, French, Russian
-
paraformer-8k-v2: Chinese only
Quick start
The Transcription class supports two patterns:
-
Submit + wait (
async_call+wait): Waits until the task completes. -
Submit + poll (
async_call+fetch): Non-blocking. Poll for results when ready.
Submit and wait for results
from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json
# To set the API key directly instead of using an environment variable:
# import dashscope
# dashscope.api_key = "<your-api-key>"
task_response = Transcription.async_call(
model='paraformer-v2',
file_urls=[
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
],
language_hints=['zh', 'en'] # paraformer-v2 only
)
transcribe_response = Transcription.wait(task=task_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
print('Transcription done!')
Submit and poll for results
from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json
# To set the API key directly instead of using an environment variable:
# import dashscope
# dashscope.api_key = "<your-api-key>"
transcribe_response = Transcription.async_call(
model='paraformer-v2',
file_urls=[
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
],
language_hints=['zh', 'en'] # paraformer-v2 only
)
while True:
if transcribe_response.output.task_status in ('SUCCEEDED', 'FAILED'):
break
transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
print('Transcription done!')
How it works
-
Call
Transcription.async_call()to submit a transcription task. The task enters thePENDINGqueue. -
The service processes queued tasks on a best-effort basis. Queue time depends on queue length and file duration (typically a few minutes). Once processing begins, recognition runs hundreds of times faster than real-time.
-
Retrieve results using
Transcription.wait()(blocking) orTranscription.fetch()(non-blocking).
Recognition results and download URLs expire 24 hours after task completion.
Task statuses: PENDING -> RUNNING -> SUCCEEDED | FAILED
When a batch task contains multiple files and at least one succeeds, the overall task_status is SUCCEEDED. Check each file's subtask_status individually.
Request parameters
Set these parameters in Transcription.async_call():
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
model |
str | - | Yes | Model name: paraformer-v2 or paraformer-8k-v2 |
file_urls |
list[str] | - | Yes | Audio/video file URLs (HTTP/HTTPS). Max 100 per request. The SDK does not support oss:// URLs. |
language_hints |
list[str] | ["zh", "en"] |
No | Language codes for recognition (paraformer-v2 only). Supported: zh, en, ja, yue, ko, de, fr, ru |
vocabulary_id |
str | - | No | Hotword list ID. The latest v2 series models support this parameter. See Custom hotwords. |
channel_id |
list[int] | [0] |
No | Audio track indexes to recognize (starting from 0). Each specified audio track is billed separately. Example: a request with [0, 1] for one file incurs two charges. |
diarization_enabled |
bool | False |
No | Enable speaker diarization (mono audio only). Adds speaker_id to results. |
speaker_count |
int | - | No | Reference speaker count (2-100). Requires diarization_enabled=True. Assists the algorithm but does not guarantee the exact count. |
disfluency_removal_enabled |
bool | False |
No | Remove filler words from results. |
timestamp_alignment_enabled |
bool | False |
No | Align timestamps with audio playback. |
special_word_filter |
str | - | No | JSON string for sensitive word filtering. See Sensitive word filter. |
Sensitive word filter
By default, words matching the built-in sensitive word list (Chinese) are replaced with * characters.
Pass a JSON string to special_word_filter to customize this behavior:
{
"filter_with_signed": {
"word_list": ["test"]
},
"filter_with_empty": {
"word_list": ["start", "happen"]
},
"system_reserved_filter": true
}
| Field | Type | Required | Description |
|---|---|---|---|
filter_with_signed |
object | No | Words replaced with equal-length * characters. Example: "Help me test this code" becomes "Help me \*\*\*\* this code". |
filter_with_empty |
object | No | Words removed from results entirely. Example: "Is the match about to start now?" becomes "Is the match about to now?". |
system_reserved_filter |
bool | No | Enable the built-in sensitive word list. Default: true. |
Response structure
TranscriptionResponse
Both wait() and fetch() return a TranscriptionResponse object.
Theasync_call()return value contains onlytask_idandtask_status. Callwait()orfetch()for the full response includingsubmit_time,results, andusage.
Successful response example (from wait() or fetch()):
{
"status_code": 200,
"request_id": "16668704-6702-9e03-8ab7-a32a5d7bb095",
"code": null,
"message": "",
"output": {
"task_id": "6351feef-9694-45d2-9d32-63454f2ffb8d",
"task_status": "SUCCEEDED",
"submit_time": "2025-02-13 17:31:20.681",
"scheduled_time": "2025-02-13 17:31:20.703",
"end_time": "2025-02-13 17:31:21.867",
"results": [
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
"subtask_status": "SUCCEEDED"
},
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
"transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
"subtask_status": "SUCCEEDED"
}
],
"task_metrics": {
"TOTAL": 2,
"SUCCEEDED": 2,
"FAILED": 0
}
},
"usage": {
"duration": 9
}
}
Response fields:
| Field | Description |
|---|---|
status_code |
HTTP status code |
task_id |
Unique task identifier |
task_status |
PENDING, RUNNING, SUCCEEDED, or FAILED. The overall status is SUCCEEDED if any subtask succeeds. |
results |
Array of per-file results |
subtask_status |
Per-file status: PENDING, RUNNING, SUCCEEDED, or FAILED |
file_url |
Source audio URL |
transcription_url |
URL to download the recognition result JSON (valid for 24 hours) |
code / message |
Error details for failed subtasks. The outermost code and message can be ignored; check the code and message under each item in results for per-file errors. See Error codes. |
Partial failure example
When some files fail, the overall task status may still be SUCCEEDED. Always check subtask_status for each file:
{
"results": [
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
"transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
"subtask_status": "SUCCEEDED"
},
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
"code": "InvalidFile.DownloadFailed",
"message": "The audio file cannot be downloaded.",
"subtask_status": "FAILED"
}
]
}
Recognition result JSON
Download the JSON file from transcription_url to access detailed transcription data:
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [0],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 3834
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 3720,
"text": "Hello world, this is Alibaba Speech Lab.",
"sentences": [
{
"begin_time": 100,
"end_time": 3820,
"text": "Hello world, this is Alibaba Speech Lab.",
"sentence_id": 1,
"speaker_id": 0,
"words": [
{
"begin_time": 100,
"end_time": 596,
"text": "Hello ",
"punctuation": ""
},
{
"begin_time": 596,
"end_time": 844,
"text": "world",
"punctuation": ", "
}
]
}
]
}
]
}
speaker_idappears only whendiarization_enabled=True. Speech content duration (content_duration_in_milliseconds) is typically shorter than the original audio duration because only speech is measured. An AI model determines speech presence, so discrepancies may occur.
Recognition result fields:
| Field | Type | Description |
|---|---|---|
audio_format |
string | Detected audio format |
channels |
array[int] | Audio track indexes. [0] for mono, [0, 1] for stereo. |
original_sampling_rate |
int | Sample rate in Hz |
original_duration_in_milliseconds |
int | Total audio duration in ms |
channel_id |
int | Track index for this transcript (starting from 0) |
content_duration_in_milliseconds |
int | Duration of detected speech in ms. Only speech content is billed; silence and non-speech are excluded. Speech duration is typically shorter than the original duration. |
text |
string | Transcription text |
sentences |
array | Sentence-level results with timestamps |
words |
array | Word-level results with timestamps |
begin_time / end_time |
int | Timestamps in ms |
speaker_id |
int | Speaker index (starts from 0). Only present when diarization_enabled=True. |
punctuation |
string | Predicted punctuation after the word |
API reference
Transcription class
Import: from dashscope.audio.asr import Transcription
| Method | Signature | Description |
|---|---|---|
async_call |
Transcription.async_call(model, file_urls, phrase_id=None, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse |
Submit a transcription task asynchronously |
wait |
Transcription.wait(task, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse |
Blocks until the task completes (SUCCEEDED or FAILED) |
fetch |
Transcription.fetch(task, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse |
Retrieve the current task status without blocking |
Error codes
For error details, see Error messages.
When a batch task contains multiple files and any file succeeds, task_status is SUCCEEDED. Check subtask_status for per-file results.
| Error code | Message | Cause |
|---|---|---|
InvalidFile.DownloadFailed |
The audio file cannot be downloaded. | The file URL is inaccessible or expired |
For unresolved issues, report them on the GitHub repository with the request_id.
FAQ
Does the service accept base64-encoded audio?
No. Only publicly accessible HTTP/HTTPS URLs are supported. Local files and binary streams cannot be used.
How do I host audio files as accessible URLs?
Upload files to Alibaba Cloud OSS and use the generated URL (format: https://<bucket-name>.<region>.aliyuncs.com/<file-name>).
Alternatives include hosting on a web server (Nginx, Apache) or a CDN. Verify URL accessibility in a browser or with curl.
The SDK does not support oss:// URLs. When using the RESTful API, temporary oss:// URLs expire after 48 hours. Do not use them in production, high-concurrency scenarios, or stress testing. The credential API is rate-limited to 100 QPS with no scale-out. Use standard HTTPS URLs from Alibaba Cloud OSS for production.
How long does transcription take?
After submission, tasks queue in PENDING state. Queue time depends on queue length and file duration (typically a few minutes). Once processing starts, recognition runs hundreds of times faster than real-time.
Timestamps are not aligned with audio playback
Set timestamp_alignment_enabled=True in the request parameters.
Polling returns no result
This may be caused by rate limiting. Request a quota increase via the GitHub repository.
No speech is recognized
-
Verify audio format and sample rate match model requirements.
-
For paraformer-v2, check that
language_hintsincludes the correct language codes. -
Use custom hotwords to improve recognition of domain-specific terms.