All Products
Search
Document Center

Alibaba Cloud Model Studio:Paraformer recording file recognition Python SDK

Last Updated:Mar 15, 2026

Transcribe audio and video files using Paraformer speech recognition models via the DashScope Python SDK.

Important

This service is available only in the China (Beijing) region. Use an API key from China (Beijing).

For model selection, see Audio file recognition - Fun-ASR/Paraformer.

Prerequisites

Before you begin, ensure the following:

For temporary access or high-risk operations (like accessing or deleting sensitive data), use a temporary authentication token instead of a long-term API key. Tokens expire after 60 seconds, reducing credential leakage risk. Replace the API key in your code with the temporary token.

Models

Feature paraformer-v2 paraformer-8k-v2
Use case Multilingual recognition for live streaming, meetings Chinese recognition for telephony, voicemail
Sample rate Any 8 kHz
Languages Chinese (Mandarin + 18 dialects), English, Japanese, Korean, German, French, Russian Chinese
Punctuation and ITN Enabled by default Enabled by default
Custom hotwords Supported Supported
Language hints Supported via language_hints Not supported

Supported Chinese dialects (paraformer-v2): Shanghai, Wu, Min Nan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese.

Limitations

Input requirements

The service does not accept local files or base64-encoded audio. Provide publicly accessible file URLs (HTTP or HTTPS).

Specify URLs using the file_urls parameter. A single request supports up to 100 URLs.

Important

When using the SDK to access OSS files, you cannot use temporary URLs with the oss:// prefix. The RESTful API supports oss:// URLs, but they expire after 48 hours. Do not use temporary oss:// URLs in production, high-concurrency scenarios, or stress testing. The upload credential API is limited to 100 QPS with no scale-out. For production, use standard HTTPS URLs from Alibaba Cloud OSS.

Supported formats

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Important

Not all format variants produce correct results. Test files before production.

Audio constraints

Constraint Limit
File size 2 GB
Duration 12 hours
URLs per request 100
Sample rate (paraformer-v2) Any
Sample rate (paraformer-8k-v2) 8 kHz

To process larger files, see Preprocess video files to improve transcription efficiency.

Recognizable languages

  • paraformer-v2: Chinese (Mandarin and 18 dialects), English, Japanese, Korean, German, French, Russian

  • paraformer-8k-v2: Chinese only

Quick start

The Transcription class supports two patterns:

  1. Submit + wait (async_call + wait): Waits until the task completes.

  2. Submit + poll (async_call + fetch): Non-blocking. Poll for results when ready.

Submit and wait for results

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json

# To set the API key directly instead of using an environment variable:
# import dashscope
# dashscope.api_key = "<your-api-key>"

task_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=[
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
    ],
    language_hints=['zh', 'en']  # paraformer-v2 only
)

transcribe_response = Transcription.wait(task=task_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('Transcription done!')

Submit and poll for results

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json

# To set the API key directly instead of using an environment variable:
# import dashscope
# dashscope.api_key = "<your-api-key>"

transcribe_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=[
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
    ],
    language_hints=['zh', 'en']  # paraformer-v2 only
)

while True:
    if transcribe_response.output.task_status in ('SUCCEEDED', 'FAILED'):
        break
    transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('Transcription done!')

How it works

  1. Call Transcription.async_call() to submit a transcription task. The task enters the PENDING queue.

  2. The service processes queued tasks on a best-effort basis. Queue time depends on queue length and file duration (typically a few minutes). Once processing begins, recognition runs hundreds of times faster than real-time.

  3. Retrieve results using Transcription.wait() (blocking) or Transcription.fetch() (non-blocking).

Recognition results and download URLs expire 24 hours after task completion.

Task statuses: PENDING -> RUNNING -> SUCCEEDED | FAILED

When a batch task contains multiple files and at least one succeeds, the overall task_status is SUCCEEDED. Check each file's subtask_status individually.

Request parameters

Set these parameters in Transcription.async_call():

Parameter Type Default Required Description
model str - Yes Model name: paraformer-v2 or paraformer-8k-v2
file_urls list[str] - Yes Audio/video file URLs (HTTP/HTTPS). Max 100 per request. The SDK does not support oss:// URLs.
language_hints list[str] ["zh", "en"] No Language codes for recognition (paraformer-v2 only). Supported: zh, en, ja, yue, ko, de, fr, ru
vocabulary_id str - No Hotword list ID. The latest v2 series models support this parameter. See Custom hotwords.
channel_id list[int] [0] No Audio track indexes to recognize (starting from 0). Each specified audio track is billed separately. Example: a request with [0, 1] for one file incurs two charges.
diarization_enabled bool False No Enable speaker diarization (mono audio only). Adds speaker_id to results.
speaker_count int - No Reference speaker count (2-100). Requires diarization_enabled=True. Assists the algorithm but does not guarantee the exact count.
disfluency_removal_enabled bool False No Remove filler words from results.
timestamp_alignment_enabled bool False No Align timestamps with audio playback.
special_word_filter str - No JSON string for sensitive word filtering. See Sensitive word filter.

Sensitive word filter

By default, words matching the built-in sensitive word list (Chinese) are replaced with * characters.

Pass a JSON string to special_word_filter to customize this behavior:

{
  "filter_with_signed": {
    "word_list": ["test"]
  },
  "filter_with_empty": {
    "word_list": ["start", "happen"]
  },
  "system_reserved_filter": true
}
Field Type Required Description
filter_with_signed object No Words replaced with equal-length * characters. Example: "Help me test this code" becomes "Help me \*\*\*\* this code".
filter_with_empty object No Words removed from results entirely. Example: "Is the match about to start now?" becomes "Is the match about to now?".
system_reserved_filter bool No Enable the built-in sensitive word list. Default: true.

Response structure

TranscriptionResponse

Both wait() and fetch() return a TranscriptionResponse object.

The async_call() return value contains only task_id and task_status. Call wait() or fetch() for the full response including submit_time, results, and usage.

Successful response example (from wait() or fetch()):

{
    "status_code": 200,
    "request_id": "16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code": null,
    "message": "",
    "output": {
        "task_id": "6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status": "SUCCEEDED",
        "submit_time": "2025-02-13 17:31:20.681",
        "scheduled_time": "2025-02-13 17:31:20.703",
        "end_time": "2025-02-13 17:31:21.867",
        "results": [
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
                "subtask_status": "SUCCEEDED"
            },
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
                "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
                "subtask_status": "SUCCEEDED"
            }
        ],
        "task_metrics": {
            "TOTAL": 2,
            "SUCCEEDED": 2,
            "FAILED": 0
        }
    },
    "usage": {
        "duration": 9
    }
}

Response fields:

Field Description
status_code HTTP status code
task_id Unique task identifier
task_status PENDING, RUNNING, SUCCEEDED, or FAILED. The overall status is SUCCEEDED if any subtask succeeds.
results Array of per-file results
subtask_status Per-file status: PENDING, RUNNING, SUCCEEDED, or FAILED
file_url Source audio URL
transcription_url URL to download the recognition result JSON (valid for 24 hours)
code / message Error details for failed subtasks. The outermost code and message can be ignored; check the code and message under each item in results for per-file errors. See Error codes.

Partial failure example

When some files fail, the overall task status may still be SUCCEEDED. Always check subtask_status for each file:

{
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/...",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ]
}

Recognition result JSON

Download the JSON file from transcription_url to access detailed transcription data:

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [0],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 3720,
            "text": "Hello world, this is Alibaba Speech Lab.",
            "sentences": [
                {
                    "begin_time": 100,
                    "end_time": 3820,
                    "text": "Hello world, this is Alibaba Speech Lab.",
                    "sentence_id": 1,
                    "speaker_id": 0,
                    "words": [
                        {
                            "begin_time": 100,
                            "end_time": 596,
                            "text": "Hello ",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 596,
                            "end_time": 844,
                            "text": "world",
                            "punctuation": ", "
                        }
                    ]
                }
            ]
        }
    ]
}
speaker_id appears only when diarization_enabled=True. Speech content duration (content_duration_in_milliseconds) is typically shorter than the original audio duration because only speech is measured. An AI model determines speech presence, so discrepancies may occur.

Recognition result fields:

Field Type Description
audio_format string Detected audio format
channels array[int] Audio track indexes. [0] for mono, [0, 1] for stereo.
original_sampling_rate int Sample rate in Hz
original_duration_in_milliseconds int Total audio duration in ms
channel_id int Track index for this transcript (starting from 0)
content_duration_in_milliseconds int Duration of detected speech in ms. Only speech content is billed; silence and non-speech are excluded. Speech duration is typically shorter than the original duration.
text string Transcription text
sentences array Sentence-level results with timestamps
words array Word-level results with timestamps
begin_time / end_time int Timestamps in ms
speaker_id int Speaker index (starts from 0). Only present when diarization_enabled=True.
punctuation string Predicted punctuation after the word

API reference

Transcription class

Import: from dashscope.audio.asr import Transcription

Method Signature Description
async_call Transcription.async_call(model, file_urls, phrase_id=None, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse Submit a transcription task asynchronously
wait Transcription.wait(task, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse Blocks until the task completes (SUCCEEDED or FAILED)
fetch Transcription.fetch(task, api_key=None, workspace=None, **kwargs) -> TranscriptionResponse Retrieve the current task status without blocking

Error codes

For error details, see Error messages.

When a batch task contains multiple files and any file succeeds, task_status is SUCCEEDED. Check subtask_status for per-file results.

Error code Message Cause
InvalidFile.DownloadFailed The audio file cannot be downloaded. The file URL is inaccessible or expired

For unresolved issues, report them on the GitHub repository with the request_id.

FAQ

Does the service accept base64-encoded audio?

No. Only publicly accessible HTTP/HTTPS URLs are supported. Local files and binary streams cannot be used.

How do I host audio files as accessible URLs?

Upload files to Alibaba Cloud OSS and use the generated URL (format: https://<bucket-name>.<region>.aliyuncs.com/<file-name>).

Alternatives include hosting on a web server (Nginx, Apache) or a CDN. Verify URL accessibility in a browser or with curl.

Important

The SDK does not support oss:// URLs. When using the RESTful API, temporary oss:// URLs expire after 48 hours. Do not use them in production, high-concurrency scenarios, or stress testing. The credential API is rate-limited to 100 QPS with no scale-out. Use standard HTTPS URLs from Alibaba Cloud OSS for production.

How long does transcription take?

After submission, tasks queue in PENDING state. Queue time depends on queue length and file duration (typically a few minutes). Once processing starts, recognition runs hundreds of times faster than real-time.

Timestamps are not aligned with audio playback

Set timestamp_alignment_enabled=True in the request parameters.

Polling returns no result

This may be caused by rate limiting. Request a quota increase via the GitHub repository.

No speech is recognized

  1. Verify audio format and sample rate match model requirements.

  2. For paraformer-v2, check that language_hints includes the correct language codes.

  3. Use custom hotwords to improve recognition of domain-specific terms.

More resources