Paraformer recording file recognition Python SDK - Alibaba Cloud Model Studio

This topic describes the parameters and API details for the Paraformer recording file recognition Python SDK.

Important

This document applies only to the "China (Beijing)" region. To use the model, you must use an API key from the "China (Beijing)" region.

User guide: For an overview of models and how to select them, see Audio file recognition - Fun-ASR/Paraformer.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
Note
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Install the latest version of the DashScope SDK.

Models

	paraformer-v2	paraformer-8k-v2
Scenarios	Multilingual recognition for scenarios such as live streaming and meetings	Chinese recognition for scenarios such as telephone customer service and voicemail
Sample rate	Any	8 kHz
Languages	Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghai dialect, Wu dialect, Min Nan dialect, Northeastern dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect, and Cantonese	Chinese
Punctuation prediction	✅ Supported by default, no configuration required	✅ Supported by default, no configuration required
Inverse text normalization (ITN)	✅ Supported by default, no configuration required	✅ Supported by default, no configuration required
Custom hotwords	✅ For more information, see Custom vocabulary	✅ For more information, see Custom vocabulary
Specify language for recognition	✅ Specified by the `language_hints` parameter	❌

Limits

The service does not support direct uploads of local audio or video files. It also does not support base64-encoded audio. The input source must be a file URL that is accessible over the Internet and supports the HTTP or HTTPS protocol, for example, https://your-domain.com/file.mp3.

When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.

When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:

Important

The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as Alibaba Cloud OSS to ensure long-term file availability and avoid rate limiting issues.

You can specify the URL using the file_urls parameter. A single request supports up to 100 URLs.

Audio formats
aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, and wmv
Important
The API cannot guarantee correct recognition for all audio and video formats and their variants because it is not feasible to test every possibility. We recommend testing your files to confirm that they produce the expected speech recognition results.
Audio sampling rate
The sample rate varies by model:
- paraformer-v2 supports any sample rate
- paraformer-8k-v2 only supports an 8 kHz sample rate
Audio file size and duration
The audio file cannot exceed 2 GB in size and 12 hours in duration.
To process files that exceed these limits, you can pre-process them to reduce their size. For more information about pre-processing best practices, see Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios).
Number of audio files for batch processing
A single request supports up to 100 file URLs.
Recognizable languages
Varies by model:
- paraformer-v2:
  - Chinese, including Mandarin and various dialects: Shanghai dialect, Wu dialect, Min Nan dialect, Northeastern dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect, and Cantonese
  - English
  - Japanese
  - Korean
- paraformer-8k-v2 only supports Chinese

Get started

The Transcription class provides methods for submitting tasks asynchronously, waiting for tasks to complete synchronously, and querying task execution results asynchronously. You can use the following two methods for recording file recognition:

Asynchronous task submission + synchronous waiting for task completion: After you submit a task, the current thread is blocked until the task is complete and the recognition result is returned.
Asynchronous task submission + asynchronous querying of task execution results: After you submit a task, you can retrieve the task execution result by calling the query task interface when needed.

Asynchronous task submission + synchronous waiting for task completion

Call the async_call method of the core class (Transcription) and set the request parameters.
Note
- The file transcription service processes tasks submitted through the API on a best-effort basis. After a task is submitted, it enters the queuing (PENDING) state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Once the task starts processing, the speech recognition process is hundreds of times faster than real-time playback.
- The recognition results and download URLs are valid for 24 hours after a task is complete. After this period, you cannot query the task or download the results.
Call the wait method of the core class (Transcription) to synchronously wait for the task to complete.
Task statuses include PENDING, RUNNING, SUCCEEDED, and FAILED. If the task status is PENDING or RUNNING, the wait method is blocked. If the task status is SUCCEEDED or FAILED, the wait method is no longer blocked and returns the task execution result.
The wait method returns a TranscriptionResponse.

Click to view the complete example

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json

# If you have not configured the API Key in the environment variable, you need to uncomment the following line of code and replace apiKey with your own API Key.
# import dashscope
# dashscope.api_key = "apiKey"

task_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # "language_hints" is only supported by the paraformer-v2 model
)

transcribe_response = Transcription.wait(task=task_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Asynchronous task submission + asynchronous querying of task execution results

Call the async_call method of the core class (Transcription) and set the request parameters.
Note
- The file transcription service processes tasks submitted through the API on a best-effort basis. After a task is submitted, it enters the queuing (PENDING) state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Once the task starts processing, the speech recognition process is hundreds of times faster than real-time playback.
- The recognition results and download URLs are valid for 24 hours after a task is complete. After this period, you cannot query the task or download the results.
Call the fetch method of the core class (Transcription) in a loop until you retrieve the final task result.
If the task status is SUCCEEDED or FAILED, stop polling and process the result.
The fetch method returns a TranscriptionResponse.

Click to view the complete example

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import json

# If you have not configured the API Key in the environment variable, you need to uncomment the following line of code and replace apiKey with your own API Key.
# import dashscope
# dashscope.api_key = "apiKey"

transcribe_response = Transcription.async_call(
    model='paraformer-v2',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'],
    language_hints=['zh', 'en']  # "language_hints" is only supported by the paraformer-v2 model
)

while True:
    if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
        break
    transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Request parameters

You can set request parameters using the async_call method of the core class (Transcription).

Parameter	Type	Default value	Required	Description
model	str	-	Yes	Specifies the Paraformer model name used for audio/video file transcription. For more information, see Models.
file_urls	list[str]	-	Yes	A list of URLs for audio/video file transcription. It supports HTTP/HTTPS protocols. A single request supports up to 100 URLs. If your audio files are stored in Alibaba Cloud OSS, the SDK does not support temporary URLs that start with the oss:// prefix.
vocabulary_id	str	-	No	The ID of the latest hotword. The latest v2 series models support this parameter and language configurations. The hotwords corresponding to this hotword ID take effect in the current speech recognition. This feature is disabled by default. For more information about how to use this feature, see Custom vocabulary.
channel_id	list[int]	[0]	No	Specifies the indices of the audio tracks in a multi-track file that require speech recognition. The indices are provided in a list. For example, `[0]` indicates that only the first audio track is recognized, and `[0, 1]` indicates that the first two audio tracks are recognized at the same time.
disfluency_removal_enabled	bool	False	No	Filters filler words. It is disabled by default.
timestamp_alignment_enabled	bool	False	No	Enables timestamp alignment. It is disabled by default.
special_word_filter	str	-	No	Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words. If you do not pass this parameter, the system enables its built-in sensitive word filtering logic. Any words in the detection results that match the Alibaba Cloud Model Studio sensitive word list (Chinese) are replaced with an equal number of `` characters. If this parameter is passed, the following sensitive word processing strategies can be implemented: Replace with ``: Replaces the matched sensitive words with an equal number of asterisks (``). Direct filtering: Completely removes matching sensitive words from the recognition results. The value of this parameter must be a JSON string with the following structure: `{ "filter_with_signed": { "word_list": ["test"] }, "filter_with_empty": { "word_list": ["start", "happen"] }, "system_reserved_filter": true }` JSON field description: `filter_with_signed` Type: object. Required: No. Description: Configures the list of sensitive words to be replaced with ``. Matched words in the recognition results are replaced with an equal number of asterisks (``). Example: Based on the preceding JSON, the speech recognition result for "Help me test this code" will be "Help me * this code". Internal field: `word_list`: A string array that lists the sensitive words to be replaced. `filter_with_empty` Type: object. Required: No. Description: Configures the list of sensitive words to be removed (filtered) from the recognition results. Matched words in the recognition results are completely deleted. Example: Based on the preceding JSON, the speech recognition result for "Is the match about to start now?" will be "Is the match about to now?". Internal field: `word_list`: A string array that lists the sensitive words to be completely removed (filtered). `system_reserved_filter` Type: Boolean value. Required: No. Default value: true. Description: Specifies whether to enable the system-predefined sensitive word rule. If this parameter is set to `true`, the system's built-in sensitive word filtering logic is also enabled, and words in the detection results that match the Alibaba Cloud Model Studio sensitive word list (Chinese) are replaced with an equal-length string of `*` characters.
language_hints	list[str]	["zh", "en"]	No	Specifies the language codes of the speech to be recognized. This parameter is applicable only to the paraformer-v2 model. Supported language codes: zh: Chinese en: English ja: Japanese yue: Cantonese ko: Korean de: German fr: French ru: Russian
diarization_enabled	bool	False	No	Automatic speaker diarization. This feature is disabled by default. This feature is applicable only to mono audio. Multi-channel audio does not support speaker diarization. When this feature is enabled, the recognition results will display a `speaker_id` field to distinguish different speakers. For an example of `speaker_id`, see Recognition result description.
speaker_count	int	-	No	The reference value for the number of speakers. The value must be an integer from 2 to 100, including 2 and 100. This parameter takes effect after speaker diarization is enabled (`diarization_enabled` is set to true). By default, the number of speakers is automatically determined. If you configure this parameter, it can only assist the algorithm in trying to output the specified number of speakers, but cannot guarantee that this number will definitely be output.

Response results

`TranscriptionResponse`

TranscriptionResponse encapsulates the basic information of the task, such as task_id and task_status, and the execution result. The execution result corresponds to the output attribute. For more information, see TranscriptionOutput.

Click to view the TranscriptionResponse structure example

`PENDING`

{
    "status_code":200,
    "request_id":"251aceab-a6aa-9fc4-b7f7-0cc6d3e2a9f3",
    "code":null,
    "message":"",
    "output":{
        "task_id":"7d0a58a3-1dbe-4de9-8cff-5f48213128b0",
        "task_status":"PENDING",
        "submit_time":"2025-02-13 16:55:08.573",
        "scheduled_time":"2025-02-13 16:55:08.592",
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":0,
            "FAILED":0
        }
    },
    "usage":null
}

`RUNNING`

{
    "status_code":200,
    "request_id":"d9d530f1-853c-9848-a5f1-f5de59086ff7",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"RUNNING",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":1,
            "FAILED":0
        }
    },
    "usage":null
}

`SUCCEEDED`

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"SUCCEEDED",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "end_time":"2025-02-13 17:31:21.867",
        "results":[
            {
                "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A31/20ee4e4f-0404-4806-b617-c7d4c62eed19-1.json?Expires=1739525481&OSSAccessKeyId=yourOSSAccessKeyId&Signature=3q%2B1uQmRwltd7FPn5HQM2mBKw74%3D",
                "subtask_status":"SUCCEEDED"
            },
            {
                "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
                "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A31/be4f14c5-e46b-47ff-b03a-476ae9a45fd3-1.json?Expires=1739525481&OSSAccessKeyId=yourOSSAccessKeyId&Signature=EUX%2FRkGcn46L5d93ihQmpWUeYE4%3D",
                "subtask_status":"SUCCEEDED"
            }
        ],
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":2,
            "FAILED":0
        }
    },
    "usage":{
        "duration":9
    }
}

`FAILED`

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2024-12-16 16:30:59.170",
        "scheduled_time": "2024-12-16 16:30:59.204",
        "end_time": "2024-12-16 16:31:02.375",
        "results": [
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
                "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
                "subtask_status": "SUCCEEDED"
            },
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
                "code": "InvalidFile.DownloadFailed",
                "message": "The audio file cannot be downloaded.",
                "subtask_status": "FAILED"
            }
        ],
        "task_metrics": {
            "TOTAL": 2,
            "SUCCEEDED": 1,
            "FAILED": 1
        }
    },
    "usage":{
        "duration":9
    }
}

The following table describes the key parameters:

Parameter	Description
status_code	HTTP request status code.
code	The outermost `code` can be ignored. The `code` under `output.results` represents the error code. You can refer to the error codes along with the `message` field to troubleshoot issues.
message	The outermost `message` can be ignored. The `message` under `output.results` represents the error message. You can refer to the error codes along with the `code` field to troubleshoot issues.
task_id	Task ID.
task_status	Task status. There are four statuses: `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`. When a task contains multiple subtasks, as long as any subtask succeeds, the entire task status will be marked as `SUCCEEDED`. You need to check the `subtask_status` field to determine the specific subtask results.
results	Subtask recognition results.
subtask_status	Subtask status. There are four statuses: `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`.
file_url	URL of the recognized audio.
transcription_url	URL corresponding to the audio recognition result. The recognition result is saved in a JSON file. You can download the file through the link corresponding to `transcription_url` or directly read the content of the file through an HTTP request. For the content of the JSON file, see Recognition result description.

`TranscriptionOutput`

TranscriptionOutput corresponds to the output attribute of TranscriptionResponse and represents the current task execution result.

Click to view the TranscriptionOutput structure example

PENDING

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"PENDING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

RUNNING

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"RUNNING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

`SUCCEEDED`

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"SUCCEEDED",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "end_time":"2025-02-13 17:59:28.828",
    "results":[
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A59/70e737cc-bf8c-418b-b0c8-83fab192a0fa-1.json?Expires=1739527168&OSSAccessKeyId=yourOSSAccessKeyId&Signature=AtGjIKI%2BdgbzjJIu%2BHsr1R5nSAY%3D",
            "subtask_status":"SUCCEEDED"
        },
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A59/ce1ebe74-be78-4ac8-b4f8-8e438a14d1c2-1.json?Expires=1739527168&OSSAccessKeyId=yourOSSAccessKeyId&Signature=z5s0ROpSU8HwiM8WHPNVpkuFG3A%3D",
            "subtask_status":"SUCCEEDED"
        }
    ],
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":2,
        "FAILED":0
    }
}

`FAILED`

The code parameter is the error code, and the message parameter is the error message. These two fields appear only in exceptional cases. You can use these two fields to troubleshoot issues. For more information, see error codes.

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

The following table describes the key parameters:

Parameter	Description
code	Represents the error code. You can refer to the error codes along with the `message` field to troubleshoot issues.
message	Represents the error message. You can refer to the error codes along with the `code` field to troubleshoot issues.
task_id	Task ID.
task_status	Task status. There are four statuses: `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`. When a task contains multiple subtasks, as long as any subtask succeeds, the entire task status will be marked as `SUCCEEDED`. You need to check the `subtask_status` field to determine the specific subtask results.
results	Subtask recognition results.
subtask_status	Subtask status. There are four statuses: `PENDING`, `RUNNING`, `SUCCEEDED`, and `FAILED`.
file_url	URL of the recognized audio.
transcription_url	URL corresponding to the audio recognition result. The recognition result is saved in a JSON file. You can download the file through the link corresponding to `transcription_url` or directly read the content of the file through an HTTP request. For the content of the JSON file, see Recognition result description.

Recognition result description

The recognition result is saved as a JSON file.

Click to view recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, //This field is only displayed when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The following table describes the key parameters:

Parameter	Type	Description
audio_format	string	The audio format in the source file.
channels	array[integer]	The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.
original_sampling_rate	integer	The sample rate (Hz) of the audio in the source file.
original_duration	integer	The original audio duration (ms) in the source file.
channel_id	integer	The audio track index of the transcription result, starting from 0.
content_duration	integer	The duration (ms) of content determined to be speech in the audio track. Important The Paraformer speech recognition model service only transcribes and charges for the duration of content determined to be speech in the audio track. Non-speech content is not measured or charged. Typically, the speech content duration is shorter than the original audio duration. Because an AI model determines whether speech content exists, discrepancies may occur.
transcript	string	The paragraph-level speech transcription result.
sentences	array	The sentence-level speech transcription result.
words	array	The word-level speech transcription result.
begin_time	integer	The start timestamp (ms).
end_time	integer	The end timestamp (ms).
text	string	The speech transcription result.
speaker_id	integer	The index of the current speaker, starting from 0, used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled.
punctuation	string	The predicted punctuation after the word, if any.

Key interfaces

Core class (`Transcription`)

You can import Transcription using from dashscope.audio.asr import Transcription.

Member method	Method signature	Description
async_call	`@classmethod def async_call(cls, model: str, file_urls: List[str], phrase_id: str = None, api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Asynchronously submits a speech recognition task.
wait	`@classmethod def wait(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Blocks the current thread until the task completes (task status is `SUCCEEDED` or `FAILED`). This method returns a TranscriptionResponse.
fetch	`@classmethod def fetch(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse`	Asynchronously retrieves the task result. This method returns a TranscriptionResponse.

Error codes

If you encounter an error, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue and provide the Request ID for further investigation.

If a task contains multiple subtasks and any subtask succeeds, the overall task status is marked as SUCCEEDED. You must check the subtask_status field to determine the result of each subtask.

Error response example:

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

More examples

For more examples, see GitHub repository for speech examples.

FAQ

Features

Q: Is Base64 encoded audio supported?

No, it is not. The service only supports recognition of audio from URLs that are accessible over the internet. It does not support binary streams or local files.

Q: How can I provide audio files as publicly accessible URLs?

Follow these general steps. The specific process may vary depending on the storage product you use. We recommend uploading audio to OSS.

1. Choose a storage and hosting method

You can use methods such as the following:

Object Storage Service (OSS) (recommended):
- Use an Object Storage Service such as Alibaba Cloud OSS, upload audio files to a bucket, and set them for public access.
- Advantages: High availability, supports content delivery network (CDN) acceleration, easy to manage.
Web server:
- Place audio files on a web server that supports HTTP/HTTPS access, such as Nginx or Apache.
- Advantages: Suitable for small projects or local testing.
Content delivery network (CDN):
- Host audio files on a CDN and access them through URLs provided by the CDN.
- Advantages: Accelerates file transfer, suitable for high concurrency scenarios.

2. Upload audio files

Upload the audio files based on your chosen storage method. For example:

Object Storage Service:
- Log in to the cloud service provider's console and create a bucket.
- Upload audio files and set file permissions to "public-read" or generate temporary access links.
Web server:
- Place audio files in a specified directory on the server, such as /var/www/html/audio/.
- Ensure files can be accessed via HTTP/HTTPS.

3. Generate publicly accessible URLs

For example:

Object Storage Service:
- After file upload, the system automatically generates a public access URL, typically in the format https://<bucket-name>.<region>.aliyuncs.com/<file-name>.
- If you need a more friendly domain name, you can bind a custom domain name and enable HTTPS.
Web server:
- The file access URL is typically the server address plus the file path, such as https://your-domain.com/audio/file.mp3.
CDN:
- After you configure CDN acceleration, use the URL provided by the CDN, such as https://cdn.your-domain.com/audio/file.mp3.

4. Verify URL availability

Verify that the generated URL is publicly accessible. For example:

In a browser, open the URL and check if the audio file can be played.
Use a tool, such as curl or Postman, to verify if the URL returns the correct HTTP response (status code 200).

When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.

When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:

Important

The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as Alibaba Cloud OSS to ensure long-term file availability and avoid rate limiting issues.

Q: How long does it take to obtain the recognition results?

After a task is submitted, it enters the PENDING state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Longer audio files require more processing time.

Troubleshooting

If you encounter an error, refer to the information in Error codes.

Q: What should I do if the recognition results are not synchronized with the audio playback?

A: Set the request parameter timestamp_alignment_enabled to true. This enables timestamp alignment, which synchronizes the recognition result with the audio playback.

Q: Why can't I obtain a result after continuous polling?

This may be due to rate limiting. To request a quota increase, join the developer group.

Q: Why is the speech not recognized (no recognition result)?

Check whether the audio meets the format and sample rate requirements.
If you are using the paraformer-v2 model, check whether the language_hints parameter is set correctly.
If the previous checks do not resolve the issue, you can use custom hotwords to improve the recognition of specific words.

Alibaba Cloud Model Studio:Paraformer recording file recognition Python SDK

Prerequisites

Models

Limits

Get started

Asynchronous task submission + synchronous waiting for task completion

Asynchronous task submission + asynchronous querying of task execution results

Request parameters

Response results

`TranscriptionResponse`

`PENDING`

`RUNNING`

`SUCCEEDED`

`FAILED`

`TranscriptionOutput`

PENDING

RUNNING

`SUCCEEDED`

`FAILED`

Recognition result description

Key interfaces

Core class (`Transcription`)

Error codes

More examples

FAQ

Features

Q: Is Base64 encoded audio supported?

Q: How can I provide audio files as publicly accessible URLs?

Q: How long does it take to obtain the recognition results?

Troubleshooting

Q: What should I do if the recognition results are not synchronized with the audio playback?

Q: Why can't I obtain a result after continuous polling?

Q: Why is the speech not recognized (no recognition result)?

More questions

Prerequisites

Models

Limits

Get started

Asynchronous task submission + synchronous waiting for task completion

Asynchronous task submission + asynchronous querying of task execution results

Request parameters

Response results

TranscriptionResponse

PENDING

RUNNING

SUCCEEDED

FAILED

TranscriptionOutput

PENDING

RUNNING

SUCCEEDED

FAILED

Recognition result description

Key interfaces

Core class (Transcription)

Error codes

More examples

FAQ

Features

Q: Is Base64 encoded audio supported?

Q: How can I provide audio files as publicly accessible URLs?

Q: How long does it take to obtain the recognition results?

Troubleshooting

Q: What should I do if the recognition results are not synchronized with the audio playback?

Q: Why can't I obtain a result after continuous polling?

Q: Why is the speech not recognized (no recognition result)?

More questions

`TranscriptionResponse`

`PENDING`

`RUNNING`

`SUCCEEDED`

`FAILED`

`TranscriptionOutput`

`SUCCEEDED`

`FAILED`

Core class (`Transcription`)