All Products
Search
Document Center

Alibaba Cloud Model Studio:Python SDK

Last Updated:Mar 24, 2026

The parameters and interfaces for the Fun-ASR audio file recognition Python SDK.

User guide: For more information about models and selection suggestions, see Audio file recognition - Fun-ASR/Paraformer.

Prerequisites

Model availability

International

In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr

Currently, fun-asr-2025-11-07

Stable

$0.000035/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-2025-11-07

Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy

Snapshot

fun-asr-2025-08-25

fun-asr-mtl

Currently, fun-asr-mtl-2025-08-25

Stable

fun-asr-mtl-2025-08-25

Snapshot

  • Languages supported:

    • fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

    • fun-asr-2025-08-25: Mandarin and English.

    • fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.

  • Sample rates supported: Any

  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Chinese Mainland

In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr

Currently, fun-asr-2025-11-07

Stable

$0.000032 / second

No free quota

fun-asr-2025-11-07

Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy

Snapshot

fun-asr-2025-08-25

fun-asr-mtl

Currently, fun-asr-mtl-2025-08-25

Stable

fun-asr-mtl-2025-08-25

Snapshot

  • Languages supported:

    • fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

    • fun-asr-2025-08-25: Mandarin and English.

    • fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.

  • Sample rates supported: Any

  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Limitations

The input must be a publicly accessible file URL (HTTP/HTTPS), for example, https://your-domain.com/file.mp3. Local files and Base64 audio are not supported.

When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.

When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:

Important
  • The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.

  • The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.

  • For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.

Specify the URLs using the file_urls parameter. A single request supports up to 100 URLs.

  • Audio formats

    aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

    Important

    Because many audio and video formats and their variants exist, it is not technically feasible to test all of them. The API cannot guarantee that all formats can be correctly recognized. You should test your files to verify that you can obtain the expected speech recognition results.

  • Audio sample rate: Any

  • Audio file size and duration

    The maximum file size is 2 GB. The maximum duration is 12 hours. For files exceeding these limits, split or compress the file before uploading. For more information about best practices for file pre-processing, see Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios).

  • Number of audio files for batch processing

    A single request supports up to 100 file URLs.

  • Recognizable languages: fun-asr supports Chinese and English. fun-asr-mtl-2025-08-25 supports Chinese, Cantonese, English, Japanese, Thai, Vietnamese, and Indonesian.

Getting started

The Transcription core class provides interfaces to submit tasks, wait for completion, and query results. Two recognition methods:

  • Asynchronous submission and synchronous waiting: Submit a task and block until it completes to get the result.

  • Asynchronous submission and asynchronous query: Submit a task and query the result when needed.

Asynchronous submission and synchronous waiting

image
  1. Call the Transcription class's async_call() method with request parameters.

    Note
    • Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.

    • Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.

  2. Call Transcription.wait() to block until the task completes.

    Task statuses: PENDING, RUNNING, SUCCEEDED, FAILED. The wait method blocks on PENDING/RUNNING and returns when the status reaches SUCCEEDED or FAILED.

    Returns a TranscriptionResponse.

Click to view the complete sample code

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

# Singapore region. For Beijing region, use: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# API keys differ by region. Get your key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If DASHSCOPE_API_KEY is not set, use: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav']
)

transcribe_response = Transcription.wait(task=task_response.output.task_id)
if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Asynchronous submission and asynchronous query

image
  1. Call the Transcription class's async_call() method with request parameters.

    Note
    • Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.

    • Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.

  2. Poll Transcription.fetch() until you get the final result.

    Stop polling when status is SUCCEEDED or FAILED.

    Returns a TranscriptionResponse.

Click to view the complete sample code

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

# Singapore region. For Beijing region, use: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# API keys differ by region. Get your key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If DASHSCOPE_API_KEY is not set, use: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

transcribe_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
               'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav']
)

while True:
    if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
        break
    transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')

Request parameters

Set request parameters in Transcription.async_call().

Parameter

Type

Default

Required

Description

model

str

-

Yes

Model for audio/video transcription. See Model availability.

file_urls

list[str]

-

Yes

URLs of audio/video files to transcribe (HTTP/HTTPS). Up to 100 URLs per request.

If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix.

channel_id

list[int]

[0]

No

Audio track indexes to recognize in multi-track files (0-indexed). Examples: [0] = first track only, [0, 1] = first and second tracks. Default: [0].

Important

Each track is billed separately. Example: [0, 1] = two charges per file.

special_word_filter

str

-

No

Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words.

If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with asterisks (*) of the same length.

If this parameter is passed, the following sensitive word processing policies can be implemented:

  • Replace with *: Replaces the matched sensitive word with asterisks (*) of the same length.

  • Filter out: Completely removes the matched sensitive word from the recognition result.

The value of this parameter must be a JSON string with the following structure:

{
  "filter_with_signed": {
    "word_list": ["test"]
  },
  "filter_with_empty": {
    "word_list": ["start", "happen"]
  },
  "system_reserved_filter": true
}

JSON field descriptions:

  • filter_with_signed

    • Type: object.

    • Required: No.

    • Description: Configures the list of sensitive words to be replaced with *. Matched words in the recognition result are replaced with asterisks (*) of the same length.

    • Example: Based on the JSON example, the speech recognition result for "Help me test this piece of code" will be "Help me **** this piece of code".

    • Internal field:

      • word_list: A string array that lists the sensitive words to be replaced.

  • filter_with_empty

    • Type: object.

    • Required: No.

    • Description: Configures the list of sensitive words to be removed (filtered) from the recognition result. Matched words in the recognition result are completely deleted.

    • Example: Based on the JSON example, the speech recognition result for "Is the game about to start?" will be "Is the game about to?".

    • Internal field:

      • word_list: A string array that lists the sensitive words to be completely removed (filtered).

  • system_reserved_filter

    • Type: Boolean value.

    • Required: No.

    • Default value: true.

    • Description: Specifies whether to enable the system's preset sensitive word rules. If set to true, the system's built-in sensitive word filtering logic is also enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with asterisks (*) of the same length.

diarization_enabled

bool

False

No

Automatic speaker diarization is disabled by default. This feature applies to single-channel audio only (not supported for multi-channel audio).

When enabled, recognition results include the speaker_id field to distinguish speakers.

For an example of speaker_id, see Recognition result description.

speaker_count

int

-

No

Reference speaker count (integer, 2-100). Only applies when diarization_enabled is true.

Default: auto-detected. This parameter helps guide the algorithm but does not guarantee the exact count.

language_hints

list[str]

-

No

Language codes for recognition. Leave unset for automatic language detection.

The system reads only the first value in the array and ignores any additional values.

The language codes supported by different models are as follows:

  • fun-asr, fun-asr-2025-11-07:

    • zh: Chinese

    • en: English

    • ja: Japanese

  • fun-asr-2025-08-25:

    • zh: Chinese

    • en: English

  • fun-asr-mtl, fun-asr-mtl-2025-08-25:

    • zh: Chinese

    • en: English

    • ja: Japanese

    • ko: Korean

    • vi: Vietnamese

    • id: Indonesian

    • th: Thai

    • ms: Malay

    • tl: Filipino

    • ar: Arabic

    • hi: Hindi

    • bg: Bulgarian

    • hr: Croatian

    • cs: Czech

    • da: Danish

    • nl: Dutch

    • et: Estonian

    • fi: Finnish

    • el: Greek

    • hu: Hungarian

    • ga: Irish

    • lv: Latvian

    • lt: Lithuanian

    • mt: Maltese

    • pl: Polish

    • pt: Portuguese

    • ro: Romanian

    • sk: Slovak

    • sl: Slovenian

    • sv: Swedish

speech_noise_threshold

float

-

No

Response results

TranscriptionResponse

TranscriptionResponse encapsulates the basic information of a task, such as task_id and task_status, and the execution result. The execution result corresponds to the output property. For more information, see TranscriptionOutput.

Click to view a sample TranscriptionResponse structure

PENDING status

{
    "status_code":200,
    "request_id":"251aceab-a6aa-9fc4-b7f7-0cc6d3e2a9f3",
    "code":null,
    "message":"",
    "output":{
        "task_id":"7d0a58a3-1dbe-4de9-8cff-5f48213128b0",
        "task_status":"PENDING",
        "submit_time":"2025-02-13 16:55:08.573",
        "scheduled_time":"2025-02-13 16:55:08.592",
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":0,
            "FAILED":0
        }
    },
    "usage":null
}

RUNNING status

{
    "status_code":200,
    "request_id":"d9d530f1-853c-9848-a5f1-f5de59086ff7",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"RUNNING",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":1,
            "FAILED":0
        }
    },
    "usage":null
}

SUCCEEDED status

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id":"6351feef-9694-45d2-9d32-63454f2ffb8d",
        "task_status":"SUCCEEDED",
        "submit_time":"2025-02-13 17:31:20.681",
        "scheduled_time":"2025-02-13 17:31:20.703",
        "end_time":"2025-02-13 17:31:21.867",
        "results":[
            {
                "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A31/20ee4e4f-0404-4806-b617-c7d4c62eed19-1.json?Expires=1739525481&OSSAccessKeyId=yourOSSAccessKeyId&Signature=3q%2B1uQmRwltd7FPn5HQM2mBKw74%3D",
                "subtask_status":"SUCCEEDED"
            },
            {
                "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
                "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A31/be4f14c5-e46b-47ff-b03a-476ae9a45fd3-1.json?Expires=1739525481&OSSAccessKeyId=yourOSSAccessKeyId&Signature=EUX%2FRkGcn46L5d93ihQmpWUeYE4%3D",
                "subtask_status":"SUCCEEDED"
            }
        ],
        "task_metrics":{
            "TOTAL":2,
            "SUCCEEDED":2,
            "FAILED":0
        }
    },
    "usage":{
        "duration":9
    }
}

FAILED status

{
    "status_code":200,
    "request_id":"16668704-6702-9e03-8ab7-a32a5d7bb095",
    "code":null,
    "message":"",
    "output":{
        "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2024-12-16 16:30:59.170",
        "scheduled_time": "2024-12-16 16:30:59.204",
        "end_time": "2024-12-16 16:31:02.375",
        "results": [
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
                "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
                "subtask_status": "SUCCEEDED"
            },
            {
                "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
                "code": "InvalidFile.DownloadFailed",
                "message": "The audio file cannot be downloaded.",
                "subtask_status": "FAILED"
            }
        ],
        "task_metrics": {
            "TOTAL": 2,
            "SUCCEEDED": 1,
            "FAILED": 1
        }
    },
    "usage":{
        "duration":9
    }
}

Parameters to note:

Parameter

Description

status_code

HTTP request status code.

code

  • The outermost code can be ignored.

  • The code under output.results is the error code. You can use this field together with the message field to troubleshoot issues based on the Error codes.

message

  • The outermost message can be ignored.

  • The message under output.results is the error message. You can use this field together with the code field to troubleshoot issues based on the Error codes.

task_id

Task ID.

task_status

Task status.

The valid values are PENDING, RUNNING, SUCCEEDED, and FAILED.

If a task contains multiple subtasks, the status of the entire task is marked as SUCCEEDED as long as at least one subtask succeeds. You must check the subtask_status field to determine the result of a specific subtask.

results

Recognition results of subtasks.

subtask_status

Subtask status.

The valid values are PENDING, RUNNING, SUCCEEDED, and FAILED.

file_url

The URL of the audio file to be recognized.

transcription_url

The URL of the audio recognition result.

The recognition result is saved as a JSON file. You can download the file from the URL specified by transcription_url or directly read the content of the file using an HTTP request. For more information about the content of the JSON file, see Recognition result description.

TranscriptionOutput

TranscriptionOutput is the output property of TranscriptionResponse, containing task execution results.

Click to view a sample TranscriptionOutput structure

PENDING status

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"PENDING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

RUNNING status

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"RUNNING",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":0,
        "FAILED":0
    }
}

SUCCEEDED status

{
    "task_id":"f2f7c2fa-0cd9-4bb2-a283-27b26ee4bb67",
    "task_status":"SUCCEEDED",
    "submit_time":"2025-02-13 17:59:27.754",
    "scheduled_time":"2025-02-13 17:59:27.789",
    "end_time":"2025-02-13 17:59:28.828",
    "results":[
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A59/70e737cc-bf8c-418b-b0c8-83fab192a0fa-1.json?Expires=1739527168&OSSAccessKeyId=yourOSSAccessKeyId&Signature=AtGjIKI%2BdgbzjJIu%2BHsr1R5nSAY%3D",
            "subtask_status":"SUCCEEDED"
        },
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/17%3A59/ce1ebe74-be78-4ac8-b4f8-8e438a14d1c2-1.json?Expires=1739527168&OSSAccessKeyId=yourOSSAccessKeyId&Signature=z5s0ROpSU8HwiM8WHPNVpkuFG3A%3D",
            "subtask_status":"SUCCEEDED"
        }
    ],
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":2,
        "FAILED":0
    }
}

FAILED status

The code field specifies the error code and the message field provides the error message. These two fields are returned only when an exception occurs. You can use these fields to troubleshoot issues by referring to the Error codes.

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

Key parameters:

Parameter

Description

code

The error code. You can use this field together with the message field to troubleshoot issues based on the Error codes.

message

The error message. You can use this message with the code field and refer to Error codes to troubleshoot the issue.

task_id

Task ID.

task_status

Task status.

The valid values are PENDING, RUNNING, SUCCEEDED, and FAILED.

If a task contains multiple subtasks, the status of the entire task is marked as SUCCEEDED as long as at least one subtask succeeds. You must check the subtask_status field to determine the result of a specific subtask.

results

Recognition results of subtasks.

subtask_status

Subtask status.

The valid values are PENDING, RUNNING, SUCCEEDED, and FAILED.

file_url

The URL of the audio file to be recognized.

transcription_url

The URL of the audio recognition result.

The recognition result is saved in a JSON file. You can download the file from the URL specified by transcription_url or directly read the content of the file using an HTTP request. For more information about the content of the JSON file, see Recognition result description.

Recognition result description

The recognition result is saved as a JSON file.

Click to view a recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, // This field is displayed only when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The key parameters are as follows:

Parameter

Type

Description

audio_format

string

The format of the audio in the source file.

channels

array[integer]

The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.

original_sampling_rate

integer

The sample rate of the audio in the source file (Hz).

original_duration_in_milliseconds

integer

The original duration of the audio in the source file (ms).

channel_id

integer

The index of the transcribed audio track, starting from 0.

content_duration

integer

The duration of the content in the audio track that is identified as speech (ms).

Important

Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies.

transcript

string

The paragraph-level speech transcription result.

sentences

array

The sentence-level speech transcription result.

words

array

The word-level speech transcription result.

begin_time

integer

Start timestamp (ms).

end_time

integer

End timestamp (ms).

text

string

The speech transcription result.

speaker_id

integer

The index of the current speaker, starting from 0. This is used to distinguish different speakers.

This field is displayed in the recognition result only when speaker diarization is enabled.

punctuation

string

The predicted punctuation mark after the word, if any.

Key interfaces

Core class (Transcription)

Import: from dashscope.audio.asr import Transcription

Member method

Method signature

Description

async_call

@classmethod
def async_call(cls,
               model: str,
               file_urls: List[str],
               phrase_id: str = None,
               api_key: str = None,
               workspace: str = None,
               **kwargs) -> TranscriptionResponse

Asynchronously submits a speech recognition task.

wait

@classmethod
def wait(cls,
         task: Union[str, TranscriptionResponse],
         api_key: str = None,
         workspace: str = None,
         **kwargs) -> TranscriptionResponse

Blocks the current thread until the asynchronous task is complete (task status is SUCCEEDED or FAILED).

This method returns a TranscriptionResponse.

fetch

@classmethod
def fetch(cls,
          task: Union[str, TranscriptionResponse],
          api_key: str = None,
          workspace: str = None,
          **kwargs) -> TranscriptionResponse

Asynchronously queries the execution result of the current task.

This method returns a TranscriptionResponse.

Error codes

If an error occurs, see Error messages to troubleshoot the issue.

If a task contains multiple subtasks, the overall task status is marked as SUCCEEDED if at least one subtask succeeds. You must check the subtask_status field to determine the result of each subtask.

Example of an error response:

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

FAQ

Features

Q: Is audio in Base64 encoding supported?

This service recognizes audio from publicly accessible URLs only. It does not support audio in Base64 encoding, binary streams, or local files.

Q: How do I provide an audio file as a publicly accessible URL?

You can typically follow these steps. This is a general guide, and the specific steps may vary for different storage products. We recommend that you upload the audio to Object Storage Service (OSS).

1. Choose a storage and hosting method

Examples include the following:

  • Object Storage Service (Recommended):

    • Use a cloud provider's object storage service, such as OSS. Upload the audio file to a bucket and set its access permissions to public.

    • Advantages: High availability, CDN acceleration support, and easy management.

  • Web server:

    • Place the audio file on a web server that supports HTTP/HTTPS access, such as Nginx or Apache.

    • Advantages: Suitable for small projects or local testing.

  • Content Delivery Network (CDN):

    • Host the audio file on a CDN and access it through the URL provided by the CDN.

    • Advantages: Accelerates file transfer, suitable for high-concurrency scenarios.

2. Upload the audio file

Upload the audio file based on your chosen storage/hosting method. For example:

  • Object Storage Service:

    • Log on to the cloud provider's console and create a bucket.

    • Upload the audio file and set its permissions to "public-read" or generate a temporary access link.

  • Web server:

    • Place the audio file in a specified directory on the server, such as /var/www/html/audio/.

    • Ensure the file is accessible via HTTP/HTTPS.

3. Generate a publicly accessible URL

For example:

  • Object Storage Service:

    • After uploading the file, the system automatically generates a public access URL, typically in the format https://<bucket-name>.<region>.aliyuncs.com/<file-name>.

    • For a more user-friendly domain name, you can attach a custom domain name and enable HTTPS.

  • Web server:

    • The access URL for the file is usually the server address plus the file path, such as https://your-domain.com/audio/file.mp3.

  • CDN:

    • After configuring CDN acceleration, use the URL provided by the CDN, such as https://cdn.your-domain.com/audio/file.mp3.

4. Verify the URL's availability

In a public network environment, ensure that the generated URL is accessible. For example:

  • Open the URL in a browser to check if the audio file can be played.

  • Use a tool, such as curl or Postman, to verify that the URL returns a correct HTTP response (status code 200).

When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.

When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:

Important
  • The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.

  • The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.

  • For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.

Q: How long does it take to get the recognition result?

Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) varies with the queue length and file duration. The longer the audio file, the longer the processing time.

Troubleshooting

If a code error occurs, refer to Error codes to troubleshoot the issue.

Q: Why can't I get a result after continuous polling?

This may be because of rate limiting.

Q: Why is the audio not recognized (no recognition result)?

Check whether the audio format and sample rate are correct and meet the parameter constraints.

You can use the ffprobe tool to retrieve information about the audio container, codec, sample rate, and channels:

ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx