All Products
Search
Document Center

Alibaba Cloud Model Studio:Java SDK

Last Updated:Mar 24, 2026

The parameters and interface details of the Fun-ASR audio file recognition Java SDK.

User guide: For more information about models and selection suggestions, see Audio file recognition - Fun-ASR/Paraformer.

Prerequisites

  • You have activated Model Studio and created an API key. To prevent security risks from code leaks, do not hard-code the API key in your code. Instead, export it as an environment variable.

    Note

    Use a temporary authentication token to grant temporary access to third-party applications or users, or to strictly control high-risk operations like accessing or deleting sensitive data.

    A temporary authentication token is more secure than a long-lived API key because it expires in 60 seconds. This short lifespan makes it ideal for temporary call scenarios and significantly reduces the risk of a compromised API key.

    To use this method, replace the API key in your code with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Model availability

International

In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr

Currently, fun-asr-2025-11-07

Stable

$0.000035/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-2025-11-07

Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy

Snapshot

fun-asr-2025-08-25

fun-asr-mtl

Currently, fun-asr-mtl-2025-08-25

Stable

fun-asr-mtl-2025-08-25

Snapshot

  • Languages supported:

    • fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

    • fun-asr-2025-08-25: Mandarin and English.

    • fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.

  • Sample rates supported: Any

  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Chinese Mainland

In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr

Currently, fun-asr-2025-11-07

Stable

$0.000032 / second

No free quota

fun-asr-2025-11-07

Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy

Snapshot

fun-asr-2025-08-25

fun-asr-mtl

Currently, fun-asr-mtl-2025-08-25

Stable

fun-asr-mtl-2025-08-25

Snapshot

  • Languages supported:

    • fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

    • fun-asr-2025-08-25: Mandarin and English.

    • fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.

  • Sample rates supported: Any

  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Limitations

The input must be a publicly accessible file URL (HTTP/HTTPS), for example, https://your-domain.com/file.mp3. Local files and Base64 audio are not supported.

The SDK does not support temporary URLs with the oss:// prefix for audio files stored in OSS.

The RESTful API supports temporary URLs with the oss:// prefix for audio files in OSS, but with the following limitations:

Important
  • The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.

  • The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.

  • For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.

Specify the URLs using the fileUrls parameter. A single request supports up to 100 URLs.

  • Audio formats

    aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

    Important

    Because many audio and video formats and their variants exist, it is not technically feasible to test all of them. The API cannot guarantee that all formats can be correctly recognized. You should test your files to verify that you can obtain the expected speech recognition results.

  • Audio sample rate: Any

  • Audio file size and duration

    The maximum file size is 2 GB. The maximum duration is 12 hours. For files exceeding these limits, split or compress the file before uploading. For more information about best practices for file pre-processing, see Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios).

  • Number of audio files for batch processing

    A single request supports up to 100 file URLs.

  • Recognizable languages: fun-asr supports Chinese and English. fun-asr-mtl-2025-08-25 supports Chinese, Cantonese, English, Japanese, Thai, Vietnamese, and Indonesian.

Getting started

The core class (Transcription) provides interfaces for submitting tasks asynchronously, waiting for tasks to finish synchronously, and querying task results asynchronously. You can use the following two methods to perform audio file recognition:

  • Asynchronous submission and synchronous wait: Submit a task and block the current thread until the task is complete to retrieve the recognition result.

  • Asynchronous submission and asynchronous query: Submit a task and then call the query interface to retrieve the task result when needed.

Asynchronous submission + synchronous wait

image
  1. Configure request parameters.

  2. Instantiate the core class (Transcription).

  3. Call the asyncCall method of the core class (Transcription) to asynchronously submit the task.

    Note
    • Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.

    • Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.

  4. Call the wait method of the core class (Transcription) to synchronously wait for the task to finish.

    Task statuses include PENDING, RUNNING, SUCCEEDED, and FAILED. The wait interface blocks on PENDING/RUNNING and unblocks when the task reaches SUCCEEDED or FAILED, returning the task execution result.

    wait returns the task execution result (TranscriptionResult).

Click to view the complete example

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        //.apiKey("apikey")
                        .model("fun-asr") // Here, fun-asr is used as an example. You can change the model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/models.
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Block and wait for the task to complete, then get the result.
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // Print the result.
            System.out.println(new GsonBuilder().setPrettyPrinting().create().toJson(result.getOutput()));
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

Asynchronous submission + asynchronous query

image
  1. Configure request parameters.

  2. Instantiate the core class (Transcription).

  3. Call the asyncCall method of the core class (Transcription) to asynchronously submit the task.

    Note
    • Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) depends on the queue length and file duration. Once processing starts, speech recognition completes at significantly accelerated speed.

    • Recognition results and download URLs expire 24 hours after the task completes. Tasks become unqueryable after expiration.

  4. Repeatedly call the fetch method of the core class (Transcription) until you retrieve the final task result.

    If the task status is SUCCEEDED or FAILED, stop polling and process the result.

    fetch returns the task execution result (TranscriptionResult).

Click to view the complete example

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.alibaba.dashscope.common.TaskStatus;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        // Create transcription request parameters.
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        //.apiKey("apikey")
                        .model("fun-asr") // Here, fun-asr is used as an example. You can change the model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/models.
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                        .build();
        try {
            Transcription transcription = new Transcription();
            // Submit the transcription request.
            TranscriptionResult result = transcription.asyncCall(param);
            System.out.println("RequestId: " + result.getRequestId());
            // Poll for the task execution result until the task is complete.
            while (true) {
                result = transcription.fetch(TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
                if (result.getTaskStatus() == TaskStatus.SUCCEEDED || result.getTaskStatus() == TaskStatus.FAILED) {
                    break;
                }
                Thread.sleep(1000);
            }
            // Print the result.
            System.out.println(new GsonBuilder().setPrettyPrinting().create().toJson(result.getOutput()));
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

Request parameters

Configure the request parameters using the chained methods of TranscriptionParam.

Click to view an example

TranscriptionParam param = TranscriptionParam.builder()
  .model("fun-asr")
  .fileUrls(
          Arrays.asList(
                  "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                  "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
  .build();

Parameter

Type

Default value

Required

Description

model

String

-

Yes

The name of the model used for audio and video file transcription. For more information, see Model availability.

fileUrls

List<String>

-

Yes

A list of URLs for the audio and video files to be transcribed. The HTTP and HTTPS protocols are supported. A single request supports up to 100 URLs.

If your audio files are stored in Alibaba Cloud OSS, the SDK does not support temporary URLs with the oss:// prefix.

vocabularyId

String

-

No

The ID of a hotword vocabulary. The hotwords in the specified vocabulary take effect during this speech recognition task. This feature is disabled by default. For more information about how to use this feature, see Customize hotwords.

channelId

List<Integer>

[0]

No

Specifies track indices to recognize in multi-track audio files. Indices start from 0. For example, [0] recognizes the first track, and [0, 1] recognizes the first and second tracks simultaneously. If omitted, only the first track is processed by default.

Important

Each specified track is billed separately. For example, requesting [0, 1] for a single file incurs two separate charges.

specialWordFilter

String

-

No

Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words.

If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with asterisks (*) of the same length.

If this parameter is passed, the following sensitive word processing policies can be implemented:

  • Replace with *: Replaces the matched sensitive word with asterisks (*) of the same length.

  • Filter out: Completely removes the matched sensitive word from the recognition result.

The value of this parameter must be a JSON string with the following structure:

{
  "filter_with_signed": {
    "word_list": ["test"]
  },
  "filter_with_empty": {
    "word_list": ["start", "happen"]
  },
  "system_reserved_filter": true
}

JSON field descriptions:

  • filter_with_signed

    • Type: object.

    • Required: No.

    • Description: Configures the list of sensitive words to be replaced with *. Matched words in the recognition result are replaced with asterisks (*) of the same length.

    • Example: Based on the JSON example, the speech recognition result for "Help me test this piece of code" will be "Help me **** this piece of code".

    • Internal field:

      • word_list: A string array that lists the sensitive words to be replaced.

  • filter_with_empty

    • Type: object.

    • Required: No.

    • Description: Configures the list of sensitive words to be removed (filtered) from the recognition result. Matched words in the recognition result are completely deleted.

    • Example: Based on the JSON example, the speech recognition result for "Is the game about to start?" will be "Is the game about to?".

    • Internal field:

      • word_list: A string array that lists the sensitive words to be completely removed (filtered).

  • system_reserved_filter

    • Type: Boolean value.

    • Required: No.

    • Default value: true.

    • Description: Specifies whether to enable the system's preset sensitive word rules. If set to true, the system's built-in sensitive word filtering logic is also enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with asterisks (*) of the same length.

diarizationEnabled

Boolean

false

No

Automatic speaker diarization is disabled by default. This feature applies to single-channel audio only (not supported for multi-channel audio).

When enabled, recognition results include the speaker_id field to distinguish speakers.

For an example of speaker_id, see Recognition result description.

speakerCount

Integer

-

No

This parameter is a speaker count hint (an integer from 2 to 100). It only takes effect when diarizationEnabled is true.

By default, speaker count is auto-detected. This parameter assists the algorithm but does not guarantee the exact speaker count.

language_hints

String[]

-

No

Sets language codes for recognition. If you cannot determine the language in advance, leave this unset—the model will auto-detect the language.

The system reads only the first value in the array. Extra values are ignored.

Language codes supported by different models:

  • fun-asr, fun-asr-2025-11-07:

    • zh: Chinese

    • en: English

    • ja: Japanese

  • fun-asr-2025-08-25:

    • zh: Chinese

    • en: English

  • fun-asr-mtl, fun-asr-mtl-2025-08-25:

    • zh: Chinese

    • en: English

    • ja: Japanese

    • ko: Korean

    • vi: Vietnamese

    • id: Indonesian

    • th: Thai

    • ms: Malay

    • tl: Filipino

    • ar: Arabic

    • hi: Hindi

    • bg: Bulgarian

    • hr: Croatian

    • cs: Czech

    • da: Danish

    • nl: Dutch

    • et: Estonian

    • fi: Finnish

    • el: Greek

    • hu: Hungarian

    • ga: Irish

    • lv: Latvian

    • lt: Lithuanian

    • mt: Maltese

    • pl: Polish

    • pt: Portuguese

    • ro: Romanian

    • sk: Slovak

    • sl: Slovenian

    • sv: Swedish

Note

language_hints must be set using the parameter or parameters method of the TranscriptionParam instance:

Set using parameter

TranscriptionParam param = TranscriptionParam.builder()
  .model("fun-asr")
  .parameter("language_hints", new String[]{"zh"})
  .build();

Set using parameters

TranscriptionParam param = TranscriptionParam.builder()
  .model("fun-asr")
  .parameters(Collections.singletonMap("language_hints", new String[]{"zh"}))
  .build();

apiKey

String

-

No

Your API key. If you have configured the API key as an environment variable, you do not need to set it in the code. Otherwise, you must set it in the code.

Response

Task execution result (TranscriptionResult)

TranscriptionResult encapsulates the current task execution result.

Interface/Method

Parameter

Return value

Description

public String getRequestId()

None

requestId

Gets the request ID.

public String getTaskId()

None

taskId

Gets the task ID.

public TaskStatus getTaskStatus()

None

TaskStatus, task status

Gets the task status.

TaskStatus is an enumeration class. You only need to be concerned with the PENDING, RUNNING, SUCCEEDED, and FAILED statuses.

Note

When a task contains multiple subtasks, the overall task status is marked as SUCCEEDED as long as at least one subtask succeeds. You need to check the subtask_status field to determine the result of each specific subtask.

public List<TranscriptionTaskResult> getResults()

None

Subtask execution result (TranscriptionTaskResult)

Gets the subtask execution result (TranscriptionTaskResult).

Each task recognizes one or more audio files. Different audio files are processed in different subtasks. Therefore, each task corresponds to one or more subtasks.

public JsonObject getOutput()

None

Task execution result, in JSON format

Gets the task execution result.

The result is in JSON format. If you use the getOutput interface to get the task execution result, you must parse the JSON data after you receive it.

Click to view a JSON example

Normal example

{
    "task_id":"0795ff8c-b666-4e91-bb8b-xxx",
    "task_status":"SUCCEEDED",
    "submit_time":"2025-02-13 16:12:09.109",
    "scheduled_time":"2025-02-13 16:12:09.128",
    "end_time":"2025-02-13 16:12:10.189",
    "results":[
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/16%3A12/34604a7b-579a-4223-8797-5116a49b07ec-1.json?Expires=1739520730&OSSAccessKeyId=yourOSSAccessKeyId&Signature=tMqyH56oB5rDW9%2FFqD8Yo%2F3WaPk%3D",
            "subtask_status":"SUCCEEDED"
        },
        {
            "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
            "transcription_url":"https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20250213/16%3A12/3baafe5f-d09d-46c6-8b01-724927670edb-1.json?Expires=1739520730&OSSAccessKeyId=yourOSSAccessKeyId&Signature=BF7vPxlsJN9hkJlY%2BLReezxOwK8%3D",
            "subtask_status":"SUCCEEDED"
        }
    ],
    "task_metrics":{
        "TOTAL":2,
        "SUCCEEDED":2,
        "FAILED":0
    }
}

Abnormal example

`code` is the error code and `message` is the error message. These two fields appear only in case of an error. You can use these fields to troubleshoot the issue by referring to the Error codes.

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

Subtask execution result (TranscriptionTaskResult)

TranscriptionTaskResult encapsulates the subtask execution result. A subtask is the recognition of a single audio file.

Interface/Method

Parameter

Return value

Description

public String getFileUrl()

None

URL of the recognized audio file

Gets the URL of the recognized audio file.

public String getTranscriptionUrl()

None

URL for the recognition result

Gets the URL for the recognition result. This URL is valid for 24 hours. After this period, you cannot query the task or download the result using the URL from a previous query.

The recognition result is saved as a JSON file. You can download this file from the URL or read its content directly through an HTTP request.

For the meaning of each field in the JSON data, see Recognition result description.

public TaskStatus getSubTaskStatus()

None

TaskStatus, subtask status

Gets the subtask status.

TaskStatus is an enumeration class. You only need to be concerned with the PENDING, RUNNING, SUCCEEDED, and FAILED statuses.

public String getMessage()

None

Key information from the task execution process. May be empty.

Gets key information from the task execution process.

If a task fails, you can view this content to analyze the cause.

Recognition result description

The recognition result is saved as a JSON file.

Click to view a recognition result example

{
    "file_url":"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties":{
        "audio_format":"pcm_s16le",
        "channels":[
            0
        ],
        "original_sampling_rate":16000,
        "original_duration_in_milliseconds":3834
    },
    "transcripts":[
        {
            "channel_id":0,
            "content_duration_in_milliseconds":3720,
            "text":"Hello world, this is Alibaba Speech Lab.",
            "sentences":[
                {
                    "begin_time":100,
                    "end_time":3820,
                    "text":"Hello world, this is Alibaba Speech Lab.",
                    "sentence_id":1,
                    "speaker_id":0, // This field is displayed only when automatic speaker diarization is enabled.
                    "words":[
                        {
                            "begin_time":100,
                            "end_time":596,
                            "text":"Hello ",
                            "punctuation":""
                        },
                        {
                            "begin_time":596,
                            "end_time":844,
                            "text":"world",
                            "punctuation":", "
                        }
                        // Other content is omitted here.
                    ]
                }
            ]
        }
    ]
}

The key parameters are as follows:

Parameter

Type

Description

audio_format

string

The format of the audio in the source file.

channels

array[integer]

The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on.

original_sampling_rate

integer

The sample rate of the audio in the source file (Hz).

original_duration_in_milliseconds

integer

The original duration of the audio in the source file (ms).

channel_id

integer

The index of the transcribed audio track, starting from 0.

content_duration

integer

The duration of the content in the audio track that is identified as speech (ms).

Important

Billing is based on speech content duration only (non-speech parts are not metered). Speech duration is typically shorter than total audio duration. The AI-based speech detection may have minor discrepancies.

transcript

string

The paragraph-level speech transcription result.

sentences

array

The sentence-level speech transcription result.

words

array

The word-level speech transcription result.

begin_time

integer

Start timestamp (ms).

end_time

integer

End timestamp (ms).

text

string

The speech transcription result.

speaker_id

integer

The index of the current speaker, starting from 0. This is used to distinguish different speakers.

This field is displayed in the recognition result only when speaker diarization is enabled.

punctuation

string

The predicted punctuation mark after the word, if any.

Key interfaces

Task query parameter class (TranscriptionQueryParam)

TranscriptionQueryParam is used when waiting for a task to complete (by calling the wait method of Transcription) or querying a task execution result (by calling the fetch method of Transcription).

Create a TranscriptionQueryParam instance using the static method FromTranscriptionParam.

Click to view an example

// Create transcription request parameters.
TranscriptionParam param =
        TranscriptionParam.builder()
                // If you have not configured the API key as an environment variable, replace apiKey with your own API key.
                //.apiKey("apikey")
                .model("fun-asr")
                .fileUrls(
                        Arrays.asList(
                                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
                .build();
try {
    Transcription transcription = new Transcription();
    // Submit the transcription request.
    TranscriptionResult result = transcription.asyncCall(param);
    System.out.println("RequestId: " + result.getRequestId());
    TranscriptionQueryParam queryParam = TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId());
    
} catch (Exception e) {
    System.out.println("error: " + e);
}

Interface/Method

Parameter

Return value

Description

public static TranscriptionQueryParam FromTranscriptionParam(TranscriptionParam param, String taskId)
  • param: A TranscriptionParam instance

  • taskId: The task ID

A TranscriptionQueryParam instance

Creates a TranscriptionQueryParam instance.

Core class (Transcription)

You can import Transcription using the statement import com.alibaba.dashscope.audio.asr.transcription.*;. The key interfaces of this class are as follows:

Interface/Method

Parameter

Return value

Description

public TranscriptionResult asyncCall(TranscriptionParam param)

param: Speech recognition parameters, a TranscriptionParam instance

Task execution result (TranscriptionResult)

Asynchronously submits a speech recognition task.

public TranscriptionResult wait(TranscriptionQueryParam queryParam)

queryParam: A TranscriptionQueryParam instance

Task execution result (TranscriptionResult)

Blocks the current thread until the asynchronous task is complete (task status is SUCCEEDED or FAILED).

public TranscriptionResult fetch(TranscriptionQueryParam queryParam)

queryParam: A TranscriptionQueryParam instance

Task execution result (TranscriptionResult)

Asynchronously queries the execution result of the current task.

Error codes

If an error occurs, see Error messages to troubleshoot the issue.

If a task contains multiple subtasks, the overall task status is marked as SUCCEEDED if at least one subtask succeeds. You must check the subtask_status field to determine the result of each subtask.

Example of an error response:

{
    "task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
    "task_status": "SUCCEEDED",
    "submit_time": "2024-12-16 16:30:59.170",
    "scheduled_time": "2024-12-16 16:30:59.204",
    "end_time": "2024-12-16 16:31:02.375",
    "results": [
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
            "transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
            "subtask_status": "SUCCEEDED"
        },
        {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
            "code": "InvalidFile.DownloadFailed",
            "message": "The audio file cannot be downloaded.",
            "subtask_status": "FAILED"
        }
    ],
    "task_metrics": {
        "TOTAL": 2,
        "SUCCEEDED": 1,
        "FAILED": 1
    }
}

FAQ

Features

Q: Is audio in Base64 encoding supported?

This service recognizes audio from publicly accessible URLs only. It does not support audio in Base64 encoding, binary streams, or local files.

Q: How do I provide an audio file as a publicly accessible URL?

You can typically follow these steps. This is a general guide, and the specific steps may vary for different storage products. We recommend that you upload the audio to Object Storage Service (OSS).

1. Choose a storage and hosting method

Examples include the following:

  • Object Storage Service (Recommended):

    • Use a cloud provider's object storage service, such as OSS. Upload the audio file to a bucket and set its access permissions to public.

    • Advantages: High availability, CDN acceleration support, and easy management.

  • Web server:

    • Place the audio file on a web server that supports HTTP/HTTPS access, such as Nginx or Apache.

    • Advantages: Suitable for small projects or local testing.

  • Content Delivery Network (CDN):

    • Host the audio file on a CDN and access it through the URL provided by the CDN.

    • Advantages: Accelerates file transfer, suitable for high-concurrency scenarios.

2. Upload the audio file

Upload the audio file based on your chosen storage/hosting method. For example:

  • Object Storage Service:

    • Log on to the cloud provider's console and create a bucket.

    • Upload the audio file and set its permissions to "public-read" or generate a temporary access link.

  • Web server:

    • Place the audio file in a specified directory on the server, such as /var/www/html/audio/.

    • Ensure the file is accessible via HTTP/HTTPS.

3. Generate a publicly accessible URL

For example:

  • Object Storage Service:

    • After uploading the file, the system automatically generates a public access URL, typically in the format https://<bucket-name>.<region>.aliyuncs.com/<file-name>.

    • For a more user-friendly domain name, you can attach a custom domain name and enable HTTPS.

  • Web server:

    • The access URL for the file is usually the server address plus the file path, such as https://your-domain.com/audio/file.mp3.

  • CDN:

    • After configuring CDN acceleration, use the URL provided by the CDN, such as https://cdn.your-domain.com/audio/file.mp3.

4. Verify the URL's availability

In a public network environment, ensure that the generated URL is accessible. For example:

  • Open the URL in a browser to check if the audio file can be played.

  • Use a tool, such as curl or Postman, to verify that the URL returns a correct HTTP response (status code 200).

The SDK does not support temporary URLs with the oss:// prefix for audio files stored in OSS.

The RESTful API supports temporary URLs with the oss:// prefix for audio files in OSS, but with the following limitations:

Important
  • The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.

  • The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.

  • For production environments, use a stable storage service such as OSS to ensure long-term file availability and avoid rate limiting issues.

Q: How long does it take to get the recognition result?

Tasks enter the PENDING state after submission. Queuing time (typically a few minutes) varies with the queue length and file duration. The longer the audio file, the longer the processing time.

Troubleshooting

If a code error occurs, troubleshoot the issue based on the information in Error codes.

Q: Why can't I get a result after continuous polling?

This may be because of rate limiting.

Q: Why is the audio not recognized (no recognition result)?

Check whether the audio format and sample rate are correct and meet the parameter constraints.

You can use the ffprobe tool to retrieve information about the audio container, codec, sample rate, and channels:

ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx