This topic describes the parameters and API of the Paraformer audio file recognition Java SDK.
This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.
User guide: For an overview of models and how to select them, see Audio file recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
NoteTo grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Model list
paraformer-v2 | paraformer-8k-v2 | |
Scenarios | Multilingual recognition for scenarios such as live streaming and meetings | Chinese recognition for scenarios such as telephone customer service and voicemail |
Sample rate | Any | 8 kHz |
Languages | Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghai dialect, Wu dialect, Min Nan dialect, Northeastern dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect, and Cantonese | Chinese |
Punctuation prediction | ✅ Supported by default, no configuration required | ✅ Supported by default, no configuration required |
Inverse text normalization (ITN) | ✅ Supported by default, no configuration required | ✅ Supported by default, no configuration required |
Custom hotwords | ✅ For more information, see Custom hotwords | ✅ For more information, see Custom hotwords |
Specify language for recognition | ✅ Specified by the | ❌ |
Limitations
The service does not support direct uploads of local audio or video files. It also does not support base64-encoded audio. The input source must be a file URL that is accessible over the Internet and supports the HTTP or HTTPS protocol, for example, https://your-domain.com/file.mp3.
When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.
When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:
The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as Alibaba Cloud OSS to ensure long-term file availability and avoid rate limiting issues.
Specify the URL using the fileUrls parameter. A single request supports up to 100 URLs.
Audio formats
aac,amr,avi,flac,flv,m4a,mkv,mov,mp3,mp4,mpeg,ogg,opus,wav,webm,wma, andwmvImportantThe API cannot guarantee correct recognition for all audio and video formats and their variants because it is not feasible to test every possibility. We recommend testing your files to confirm that they produce the expected speech recognition results.
Audio sampling rate
The sample rate varies by model:
paraformer-v2 supports any sample rate
paraformer-8k-v2 only supports an 8 kHz sample rate
Audio file size and duration
The audio file cannot exceed 2 GB in size and 12 hours in duration.
To process files that exceed these limits, you can pre-process them to reduce their size. For more information about pre-processing best practices, see Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios).
Number of audio files for batch processing
A single request supports up to 100 file URLs.
Recognizable languages
Varies by model:
paraformer-v2:
Chinese, including Mandarin and various dialects: Shanghai dialect, Wu dialect, Min Nan dialect, Northeastern dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect, and Cantonese
English
Japanese
Korean
paraformer-8k-v2 only supports Chinese
Getting started
The core class (Transcription) provides methods to submit tasks asynchronously, wait for them to complete synchronously, and query task results asynchronously. You can perform audio file recognition using one of the following two approaches:
Asynchronous task submission + synchronous waiting for task completion: After you submit a task, the current thread is blocked until the task is complete and the recognition result is returned.
Asynchronous task submission + asynchronous query of task execution results: After you submit a task, you can query the task execution results by calling the query task interface when needed.
Asynchronous task submission + synchronous waiting for task completion
Configure request parameters.
Instantiate the core class (Transcription).
Call the
asyncCallmethod of the core class (Transcription) to asynchronously submit the task.NoteThe file transcription service processes tasks submitted through the API on a best-effort basis. After a task is submitted, it enters the queuing (
PENDING) state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Once the task starts processing, the speech recognition process is hundreds of times faster than real-time playback.The recognition results and download URLs are valid for 24 hours after a task is complete. After this period, you cannot query the task or download the results.
Call the
waitmethod of the core class (Transcription) to wait synchronously for the task to complete.A task can have a status of
PENDING,RUNNING,SUCCEEDED, orFAILED. Thewaitcall is blocked while the task status isPENDINGorRUNNING. Once the task status isSUCCEEDEDorFAILED, thewaitcall is unblocked and returns the task execution result.The
waitmethod returns a task execution result (TranscriptionResult).
Asynchronous task submission + asynchronous query of task execution results
Configure request parameters.
Instantiate the core class (Transcription).
Call the
asyncCallmethod of the core class (Transcription) to asynchronously submit the task.NoteThe file transcription service processes tasks submitted through the API on a best-effort basis. After a task is submitted, it enters the queuing (
PENDING) state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Once the task starts processing, the speech recognition process is hundreds of times faster than real-time playback.The recognition results and download URLs are valid for 24 hours after a task is complete. After this period, you cannot query the task or download the results.
You can continue to call the
fetchmethod of the core class (Transcription) until you retrieve the final task result.When the task status is
SUCCEEDEDorFAILED, stop polling and process the result.The
fetchmethod returns a task execution result (TranscriptionResult).
Request parameters
You can configure request parameters using the chained methods of the TranscriptionParam class.
Parameter | Type | Default value | Required | Description |
model | String | - | Yes | The Paraformer model name used for audio and video file transcription. For more information, see Model list. |
fileUrls | List<String> | - | Yes | The URLs of the audio and video files to be transcribed. The HTTP and HTTPS protocols are supported. You can specify up to 100 URLs in each request. If your audio files are stored in Alibaba Cloud OSS, the SDK does not support temporary URLs that start with the oss:// prefix. |
vocabularyId | String | - | No | The ID of the latest hotword vocabulary. The latest v2 series models support this parameter and language configurations. The hotwords corresponding to this hotword ID take effect in the current speech recognition. This feature is disabled by default. For more information about how to use this feature, see Custom hotwords. |
channelId | List<Integer> | [0] | No | Specifies the indexes of the audio tracks in a multi-track audio file to recognize. The index starts from 0. For example, [0] indicates that only the first track is recognized, and [0, 1] indicates that both the first and second tracks are recognized. If you omit this parameter, the first track is processed by default. Important Each specified audio track is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges. |
disfluencyRemovalEnabled | Boolean | false | No | Filters filler words. This feature is disabled by default. |
timestampAlignmentEnabled | Boolean | false | No | Specifies whether to enable the timestamp alignment feature. This feature is disabled by default. |
specialWordFilter | String | - | No | Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words. If you do not pass this parameter, the system enables its built-in sensitive word filtering logic. Any words in the detection results that match the Alibaba Cloud Model Studio sensitive word list (Chinese) are replaced with an equal number of If this parameter is passed, the following sensitive word processing strategies can be implemented:
The value of this parameter must be a JSON string with the following structure: JSON field description:
|
language_hints | String[] | ["zh", "en"] | No | Specifies the language codes of the speech to be recognized. This parameter is applicable only to the paraformer-v2 model. Supported language codes:
Note The Set using the parameter methodSet using the parameters method |
diarizationEnabled | Boolean | false | No | Automatic speaker diarization. This feature is disabled by default. This feature is applicable only to mono audio. Multi-channel audio does not support speaker diarization. When this feature is enabled, the recognition results will display a For examples of |
speakerCount | Integer | - | No | The reference value for the number of speakers. The value must be an integer from 2 to 100, inclusive. This parameter takes effect after speaker diarization is enabled ( By default, the number of speakers is automatically determined. If you configure this parameter, it can only assist the algorithm in trying to output the specified number of speakers, but cannot guarantee that this number will be output. |
apiKey | String | - | No | Your API key. If you have configured the API key as an environment variable, you do not need to set it in the code. Otherwise, you must set it in the code. |
Response results
Task execution result (TranscriptionResult)
TranscriptionResult encapsulates the current task execution result.
Interface/Method | Parameters | Return value | Description |
| None | requestId | Returns the request ID. |
| None | taskId | Returns the task ID. |
| None |
| Returns the task status.
Note If a task contains multiple subtasks, the overall task status is marked as |
| None | Returns the subtask execution result (TranscriptionTaskResult). Each task recognizes one or more audio files. Different audio files are processed in different subtasks, so each task corresponds to one or more subtasks. | |
| None | Task execution result in JSON format | Returns the task execution result. This result is in JSON format. If you want to get the task execution result through the |
Subtask execution result (TranscriptionTaskResult)
TranscriptionTaskResult encapsulates the subtask execution result. A subtask corresponds to the recognition of a single audio file.
Interface/Method | Parameters | Return value | Description |
| None | Link to the recognized audio file | Returns the link to the recognized audio file. |
| None | Link to the recognition result | Returns the link to the recognition result. This link is valid for 24 hours. After this period, you cannot query the task or download results using the previously queried URL. The recognition result is saved as a JSON file. You can download this file over the above link or directly read the content of the file through an HTTP request. For the meaning of each field in the JSON data, see Recognition result description. |
| None |
| Returns the subtask status.
|
| None | Key information during task execution, which may be empty | Returns key information during task execution. When a task fails, you can check this content to analyze the reason. |
Recognition result description
The recognition result is saved as a JSON file.
The following table describes the key parameters:
Parameter | Type | Description |
audio_format | string | The audio format in the source file. |
channels | array[integer] | The audio track index information in the source file. Returns [0] for single-track audio, [0, 1] for dual-track audio, and so on. |
original_sampling_rate | integer | The sample rate (Hz) of the audio in the source file. |
original_duration | integer | The original audio duration (ms) in the source file. |
channel_id | integer | The audio track index of the transcription result, starting from 0. |
content_duration | integer | The duration (ms) of content determined to be speech in the audio track. Important The Paraformer speech recognition model service only transcribes and charges for the duration of content determined to be speech in the audio track. Non-speech content is not measured or charged. Typically, the speech content duration is shorter than the original audio duration. Because an AI model determines whether speech content exists, discrepancies may occur. |
transcript | string | The paragraph-level speech transcription result. |
sentences | array | The sentence-level speech transcription result. |
words | array | The word-level speech transcription result. |
begin_time | integer | The start timestamp (ms). |
end_time | integer | The end timestamp (ms). |
text | string | The speech transcription result. |
speaker_id | integer | The index of the current speaker, starting from 0, used to distinguish different speakers. This field is displayed in the recognition result only when speaker diarization is enabled. |
punctuation | string | The predicted punctuation after the word, if any. |
Key interfaces
Task query parameter configuration class (TranscriptionQueryParam)
The TranscriptionQueryParam class is used to wait for a task to complete by calling the wait method of the Transcription class, or to query the execution result of a task by calling the fetch method of the Transcription class.
You can create a TranscriptionQueryParam instance using the FromTranscriptionParam static method.
Interface/Method | Parameters | Return value | Description |
|
| a | Creates a |
Core class (Transcription)
You can import the Transcription class using the statement import com.alibaba.dashscope.audio.asr.transcription.*;. The key methods of this class are described in the following table:
Interface/Method | Parameters | Return value | Description |
|
| Asynchronously submits a speech recognition task. | |
|
| Blocks the current thread until the asynchronous task ends (task status is | |
|
| Asynchronously queries the current task execution result. |
Error codes
If you encounter an error, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue and provide the Request ID for further investigation.
If a task contains multiple subtasks and any subtask succeeds, the overall task status is marked as SUCCEEDED. You must check the subtask_status field to determine the result of each subtask.
Error response example:
{
"task_id": "7bac899c-06ec-4a79-8875-xxxxxxxxxxxx",
"task_status": "SUCCEEDED",
"submit_time": "2024-12-16 16:30:59.170",
"scheduled_time": "2024-12-16 16:30:59.204",
"end_time": "2024-12-16 16:31:02.375",
"results": [
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/long_audio_demo_cn.mp3",
"transcription_url": "https://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/prod/paraformer-v2/20241216/xxxx",
"subtask_status": "SUCCEEDED"
},
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_exaple_1.wav",
"code": "InvalidFile.DownloadFailed",
"message": "The audio file cannot be downloaded.",
"subtask_status": "FAILED"
}
],
"task_metrics": {
"TOTAL": 2,
"SUCCEEDED": 1,
"FAILED": 1
}
}More examples
For more examples, see GitHub.
FAQ
Features
Q: Is Base64 encoded audio supported?
No, it is not. The service only supports recognition of audio from URLs that are accessible over the internet. It does not support binary streams or local files.
Q: How can I provide audio files as publicly accessible URLs?
Follow these general steps. The specific process may vary depending on the storage product you use. We recommend uploading audio to OSS.
When using the SDK to access a file stored in OSS, you cannot use a temporary URL with the oss:// prefix.
When using the RESTful API to access a file stored in OSS, you can use a temporary URL with the oss:// prefix:
The temporary URL is valid for 48 hours and cannot be used after it expires. Do not use it in a production environment.
The API for obtaining an upload credential is limited to 100 QPS and does not support scaling out. Do not use it in production environments, high-concurrency scenarios, or stress testing scenarios.
For production environments, use a stable storage service such as Alibaba Cloud OSS to ensure long-term file availability and avoid rate limiting issues.
Q: How long does it take to obtain the recognition results?
After a task is submitted, it enters the PENDING state. The queuing time depends on the queue length and file duration and cannot be precisely determined, but it is typically within a few minutes. Longer audio files require more processing time.
Troubleshooting
If you encounter a code error, see Error codes for troubleshooting.
Q: What should I do if the recognition results are not synchronized with the audio playback?
You can set the request parameter timestampAlignmentEnabled to true to enable the timestamp alignment feature. This feature synchronizes the recognition results with the audio playback.
Q: Why can't I obtain a result after continuous polling?
This may be due to rate limiting. To request a quota increase, join the developer group.
Q: Why is the speech not recognized (no recognition result)?
Check whether the audio meets the format and sample rate requirements.
If you are using the
paraformer-v2model, check whether thelanguage_hintsparameter is set correctly.If the previous checks do not resolve the issue, you can use custom hotwords to improve the recognition of specific words.
More questions
For more information, see the QA on GitHub.