Best practices - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key for the China (Beijing) region.

Preprocess video files to improve file transcription efficiency (for audio file recognition scenarios)

Paraformer speech recognition API is compatible with video files. However, they are typically large and time-consuming to transfer. We recommend pre-processing them by extracting only the audio tracks needed for speech recognition and applying appropriate compression to significantly reduce file size. This will greatly improve the throughput efficiency of video file transcription. The following best practices demonstrate how to use ffmpeg for relevant pre-processing.

Prerequisites

Install ffmpeg: Visit the official ffmpeg website.

Pre-process video files

You can use ffmpeg to extract the first audio track from the video file, downsample it to 16kHz, and compress it using opus encoding.

Shell

ffmpeg -i input-video-file -ac 1 -ar 16000 -acodec libopus output-audio-file.opus

In general, the output audio file will be significantly smaller than the input video file. You can then submit this audio file (via URL) to the file transcription API to obtain speech recognition results.