VideoRetalk is a model that generates a new video of a person speaking by synchronizing their lip movements with a given audio track. This document describes how to use the API to generate videos.
This document applies only to the China (Beijing) region. To use the model, use an API key from the China (Beijing) region.
HTTP
The VideoRetalk API supports only HTTP calls. It uses an asynchronous process to reduce wait times and prevent request timeouts. This means you make two separate requests to generate a video:
Submit a task: Submit a request to create a video generation task. The API returns a task ID.
Query task status and retrieve results: Use the returned task ID to query the task status and retrieve the generated video.
Prerequisites
You have created an API key and set the API key as an environment variable.
Input limitations
Video requirements:
File: MP4, AVI, and MOV files are supported. The file size must be 300 MB or less. The duration must be between 2 and 120 seconds.
Properties: The video frame rate must be between 15 fps and 60 fps. The video must be encoded in H.264 or H.265. The length of each side must be between 640 and 2,048 pixels.
Content: The video must be a close-up shot of a person facing forward. Avoid extreme side angles or very small faces.
Audio requirements:
File: WAV, MP3, and AAC files are supported. The file size must be 30 MB or less. The duration must be between 2 and 120 seconds.
Content: The audio must contain a clear and loud human voice. Remove any interference such as ambient noise or background music.
Character reference image requirements:
File: JPEG, JPG, PNG, BMP, and WebP files are supported. The file size must be 10 MB or less.
Content: The image must contain a clear, frontal view of a person's face, and this person must appear in the video. You can also use a screenshot from the video.
File URL requirements:
The uploaded video, audio, and image files must be accessible through HTTP links. Local paths are not supported. You can also use the temporary storage space provided by the platform to upload local files and create links.
Submit a task
POST https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/Request parameters
Field | Type | Location | Required | Description | Example |
Content-Type | String | Header | Yes | Request type: application/json | application/json |
Authorization | String | Header | Yes | API key. Example: Bearer d1**2a | Bearer d1**2a |
X-DashScope-Async | String | Header | Yes | Set to | enable |
model | String | Body | Yes | Specifies the model to call. | videoretalk |
input.video_url | String | Body | Yes | The URL of the video file that you uploaded. The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol. Video file requirements:
| http://aaa/bbb.mp4 |
input.audio_url | String | Body | Yes | The URL of the audio file that you uploaded. The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol. Audio file requirements:
| http://aaa/bbb.wav |
input.ref_image_url | String | Body | No | The URL of the reference face image. The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol. Use this parameter to specify which face to use for lip-syncing when multiple faces are present in the input video. If the video contains only one face, this parameter is not needed. If you omit this parameter, the system uses the largest face detected in the first frame that contains a face. Image file requirements:
| http://aaa/bbb.jpg |
parameters.video_extension | Boolean | Body | No | Specifies whether to extend the video length when the input audio is longer than the video. Default:
| false |
Response parameters
Field | Type | Description | Example |
output.task_id | String | The ID of the submitted asynchronous task. Use this ID to retrieve the actual task result using the task status query API. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
output.task_status | String | The status of the task after submission. | "PENDING" |
request_id | String | The request ID. | 7574ee8f-38a3-4b1e-9280-11c33ab46e51 |
Sample request
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "videoretalk",
"input": {
"video_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250717/pvegot/input_video_01.mp4",
"audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250717/aumwir/stella2-%E6%9C%89%E5%A3%B0%E4%B9%A67.wav",
"ref_image_url": ""
},
"parameters": {
"video_extension": false
}
}'Sample response
{
"output": {
"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1",
"task_status": "PENDING"
},
"request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
}Query task status and retrieve results
GET https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}Request parameters
Field | Type | Location | Required | Description | Example |
Authorization | String | Header | Yes | API key. Example: Bearer d1**2a. | Bearer d1**2a |
task_id | String | Url Path | Yes | The ID of the task to query. This is the value returned by the task submission API. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
Response parameters
Field | Type | Description | Example |
output.task_id | String | The ID of the queried task. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
output.task_status | String | The status of the queried task. | Task statuses:
|
output.video_url | String | The generated video. The video_url is valid for 24 hours after the task is completed. | https://xxx/1.mp4" |
usage.video_duration | Float | The duration of the video generated for this request, in seconds. | "video_duration": 10.23 |
usage.video_ratio | String | The aspect ratio type of the video generated for this request. The value is standard, which means the output video has the same aspect ratio as the original video by default. | "video_ratio": "standard" |
request_id | String | The request ID. | 7574ee8f-38a3-4b1e-9280-11c33ab46e51 |
Sample request
curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/<YOUR_TASK_ID>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"Sample response
{
"request_id": "87b9dce5-7f36-4305-a347-xxxxxx",
"output": {
"task_id": "3afd65eb-9604-48ea-8a91-xxxxxx",
"task_status": "SUCCEEDED",
"submit_time": "2025-09-11 20:15:29.887",
"scheduled_time": "2025-09-11 20:15:36.741",
"end_time": "2025-09-11 20:16:40.577",
"video_url": "http://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxx"
},
"usage": {
"video_duration": 7.16,
"video_ratio": "standard"
}
}Sample error response
{
"request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51",
"output": {
"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1",
"task_status": "FAILED",
"code": "xxx",
"message": "xxxxxx"
}
}Error codes
For general status codes, see Error messages.
This model also has the following specific error codes:
HTTP return code | Error code | Error message | Description |
400 | InvalidParameter | Field required: xxx | A request parameter is missing or the format is incorrect. |
400 | InvalidURL.ConnectionRefused | Connection to ${url} refused, please provide avaiable URL | The download was rejected. Provide an available URL. |
400 | InvalidURL.Timeout | Download ${url} timeout, please check network connection. | The download timed out. The timeout period is 60s. |
400 | InvalidFile.Size | Invalid file size. The video/audio/image file size must be less than **MB. | The video, audio, or image file must be smaller than ** MB. |
400 | InvalidFile.Format | Invalid file format,the request file format is one of the following types: MP4, AVI, MOV, MP3, WAV, AAC, JPEG, JPG, PNG, BMP, and WEBP. | The file format is invalid. Videos in MP4, AVI, or MOV format are supported. Audios in MP3, WAV, or AAC format are supported. Images in JPG, JPEG, PNG, BMP, or WebP format are supported. |
400 | InvalidFile.Resolution | Invalid video resolution. The height or width of video must be 640 ~ 2048. | The side length of the video must be between 640 and 2048 pixels. |
400 | InvalidFile.FPS | Invalid video FPS. The video FPS must be 15 ~ 60. | The video frame rate must be between 15 and 60 fps. |
400 | InvalidFile.Duration | Invalid file duration. The video/audio file duration must be 2s ~ 120s. | The duration of the video or audio file must be between 2 and 120 seconds. |
400 | InvalidFile.ImageSize | The size of image is beyond limit. | The image size exceeds the limit. The aspect ratio of the image must be 2 or less, and the longest side must be 4,096 pixels or less. |
400 | InvalidFile.Openerror | Invalid file, cannot open file as video/audio/image. | The video, audio, or image file cannot be opened. |
400 | InvalidFile.Content | The input image has no human body or multi human bodies. Please upload other image with single person. | The input image contains no person or multiple people. |
400 | InvalidFile.FaceNotMatch | There are no matched face in the video with the provided reference image. | The face in the reference image does not match any face in the video. |
FAQ
How do I handle input video and audio with different durations?
By default, the longer file is truncated to match the duration of the shorter file.
If the input audio is longer than the video and you want to generate a video based on the audio length, set the video_extension parameter to true. This extends the video by looping it in a "reverse-play, forward-play" pattern until its duration matches the audio's length.
How does the API handle silent segments in the input audio?
The model generates frames where the person's mouth is closed to correspond with any silent segments in the input audio.
What happens when a video frame contains no face, but the corresponding audio has speech?
The original video frame is preserved, and the audio continues to play over it. Lip-syncing is only applied to frames where a detectable face is present.
How do I select a specific person for lip-syncing in a video with multiple people?
The API can only sync the lips of one person in the video. The algorithm detects the specified face using the input reference image (input.ref_image_url). If it is not provided, the algorithm defaults to the largest face in the first frame that contains a face.