VideoRetalk is a character video generation model. You can use a video of a character and an audio file to generate a new video where the character's lip movements are synchronized with the input audio. This document describes how to call the API for the video generation feature provided by this model.
This document applies only to the China (Beijing) region. To use the model, use an API key from the China (Beijing) region.
HTTP
The VideoRetalk video generation API can only be called over HTTP. To reduce waiting times and avoid request timeouts, this API uses an asynchronous processing method. Therefore, you must make two requests to generate a video.
Task submission: Submit a request to create a video generation task. The API returns a task ID.
Task status query and result retrieval: Use the returned task ID to query the task status and retrieve the generated video.
Prerequisites
You have obtained an API key and configured the API key as an environment variable.
Input limitations
Video requirements:
File limits: MP4, AVI, and MOV files are supported. The file size must be 300 MB or less. The duration must be between 2 and 120 seconds.
Video limits: The video frame rate must be between 15 fps and 60 fps. The video must be encoded in H.264 or H.265. The length of each side must be between 640 and 2048 pixels.
Video content: The video must be a close-up shot of a person facing forward. Avoid extreme side angles or very small faces.
Audio requirements:
File limits: WAV, MP3, and AAC files are supported. The file size must be 30 MB or less. The duration must be between 2 and 120 seconds.
Audio content: The audio must contain a clear and loud human voice. Interference, such as environmental noise and background music, must be removed.
Character reference image requirements:
File limits: JPEG, JPG, PNG, BMP, and WebP files are supported. The file size must be 10 MB or less.
Image content: The image must contain a clear, frontal view of a person's face, and this person must appear in the video. You can also use a screenshot from the video as the character reference image.
Uploaded file link requirements:
The uploaded video, audio, and image files must be accessible through HTTP links. Local paths are not supported. You can also use the temporary storage space provided by the platform to upload local files and create links.
Task submission
POST https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/Request parameters
Field | Type | Method | Required | Description | Example |
Content-Type | String | Header | Yes | Request type: application/json | application/json |
Authorization | String | Header | Yes | API key. Example: Bearer d1**2a | Bearer d1**2a |
X-DashScope-Async | String | Header | Yes | Set to | enable |
model | String | Body | Yes | Specifies the model to call. | videoretalk |
input.video_url | String | Body | Yes | The URL of the video file that you uploaded. The URL must be a publicly accessible address and support the HTTP or HTTPS protocol. Video file requirements:
| http://aaa/bbb.mp4 |
input.audio_url | String | Body | Yes | The URL of the audio file that you uploaded. The URL must be a publicly accessible address and support the HTTP or HTTPS protocol. Audio file requirements:
| http://aaa/bbb.wav |
input.ref_image_url | String | Body | No | The URL of the reference face image that you uploaded. The URL must be a publicly accessible address and support the HTTP or HTTPS protocol. When multiple faces are present in the input video, you can use this parameter to specify the face for lip-syncing. If only one face is present in the video, you do not need to specify it. If you do not provide a reference face image, the person with the largest face in the first frame that contains a face is selected by default. Image file requirements:
| http://aaa/bbb.jpg |
parameters.video_extension | Boolean | Body | No | Specifies whether to extend the video length when the input audio is longer than the video. The default value is
| false |
Response parameters
Field | Type | Description | Example |
output.task_id | String | The ID of the submitted asynchronous task. Use this ID to retrieve the actual task result using the task status query API. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
output.task_status | String | The status of the task after the asynchronous task is submitted. | "PENDING" |
request_id | String | The request ID. | 7574ee8f-38a3-4b1e-9280-11c33ab46e51 |
Sample request
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "videoretalk",
"input": {
"video_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20250717/pvegot/input_video_01.mp4",
"audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20250717/aumwir/stella2-audiobook7.wav",
"ref_image_url": ""
},
"parameters": {
"video_extension": false
}
}'Sample response
{
"output": {
"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1",
"task_status": "PENDING"
}
"request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
}Task status query and result retrieval
GET https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}Request parameters
Field | Type | Method | Required | Description | Example |
Authorization | String | Header | Yes | API key. Example: Bearer d1**2a. | Bearer d1**2a |
task_id | String | Url Path | Yes | The ID of the task to query. This is the value returned by the task submission API. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
Response parameters
Field | Type | Description | Example |
output.task_id | String | The ID of the queried task. | a8532587-fa8c-4ef8-82be-0c46b17950d1 |
output.task_status | String | The status of the queried task. | Task statuses:
|
output.video_url | String | The generated video. The video_url is valid for 24 hours after the task is completed. | https://xxx/1.mp4" |
usage.video_duration | Float | The duration of the video generated for this request, in seconds. | "video_duration": 10.23 |
usage.video_ratio | String | The aspect ratio type of the video generated for this request. The value is standard, which means the output video has the same aspect ratio as the original video by default. | "video_ratio": "standard" |
request_id | String | The request ID. | 7574ee8f-38a3-4b1e-9280-11c33ab46e51 |
Sample request
curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/<YOUR_TASK_ID>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"Sample response
{
"request_id": "87b9dce5-7f36-4305-a347-xxxxxx",
"output": {
"task_id": "3afd65eb-9604-48ea-8a91-xxxxxx",
"task_status": "SUCCEEDED",
"submit_time": "2025-09-11 20:15:29.887",
"scheduled_time": "2025-09-11 20:15:36.741",
"end_time": "2025-09-11 20:16:40.577",
"video_url": "http://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxx"
},
"usage": {
"video_duration": 7.16,
"video_ratio": "standard"
}
}Sample error response
{
"request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
"output": {
"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1",
"task_status": "FAILED",
"code": "xxx",
"message": "xxxxxx",
}
}Status codes
For information about the general status codes of the model service, see Error messages.
This model also has the following specific error codes:
HTTP return code | Error code | Error message | Description |
400 | InvalidParameter | Field required: xxx | A request parameter is missing or the format is incorrect. |
400 | InvalidURL.ConnectionRefused | Connection to ${url} refused, please provide avaiable URL | The download was rejected. Provide an available URL. |
400 | InvalidURL.Timeout | Download ${url} timeout, please check network connection. | The download timed out. The timeout period is 60s. |
400 | InvalidFile.Size | Invalid file size. The video/audio/image file size must be less than **MB. | The video, audio, or image file must be smaller than ** MB. |
400 | InvalidFile.Format | Invalid file format,the request file format is one of the following types: MP4, AVI, MOV, MP3, WAV, AAC, JPEG, JPG, PNG, BMP, and WEBP. | The file format is invalid. Videos in MP4, AVI, or MOV format are supported. Audios in MP3, WAV, or AAC format are supported. Images in JPG, JPEG, PNG, BMP, or WebP format are supported. |
400 | InvalidFile.Resolution | Invalid video resolution. The height or width of video must be 640 ~ 2048. | The side length of the video must be between 640 and 2048 pixels. |
400 | InvalidFile.FPS | Invalid video FPS. The video FPS must be 15 ~ 60. | The video frame rate must be between 15 and 60 fps. |
400 | InvalidFile.Duration | Invalid file duration. The video/audio file duration must be 2s ~ 120s. | The duration of the video or audio file must be between 2s and 120s. |
400 | InvalidFile.ImageSize | The size of image is beyond limit. | The image size exceeds the limit. The aspect ratio of the image must be 2 or less, and the longest side must be 4096 pixels or less. |
400 | InvalidFile.Openerror | Invalid file, cannot open file as video/audio/image. | The video, audio, or image file cannot be opened. |
400 | InvalidFile.Content | The input image has no human body or multi human bodies. Please upload other image with single person. | The input image contains no person or multiple people. |
400 | InvalidFile.FaceNotMatch | There are no matched face in the video with the provided reference image. | The face in the reference image does not match any face in the video. |
FAQ
How are inconsistencies between the input audio and video durations handled?
By default, the longer file is truncated to match the duration of the shorter file.
If the input audio is longer than the video and you want to generate a video based on the audio length, you can set the video_extension parameter to true. The algorithm extends the video duration using a "reverse-playback" alternating pattern with the original video frames until the video duration matches the audio duration.
How is silence in the input audio handled?
During periods of silence in the audio, the character in the video keeps their mouth closed.
How is the situation handled if there is no person or an incomplete face in the input video?
If the audio contains a human voice but a person or their mouth is not visible in a video frame, the original video frame is retained, and the audio plays normally.
How is the situation handled if there are multiple people in the input video?
Only one person's face can be replaced. The algorithm detects the specified face using the input reference image (input.ref_image_url). If a reference image is not provided, the algorithm defaults to selecting the largest face in the first frame that contains a face.