All Products
Search
Document Center

Alibaba Cloud Model Studio:VideoRetalk API reference

Last Updated:Jan 21, 2026

VideoRetalk is a model that generates a new video of a person speaking by synchronizing their lip movements with a given audio track. This document describes how to use the API to generate videos.

Important

This document applies only to the China (Beijing) region. To use the model, use an API key from the China (Beijing) region.

HTTP

The VideoRetalk API supports only HTTP calls. It uses an asynchronous process to reduce wait times and prevent request timeouts. This means you make two separate requests to generate a video:

Prerequisites

You have created an API key and set the API key as an environment variable.

Input limitations

  • Video requirements:

    • File: MP4, AVI, and MOV files are supported. The file size must be 300 MB or less. The duration must be between 2 and 120 seconds.

    • Properties: The video frame rate must be between 15 fps and 60 fps. The video must be encoded in H.264 or H.265. The length of each side must be between 640 and 2,048 pixels.

    • Content: The video must be a close-up shot of a person facing forward. Avoid extreme side angles or very small faces.

  • Audio requirements:

    • File: WAV, MP3, and AAC files are supported. The file size must be 30 MB or less. The duration must be between 2 and 120 seconds.

    • Content: The audio must contain a clear and loud human voice. Remove any interference such as ambient noise or background music.

  • Character reference image requirements:

    • File: JPEG, JPG, PNG, BMP, and WebP files are supported. The file size must be 10 MB or less.

    • Content: The image must contain a clear, frontal view of a person's face, and this person must appear in the video. You can also use a screenshot from the video.

  • File URL requirements:

    • The uploaded video, audio, and image files must be accessible through HTTP links. Local paths are not supported. You can also use the temporary storage space provided by the platform to upload local files and create links.

Submit a task

POST https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/

Request parameters

Field

Type

Location

Required

Description

Example

Content-Type

String

Header

Yes

Request type: application/json

application/json

Authorization

String

Header

Yes

API key. Example: Bearer d1**2a

Bearer d1**2a

X-DashScope-Async

String

Header

Yes

Set to enable to indicate that the task is created asynchronously.

enable

model

String

Body

Yes

Specifies the model to call.

videoretalk

input.video_url

String

Body

Yes

The URL of the video file that you uploaded.

The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol.

Video file requirements:

  • Size: The file size must be 300 MB or less.

  • Format: MP4, AVI, or MOV.

  • Duration: Between 2 and 120 seconds.

  • Frame rate: Between 15 and 60 fps.

  • Encoding: H.264 or H.265 are recommended.

  • Side length: Between 640 and 2048 pixels.

  • Content: A close-up, front-facing shot of a person. Avoid extreme side angles or very small faces. If the video does not contain a full face or any face, see the FAQ for solution.

http://aaa/bbb.mp4

input.audio_url

String

Body

Yes

The URL of the audio file that you uploaded.

The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol.

Audio file requirements:

  • Size: The file size must be 30 MB or less.

  • Format: WAV, MP3, or AAC.

  • Duration: Between 2 and 120 seconds. If the video and audio have different durations, see the FAQ for solution.

  • Content: Must contain clear, loud human speech. Remove any background noise or music.

http://aaa/bbb.wav

input.ref_image_url

String

Body

No

The URL of the reference face image.

The URL must be a publicly accessible address and must use the HTTP or HTTPS protocol.

Use this parameter to specify which face to use for lip-syncing when multiple faces are present in the input video. If the video contains only one face, this parameter is not needed.

If you omit this parameter, the system uses the largest face detected in the first frame that contains a face.

Image file requirements:

  • Content: Must be a clear, front-facing image of a person who appears in the video.

  • File size: 10 MB or less.

  • Image size: The aspect ratio must be 2 or less. The maximum side length must be 4,096 pixels or less.

  • Format: JPEG, JPG, PNG, BMP, or WebP.

http://aaa/bbb.jpg

parameters.video_extension

Boolean

Body

No

Specifies whether to extend the video length when the input audio is longer than the video. Default: false.

  • true: Extends the video duration to match the audio length by looping the video in a "reverse-play, forward-play" pattern.

  • false: Does not extend the video. The generated video will have the same duration as the original video, and the audio will be truncated.

false

Response parameters

Field

Type

Description

Example

output.task_id

String

The ID of the submitted asynchronous task. Use this ID to retrieve the actual task result using the task status query API.

a8532587-fa8c-4ef8-82be-0c46b17950d1

output.task_status

String

The status of the task after submission.

"PENDING"

request_id

String

The request ID.

7574ee8f-38a3-4b1e-9280-11c33ab46e51

Sample request

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "videoretalk",
    "input": {
        "video_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250717/pvegot/input_video_01.mp4",
        "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250717/aumwir/stella2-%E6%9C%89%E5%A3%B0%E4%B9%A67.wav",
        "ref_image_url": ""
     },
    "parameters": {
        "video_extension": false
    }
  }'

Sample response

{
    "output": {
	"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1", 
    	"task_status": "PENDING"
    },
    "request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
}

Query task status and retrieve results

GET https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}

Request parameters

Field

Type

Location

Required

Description

Example

Authorization

String

Header

Yes

API key. Example: Bearer d1**2a.

Bearer d1**2a

task_id

String

Url Path

Yes

The ID of the task to query. This is the value returned by the task submission API.

a8532587-fa8c-4ef8-82be-0c46b17950d1

Response parameters

Field

Type

Description

Example

output.task_id

String

The ID of the queried task.

a8532587-fa8c-4ef8-82be-0c46b17950d1

output.task_status

String

The status of the queried task.

Task statuses:

  • PENDING

  • PRE-PROCESSING

  • RUNNING

  • POST-PROCESSING

  • SUCCEEDED

  • FAILED

  • UNKNOWN: The task does not exist or its status is unknown.

output.video_url

String

The generated video. The video_url is valid for 24 hours after the task is completed.

https://xxx/1.mp4"

usage.video_duration

Float

The duration of the video generated for this request, in seconds.

"video_duration": 10.23

usage.video_ratio

String

The aspect ratio type of the video generated for this request. The value is standard, which means the output video has the same aspect ratio as the original video by default.

"video_ratio": "standard"

request_id

String

The request ID.

7574ee8f-38a3-4b1e-9280-11c33ab46e51

Sample request

curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/<YOUR_TASK_ID>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

{
    "request_id": "87b9dce5-7f36-4305-a347-xxxxxx",
    "output": {
        "task_id": "3afd65eb-9604-48ea-8a91-xxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2025-09-11 20:15:29.887",
        "scheduled_time": "2025-09-11 20:15:36.741",
        "end_time": "2025-09-11 20:16:40.577",
        "video_url": "http://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxx"
    },
    "usage": {
        "video_duration": 7.16,
        "video_ratio": "standard"
    }
}

Sample error response

{
    "request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51",
  	"output": {
        "task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1", 
    	"task_status": "FAILED",
    	"code": "xxx", 
    	"message": "xxxxxx" 
    }  
}

Error codes

For general status codes, see Error messages.

This model also has the following specific error codes:

HTTP return code

Error code

Error message

Description

400

InvalidParameter

Field required: xxx

A request parameter is missing or the format is incorrect.

400

InvalidURL.ConnectionRefused

Connection to ${url} refused, please provide avaiable URL

The download was rejected. Provide an available URL.

400

InvalidURL.Timeout

Download ${url} timeout, please check network connection.

The download timed out. The timeout period is 60s.

400

InvalidFile.Size

Invalid file size. The video/audio/image file size must be less than **MB.

The video, audio, or image file must be smaller than ** MB.

400

InvalidFile.Format

Invalid file format,the request file format is one of the following types: MP4, AVI, MOV, MP3, WAV, AAC, JPEG, JPG, PNG, BMP, and WEBP.

The file format is invalid. Videos in MP4, AVI, or MOV format are supported. Audios in MP3, WAV, or AAC format are supported. Images in JPG, JPEG, PNG, BMP, or WebP format are supported.

400

InvalidFile.Resolution

Invalid video resolution. The height or width of video must be 640 ~ 2048.

The side length of the video must be between 640 and 2048 pixels.

400

InvalidFile.FPS

Invalid video FPS. The video FPS must be 15 ~ 60.

The video frame rate must be between 15 and 60 fps.

400

InvalidFile.Duration

Invalid file duration. The video/audio file duration must be 2s ~ 120s.

The duration of the video or audio file must be between 2 and 120 seconds.

400

InvalidFile.ImageSize

The size of image is beyond limit.

The image size exceeds the limit.

The aspect ratio of the image must be 2 or less, and the longest side must be 4,096 pixels or less.

400

InvalidFile.Openerror

Invalid file, cannot open file as video/audio/image.

The video, audio, or image file cannot be opened.

400

InvalidFile.Content

The input image has no human body or multi human bodies. Please upload other image with single person.

The input image contains no person or multiple people.

400

InvalidFile.FaceNotMatch

There are no matched face in the video with the provided reference image.

The face in the reference image does not match any face in the video.

FAQ

  1. How do I handle input video and audio with different durations?

    By default, the longer file is truncated to match the duration of the shorter file.

    If the input audio is longer than the video and you want to generate a video based on the audio length, set the video_extension parameter to true. This extends the video by looping it in a "reverse-play, forward-play" pattern until its duration matches the audio's length.

  2. How does the API handle silent segments in the input audio?

    The model generates frames where the person's mouth is closed to correspond with any silent segments in the input audio.

  3. What happens when a video frame contains no face, but the corresponding audio has speech?

    The original video frame is preserved, and the audio continues to play over it. Lip-syncing is only applied to frames where a detectable face is present.

  4. How do I select a specific person for lip-syncing in a video with multiple people?

    The API can only sync the lips of one person in the video. The algorithm detects the specified face using the input reference image (input.ref_image_url). If it is not provided, the algorithm defaults to the largest face in the first frame that contains a face.