All Products
Search
Document Center

Alibaba Cloud Model Studio:VideoRetalk video generation API reference

Last Updated:Oct 21, 2025

VideoRetalk is a character video generation model. You can use a video of a character and an audio file to generate a new video where the character's lip movements are synchronized with the input audio. This document describes how to call the API for the video generation feature provided by this model.

Important

This document applies only to the China (Beijing) region. To use the model, use an API key from the China (Beijing) region.

HTTP

The VideoRetalk video generation API can only be called over HTTP. To reduce waiting times and avoid request timeouts, this API uses an asynchronous processing method. Therefore, you must make two requests to generate a video.

Prerequisites

You have obtained an API key and configured the API key as an environment variable.

Input limitations

  • Video requirements:

    • File limits: MP4, AVI, and MOV files are supported. The file size must be 300 MB or less. The duration must be between 2 and 120 seconds.

    • Video limits: The video frame rate must be between 15 fps and 60 fps. The video must be encoded in H.264 or H.265. The length of each side must be between 640 and 2048 pixels.

    • Video content: The video must be a close-up shot of a person facing forward. Avoid extreme side angles or very small faces.

  • Audio requirements:

    • File limits: WAV, MP3, and AAC files are supported. The file size must be 30 MB or less. The duration must be between 2 and 120 seconds.

    • Audio content: The audio must contain a clear and loud human voice. Interference, such as environmental noise and background music, must be removed.

  • Character reference image requirements:

    • File limits: JPEG, JPG, PNG, BMP, and WebP files are supported. The file size must be 10 MB or less.

    • Image content: The image must contain a clear, frontal view of a person's face, and this person must appear in the video. You can also use a screenshot from the video as the character reference image.

  • Uploaded file link requirements:

    • The uploaded video, audio, and image files must be accessible through HTTP links. Local paths are not supported. You can also use the temporary storage space provided by the platform to upload local files and create links.

Task submission

POST https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/

Request parameters

Field

Type

Method

Required

Description

Example

Content-Type

String

Header

Yes

Request type: application/json

application/json

Authorization

String

Header

Yes

API key. Example: Bearer d1**2a

Bearer d1**2a

X-DashScope-Async

String

Header

Yes

Set to enable to indicate that the task is created asynchronously.

enable

model

String

Body

Yes

Specifies the model to call.

videoretalk

input.video_url

String

Body

Yes

The URL of the video file that you uploaded.

The URL must be a publicly accessible address and support the HTTP or HTTPS protocol.

Video file requirements:

  • Size: The file size must be 300 MB or less.

  • Format: MP4, AVI, or MOV.

  • Duration: The duration must be longer than 2 seconds and shorter than 120 seconds.

  • Frame rate: The frame rate must be between 15 fps and 60 fps.

  • Encoding: We recommend that you use H.264 or H.265 encoding.

  • Side length: The side length must be between 640 and 2048 pixels.

  • Content: The video must be a close-up shot of a person facing forward. Avoid extreme side angles or very small faces. If the face in the video is incomplete or no person is present, see the FAQ for how to handle this.

http://aaa/bbb.mp4

input.audio_url

String

Body

Yes

The URL of the audio file that you uploaded.

The URL must be a publicly accessible address and support the HTTP or HTTPS protocol.

Audio file requirements:

  • Size: The file size must be 30 MB or less.

  • Format: WAV, MP3, or AAC.

  • Duration: The duration must be longer than 2 seconds and shorter than 120 seconds. If the video and audio have different durations, see the FAQ for how to handle this.

  • Content: The audio must contain a clear and loud human voice. Interference, such as environmental noise and background music, must be removed.

http://aaa/bbb.wav

input.ref_image_url

String

Body

No

The URL of the reference face image that you uploaded.

The URL must be a publicly accessible address and support the HTTP or HTTPS protocol.

When multiple faces are present in the input video, you can use this parameter to specify the face for lip-syncing. If only one face is present in the video, you do not need to specify it.

If you do not provide a reference face image, the person with the largest face in the first frame that contains a face is selected by default.

Image file requirements:

  • Content: The image must contain a clear front view of a person's face, and this person must appear in the video.

  • File size: The file size must be 10 MB or less.

  • Image size: The aspect ratio must be 2 or less. The maximum side length must be 4096 pixels or less.

  • Format: JPEG, JPG, PNG, BMP, or WebP.

http://aaa/bbb.jpg

parameters.video_extension

Boolean

Body

No

Specifies whether to extend the video length when the input audio is longer than the video. The default value is false. You can set it to true or false.

  • If you set this parameter to true, the original video frames are extended in a "reverse-playback" alternating pattern until the video duration matches the audio duration.

  • If you set this parameter to false, the video length is not extended. The generated video will have the same duration as the original video, and the audio will be truncated.

false

Response parameters

Field

Type

Description

Example

output.task_id

String

The ID of the submitted asynchronous task. Use this ID to retrieve the actual task result using the task status query API.

a8532587-fa8c-4ef8-82be-0c46b17950d1

output.task_status

String

The status of the task after the asynchronous task is submitted.

"PENDING"

request_id

String

The request ID.

7574ee8f-38a3-4b1e-9280-11c33ab46e51

Sample request

curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "videoretalk",
    "input": {
        "video_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20250717/pvegot/input_video_01.mp4",
        "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20250717/aumwir/stella2-audiobook7.wav",
        "ref_image_url": ""
     },
    "parameters": {
        "video_extension": false
    }
  }'

Sample response

{
    "output": {
	"task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1", 
    	"task_status": "PENDING"
    }
    "request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
}

Task status query and result retrieval

GET https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}

Request parameters

Field

Type

Method

Required

Description

Example

Authorization

String

Header

Yes

API key. Example: Bearer d1**2a.

Bearer d1**2a

task_id

String

Url Path

Yes

The ID of the task to query. This is the value returned by the task submission API.

a8532587-fa8c-4ef8-82be-0c46b17950d1

Response parameters

Field

Type

Description

Example

output.task_id

String

The ID of the queried task.

a8532587-fa8c-4ef8-82be-0c46b17950d1

output.task_status

String

The status of the queried task.

Task statuses:

  • PENDING: The task is in the queue.

  • PRE-PROCESSING: The task is being pre-processed.

  • RUNNING: The task is in progress.

  • POST-PROCESSING: The task is being post-processed.

  • SUCCEEDED: The task is successful.

  • FAILED: The task failed.

  • UNKNOWN: The task does not exist or its status is unknown.

output.video_url

String

The generated video. The video_url is valid for 24 hours after the task is completed.

https://xxx/1.mp4"

usage.video_duration

Float

The duration of the video generated for this request, in seconds.

"video_duration": 10.23

usage.video_ratio

String

The aspect ratio type of the video generated for this request. The value is standard, which means the output video has the same aspect ratio as the original video by default.

"video_ratio": "standard"

request_id

String

The request ID.

7574ee8f-38a3-4b1e-9280-11c33ab46e51

Sample request

curl -X GET 'https://dashscope.aliyuncs.com/api/v1/tasks/<YOUR_TASK_ID>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

{
    "request_id": "87b9dce5-7f36-4305-a347-xxxxxx",
    "output": {
        "task_id": "3afd65eb-9604-48ea-8a91-xxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2025-09-11 20:15:29.887",
        "scheduled_time": "2025-09-11 20:15:36.741",
        "end_time": "2025-09-11 20:16:40.577",
        "video_url": "http://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxx"
    },
    "usage": {
        "video_duration": 7.16,
        "video_ratio": "standard"
    }
}

Sample error response

{
    "request_id": "7574ee8f-38a3-4b1e-9280-11c33ab46e51"
  	"output": {
        "task_id": "a8532587-fa8c-4ef8-82be-0c46b17950d1", 
    	"task_status": "FAILED",
    	"code": "xxx", 
    	"message": "xxxxxx", 
    }  
}

Status codes

For information about the general status codes of the model service, see Error messages.

This model also has the following specific error codes:

HTTP return code

Error code

Error message

Description

400

InvalidParameter

Field required: xxx

A request parameter is missing or the format is incorrect.

400

InvalidURL.ConnectionRefused

Connection to ${url} refused, please provide avaiable URL

The download was rejected. Provide an available URL.

400

InvalidURL.Timeout

Download ${url} timeout, please check network connection.

The download timed out. The timeout period is 60s.

400

InvalidFile.Size

Invalid file size. The video/audio/image file size must be less than **MB.

The video, audio, or image file must be smaller than ** MB.

400

InvalidFile.Format

Invalid file format,the request file format is one of the following types: MP4, AVI, MOV, MP3, WAV, AAC, JPEG, JPG, PNG, BMP, and WEBP.

The file format is invalid. Videos in MP4, AVI, or MOV format are supported. Audios in MP3, WAV, or AAC format are supported. Images in JPG, JPEG, PNG, BMP, or WebP format are supported.

400

InvalidFile.Resolution

Invalid video resolution. The height or width of video must be 640 ~ 2048.

The side length of the video must be between 640 and 2048 pixels.

400

InvalidFile.FPS

Invalid video FPS. The video FPS must be 15 ~ 60.

The video frame rate must be between 15 and 60 fps.

400

InvalidFile.Duration

Invalid file duration. The video/audio file duration must be 2s ~ 120s.

The duration of the video or audio file must be between 2s and 120s.

400

InvalidFile.ImageSize

The size of image is beyond limit.

The image size exceeds the limit.

The aspect ratio of the image must be 2 or less, and the longest side must be 4096 pixels or less.

400

InvalidFile.Openerror

Invalid file, cannot open file as video/audio/image.

The video, audio, or image file cannot be opened.

400

InvalidFile.Content

The input image has no human body or multi human bodies. Please upload other image with single person.

The input image contains no person or multiple people.

400

InvalidFile.FaceNotMatch

There are no matched face in the video with the provided reference image.

The face in the reference image does not match any face in the video.

FAQ

  1. How are inconsistencies between the input audio and video durations handled?

    By default, the longer file is truncated to match the duration of the shorter file.

    If the input audio is longer than the video and you want to generate a video based on the audio length, you can set the video_extension parameter to true. The algorithm extends the video duration using a "reverse-playback" alternating pattern with the original video frames until the video duration matches the audio duration.

  2. How is silence in the input audio handled?

    During periods of silence in the audio, the character in the video keeps their mouth closed.

  3. How is the situation handled if there is no person or an incomplete face in the input video?

    If the audio contains a human voice but a person or their mouth is not visible in a video frame, the original video frame is retained, and the audio plays normally.

  4. How is the situation handled if there are multiple people in the input video?

    Only one person's face can be replaced. The algorithm detects the specified face using the input reference image (input.ref_image_url). If a reference image is not provided, the algorithm defaults to selecting the largest face in the first frame that contains a face.