The wan2.2-s2v model uses a single image and an audio clip to generate videos of a person speaking, singing, or performing with natural movements. The model supports portrait, full-body, or half-body images and has no restrictions on image composition.
This document applies only to the China (Beijing) region. An API key from the China (Beijing) region is required to use the model.
Model overview
Sample results
Sample input | Output video |
Input audio |
Models and pricing
Model | Description | Unit price | Rate limit (shared by Alibaba Cloud accounts and RAM users) | |
RPS limit for task submission API | Concurrent tasks | |||
wan2.2-s2v-detect | Checks if the input image meets requirements, such as definition, a single person, and a frontal view. | $0.000574/image | 5 | No limit for sync APIs |
wan2.2-s2v | Generates a dynamic video of a person from a validated image and an audio clip. | 480p: $0.071677/second 720p: $0.129018/second | 5 | 1 |
The process to generate a digital human video is as follows:
Step 1: Call the wan2.2-s2v-detect API. Pass an image URL to check if the image is compliant.
Step 2: If the image is compliant, call the asynchronous wan2.2-s2v API. Pass the image URL and an audio URL to submit the video generation task. Poll the API to retrieve the result.
Getting started
Prerequisites
Before you call the API, activate Model Studio and obtain an API key. Then, set the API key as an environment variable.
Sample code
The sample image in this topic has passed the detection. The following sample code shows how to generate a video.
The HTTP request involves two steps: creating a task and then retrieving the result.
Step 1: Create a task to get a task ID
This request returns a task_id that you can use to query the result.
curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "wan2.2-s2v",
"input": {
"image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
"audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
},
"parameters": {
"style": "speech"
}
}'Step 2: Query the result by task ID
Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task ID.
The API keys for the Singapore and Beijing regions are different. Obtain an API key.
The following code provides the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}
curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"The task_id is valid for 24 hours. If you try to query a task after the task ID has expired, the API returns a task status of UNKNOWN.
Model comparison
Model selection recommendations: Use the wan2.2-s2v model to generate videos that include a full-body or large half-body view of a person. If cost-effectiveness is a priority, choose EMO instead.
Feature comparison | Digital Human wan2.2-s2v | EMO (View) |
Model description | Larger and more natural movements. Wide range of supported frames (especially full-body). Supports cartoon characters. | Better suited for close-ups or portraits. Natural lip-syncing and expressions. |
Applicable frames | Full-body, half-body, portrait | Portrait, half-body (recommended) |
Invocation method | Two-step call. The detection API is used only for compliance checks, which simplifies integration. | Two-step call. The coordinates returned by the detection API are a required input parameter for the generation API. |
Style control | Scenario-driven (speaking, singing, performing) | Style-driven (moderate, calm, lively) |
Output specifications | By resolution (480p, 720p) | By aspect ratio (1:1, 3:4) |
Model call price |
|
|
Next steps
Refer to the following API documents to start your development:
