Build Lip-Sync Digital Human Videos with Wan2.2-S2V - Model Studio

Generate lip-sync videos from one image and one audio clip. Supports portrait, half-body, or full-body frames with no composition restrictions.

Important

This document applies only to the China (Beijing) region. An API key from the China (Beijing) region is required to use the model.

Model overview

Sample results

Sample input

Output video

input_image

Input audio

Models and pricing

Model	Description	Unit price	Rate limit (shared by Alibaba Cloud accounts and RAM users)
Model	Description	Unit price	RPS limit for task submission API	Concurrent tasks
wan2.2-s2v-detect	Validates image quality, single person, and frontal view.	$0.000574/image	5	No limit for sync APIs
wan2.2-s2v	Generates a video from a validated image and audio clip.	480p: $0.071677/second 720p: $0.129018/second	5	1

Video generation workflow:

Validate image with wan2.2-s2v-detect API
If compliant, submit video generation task with wan2.2-s2v API (image URL + audio URL), then poll for results

Getting started

Prerequisites

Before you call the API, activate Model Studio and get an API key. Then, set the API key as an environment variable.

Sample code

The sample image has already passed detection. Code below generates a video.

Note

HTTP workflow: create task → retrieve result.

Step 1: Create a task to get a task ID

Returns a task_id for querying results.

curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
 --header 'X-DashScope-Async: enable' \
 --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
 --header 'Content-Type: application/json' \
 --data '{
     "model": "wan2.2-s2v",
     "input": {
            "image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
            "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
        },
        "parameters": {
            "style": "speech"
        }
    }'

Step 2: Query the result by task ID

Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task ID.

API keys are region-specific. See API key documentation for details.

For models in the Beijing region, replace base_url with https://dashscope.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx

curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Task ID expires after 24 hours. Expired queries return status UNKNOWN.

Model comparison

Model selection: Use wan2.2-s2v for full-body or large half-body frames. For cost-effective portraits, use EMO.

Feature comparison	Digital Human wan2.2-s2v	EMO (View)
Model description	More natural movements with wider frame support (especially full-body and cartoon characters).	Better for close-ups and portraits with natural lip-sync and expressions.
Applicable frames	Full-body, half-body, portrait	Portrait, half-body (recommended)
Invocation method	Two-step: detection API for compliance only (simpler integration).	Two-step: detection API returns coordinates required by generation API.
Style control	Scenario-driven (speaking, singing, performing)	Style-driven (moderate, calm, lively)
Output specifications	By resolution (480p, 720p)	By aspect ratio (1:1, 3:4)
Model call price	Image detection: $0.000574/image Video generation: 480p: $0.071677/second 720p: $0.129018/second	Image detection: $0.000574/image Video generation: 1:1 aspect ratio: $0.011469/second 3:4 aspect ratio: $0.022937/second

Next steps

API documentation:

Image detection API

Video generation API