All Products
Search
Document Center

Alibaba Cloud Model Studio:Wan - digital human

Last Updated:Mar 15, 2026

Generate lip-sync videos from one image and one audio clip. Supports portrait, half-body, or full-body frames with no composition restrictions.

Important

This document applies only to the China (Beijing) region. An API key from the China (Beijing) region is required to use the model.

Model overview

Sample results

Sample input

Output video

input_image

Input audio

Models and pricing

Model

Description

Unit price

Rate limit (shared by Alibaba Cloud accounts and RAM users)

RPS limit for task submission API

Concurrent tasks

wan2.2-s2v-detect

Validates image quality, single person, and frontal view.

$0.000574/image

5

No limit for sync APIs

wan2.2-s2v

Generates a video from a validated image and audio clip.

480p: $0.071677/second

720p: $0.129018/second

5

1

Video generation workflow:

  1. Validate image with wan2.2-s2v-detect API

  2. If compliant, submit video generation task with wan2.2-s2v API (image URL + audio URL), then poll for results

Getting started

Prerequisites

Before you call the API, activate Model Studio and get an API key. Then, set the API key as an environment variable.

Sample code

The sample image has already passed detection. Code below generates a video.

Note

HTTP workflow: create task → retrieve result.

Step 1: Create a task to get a task ID

Returns a task_id for querying results.

curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
 --header 'X-DashScope-Async: enable' \
 --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
 --header 'Content-Type: application/json' \
 --data '{
     "model": "wan2.2-s2v",
     "input": {
            "image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
            "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
        },
        "parameters": {
            "style": "speech"
        }
    }'
Step 2: Query the result by task ID

Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task ID.

API keys are region-specific. See API key documentation for details.
For models in the Beijing region, replace base_url with https://dashscope.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx
curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Task ID expires after 24 hours. Expired queries return status UNKNOWN.

Model comparison

Model selection: Use wan2.2-s2v for full-body or large half-body frames. For cost-effective portraits, use EMO.

Feature comparison

Digital Human wan2.2-s2v

EMO (View)

Model description

More natural movements with wider frame support (especially full-body and cartoon characters).

Better for close-ups and portraits with natural lip-sync and expressions.

Applicable frames

Full-body, half-body, portrait

Portrait, half-body (recommended)

Invocation method

Two-step: detection API for compliance only (simpler integration).

Two-step: detection API returns coordinates required by generation API.

Style control

Scenario-driven (speaking, singing, performing)

Style-driven (moderate, calm, lively)

Output specifications

By resolution (480p, 720p)

By aspect ratio (1:1, 3:4)

Model call price

  • Image detection: $0.000574/image

  • Video generation:

    • 480p: $0.071677/second

    • 720p: $0.129018/second

  • Image detection: $0.000574/image

  • Video generation:

    • 1:1 aspect ratio: $0.011469/second

    • 3:4 aspect ratio: $0.022937/second

Next steps

API documentation:

Image detection API

Video generation API