All Products
Search
Document Center

Alibaba Cloud Model Studio:Wan - digital human

Last Updated:Nov 10, 2025

The wan2.2-s2v model uses a single image and an audio clip to generate videos of a person speaking, singing, or performing with natural movements. The model supports portrait, full-body, or half-body images and has no restrictions on image composition.

Important

This document applies only to the China (Beijing) region. An API key from the China (Beijing) region is required to use the model.

Model overview

Sample results

Sample input

Output video

input_image

Input audio

Models and pricing

Model

Description

Unit price

Rate limit (shared by Alibaba Cloud accounts and RAM users)

RPS limit for task submission API

Concurrent tasks

wan2.2-s2v-detect

Checks if the input image meets requirements, such as definition, a single person, and a frontal view.

$0.000574/image

5

No limit for sync APIs

wan2.2-s2v

Generates a dynamic video of a person from a validated image and an audio clip.

480p: $0.071677/second

720p: $0.129018/second

5

1

The process to generate a digital human video is as follows:

  • Step 1: Call the wan2.2-s2v-detect API. Pass an image URL to check if the image is compliant.

  • Step 2: If the image is compliant, call the asynchronous wan2.2-s2v API. Pass the image URL and an audio URL to submit the video generation task. Poll the API to retrieve the result.

Getting started

Prerequisites

Before you call the API, activate Model Studio and obtain an API key. Then, set the API key as an environment variable.

Sample code

The sample image in this topic has passed the detection. The following sample code shows how to generate a video.

Note

The HTTP request involves two steps: creating a task and then retrieving the result.

Step 1: Create a task to get a task ID

This request returns a task_id that you can use to query the result.

curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
 --header 'X-DashScope-Async: enable' \
 --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
 --header 'Content-Type: application/json' \
 --data '{
     "model": "wan2.2-s2v",
     "input": {
            "image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
            "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
        },
        "parameters": {
            "style": "speech"
        }
    }'
Step 2: Query the result by task ID

Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task ID.

The API keys for the Singapore and Beijing regions are different. Obtain an API key.
The following code provides the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/tasks/{task_id}
curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

The task_id is valid for 24 hours. If you try to query a task after the task ID has expired, the API returns a task status of UNKNOWN.

Model comparison

Model selection recommendations: Use the wan2.2-s2v model to generate videos that include a full-body or large half-body view of a person. If cost-effectiveness is a priority, choose EMO instead.

Feature comparison

Digital Human wan2.2-s2v

EMO (View)

Model description

Larger and more natural movements. Wide range of supported frames (especially full-body). Supports cartoon characters.

Better suited for close-ups or portraits. Natural lip-syncing and expressions.

Applicable frames

Full-body, half-body, portrait

Portrait, half-body (recommended)

Invocation method

Two-step call. The detection API is used only for compliance checks, which simplifies integration.

Two-step call. The coordinates returned by the detection API are a required input parameter for the generation API.

Style control

Scenario-driven (speaking, singing, performing)

Style-driven (moderate, calm, lively)

Output specifications

By resolution (480p, 720p)

By aspect ratio (1:1, 3:4)

Model call price

  • Image detection: $0.000574/image

  • Video generation:

    • 480p: $0.071677/second

    • 720p: $0.129018/second

  • Image detection: $0.000574/image

  • Video generation:

    • 1:1 aspect ratio: $0.011469/second

    • 3:4 aspect ratio: $0.022937/second

Next steps

Refer to the following API documents to start your development:

Image detection API

Video generation API