Video generation - - Alibaba Cloud Documentation Center

Supported models

Wan - text-to-video

Generates videos from text prompts. It supports text and audio input to create cinematic, multi-shot videos.

API reference | Model pricing | Try online: Singapore, Virginia, Beijing

Global

If you select the Global deployment scope, model inference compute resources are dynamically scheduled worldwide. Static data is stored in your selected region. Supported regions: US (Virginia) and Germany (Frankfurt).

Model

Features

Input modality

Output video specifications

wan2.6-t2v Recommended

Video with audio

Multi-shot narrative, audio-video synchronization

Text, audio

Resolution options: 720P, 1080P

Video duration: 5s, 10s, 15s

Defined specifications: 30 fps, MP4 (H.264 encoding)

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model	Features	Input modality	Output video specifications
wan2.7-t2v `Recommended`	Video with audio Multi-shot narrative, audio-video synchronization	Text, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-t2v	Video with audio Multi-shot narrative, audio-video synchronization	Text, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.5-t2v-preview	Video with audio Audio-video synchronization	Text, audio	Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-t2v-plus	Silent video Improved stability and success rate compared to the 2.1 model.	Text	Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.1-t2v-turbo	Silent video	Text	Resolution options: 480P, 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.1-t2v-plus	Silent video	Text	Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)

US

If you select the US deployment scope, model inference compute resources are restricted to the United States. Static data is stored in your selected region. Supported region: US (Virginia).

Model

Features

Input modality

Output video specifications

wan2.6-t2v-us Recommended

Video with audio

Multi-shot narrative, audio-video synchronization

Text, audio

Resolution options: 720P, 1080P

Video duration: 5s, 10s, 15s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model	Features	Input modality	Output video specifications
wan2.7-t2v `Recommended`	Video with audio Multi-shot narrative, audio-video synchronization	Text, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-t2v	Video with audio Multi-shot narrative, audio-video synchronization	Text, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.5-t2v-preview	Video with audio Audio-video synchronization	Text, audio	Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-t2v-plus	Silent video Improved stability and success rate compared to the 2.1 model.	Text	Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wanx2.1-t2v-turbo	Silent video	Text	Resolution options: 480P, 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wanx2.1-t2v-plus	Silent video	Text	Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)

Input prompt

Output video (wan2.6, multi-shot video)

Shot from a low angle, in a medium close-up, with warm tones, mixed lighting (the practical light from the desk lamp blends with the overcast light from the window), side lighting, and a central composition. In a classic detective office, wooden bookshelves are filled with old case files and ashtrays. A green desk lamp illuminates a case file spread out in the center of the desk. A fox, wearing a dark brown trench coat and a light gray fedora, sits in a leather chair, its fur crimson, its tail resting lightly on the edge, its fingers slowly turning yellowed pages. Outside, a steady drizzle falls beneath a blue sky, streaking the glass with meandering streaks. It slowly raises its head, its ears twitching slightly, its amber eyes gazing directly at the camera, its mouth clearly moving as it speaks in a smooth, cynical voice: 'The case was cold, colder than a fish in winter. But every chicken has its secrets, and I, for one, intended to find them '.

Wan - image-to-video

The Wan image-to-video model is upgraded with multimodal input (text/image/audio/video) and supports three tasks: first-frame-to-video, first-and-last-frame-to-video, and video continuation.

API reference | Model pricing | Prompt guide

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model

Features

Input modality

Output video specifications

wan2.7-i2v Recommended

Video with audio

First-frame-to-video, first-and-last-frame-to-video, video continuation, video continuation with last frame control

Multi-shot narrative, audio-video synchronization

Text, image, audio, video

Resolution options: 720P, 1080P

Video duration: [2s, 15s] (integer)

Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese Mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model

Features

Input modality

Output video specifications

wan2.7-i2v Recommended

Video with audio

First-frame-to-video, first-and-last-frame-to-video, video continuation, video continuation with last frame control

Multi-shot narrative, audio-video synchronization

Text, image, audio, video

Resolution options: 720P, 1080P

Video duration: [2s, 15s] (integer)

Defined specifications: 30 fps, MP4 (H.264 encoding)

Wan - image-to-video - first frame

Generates a video from a specified first-frame image. This model accepts text, a first-frame image, and audio as input to generate cinematic, multi-shot videos.

API reference | Model pricing | Try online: Singapore, Virginia, Beijing

Global

If you select the Global deployment scope, model inference compute resources are dynamically scheduled worldwide. Static data is stored in your selected region. Supported regions: US (Virginia) and Germany (Frankfurt).

Model

Features

Input modality

Output video specifications

wan2.6-i2v Recommended

Video with audio

Multi-shot narrative, audio-video synchronization

Text, image, audio

Resolution options: 720P, 1080P

Video duration: 5s, 10s, 15s

Defined specifications: 30 fps, MP4 (H.264 encoding)

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model	Features	Input modality	Output video specifications
wan2.6-i2v-flash `Recommended`	Video with audio, silent video Multi-shot narrative, audio-video synchronization	Text, image, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-i2v `Recommended`	Video with audio Multi-shot narrative, audio-video synchronization	Text, image, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.5-i2v-preview	Video with audio Audio-video synchronization	Text, image, audio	Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-i2v-flash	Silent video 50% faster than the 2.1 model.	Text, image	Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-i2v-plus	Silent video Improved stability and success rate compared to the 2.1 model.	Text, image	Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.1-i2v-plus	Silent video	Text, image	Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.1-i2v-turbo	Silent video	Text, image	Resolution options: 480P, 720P Video duration: 3s, 4s, 5s Defined specifications: 30 fps, MP4 (H.264 encoding)

US

If you select the US deployment scope, model inference compute resources are restricted to the United States. Static data is stored in your selected region. Supported region: US (Virginia).

Model

Features

Input modality

Output video specifications

wan2.6-i2v-us Recommended

Video with audio

Multi-shot narrative, audio-video synchronization

Text, image, audio

Resolution options: 720P, 1080P

Video duration: 5s, 10s, 15s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese Mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model	Features	Input modality	Output video specifications
wan2.6-i2v-flash `Recommended`	Video with audio, silent video Multi-shot narrative, audio-video synchronization	Text, image, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-i2v `Recommended`	Video with audio Multi-shot narrative, audio-video synchronization	Text, image, audio	Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.5-i2v-preview	Video with audio Audio-video synchronization	Text, image, audio	Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-i2v-flash	Silent video 50% faster than the 2.1 model.	Text, image	Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.2-i2v-plus	Silent video Improved stability and success rate compared to the 2.1 model.	Text, image	Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wanx2.1-i2v-plus	Silent video	Text, image	Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding)
wanx2.1-i2v-turbo	Silent video	Text, image	Resolution options: 480P, 720P Video duration: 3s, 4s, 5s Defined specifications: 30 fps, MP4 (H.264 encoding)

Input prompt

Input first frame image and audio

Output video (wan2.6, multi-shot video)

An urban fantasy art scene. A dynamic graffiti art character. A teenager made of spray paint comes to life from a concrete wall. He performs an English rap at high speed while striking a classic, energetic rapper pose. The scene is set under an urban railway bridge at night. The lighting comes from a single streetlight, creating a cinematic atmosphere with high energy and amazing detail. The audio of the video consists entirely of his rap, with no other dialogue or noise.

rap-转换自-png

Input audio:

Wan - image-to-video - first and last frames

Generates a video that smoothly transitions between specified first and last frame images. This model accepts text, first and last frame images, and audio as input to generate cinematic, multi-shot videos.

API reference | Model pricing | Try online

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model

Features

Input modality

Output video specifications

wan2.2-kf2v-flash Recommended

Silent video

Improved stability and success rate compared to the 2.1 model.

Text, image

Resolution options: 480P, 720P, 1080P

Video duration: 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

wan2.1-kf2v-plus

Silent video

Text, image

Resolution options: 720P

Video duration: 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese Mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model

Features

Input modality

Output video specifications

wan2.2-kf2v-flash Recommended

Silent video

Improved stability and success rate compared to the 2.1 model.

Text, image

Resolution options: 480P, 720P, 1080P

Video duration: 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

wanx2.1-kf2v-plus

Silent video

Text, image

Resolution options: 720P

Video duration: 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Input first frame image	Input last frame image	Input prompt	Output video
		Realistic style. A small black cat looks up at the sky curiously. The camera starts at eye level, gradually rises, and ends with a top-down shot of the cat's curious gaze.

Wan - reference-to-video

Make characters from a specified video perform actions. Input a video and a text prompt to generate an output video that maintains character consistency.

API reference | Model pricing

Global

If you select the Global deployment scope, model inference compute resources are dynamically scheduled worldwide. Static data is stored in your selected region. Supported regions: US (Virginia) and Germany (Frankfurt).

Model

Features

Input modality

Output video specifications

wan2.6-r2v Recommended

Video with audio

Single-role/multi-role video generation

Multi-shot narrative, audio-video synchronization

Text, video

Resolution options: 720P, 1080P

Video duration: 5s, 10s

Defined specifications: 30 fps, MP4 (H.264 encoding)

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model	Features	Input modality	Output video specifications
wan2.7-r2v `Recommended`	Video with audio Multi-entity reference-to-video; supports configuring voice timbre for each entity.	Text, image, video, audio	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-r2v-flash	Video with audio, silent video Single-role/multi-role video generation Multi-shot narrative, audio-video synchronization Faster generation, cost-effective.	Text, image, video	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-r2v	Video with audio Single-role/multi-role video generation Multi-shot narrative, audio-video synchronization	Text, image, video	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model	Features	Input modality	Output video specifications
wan2.7-r2v `Recommended`	Video with audio Multi-entity reference-to-video lets you configure voice timbre for each entity.	Text, image, video, audio	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-r2v-flash	Video with audio, silent video Single-role/multi-role video generation Multi-shot narrative, audio-video synchronization Faster generation, cost-effective.	Text, image, video	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)
wan2.6-r2v	Video with audio Single-role/multi-role video generation Multi-shot narrative, audio-video synchronization	Text, image, video	Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding)

Input reference video 1 (role: little girl)	Input reference video 2 (role: alarm clock)	Input prompt	Output video (multi-role dialogue)
		character1 says to character2: “I’ll rely on you tomorrow morning!” character2 replies: “You can count on me!”

Wan - video editing

Video editing model. Accepts text, image, and video multimodal input to perform various video generation and editing tasks.

Video editing 2.7 API reference | Video editing 2.1 API reference | Model pricing

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model

Features

Input modality

Output video specifications

wan2.7-videoedit Recommended

Video with audio, silent video (depends on the input video)

Instruction-based editing, video migration

Text, image, video

Resolution options: 720P, 1080P

Video duration: [2s, 10s] (integer)

Defined specifications: 30 fps, MP4 (H.264 encoding)

wan2.1-vace-plus

Silent video

Multi-image reference, video redrawing, local editing, video extension, video frame extension

Text, image, video

Resolution options: 720P

Video duration: Up to 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Chinese Mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model

Features

Input modality

Output video specifications

wan2.7-videoedit Recommended

Video with audio, silent video (depends on the input video)

Instruction-based editing, video migration

Text, image, video

Resolution options: 720P, 1080P

Video duration: [2s, 10s] (integer)

Defined specifications: 30 fps, MP4 (H.264 encoding)

wanx2.1-vace-plus

Silent video

Multi-image reference, video redrawing, local editing, video extension, video frame extension

Text, image, video

Resolution options: 720P

Video duration: Up to 5s

Defined specifications: 30 fps, MP4 (H.264 encoding)

Video editing 2.1

Feature 1: Multi-image reference

Input reference image 1 (reference entity)	Input reference image 2 (reference background)	Input prompt	Output video
		Video shows a girl gracefully walking out from the depths of an ancient, misty forest. Her steps are light, and the camera captures her every nimble moment. When she stops and looks around at the lush woods, a smile of surprise and joy blossoms on her face. This moment, frozen in the interplay of light and shadow, records her wonderful encounter with nature.

Feature 2: Video redrawing

Input video	Input prompt	Output video
	The video shows a black steampunk-style car, driven by a gentleman, adorned with gears and copper pipes. The background is a steam-powered candy factory with retro elements, creating a vintage and playful scene.

Feature 3: Local video editing

Input video	Input mask image (the white area indicates the editing area)	Input prompt	Output video
		The video shows a Parisian-style French cafe where a lion in a suit elegantly sips coffee. It holds a coffee cup in one hand, taking a gentle sip with a contented expression. The cafe is tastefully decorated, with soft hues and warm lighting illuminating the area where the lion is.

Feature 4: Video extension

Input first video segment (1s)	Input prompt	Output video (extended video is 5s)
	A dog wearing sunglasses skateboards on a street, 3D cartoon.

Feature 5: Video frame extension

Input video	Input prompt	Output video
	An elegant woman passionately plays the violin, with a full symphony orchestra behind her.

Wan - digital human

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.

Digital human lip-syncing animates a person or cartoon character in an image to speak, sing, narrate, or perform. You provide an image and an audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and head and body motions.

Image detection API reference | Video generation API reference | Model pricing

Model

Features

Input modality

Output description

wan2.2-s2v-detect

Image detection

Image

Output detection status: Pass or Fail

wan2.2-s2v

Video generation

Video with audio

Image, audio

Resolution options: 480P, 720P

Video duration: Up to 20s (follows audio duration)

Defined specifications:

480P: 16 fps, MP4 (H.264 encoding)
720P: 30 fps, MP4 (H.264 encoding)

Input example (character image + audio)

Output video (lip-sync)

mix_input_image

Input audio:

Wan - image to action

Animates a person from an image using motion from a reference video. You provide an image and a video, and the model generates a video that applies the motion from the reference video to the person, while keeping the background of the original image static.

API reference | Model pricing

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model

Features

Input modality

Output video specifications

wan2.2-animate-move

Video with audio, silent video (depends on the input video)

Standard mode wan-std: Fast generation, cost-effective.
Professional mode wan-pro: More realistic results.

Image, video

Resolution options: 720P

Video duration: 2s < duration < 30s

Defined specifications:

Standard mode wan-std: 15 fps, MP4 (H.264 encoding)
Professional mode wan-pro: 25 fps, MP4 (H.264 encoding)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model

Features

Input modality

Output video specifications

wan2.2-animate-move

Video with audio, silent video (depends on the input video)

Standard mode wan-std: Fast generation, cost-effective.
Professional mode wan-pro: More realistic results.

Image, video

Resolution options: 720P

Video duration: 2s < duration < 30s

Defined specifications:

Standard mode wan-std: 15 fps, MP4 (H.264 encoding)
Professional mode wan-pro: 25 fps, MP4 (H.264 encoding)

Input character image	Input reference video	Output video (standard mode `wan-std`)	Output video (professional mode `wan-pro`)

Wan - video character swap

Replaces a character in a video with one from a reference image. You provide a source video and a reference image, and the model generates an output video that retains the original background. This feature is ideal for use cases like face swapping and full character replacement.

API reference | Model pricing

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Model

Features

Input modality

Output video specifications

wan2.2-animate-mix

Video with audio, silent video (depends on the input video)

Standard mode wan-std: Fast generation, cost-effective.
Professional mode wan-pro: More realistic results.

Image, video

Resolution options: 720P

Video duration: 2s < duration < 30s

Defined specifications:

Standard mode wan-std: 15 fps, MP4 (H.264 encoding)
Professional mode wan-pro: 25 fps, MP4 (H.264 encoding)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Model

Features

Input modality

Output video specifications

wan2.2-animate-mix

Video with audio, silent video (depends on the input video)

Standard mode wan-std: Fast generation, cost-effective.
Professional mode wan-pro: More realistic results.

Image, video

Resolution options: 720P

Video duration: 2s < duration < 30s

Defined specifications:

Standard mode wan-std: 15 fps, MP4 (H.264 encoding)
Professional mode wan-pro: 25 fps, MP4 (H.264 encoding)

Input video	Input character image for replacement	Output video (standard mode `wan-std`)	Output video (professional mode `wan-pro`)

AnimateAnyone

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.
We recommend using Wan - image-to-action and Wan - video character swapping instead of AnimateAnyone. These models offer better quality, while AnimateAnyone is a more cost-effective option.

Designed specifically for dancing, this model replaces the dancer in a video with a person from an image. You provide an image and a video to generate an output video in one of two ways: 1. Retain the image background. 2. Retain the video background.

Image detection API reference | Action template generation API reference | Video generation API reference | Model pricing

Model	Features	Input modality	Output description
animate-anyone-detect-gen2	Image detection	Image	Output detection status: Pass or Fail
animate-anyone-template-gen2	Dance video template generation Extracts an action template from a dance video.	Video	Outputs a dance action template ID.
animate-anyone-gen2	Video generation Silent video	Image, video, dance action template ID	Video resolution options: 720P Video duration: 2s ≤ duration ≤ 60s Defined specifications: 15 fps, MP4 (H.264 encoding)

Input character image	Input dance video	Output video (generated with image background)	Output video (generated with video background)

EMO

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.
Consider using Wan - digital human as an alternative to EMO. Wan - digital human provides better results, while EMO is a more cost-effective option.

Generates singing and performance videos from an image. You provide an image and an audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and head motions.

Image detection API reference | Video generation API reference | Model pricing

Model

Features

Input modality

Output description

emo-detect-v1

Image detection

Image

Output detection status: Pass or Fail

emo-v1

Video generation

Video with audio

Image, audio

Video resolution:

1:1 aspect ratio: Fixed at 512 × 512
3:4 aspect ratio: Fixed at 512 × 704

Video duration: Up to 60s

Defined specifications: 15 fps, MP4 (H.264 encoding)

Input example (portrait image + audio)

Output video (lip-sync singing)

15_原图

Input audio:

LivePortrait

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.
Consider using Wan - digital human as an alternative to LivePortrait. Wan - digital human delivers higher quality results, while LivePortrait is a more cost-effective option. Note that LivePortrait is suitable for generating long videos (over 20 seconds).

Generates narration videos from an image by animating the person in the image to deliver news or tell stories. You provide an Image and an Audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and slight head motions.

Image detection API reference | Video generation API reference | Model pricing

Model

Features

Input modality

Output description

liveportrait-detect

Image detection

Image

Output detection status: Pass or Fail

liveportrait

Video generation

Video with audio

Image, audio

Video resolution: Follows the input image, up to nearly 4K (4096 × 4096).

Video duration: 1s < duration < 180s

Video frame rate: 15 fps ≤ frame rate ≤ 30 fps

Video format: MP4 (H.264 encoding)

Input example (portrait image + audio)

Output video (lip-sync voiceover)

Emoji男孩

Input audio:

Emoji

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.

Creates emojis using fixed emoji templates. You provide an image and an emoji template ID to generate an emoji video.

Image detection API reference | Video generation API reference | Model pricing

Model

Features

Input modality

Output description

emoji-detect-v1

Image detection

Image

Output detection status: Pass or Fail

emoji-v1

Video generation

Silent video

Image, emoji template ID

Video resolution: Fixed at 512 × 512

Video duration: Up to 5s (follows template duration)

Defined specifications: 15 fps, MP4 (H.264 encoding)

Input portrait image	Output video ("disgusted" emoji)

VideoRetalk

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.

Lip sync: Replaces the lip movements in a video to match a new audio track. You provide a video and an audio file, and the model generates an output video with synchronized lip movements.

API reference | Model pricing

Model

Features

Input modality

Output video specifications

videoretalk

Video with audio

Video, audio

Video resolution: Follows the input video, up to nearly 2K (2048 × 2048).

Video duration: 2s < duration < 120s

Video frame rate: 15 fps ≤ frame rate ≤ 60 fps

Video format: MP4 (H.264 encoding)

Input example (character broadcast video + audio)	Output video (lip-sync replacement)
Input audio:

Video style transform

Note

Only the Chinese mainland service deployment scope is supported. Data storage is in the Beijing access region. Model inference compute resources are limited to the Chinese mainland.

Applies a new artistic style to a video based on a predefined style template. You provide a video and a style transfer ID to generate a restyled video.

API reference | Model pricing

Model

Features

Input modality

Output video specifications

video-style-transform

Video with audio, silent video

Depends on the input video.

Video redraw style ID

Video resolution: Follows the input video, up to nearly 4K (4096 × 4096).

Video duration: Up to 30s

Video frame rate: 15 fps ≤ frame rate ≤ 25 fps

Video format: MP4 (H.264 encoding)

Input video	Output video (style transfer: "Japanese manga")

:Video generation

Model overview

Model selection

Supported models

Wan - text-to-video

Global

International

US

Chinese mainland

Wan - image-to-video

International

Chinese Mainland

Wan - image-to-video - first frame

Global

International

US

Chinese Mainland

Wan - image-to-video - first and last frames

International

Chinese Mainland

Wan - reference-to-video

Global

International

Chinese mainland

Wan - video editing

International

Chinese Mainland

Wan - digital human

Wan - image to action

International

Chinese mainland

Wan - video character swap

International

Chinese mainland

AnimateAnyone

EMO

LivePortrait

Emoji

VideoRetalk

Video style transform