Alibaba Cloud Model Studio offers a diverse selection of video models for scenarios such as text-to-video, image-to-video (including general, dance, singing, and broadcasting), and video editing (including general, style transfer, and lip-syncing).
Model overview
Category | Description |
Text-to-video |
|
Image-to-video |
|
Video editing |
|
Supported models
Text-to-video
The Wan text-to-video model generates videos from a single sentence. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
International (Singapore)
Model | Description | Unit price | Free quota (Claim) Valid for 90 days after you activate Alibaba Cloud Model Studio |
wan2.6-t2v | Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files. | 720P: $0.10/second 1080P: $0.15/second | 50 seconds |
wan2.5-t2v-preview | Wan 2.5 preview. Supports automatic voiceover and custom audio file input. | 480p: $0.05/second 720p: $0.10/second 1080p: $0.15/second | 50 seconds |
wan2.2-t2v-plus | Wan 2.2 Professional Edition. Significantly improved image detail and motion stability. | 480p: $0.02/second 1080p: $0.10/second | 50 seconds |
wan2.1-t2v-turbo | Wan 2.1 Turbo Edition. Fast generation speed and balanced performance. | $0.036/second | 200 seconds |
wan2.1-t2v-plus | Wan 2.1 Professional Edition. Generates rich details and higher-quality images. | $0.10/second | 200 seconds |
China (Beijing)
Model | Description | Unit price | Free quota |
wan2.6-t2v | Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files. | 720P: $0.086012/second 1080P: 0.143353/second | No free quota |
wan2.5-t2v-preview | Wan 2.5 preview. Supports automatic voiceover and custom audio file input. | 480p: $0.043006/second 720p: $0.086012/second 1080p: $0.143353/second | No free quota |
wan2.2-t2v-plus | Wan 2.2 Professional Edition. Significantly improved image detail and motion stability. | 480p: $0.02007/second 1080p: $0.100347/second | No free quota |
wanx2.1-t2v-turbo | Faster generation speed and balanced performance. | $0.034405/second | No free quota |
wanx2.1-t2v-plus | Generates richer details and higher-quality images. | $0.100347/second | No free quota |
Input example | Output video (wan2.5) |
Input prompt: Shot from a low angle, in a medium close-up, with warm tones, mixed lighting (the practical light from the desk lamp blends with the overcast light from the window), side lighting, and a central composition. In a classic detective office, wooden bookshelves are filled with old case files and ashtrays. A green desk lamp illuminates a case file spread out in the center of the desk. A fox, wearing a dark brown trench coat and a light gray fedora, sits in a leather chair, its fur crimson, its tail resting lightly on the edge, its fingers slowly turning yellowed pages. Outside, a steady drizzle falls beneath a blue sky, streaking the glass with meandering streaks. It slowly raises its head, its ears twitching slightly, its amber eyes gazing directly at the camera, its mouth clearly moving as it speaks in a smooth, cynical voice: 'The case was cold, colder than a fish in winter. But every chicken has its secrets, and I, for one, intended to find them '. Input audio: |
Image-to-video - based on the first frame
The Wan image-to-video model uses an input image as the first frame of a video. It then generates the rest of the video based on a prompt. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
International (Singapore)
Model | Description | Unit price | Free quota (Note) Validity: Within 90 days after you activate Alibaba Cloud Model Studio |
wan2.6-i2v | Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files. | 720P: $0.10/seconds 1080P: $0.15/seconds | 50seconds |
wan2.5-i2v-preview | Wan 2.5 preview. Supports automatic dubbing and custom audio file uploads. | 480P: $0.05/second 720P: $0.10/second 1080P: $0.15/second | 50 seconds |
wan2.2-i2v-flash | Wan 2.2 Flash Edition. Delivers extremely fast generation speed with significant improvements in visual detail and motion stability. | 480P: $0.015/second 720P: $0.036/second | 50 seconds |
wan2.2-i2v-plus | Wan 2.2 Professional Edition. Delivers significant improvements in visual detail and motion stability. | 480P: $0.02/second 1080P: $0.10/second | 50 seconds |
wan2.1-i2v-turbo | Wan 2.1 Turbo Edition. Fast generation speed with balanced performance. | $0.036/second | 200 seconds |
wan2.1-i2v-plus | Wan 2.1 Professional Edition. Generates rich details and produces higher-quality, more textured visuals. | $0.10/second | 200 seconds |
China (Beijing)
Model | Description | Unit price | Free quota |
wan2.6-i2v | Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files. | 720P: $0.086012/second 1080P: $0.143353/second | No free quota |
wan2.5-i2v-preview | Wan 2.5 preview. Supports automatic dubbing and custom audio file uploads. | 480P: $0.043006/second 720P: $0.086012/second 1080P: $0.143353/second | No free quota |
wan2.2-i2v-plus | Wan 2.2 Professional Edition. Delivers significant improvements in visual detail and motion stability. | 480P: $0.02007/second 1080P: $0.100347/second | No free quota |
wanx2.1-i2v-turbo | Wan 2.1 Turbo Edition. Fast generation speed with balanced performance. | $0.034405/second | No free quota |
wanx2.1-i2v-plus | Wan 2.1 Professional Edition. Generates rich details and produces higher-quality, more textured visuals. | $0.100347/second | No free quota |
Input first frame image and audio | Output video (wan2.6) |
Input audio: | |
Input prompt: A scene of urban fantasy art. A dynamic graffiti art character. A boy painted with spray paint comes to life from a concrete wall. He sings an English rap song at a very fast pace while striking a classic, energetic rapper pose. The scene is set under an urban railway bridge at night. The lighting comes from a single streetlight, creating a cinematic atmosphere full of high energy and amazing detail. The audio of the video consists entirely of his rap, with no other dialogue or noise. | |
Image-to-video - based on the first and last frames
The Wan first-and-last-frame video model generates a smooth, dynamic video from a prompt. You only need to provide the first and last frame images. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
International (Singapore)
Model | Unit price | Free quota (Note) |
wan2.1-kf2v-plus | $0.10/second | 200 seconds Validity period: Within 90 days after you activate Model Studio |
China (Beijing)
Model | Unit price | Free quota (Note) |
wanx2.1-kf2v-plus | $0.100347/second | No free quota |
Example input | Output video | ||
First frame | Last frame | Prompt | |
|
| In a realistic style, the camera starts at eye level with a small black cat looking up at the sky with curiosity, then gradually moves upward, ending in a top-down shot focused on the cat's curious eyes. | |
Reference-to-video
Wan reference-to-video model uses a character's appearance and voice from an input video and a prompt to generate a new video that maintains character consistency. API reference
International (Singapore)
Model | Input price | Output price | Free quota (Note) |
wan2.6-r2v | 720P:$0.10/second 1080P:$0.15/second | 720P:$0.10/second 1080P:$0.15/second | 50 seconds Validity period: Within 90 days after you activate Model Studio |
China (Beijing)
Model | Input price | Output price | Free quota (Note) |
wan2.6-r2v | 720P:$0.086012/second 1080P:$0.143353/second | 720P:$0.086012/second 1080P:$0.143353/second | No free quota |
General video editing
The Wan unified video editing model supports multimodal inputs, including text, images, and videos. It can perform video generation and general editing tasks. API reference | Try it online
International (Singapore)
Model | Unit price | Free quota (Note) |
wan2.1-vace-plus | $0.1/s | 50 seconds Validity: Valid for 90 days after Model Studio activation. |
China (Beijing)
Model | Unit price | Free quota (Note) |
wanx2.1-vace-plus | $0.100347/s | No free quota |
The unified video editing model supports the following features:
Feature | Input reference image | Input prompt | Output video |
Multi-image reference | Reference image 1 (reference entity)
Reference image 2 (reference background)
| In the video, a girl gracefully walks out from a misty, ancient forest. Her steps are light, and the camera captures her every nimble moment. When the girl stops and looks around at the lush woods, a smile of surprise and joy blossoms on her face. This scene, frozen in a moment of interplay between light and shadow, records her wonderful encounter with nature. | Output video |
Video repainting | The video shows a black steampunk-style car driven by a gentleman. The car is decorated with gears and copper pipes. The background features a steam-powered candy factory and retro elements, creating a vintage and playful scene. | ||
Local editing | Input video Input mask image (The white area indicates the editing area)
| The video shows a Parisian-style French cafe where a lion in a suit is elegantly sipping coffee. It holds a coffee cup in one hand, taking a gentle sip with a relaxed expression. The cafe is tastefully decorated, with soft hues and warm lighting illuminating the area where the lion is. | The content in the editing area is modified based on the prompt. |
Video extension | Input first clip (1 second) | A dog wearing sunglasses is skateboarding on the street, 3D cartoon. | Output extended video (5 seconds) |
Video outpainting | An elegant lady is passionately playing the violin, with a full symphony orchestra behind her. |





