All Products
Search
Document Center

Alibaba Cloud Model Studio:Video generation

Last Updated:Dec 16, 2025

Alibaba Cloud Model Studio offers a diverse selection of video models for scenarios such as text-to-video, image-to-video (including general, dance, singing, and broadcasting), and video editing (including general, style transfer, and lip-syncing).

Model overview

Category

Description

Text-to-video

  • Text-to-video: Generates a video from a single sentence. The video features rich artistic styles and high-quality visuals.

Image-to-video

  • First frame to video: Uses an input image as the first frame of a video and generates the rest based on a prompt.

  • First and last frame to video: Generates a smooth, dynamic video from first and last frame images and a prompt.

  • Multi-image to video: Generates a video based on one or more reference images for the subject or background, combined with a prompt.

Video editing

  • General video editing: Performs various video editing tasks based on input text prompts, images, and videos. For example, it can extract motion features from an input video and combine them with a prompt to generate a new video.

Supported models

Text-to-video

The Wan text-to-video model generates videos from a single sentence. The videos feature rich artistic styles and cinematic quality. API reference | Try it online

International (Singapore)

Model

Description

Unit price

Free quota (Claim)

Valid for 90 days after you activate Alibaba Cloud Model Studio

wan2.6-t2v Recommended

Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files.

720P: $0.10/second

1080P: $0.15/second

50 seconds

wan2.5-t2v-preview Recommended

Wan 2.5 preview. Supports automatic voiceover and custom audio file input.

480p: $0.05/second

720p: $0.10/second

1080p: $0.15/second

50 seconds

wan2.2-t2v-plus

Wan 2.2 Professional Edition. Significantly improved image detail and motion stability.

480p: $0.02/second

1080p: $0.10/second

50 seconds

wan2.1-t2v-turbo

Wan 2.1 Turbo Edition. Fast generation speed and balanced performance.

$0.036/second

200 seconds

wan2.1-t2v-plus

Wan 2.1 Professional Edition. Generates rich details and higher-quality images.

$0.10/second

200 seconds

China (Beijing)

Model

Description

Unit price

Free quota

wan2.6-t2v REcommended

Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files.

720P: $0.086012/second

1080P: 0.143353/second

No free quota

wan2.5-t2v-preview Recommended

Wan 2.5 preview. Supports automatic voiceover and custom audio file input.

480p: $0.043006/second

720p: $0.086012/second

1080p: $0.143353/second

No free quota

wan2.2-t2v-plus

Wan 2.2 Professional Edition. Significantly improved image detail and motion stability.

480p: $0.02007/second

1080p: $0.100347/second

No free quota

wanx2.1-t2v-turbo

Faster generation speed and balanced performance.

$0.034405/second

No free quota

wanx2.1-t2v-plus

Generates richer details and higher-quality images.

$0.100347/second

No free quota

Input example

Output video (wan2.5)

Input prompt: Shot from a low angle, in a medium close-up, with warm tones, mixed lighting (the practical light from the desk lamp blends with the overcast light from the window), side lighting, and a central composition. In a classic detective office, wooden bookshelves are filled with old case files and ashtrays. A green desk lamp illuminates a case file spread out in the center of the desk. A fox, wearing a dark brown trench coat and a light gray fedora, sits in a leather chair, its fur crimson, its tail resting lightly on the edge, its fingers slowly turning yellowed pages. Outside, a steady drizzle falls beneath a blue sky, streaking the glass with meandering streaks. It slowly raises its head, its ears twitching slightly, its amber eyes gazing directly at the camera, its mouth clearly moving as it speaks in a smooth, cynical voice: 'The case was cold, colder than a fish in winter. But every chicken has its secrets, and I, for one, intended to find them '.

Input audio:

Image-to-video - based on the first frame

The Wan image-to-video model uses an input image as the first frame of a video. It then generates the rest of the video based on a prompt. The videos feature rich artistic styles and cinematic quality. API reference | Try it online

International (Singapore)

Model

Description

Unit price

Free quota (Note)

Validity: Within 90 days after you activate Alibaba Cloud Model Studio

wan2.6-i2v Recommended

Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files.

720P: $0.10/seconds

1080P: $0.15/seconds

50seconds

wan2.5-i2v-preview Recommended

Wan 2.5 preview. Supports automatic dubbing and custom audio file uploads.

480P: $0.05/second

720P: $0.10/second

1080P: $0.15/second

50 seconds

wan2.2-i2v-flash

Wan 2.2 Flash Edition. Delivers extremely fast generation speed with significant improvements in visual detail and motion stability.

480P: $0.015/second

720P: $0.036/second

50 seconds

wan2.2-i2v-plus

Wan 2.2 Professional Edition. Delivers significant improvements in visual detail and motion stability.

480P: $0.02/second

1080P: $0.10/second

50 seconds

wan2.1-i2v-turbo

Wan 2.1 Turbo Edition. Fast generation speed with balanced performance.

$0.036/second

200 seconds

wan2.1-i2v-plus

Wan 2.1 Professional Edition. Generates rich details and produces higher-quality, more textured visuals.

$0.10/second

200 seconds

China (Beijing)

Model

Description

Unit price

Free quota

wan2.6-i2v Recommended

Wan 2.6 introduces a multi-shot narrative feature and supports automatic voiceover and the import of custom audio files.

720P: $0.086012/second

1080P: $0.143353/second

No free quota

wan2.5-i2v-preview Recommended

Wan 2.5 preview. Supports automatic dubbing and custom audio file uploads.

480P: $0.043006/second

720P: $0.086012/second

1080P: $0.143353/second

No free quota

wan2.2-i2v-plus

Wan 2.2 Professional Edition. Delivers significant improvements in visual detail and motion stability.

480P: $0.02007/second

1080P: $0.100347/second

No free quota

wanx2.1-i2v-turbo

Wan 2.1 Turbo Edition. Fast generation speed with balanced performance.

$0.034405/second

No free quota

wanx2.1-i2v-plus

Wan 2.1 Professional Edition. Generates rich details and produces higher-quality, more textured visuals.

$0.100347/second

No free quota

Input first frame image and audio

Output video (wan2.6)

rap-转换自-png

Input audio:

Input prompt: A scene of urban fantasy art. A dynamic graffiti art character. A boy painted with spray paint comes to life from a concrete wall. He sings an English rap song at a very fast pace while striking a classic, energetic rapper pose. The scene is set under an urban railway bridge at night. The lighting comes from a single streetlight, creating a cinematic atmosphere full of high energy and amazing detail. The audio of the video consists entirely of his rap, with no other dialogue or noise.

Image-to-video - based on the first and last frames

The Wan first-and-last-frame video model generates a smooth, dynamic video from a prompt. You only need to provide the first and last frame images. The videos feature rich artistic styles and cinematic quality. API reference | Try it online

International (Singapore)

Model

Unit price

Free quota (Note)

wan2.1-kf2v-plus

$0.10/second

200 seconds

Validity period: Within 90 days after you activate Model Studio

China (Beijing)

Model

Unit price

Free quota (Note)

wanx2.1-kf2v-plus

$0.100347/second

No free quota

Example input

Output video

First frame

Last frame

Prompt

first_frame

last_frame

In a realistic style, the camera starts at eye level with a small black cat looking up at the sky with curiosity, then gradually moves upward, ending in a top-down shot focused on the cat's curious eyes.

Reference-to-video

Wan reference-to-video model uses a character's appearance and voice from an input video and a prompt to generate a new video that maintains character consistency. API reference

International (Singapore)

Model

Input price

Output price

Free quota (Note)

wan2.6-r2v

720P:$0.10/second

1080P:$0.15/second

720P:$0.10/second

1080P:$0.15/second

50 seconds

Validity period: Within 90 days after you activate Model Studio

China (Beijing)

Model

Input price

Output price

Free quota (Note)

wan2.6-r2v

720P:$0.086012/second

1080P:$0.143353/second

720P:$0.086012/second

1080P:$0.143353/second

No free quota

General video editing

The Wan unified video editing model supports multimodal inputs, including text, images, and videos. It can perform video generation and general editing tasks. API reference | Try it online

International (Singapore)

Model

Unit price

Free quota (Note)

wan2.1-vace-plus

$0.1/s

50 seconds

Validity: Valid for 90 days after Model Studio activation.

China (Beijing)

Model

Unit price

Free quota (Note)

wanx2.1-vace-plus

$0.100347/s

No free quota

The unified video editing model supports the following features:

Feature

Input reference image

Input prompt

Output video

Multi-image reference

Reference image 1 (reference entity)

image

Reference image 2 (reference background)

image

In the video, a girl gracefully walks out from a misty, ancient forest. Her steps are light, and the camera captures her every nimble moment. When the girl stops and looks around at the lush woods, a smile of surprise and joy blossoms on her face. This scene, frozen in a moment of interplay between light and shadow, records her wonderful encounter with nature.

Output video

Video repainting

The video shows a black steampunk-style car driven by a gentleman. The car is decorated with gears and copper pipes. The background features a steam-powered candy factory and retro elements, creating a vintage and playful scene.

Local editing

Input video

Input mask image (The white area indicates the editing area)

mask

The video shows a Parisian-style French cafe where a lion in a suit is elegantly sipping coffee. It holds a coffee cup in one hand, taking a gentle sip with a relaxed expression. The cafe is tastefully decorated, with soft hues and warm lighting illuminating the area where the lion is.

The content in the editing area is modified based on the prompt.

Video extension

Input first clip (1 second)

A dog wearing sunglasses is skateboarding on the street, 3D cartoon.

Output extended video (5 seconds)

Video outpainting

An elegant lady is passionately playing the violin, with a full symphony orchestra behind her.