Alibaba Cloud Model Studio provides video generation models for general-purpose creation (text-to-video, image-to-video, reference-to-video, video editing) and vertical scenarios (digital human lip-syncing, image-to-action, video character swapping, emoji creation).
Model overview
Deployment mode Compare modes | Global Compute resources for model inference are scheduled globally. | International Compute resources for model inference are scheduled globally, excluding Chinese Mainland. | US Compute resources for model inference are restricted to the US. | Chinese Mainland Compute resources for model inference are restricted to Chinese Mainland. |
Region | Virginia | Singapore | Virginia | Beijing |
Supported models | Wan - image-to-video - first frame | Wan - image-to-video - first frame |
Model selection
General video generation
To generate a video from a text prompt, use Wan - text-to-video.
To generate a cinematic shot from a single image, use Wan - image-to-video - first frame.
To control the transition between a starting and an ending image, use Wan - image-to-video - first and last frames.
To replicate a character's appearance and voice from reference videos to match a new script, use Wan - reference-to-video.
Digital human lip-syncing: Animates static photos to speak, sing, or narrate --- background stays fixed while the face, head, and body move.
For the most natural results, including facial expressions, head, and body movements, use Wan - digital human. This model replaces EMO.
For videos longer than 20 seconds with simple head movements, such as news reports, use LivePortrait.
Video motion transfer : This feature keeps the background of the photo static and animates the person using motion from a reference video. Use Wan - image-to-action.
Video character swapping : This feature replaces the person in a video with a person from an image while keeping the original background. Use Wan - video character swap.
Dance replacement: Replaces the dancer in a video with a person from an image. For best quality, use Wan - image-to-action and Wan - video character swapping. If budget is limited, use AnimateAnyone.
Video lip replacement: This feature replaces the lip movements in an existing video to match new audio. Use VideoRetalk.
Emoji creation: This feature creates emojis using fixed-style templates. Use Emoji.
Video redrawing: To use fixed-style templates, use Video style transform. To describe styles freely using prompts, use Wan - general video editing.
Video editing: For all the following tasks, use Wan - general video editing.
Local video editing: Replace elements such as subjects or clothing, or remove bystanders.
Video extension: Extend short videos, for example, from 1 second to 5 seconds.
Video frame expansion: Convert landscape videos to portrait mode or fill in missing borders.
Multi-image reference generation: Fuse background and subject images to create a video.
Supported models
Wan - Text-to-video
Generates videos from text prompts. It supports text and audio input to create cinematic, multi-shot videos.
API reference | Model pricing | Try online: Singapore, Virginia, Beijing
Global
In the Global deployment mode, endpoint and data storage are located in the US (Virginia) regionor Germany (Frankfurt) region, and model inference computing resources are dynamically scheduled globally.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-t2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, audio |
Resolution options: 720P, 1080P Video duration: 5s, 10s, 15s Defined specifications: 30 fps, MP4 (H.264 encoding) |
International
In the International deployment mode, endpoint and data storage are located in the Singapore region, and model inference computing resources are dynamically scheduled worldwide, excluding the Chinese Mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-t2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.5-t2v-preview |
Video with audio Audio-video sync |
Text, audio |
Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-t2v-plus |
Video without audio Improved overall stability and a higher success rate compared to Model 2.1. |
Text |
Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.1-t2v-turbo |
Video without audio |
Text |
Resolution options: 480P, 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.1-t2v-plus |
Video without audio |
Text |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
US
In the US deployment mode, endpoint and data storage are located in the US (Virginia) region, and model inference computing resources are restricted to the US.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-t2v-us |
Video with audio Multi-shot narrative, audio-video sync |
Text, audio |
Resolution options: 720P, 1080P Video duration: 5s, 10s, 15s Defined specifications: 30 fps, MP4 (H.264 encoding) |
Chinese mainland
In the Chinese Mainland deployment mode, endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to the Chinese Mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-t2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.5-t2v-preview |
Video with audio Audio-video sync |
Text, audio |
Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-t2v-plus |
Video without audio Comprehensively improves stability and success rate compared to Model 2.1. |
Text |
Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wanx2.1-t2v-turbo |
Video without audio |
Text |
Resolution options: 480P, 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wanx2.1-t2v-plus |
Video without audio |
Text |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
Input prompt |
Output video (wan2.6, multi-shot video) |
|
Shot from a low angle, in a medium close-up, with warm tones, mixed lighting (the practical light from the desk lamp blends with the overcast light from the window), side lighting, and a central composition. In a classic detective office, wooden bookshelves are filled with old case files and ashtrays. A green desk lamp illuminates a case file spread out in the center of the desk. A fox, wearing a dark brown trench coat and a light gray fedora, sits in a leather chair, its fur crimson, its tail resting lightly on the edge, its fingers slowly turning yellowed pages. Outside, a steady drizzle falls beneath a blue sky, streaking the glass with meandering streaks. It slowly raises its head, its ears twitching slightly, its amber eyes gazing directly at the camera, its mouth clearly moving as it speaks in a smooth, cynical voice: 'The case was cold, colder than a fish in winter. But every chicken has its secrets, and I, for one, intended to find them '. |
Wan - image-to-video - first frame
Generates a video from a specified first-frame image. This model accepts text, a first-frame image, and audio as input to generate cinematic, multi-shot videos.
API reference | Model pricing | Try online: Singapore, Virginia, Beijing
Global
In the Global deployment mode, endpoint and data storage are located in the US (Virginia) regionor Germany (Frankfurt) region, and model inference computing resources are dynamically scheduled globally.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-i2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: 5 s, 10 s, 15 s Defined specifications: 30 fps, MP4 (H.264 encoding) |
International
In the International Deployment Mode, the Access Point and Data Storage are in the Singapore Region, and Compute Resources for Model Inference are dynamically scheduled worldwide, excluding the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-i2v-flash |
Video with audio, video without audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.6-i2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.5-i2v-preview |
Video with audio Audio-video sync |
Text, image, audio |
Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-i2v-flash |
Video without audio 50% faster than model 2.1 |
Text, image |
Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-i2v-plus |
Video without audio This model offers comprehensive improvements in stability and success rate over model 2.1. |
Text, image |
Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.1-i2v-plus |
Video without audio |
Text, image |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.1-i2v-turbo |
Video without audio |
Text, image |
Resolution options: 480P, 720P Video duration: 3s, 4s, 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
US
In the US Deployment Mode, the Access Point and Data Storage are in the US (Virginia) Region, and Compute Resources for Model Inference are restricted to the US.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-i2v-us |
Video with audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: 5s, 10s, 15s Defined specifications: 30 fps, MP4 (H.264 encoding) |
Chinese Mainland
In the Chinese Mainland Deployment Mode, the Access Point and Data Storage are in the Beijing Region, and Compute Resources for Model Inference are restricted to the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-i2v-flash |
Video with audio, video without audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.6-i2v |
Video with audio Multi-shot narrative, audio-video sync |
Text, image, audio |
Resolution options: 720P, 1080P Video duration: [2s, 15s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.5-i2v-preview |
Video with audio Audio-video sync |
Text, image, audio |
Resolution options: 480P, 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-i2v-flash |
Video without audio 50% faster than model 2.1 |
Text, image |
Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.2-i2v-plus |
Video without audio Stability and success rate have been comprehensively improved compared to Model 2.1. |
Text, image |
Resolution options: 480P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wanx2.1-i2v-plus |
Video without audio |
Text, image |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wanx2.1-i2v-turbo |
Video without audio |
Text, image |
Resolution options: 480P, 720P Video duration: 3s, 4s, 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
Input prompt |
Input first frame image and audio |
Output video (wan2.6, multi-shot video) |
|
A scene of urban fantasy art. A dynamic graffiti art character. A boy made of spray paint comes to life from a concrete wall. He raps an English song at high speed while striking a classic, energetic rapper pose. The scene is set under an urban railway bridge at night. The light comes from a single street lamp, creating a cinematic atmosphere with high energy and amazing detail. The audio of the video consists entirely of his rap, with no other dialogue or noise. |
Input audio: |
Wan - image-to-video - first and last frames
Generates a video that smoothly transitions between specified first and last frame images. This model accepts text, first and last frame images, and audio as input to generate cinematic, multi-shot videos.
API reference | Model pricing | Try online
International
In the International Deployment Mode, the Access Point and Data Storage are in the Singapore Region, and Compute Resources for Model Inference are dynamically scheduled worldwide, excluding the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-kf2v-flash |
Video without audio Overall stability and success rate have improved compared to model 2.1. |
Text, image |
Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.1-kf2v-plus |
Video without audio |
Text, image |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
Chinese Mainland
In the Chinese Mainland Deployment Mode, the Access Point and Data Storage are in the Beijing Region, and Compute Resources for Model Inference are restricted to the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-kf2v-flash |
Video without audio Overall stability and success rate have improved compared to model 2.1. |
Text, image |
Resolution options: 480P, 720P, 1080P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wanx2.1-kf2v-plus |
Video without audio |
Text, image |
Resolution options: 720P Video duration: 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
Input first frame image |
Input last frame image |
Input prompt |
Output video |
|
|
|
Photorealistic style. A small black cat looks up at the sky curiously. The camera starts at eye level, gradually rises, and ends with a top-down shot of the cat's curious gaze. |
Wan - reference-to-video
Make characters from a specified video perform actions. Input a video and a text prompt to generate an output video that maintains character consistency.
Global
In the Global deployment mode, endpoint and data storage are located in the US (Virginia) regionor Germany (Frankfurt) region, and model inference computing resources are dynamically scheduled globally.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-r2v |
Video with audio Single or multi character to video Multi-shot narrative, audio-video sync |
Text, video |
Resolution options: 720P, 1080P Video duration: 5s, 10s Defined specifications: 30 fps, MP4 (H.264 encoding) |
International
In International Deployment Mode, the access point and data storage are located in the Singapore region. Model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-r2v-flash |
Video with or without audio Single or multi character to video Multi-shot narrative, audio-video sync Fast and cost-effective |
Text, image, video |
Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.6-r2v |
Video with audio Multi-role reference-to-video Multi-shot narrative, audio-video sync |
Text, image, video |
Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
Chinese mainland
In Chinese Mainland Deployment Mode, the access point and data storage are located in the Beijing region. Model inference compute resources are limited to the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.6-r2v-flash |
Video with or without audio Single or multi character to video Multi-shot narrative, audio-video sync Fast and cost-effective |
Text, image, video |
Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
wan2.6-r2v |
Video with audio Single or multi character to video Multi-shot narrative, audio-video sync |
Text, image, video |
Resolution options: 720P, 1080P Video duration: [2s, 10s] (integer) Defined specifications: 30 fps, MP4 (H.264 encoding) |
|
Input reference video 1 (role: little girl) |
Input reference video 2 (role: alarm clock) |
Input prompt |
Output video (multi-role dialogue) |
|
character1 says to character2: "I’ll rely on you tomorrow morning!" character2 replies: "You can count on me!" |
Wan - general-purpose video editing
A general-purpose video editing model that supports multimodal inputs, including text, images, and videos, for various video generation and editing tasks.
International
In International deployment mode, the access point and data storage are in the Singapore Region. The system dynamically schedules compute resources for model inference worldwide, excluding the Chinese Mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.1-vace-plus |
Video without audio Multi-image reference, video redraw, local editing, video extension, video frame expansion |
Text, image, video |
Resolution options: 720P Video duration: Up to 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
Chinese Mainland
In Chinese Mainland deployment mode, the access point and data storage are in the Beijing Region. Compute resources for model inference are limited to the Chinese Mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wanx2.1-vace-plus |
Video without audio Multi-image reference, video redraw, local editing, video extension, video frame expansion |
Text, image, video |
Resolution options: 720P Video duration: Up to 5s Defined specifications: 30 fps, MP4 (H.264 encoding) |
-
Feature 1: Multi-image reference
Reference 1 (entity)
Reference 2 (Background)
Input prompt
Output video


In the video, a girl walks gracefully from the depths of an ancient, misty forest. Her steps are light, and the camera captures her every nimble movement. When she stops and looks around at the lush woods, a smile of surprise and joy blossoms on her face. This scene records her beautiful encounter with nature.
-
Feature 2: Video redraw
Input video
Input prompt
Output video
The video shows a black steampunk-style car driven by a gentleman, decorated with gears and copper pipes. The background is a steam-powered candy factory with retro elements, creating a vintage and fun scene.
-
Feature 3: Local video editing
Input video
Input mask image (white area indicates the editing area)
Input prompt
Output video

The video shows a Parisian-style French cafe. A lion in a suit elegantly sips coffee. It holds a coffee cup and drinks with a look of contentment. The cafe is tastefully decorated. Soft tones and warm light illuminate the lion.
-
Feature 4: Video extension
Input first video segment (1 second)
Input prompt
Output video (extended video is 5 seconds)
A dog wearing sunglasses skateboards on a street. 3D cartoon style.
-
Feature 5: Video frame expansion
Input video
Input prompt
Output video
An elegant woman passionately plays the violin. A full symphony orchestra is behind her.
Wan - digital human
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Digital human lip-syncing animates a person or cartoon character in an image to speak, sing, narrate, or perform. You provide an image and an audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and head and body motions.
Image detection API reference|Video generation API reference|Model pricing
|
Model |
Features |
Input modalities |
Output details |
|
wan2.2-s2v-detect |
Image detection |
Image |
Output detection status: Pass or Fail |
|
wan2.2-s2v |
Video generation Video with audio |
Image, audio |
Resolution options: 480P, 720P Video duration: Up to 20s (matches audio duration) Defined specifications:
|
|
Input example (person image + audio) |
Output video (lip-sync) |
|
Input audio: |
Wan - image to action
Animates a person from an image using motion from a reference video. You provide an image and a video, and the model generates a video that applies the motion from the reference video to the person, while keeping the background of the original image static.
International
In the International Deployment Mode, the Access Point and Data Storage are in the Singapore Region, and compute resources for Model Inference are dynamically scheduled worldwide, excluding the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-animate-move |
Video with or without audio (depends on the input video)
|
Image, video |
Resolution options: 720P Video duration: 2s < duration < 30s Defined specifications:
|
Chinese mainland
In the Chinese Mainland Deployment Mode, the Access Point and Data Storage are in the Beijing Region, and compute resources for Model Inference are restricted to the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-animate-move |
Video with or without audio (depends on the input video)
|
Image, video |
Resolution options: 720P Video duration: 2s < duration < 30s Specifications:
|
|
Input person image |
Input reference video |
Output video (Standard mode |
Output video (Professional mode |
|
|
Wan - video character swap
Replaces a character in a video with one from a reference image. You provide a source video and a reference image, and the model generates an output video that retains the original background. This feature is ideal for use cases like face swapping and full character replacement.
International
In the International Deployment Mode, the Access Point and Data Storage are located in the Singapore Region. The system dynamically schedules compute resources for Model Inference worldwide, excluding the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-animate-mix |
Video with or without audio (depends on the input video)
|
Image, video |
Resolution options: 720P Video duration: 2s < duration < 30s Defined specifications:
|
Chinese mainland
In the Chinese Mainland Deployment Mode, the Access Point and Data Storage are located in the Beijing Region. Model Inference computations are restricted to the Chinese mainland.
|
Model |
Features |
Input modalities |
Output video specifications |
|
wan2.2-animate-mix |
Video with or without audio (depends on the input video)
|
Image, video |
Resolution options: 720P Video duration: 2s < duration < 30s Defined specifications:
|
|
Input video |
Input image of the person to swap |
Output video (Standard mode |
Output video (Professional mode |
|
|
AnimateAnyone
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
We recommend using Wan - image-to-action and Wan - video character swapping instead of AnimateAnyone. These models offer better quality, while AnimateAnyone is a more cost-effective option.
Designed specifically for dancing, this model replaces the dancer in a video with a person from an image. You provide an image and a video to generate an output video in one of two ways: 1. Retain the image background. 2. Retain the video background.
Image detection API reference | Action template generation API reference | Video generation API reference |Model pricing
|
Model |
Features |
Input modalities |
Output details |
|
animate-anyone-detect-gen2 |
Image detection |
Image |
Output detection status: Pass or Fail |
|
animate-anyone-template-gen2 |
Dance video template generation Extracts an action template from a dance video. |
Video |
Outputs the dance action template ID |
|
animate-anyone-gen2 |
Video generation Video without audio |
Image, video, dance action template ID |
Video resolution options: 720P Video duration: 2s≤duration≤60s Defined specifications: 15 fps, MP4 (H.264 encoding) |
|
Input person image |
Input dance video |
Output video (generated with image background) |
Output video (generated with video background) |
|
|
EMO
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Consider using Wan - digital human as an alternative to EMO. Wan - digital human provides better results, while EMO is a more cost-effective option.
Generates singing and performance videos from an image. You provide an image and an audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and head motions.
Image detection API reference | Video generation API reference | Model pricing
|
Model |
Features |
Input modalities |
Output details |
|
emo-detect-v1 |
Image detection |
Image |
Output detection status: Pass or Fail |
|
emo-v1 |
Video generation Video with audio |
Image, audio |
Video resolution:
Video duration: Up to 60s Defined specifications: 15 fps, MP4 (H.264 encoding) |
|
Input example (portrait image + audio) |
Output video (singing lip-sync) |
|
Input audio: |
LivePortrait
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Consider using Wan - digital human as an alternative to LivePortrait. Wan - digital human delivers higher quality results, while LivePortrait is a more cost-effective option. Note that LivePortrait is suitable for generating long videos (over 20 seconds).
Generates narration videos from an image by animating the person in the image to deliver news or tell stories. You provide an Image and an Audio file, and the model automatically generates a video with synchronized lip movements, facial expressions, and slight head motions.
Image detection API reference | Video generation API reference|Model pricing
|
Model |
Features |
Input modalities |
Output details |
|
liveportrait-detect |
Image detection |
Image |
Output detection status: Pass or Fail |
|
liveportrait |
Video generation Video with audio |
Image, audio |
Video resolution: Matches the input image, up to nearly 4K (4096×4096) Video duration: 1s < duration < 180s Video frame rate: 15 fps ≤ frame rate ≤ 30 fps Video format: MP4 (H.264 encoding) |
|
Input example (portrait image + audio) |
Output video (voice-over lip-sync) |
|
Input audio: |
Emoji
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Creates emojis using fixed emoji templates. You provide an image and an emoji template ID to generate an emoji video.
Image detection API reference | Video generation API reference | Model pricing
|
Model |
Features |
Input modalities |
Output details |
|
emoji-detect-v1 |
Image detection |
Image |
Output detection status: Pass or Fail |
|
emoji-v1 |
Video generation Video without audio |
Image, emoji template ID |
Video resolution: Fixed at 512×512 Video duration: Up to 5s (matches template duration) Defined specifications: 15 fps, MP4 (H.264 encoding) |
|
Input portrait image |
Output video ("disgusted" emoji) |
|
|
VideoRetalk
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Lip sync: Replaces the lip movements in a video to match a new audio track. You provide a video and an audio file, and the model generates an output video with synchronized lip movements.
|
Model |
Features |
Input modalities |
Output video specifications |
|
videoretalk |
Video with audio |
Video, audio |
Video resolution: Matches the input video, up to nearly 2K (2048×2048) Video duration: 2s < duration < 120s Video frame rate: 15 fps ≤ frame rate ≤ 60 fps Video format: MP4 (H.264 encoding) |
|
Input example (person speaking video + audio) |
Output video (lip-sync replacement) |
|
Input audio: |
Video style transform
Only the Chinese Mainland deployment mode is supported. Endpoint and data storage are located in the Beijing region, and model inference computing resources are restricted to Chinese Mainland.
Applies a new artistic style to a video based on a predefined style template. You provide a video and a style transfer ID to generate a restyled video.
|
Model |
Features |
Input modalities |
Output video specifications |
|
video-style-transform |
Video with or without audio Depends on the input video. |
Video, redraw style ID |
Video resolution: Matches the input video, up to nearly 4K (4096×4096) Video duration: Up to 30s Video frame rate: 15 fps ≤ frame rate ≤ 25 fps Video format: MP4 (H.264 encoding) |
|
Input video |
Output video (style transfer option: "Japanese manga") |








