When you generate videos with Wan, if prompt optimization or official video effects do not provide the desired specific actions, effects, or styles, you can use model fine-tuning.
Availability
Region: This document applies only to the Singapore region. You must use an API key from this region.
Model: Image-to-video - first frame: wan2.5-i2v-preview.
Method: SFT-LoRA efficient fine-tuning.
How to fine-tune a model
This topic provides an example of how to train a "money rain effect" LoRA model. The goal is to create a model that automatically generates a video with a "money rain effect" from an input image, without a prompt.
Input first frame image
| Output video (before fine-tuning) The base model cannot consistently generate a "money rain" effect with specific motion from a prompt, and the motion is uncontrollable. | Output video (after fine-tuning) The fine-tuned model can consistently reproduce the specific "money rain" effect from the training dataset without a prompt. |
Before you run the following code, you must obtain and configure an API key, and then configure the API key.
Step 1: Upload the dataset
Upload your local dataset in .zip format to Alibaba Cloud Model Studio and retrieve the file ID (id).
Sample training set: wan-i2v-training-dataset.zip. For more information about the dataset format, see Training set.
Sample request
This example uploads only a training dataset. The system automatically splits a portion of the training dataset to create a validation set. The dataset upload takes several minutes, depending on the file size.
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/files' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--form 'file=@"./wan-i2v-training-dataset.zip"' \
--form 'purpose="fine-tune"'Sample response
Save the id. This is the unique identifier for the uploaded dataset.
{
"id": "file-ft-b2416bacc4d742xxxx",
"object": "file",
"bytes": 73310369,
"filename": "wan-i2v-training-dataset.zip",
"purpose": "fine-tune",
"status": "processed",
"created_at": 1766127125
}Step 2: Fine-tune the model
Step 2.1: Create a fine-tuning job
Start the training job using the file ID from Step 1.
For more information about hyperparameter settings for fine-tuning, see Hyperparameters.
Sample request
Replace <file_id_of_your_training_set> with the id that you retrieved in the previous step.
Wan2.5 model
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model":"wan2.5-i2v-preview",
"training_file_ids":[
"<file_id_of_your_training_set>"
],
"training_type":"efficient_sft",
"hyper_parameters":{
"n_epochs":400,
"batch_size":2,
"learning_rate":2e-5,
"split":0.9,
"eval_epochs": 20,
"max_pixels": 36864
}
}'Sample response
Note the following three key parameters in the output object:
job_id: The job ID, which is used to query the job progress.finetuned_output: The name of the new fine-tuned model. You must use this name for subsequent deployment and invocation.status: The model training status. After you create a fine-tuning job, the initial status is PENDING, which indicates that the training is waiting to start.
{
...
"output": {
"job_id": "ft-202511111122-xxxx",
"status": "PENDING",
"finetuned_output": "wan2.5-i2v-preview-ft-202511111122-xxxx",
...
}
}Step 2.2: Query the fine-tuning job status
Query the job progress using the job_id obtained in Step 2.1. Poll this API until the status changes to SUCCEEDED.
The fine-tuning job in this example takes several hours to complete, depending on the model. Wait for the job to finish.
Sample request
Replace <your_fine-tuning_job_id> in the URL with the value of job_id.
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'Sample response
Note the two parameters in the output field:
status: When the status changes to SUCCEEDED, the model has finished training and is ready for deployment.usage: The total number of tokens consumed for model training, which is used for billing.
{
...
"output": {
"job_id": "ft-202511111122-xxxx",
"status": "SUCCEEDED",
"usage": 432000,
...
}
}Step 3: Deploy the fine-tuned model
Step 3.1: Deploy the model as an online service
After the fine-tuning job status changes to SUCCEEDED, deploy the model as an online service.
Sample request
Replace <your_model_name> with the value of the finetuned_output parameter from the response of the Create a fine-tuning job operation.
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model_name": "<your_model_name>",
"aigc_config": {
"prompt": "Provide a video description based on the image content. The description must include \"Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding him/her. The bills continue to fall as the camera slowly zooms in. He/She stretches his/her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain\".\nOutput template:\nThe video begins by showing [entity description] [background description]. Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding him/her. The bills continue to fall as the camera slowly zooms in. He/She stretches his/her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain.\nExample:\nThe video begins by showing a young woman on a beach. Her hair is wet, dark brown, curly, and slightly messy. She has a bright smile on her face. The background shows crashing waves and distant mountains. Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding her. The bills continue to fall as the camera slowly zooms in. She stretches her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain.",
"lora_prompt_default": "Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding the main character. The bills continue to fall as the camera slowly zooms in. The main character stretches their arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain."
},
"capacity": 1,
"plan": "lora"
}'Sample response
Note the two parameters in the output field:
deployed_model: The name of the deployed model, which is used to query the deployment status and call the model.status: The model deployment status. After you deploy a fine-tuned model, the initial status is PENDING, which indicates that the deployment has not started.
{
...
"output": {
"deployed_model": "wan2.5-i2v-preview-ft-202511111122-xxxx",
"status": "PENDING",
...
}
}Step 3.2: Query the deployment status
Query the deployment status. Poll this API until the status changes to RUNNING.
The deployment process for the fine-tuned model in this example is expected to take 5 to 10 minutes.
Sample request
Replace <your_deployed_model_id> with the value of the deployed_model response parameter from Step 3.1.
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/<your_deployed_model_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' Sample response
Note the two parameters in the output field:
status: When the status changes to RUNNING, the model is successfully deployed and ready to be called.deployed_model: The name of the deployed model.
{
...
"output": {
"status": "RUNNING",
"deployed_model": "wan2.5-i2v-preview-ft-202511111122-xxxx",
...
}
}Step 4: Call the model to generate a video
After the model is successfully deployed (that is, the deployment status is RUNNING), you can call it.
Expected result: If you input an image, the model automatically generates a video with a "money rain effect" without requiring a prompt.
Build a custom dataset
In addition to using the sample data in this topic, you can build your own dataset for fine-tuning.
The dataset must contain a training set (required) and a validation set (optional). The validation set can be automatically split from the training set. You must package all files into a .zip file. Use only English letters, numbers, underscores, or hyphens in the filename.
Dataset format
Training set: Required
The training set for the image-to-video - first frame model includes first-frame images, videos, and an annotation file (data.jsonl).
Sample dataset: wan-i2v-training-dataset.zip.
Zip package directory structure:
wan-i2v-training-dataset.zip ├── data.jsonl (The jsonl filename must be data, maximum size 20 MB) ├── image_1.jpeg (Maximum image resolution 1024×1024, maximum size per image 10 MB, supported formats: BMP, JPEG, PNG, WEBP) ├── video_1.mp4 (Maximum video resolution 512×512, maximum size per video 10 MB, supported formats: MP4, MOV) ├── image_2.jpeg └── video_2.mp4Annotation file (data.jsonl): Each line is a JSON object that contains three fields: Prompt, image path, and video path. The structure of one line of training data is as follows:
{ "prompt": "The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.", "first_frame_path": "image_1.jpeg", "video_path": "video_1.mp4" }
Validation set: Optional
The validation set for the image-to-video - first frame model includes first-frame images and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.
Sample dataset: wan-i2v-valid-dataset.zip.
Zip package directory structure:
wan-i2v-valid-dataset.zip ├── data.jsonl (The jsonl filename must be data, maximum size 20 MB) ├── image_1.jpeg (Maximum image resolution 1024×1024, maximum size per image 10 MB, supported formats: BMP, JPEG, PNG, WEBP) └── image_2.jpegAnnotation file (data.jsonl): Each line is a JSON object that contains two fields: Prompt and image path. The structure of one line of data is as follows:
{ "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.", "first_frame_path": "image_1.jpg" }
Data scale and limits
Data volume: At least 5 data entries. The larger the training data volume, the better. We recommend 20 to 100 entries for stable results.
Zip package: Total size ≤ 2 GB.
Number of files: Up to 200 images and 200 videos.
Format requirements:
Image: Supported formats are BMP, JPEG, PNG, and WEBP. Image resolution ≤ 1024x1024. Single image file size ≤ 10 MB.
Video: Supported formats are MP4 and MOV. Video resolution ≤ 512x512. Single video file size ≤ 10 MB.
Data collection and cleaning
1. Identify the fine-tuning scenario
Wan supports fine-tuning for image-to-video scenarios, such as the following:
Fixed video effects: Teach the model a specific visual change, such as a carousel rotation or a magical costume change.
Fixed character actions: Improve the model's ability to reproduce specific body movements, such as particular dance moves or martial arts forms.
Fixed camera movements: Replicate complex camera language, such as push-pull, pan-tilt, or orbiting shots in a fixed template.
2. Get source materials
AI generation and selection: Use the Wan base model to generate videos in batches, and then manually select high-quality samples that best match the target effect. This is the most common method.
Live-action shooting: If your goal is to achieve highly realistic interactive scenes (such as hugs or handshakes), using live-action footage is the best choice.
3D software rendering: For effects or abstract animations that require detailed control, you can use 3D software (such as Blender or C4D) to create the materials.
3. Clean the data
Clean the collected data according to the following requirements:
Dimension | Requirement | Negative example |
Consistency | Core features must be consistent. For example, to train a "360-degree rotation," all videos must rotate clockwise and at a roughly consistent speed. | Inconsistent directions. The dataset contains both clockwise and counter-clockwise rotations. The model cannot determine which direction to learn. |
Diversity | Entities and scenes must be diverse. Include diverse entities (men, women, old, young, cats, dogs, buildings) and compositions (close-ups, long shots, high-angle, low-angle). The resolution and aspect ratio should also be diverse. | Lack of diversity in scenes or entities. All videos are of "a person in red clothes rotating in front of a white wall." The model might mistakenly associate "red clothes" and "white wall" with the rotation effect and fail to apply the effect if the clothes are different. |
Balance | Proportions of different data types must be balanced. If multiple styles are included, the number of samples for each style should be roughly equal. | Imbalanced proportions. 90% are portrait videos, and 10% are landscape videos. The model may perform poorly when generating landscape videos. |
Purity | The image is clear and sharp. Use original materials that do not contain interfering elements. | Presence of interfering elements. The video contains captions, logos, watermarks, obvious black bars, or noise. The model might learn the watermark as part of the effect. |
Duration | Source material duration must be less than or equal to the target duration. If you expect to generate a 5-second video, the source material should be cropped to 4 to 5 seconds. | Source material is too long. If you expect to generate a 5-second video but feed the model an 8-second source, this can lead to incomplete action learning and cause the generated video to feel truncated. |
Video annotation: Write prompts for videos
In the dataset's annotation file (data.jsonl), each video has a corresponding prompt. The prompt is used to describe the visual content of the video, and the quality of the prompt directly determines what the model "learns."
Sample prompt The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, is wearing a white sleeveless dress and a shiny silver necklace, and has a smile on her face. The background is a brick wall covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain. |
Prompt writing formula
Prompt = [Entity description] + [Background description] + [Trigger word] + [Motion description]
Prompt item | Description | Recommendation | Example |
Entity description | Describes the person or object originally present in the frame | Required | The video begins showing a young woman... |
Background description | Describes the environment where the entity is located | Required | The background is a brick wall covered with green vines... |
Trigger word | A rare word with no actual meaning | Recommended | s86b5p or m01aa |
Motion description | Describes in detail the motion changes of the effect in the video | Recommended | countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain... |
How to write good prompts
Follow the consistency principle for effect descriptions
For all samples that contain the effect, the motion description part of the effect must be as consistent as possible. Both the training and validation sets must follow this rule.
Purpose: When the model finds that
s86b5pis always followed by a fixed description and the frame always shows a money rain, it learns to associate s86b5p with the money rain visual effect.Example: Whether it is a "young woman" or a "man in a suit," as long as it is the money rain effect, the second half of the prompt is uniformly written as: "...then the s86b5p money rain effect begins, countless US dollar bills pour down like a torrential rain..."
Sample type
Prompt content (Note the consistency of the underlined description)
Training set sample 1
The video begins showing a young woman standing in front of a brick wall... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, expression surprised, completely immersed in this wild money rain.
Training set sample 2
The video begins showing a man in a suit in a high-end restaurant... (environment description omitted)....then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall, he stretches his arms upward, expression surprised, completely immersed in this wild money rain.
Validation set sample 1
The video begins showing a young child standing in front of a cityscape... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.
Generate prompts with AI assistance
To obtain high-quality prompts, you can use multimodal large models such as Qwen-VL to help generate prompts for videos.
Use AI to help generate initial descriptions
Brainstorm (find inspiration): If you do not know how to describe the effect, you can use AI to brainstorm ideas.
Send the prompt "
Describe the video content in detail" and observe what the model outputs.Focus on the vocabulary the model uses to describe the motion trajectory of the effect (such as "pour down like a torrential rain," "camera slowly zooms in"). These words can be used as material for later optimization.
Fixed sentence pattern (standardize output): Once you have a general idea, you can design a fixed sentence pattern based on the annotation formula to guide the AI in generating prompts that conform to the format.
Refine the effect template
Repeat the process for multiple samples with the same effect to find common, accurate phrases used to describe the effect. From these, you can extract a universal "effect description."
Copy and paste this standardized effect description into all datasets for that effect.
Keep the unique "entity" and "background" descriptions for each sample, and only replace the "effect description" part with the unified template.
Manual check
AI models may hallucinate or make recognition errors. Therefore, you must perform a final manual check to confirm that the descriptions of the entity and background match the actual scene.
Evaluate the model using a validation set
Specify a validation set
A fine-tuning job must include a training set, while a validation set is optional. You can choose to have the system automatically split the set or manually upload a validation set. The methods are as follows:
Method 1: No validation set uploaded (system auto-split)
When you create a fine-tuning job, if you do not upload a separate validation set (that is, the validation_file_ids parameter is not passed), the system uses the following two hyperparameters to automatically split a portion of the training set for use as a validation set:
split: The training set split ratio. For example, 0.9 means 90% of the data is used for training, and the remaining 10% is used for validation.max_split_val_dataset_sample: The maximum number of samples for the automatically split validation set.
Validation set splitting rule: The system selects the smaller value between total number of dataset entries × (1 - split) and max_split_val_dataset_sample.
Example: Assume you only upload a training set with 100 data entries, split=0.9 (meaning 10% for the validation set), and max_split_val_dataset_sample=5.
Theoretical split: 100 × 10% = 10 entries.
Actual split: min(10, 5) = 5. So the system takes only 5 entries as the validation set.
Method 2: Manually upload a validation set (by validation_file_ids)
If you want to use your own prepared data to evaluate checkpoints instead of relying on the system's random split, you can upload a custom validation set.
Note: If you choose to upload a validation set manually, the system completely ignores the automatic splitting rules and uses only the data you uploaded for validation.
Select the best checkpoint for deployment
During the training process, the system periodically saves "snapshots" of the model (that is, checkpoints). By default, the system outputs the last checkpoint as the final fine-tuned model. However, checkpoints produced during the training process may yield better results than the final version. You can select the best one for deployment.
The system evaluates checkpoints against the validation set and generates preview videos at the interval set by the hyperparameter eval_epochs.
How to evaluate: Evaluate the effect by directly observing the generated preview videos.
Selection criteria: Select the checkpoint that produces the best results and does not cause action distortion.
Procedure
Step 1: View the preview effects generated by the checkpoint
Step 2: Export the checkpoint and get the model name for deployment
Step 3: Deploy and call the model
Go live
In a production environment, if the initially trained model performs poorly (such as distorted frames, indistinct effects, or inaccurate actions), you can optimize it in the following ways:
1. Check the data and prompts
Data consistency: Data consistency is key. Check for "bad samples" with opposite directions or vastly different styles.
Sample quantity: Increase high-quality data to more than 20 entries.
Prompt: Ensure that the trigger word is a meaningless rare word (such as s86b5p) to avoid interference from common words (such as running).
2. Adjust hyperparameters: For more information, see Hyperparameters.
n_epochs (number of training epochs)
Default value: 400. We recommend the default value. If you adjust it, follow the principle of "total training steps ≥ 800".
Total steps calculation formula:
steps = n_epochs × ceiling(training set size / batch_size).Therefore, the formula for calculating the minimum value of n_epochs is:
n_epochs = 800 / ceiling(dataset size / batch_size).Example: Assume the training set has 5 data entries and you are using the Wan2.5 model (batch_size=2).
Training steps per epoch: 5 / 2 = 2.5, which rounds up to 3. Minimum training epochs: n_epochs = 800 / 3 ≈ 267. This is the recommended minimum value. You can increase it as needed, for example, to 300.
learning_rate (learning rate), batch_size (batch size): We recommend the default values. You usually do not need to modify them.
Billing
Training: Billed.
Fee = Total training tokens × Unit price. For more information, see Model training billing.
After the training is complete, you can view the total number of tokens consumed during training in the Query fine-tuning job status API's
usagefield.
Deployment: Free.
Invocation: Billed.
Billed at the standard invocation price of the base model. For more information, see Model pricing.
API reference
FAQ
Q: How do I calculate the data volume for the training and validation sets?
A: A training set is required, and a validation set is optional. The calculation method is as follows:
If you do not pass a validation set: The uploaded training set is the "total dataset size," and the system automatically splits a portion of the training data for validation.
Validation set size =
min(total dataset size × (1 - split), max_split_val_dataset_sample). For a calculation example, see Specify a validation set.Training set size =
total dataset size - validation set size.
If you manually upload a validation set: The system no longer splits the training data for validation.
Training set size = Uploaded training data volume.
Validation set size = Uploaded validation data volume.
Q: How do I design a good trigger word?
A: The rules are as follows:
Use a meaningless combination of letters, such as sksstyle or a8z2_effect.
Avoid using common English words (such as beautiful, fire, or dance), because this will interfere with the model's original understanding of these words.
Q: Can fine-tuning change the video resolution or duration?
A: No. Fine-tuning allows the model to learn content and dynamics, but it does not change specifications. The output video format (resolution, frame rate, or maximum duration) is determined by the base model.
