When using Wan for image-to-video, if prompt optimization or official video effects still cannot meet your customization needs for specific actions, effects, or styles, use model fine-tuning.
Applicability
Applicable deployment modes and regions: This document applies only to the Singapore region in International deployment mode, and you must use an API key from this region.
Supported fine-tuning method: SFT with LoRA efficient fine-tuning.
Supported models for fine-tuning:
Image-to-video based on the first frame: wan2.6-i2v, wan2.5-i2v-preview, wan2.2-i2v-flash.
Image-to-video based on the first and last frames: wan2.2-kf2v-flash.
How to fine-tune a model
Image-to-video based on the first frame
Fine-tuning goal: Train a LoRA model for a "money rain" effect.
Expected result: Input a first frame image, and the model automatically generates a video with the "money rain" effect without a prompt.
Input first frame image
| Output video (before fine-tuning) Prompts cannot consistently generate a "money rain" effect with fixed motion. The motion is uncontrollable. | Output video (after fine-tuning) The fine-tuned model can stably reproduce the specific "money rain" effect from the training set without a prompt. |
Image-to-video based on the first and last frames
Fine-tuning goal: Train a LoRA model for a "fashion magazine" effect.
Expected result: Input first and last frame images, and the model automatically generates a video with the "fashion magazine" effect without a prompt.
Input first frame image
| Input last frame image
| Output video (before fine-tuning) Prompts cannot consistently generate a "fashion magazine" effect with fixed motion. The motion is uncontrollable. | Output video (after fine-tuning) The fine-tuned model can stably reproduce the specific "fashion magazine" effect from the training set without a prompt. |
Before you run the following code, create an API key and set the API key as an environment variable.
Step 1: Upload the dataset
Upload your local dataset (in .zip format) to the Alibaba Cloud Model Studio platform and obtain the file ID (id).
Training set sample data: For the format, see Training set.
Image-to-video based on the first frame: wan-i2v-training-dataset.zip.
Image-to-video based on the first and last frames: wan-kf2v-training-dataset.zip.
Request example
This example uses the image-to-video model based on the first frame. Only a training set is uploaded. The system automatically splits a portion of the training set to use as a validation set. Uploading the dataset takes several minutes. The exact time depends on the file size.
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/files' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--form 'file=@"./wan-i2v-training-dataset.zip"' \
--form 'purpose="fine-tune"'Response example
Save the id. It is the unique identifier for the uploaded dataset.
{
"id": "file-ft-b2416bacc4d742xxxx",
"object": "file",
"bytes": 73310369,
"filename": "wan-i2v-training-dataset.zip",
"purpose": "fine-tune",
"status": "processed",
"created_at": 1766127125
}Step 2: Fine-tune the model
Step 2.1: Create a fine-tuning job
Start a training job using the file ID from Step 1.
Hyperparameter values vary across models. For hyperparameter settings, see Hyperparameters. For more call examples, see Request examples.
Request example
Replace <replace_with_training_dataset_file_id> with the id that you obtained in the previous step.
Image-to-video based on the first frame
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model":"wan2.6-i2v",
"training_file_ids":[
"<replace_with_training_dataset_file_id>"
],
"training_type":"efficient_sft",
"hyper_parameters":{
"n_epochs":400,
"batch_size":2,
"learning_rate":2e-5,
"split":0.9,
"eval_epochs": 50,
"max_pixels": 36864
}
}'Image-to-video based on the first and last frames
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model":"wan2.2-kf2v-flash",
"training_file_ids":[
"<replace_with_training_dataset_file_id>"
],
"training_type":"efficient_sft",
"hyper_parameters":{
"n_epochs":400,
"batch_size":4,
"learning_rate":2e-5,
"split":0.9,
"eval_epochs": 50,
"max_pixels": 262144
}
}'Response example
Note the following three key parameters in output:
job_id: The task ID, used to query progress.finetuned_output: The name of the new fine-tuned model. You must use this name for subsequent deployment and calls.status: The model training status. After you create a fine-tuning job, the initial status is PENDING, which indicates that the training has not started.
{
...
"output": {
"job_id": "ft-202511111122-xxxx",
"status": "PENDING",
"finetuned_output": "xxxx-ft-202511111122-xxxx",
...
}
}Step 2.2: Query the status of the fine-tuning job
Use the job_id obtained in Step 2.1 to query the job progress. Poll the following API operation until the status becomes SUCCEEDED.
The fine-tuning job in this example takes several hours to train. The exact duration depends on the model being fine-tuned. Please wait patiently.
Request example
Replace <replace_with_fine-tuning_job_id> in the URL with the value of job_id.
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'Response example
Note the following two parameters in the output field:
status: When the value becomes SUCCEEDED, the model has been trained and can be deployed.usage: The total number of tokens consumed for model training, used for billing.
{
...
"output": {
"job_id": "ft-202511111122-xxxx",
"status": "SUCCEEDED",
"usage": 432000,
...
}
}Step 3: Deploy the fine-tuned model
Step 3.1: Deploy the model as an online service
After the status of the fine-tuning job becomes SUCCEEDED, deploy the model as an online service.
Request example
Replace <replace_with_model_name> with the value of the finetuned_output response parameter from the create a fine-tuning job operation.
Image-to-video based on the first frame
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model_name": "<replace_with_model_name>",
"aigc_config": {
"use_input_prompt": false,
"prompt": "Provide a video description based on the image content. The description must include the phrase “Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.“\nOutput Template:\nThe video begins with a shot of [subject description]. [Environment description]. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.\nExample:\nThe video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
"lora_prompt_default": "Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain."
},
"capacity": 1,
"plan": "lora"
}'Image-to-video based on the first and last frames
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model_name": "<replace_with_model_name>",
"aigc_config": {
"use_input_prompt": false,
"prompt": "Provide a video description based on the image content. The description must include the phrase “Then she/he begins the s86b5p transformation.“\nOutput Template:\nThe video begins with a shot of [subject description]. [Environment description]. Then she/he begins the s86b5p transformation.\nExample:\nThe video begins with a young woman in an outdoor setting. She has short, curly dark brown hair and a friendly smile. She is wearing a black Polo shirt with colorful floral embroidery. The background features green vegetation and distant mountains. Then she begins the s86b5p transformation.",
"lora_prompt_default": "Then she/he begins the s86b5p transformation."
},
"capacity": 1,
"plan": "lora"
}'Response example
Note the following two parameters in output:
deployed_model: The name of the deployed model, used to query the deployment status and call the model.status: The model deployment status. After you deploy a fine-tuned model, the initial status is PENDING, which indicates that the deployment has not started.
{
...
"output": {
"deployed_model": "xxxx-ft-202511111122-xxxx",
"status": "PENDING",
...
}
}Step 3.2: Query the deployment status
Query the deployment status. Poll the following API operation until the status becomes RUNNING.
The deployment process for the fine-tuned model in this example is expected to take 5 to 10 minutes.
Request example
Replace <replace_with_deployed_model> with the value of the deployed_model parameter returned in Step 3.1.
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/<replace_with_deployed_model>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' Response example
Note the following two parameters in the output field:
status: When the status becomes RUNNING, the model is deployed and ready to be called.deployed_model: The name of the deployed model.
{
...
"output": {
"status": "RUNNING",
"deployed_model": "xxxx-ft-202511111122-xxxx",
...
}
}Step 4: Call the model to generate a video
After the model is deployed (the deployment status is RUNNING), you can call it.
Build a custom dataset
In addition to using the sample data in this topic to experience the fine-tuning process, you can also build your own dataset for fine-tuning.
A dataset must include a training set (required) and can optionally include a validation set (which can be automatically split from the training set). Package all files into a .zip file. We recommend that you use only English letters, digits, underscores (_), or hyphens (-) in the filename.
Dataset format
Training set: Required
Image-to-video based on the first frame
The training set includes first frame images, training videos, and an annotation file (data.jsonl).
Sample training set: wan-i2v-training-dataset.zip.
ZIP package directory structure:
wan-i2v-training-dataset.zip ├── data.jsonl # Must be named data.jsonl, max size 20 MB ├── image_1.jpeg # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats ├── video_1.mp4 # Max video resolution 4096x4096, supports MP4, MOV formats ├── image_2.jpeg └── video_2.mp4Annotation file (data.jsonl): Each line represents a training data entry and must be a JSON object. The structure of a training data entry is as follows:
{ "prompt": "The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.", "first_frame_path": "image_1.jpg", "video_path": "video_1.mp4" }
Image-to-video based on the first and last frames
The training set includes first frame images, last frame images, training videos, and an annotation file (data.jsonl).
Sample training set: wan-kf2v-training-dataset.zip.
ZIP package directory structure:
wan-kf2v-training-dataset.zip ├── data.jsonl # Must be named data.jsonl, max size 20 MB ├── image/ # Stores first and last frame images │ ├── image_1_first.jpg # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats │ └── image_1_last.png └── video/ # Stores video files as "training targets" ├── video_1.mp4 # Max video resolution 4096x4096, supports MP4, MOV formats └── video_2.movAnnotation file (data.jsonl): Each line represents a training data entry and must be a JSON object. The structure of a training data entry is as follows:
{ "prompt": "The video begins by showing a young woman in an outdoor setting. She has short, curly dark brown hair, a smile on her face, and looks very friendly. She is wearing a black polo shirt with colorful floral embroidery, with a background of green vegetation and distant mountains. Then she begins the s86b5p transformation.", "first_frame_path": "image/image_1_first.jpg", "last_frame_path": "image/image_1_last.jpg", "video_path": "video/video_1.mp4" }
Validation set: Optional
Image-to-video based on the first frame
The validation set includes first frame images and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.
Sample validation set: wan-i2v-valid-dataset.zip.
ZIP package directory structure:
wan-i2v-valid-dataset.zip ├── data.jsonl # Must be named data.jsonl, max size 20 MB ├── image_1.jpeg # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats └── image_2.jpegAnnotation file (data.jsonl): Each line represents a validation data entry and must be a JSON object. The structure of a validation data entry is as follows:
{ "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.", "first_frame_path": "image_1.jpg" }
Image-to-video based on the first and last frames
The validation set includes first frame images, last frame images, and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.
Sample validation set: wan-kf2v-valid-dataset.zip.
ZIP package directory structure:
wan-kf2v-valid-dataset.zip ├── data.jsonl # Must be named data.jsonl, max size 20 MB └── image/ # Stores first and last frame images ├── image_1_first.jpg # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats └── image_1_last.jpgAnnotation file (data.jsonl): Each line represents a validation data entry and must be a JSON object. The structure of a validation data entry is as follows:
{ "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.", "first_frame_path": "image/image_1_first.jpg", "last_frame_path": "image/image_1_last.jpg", }
Data volume and limitations
Data volume: Provide at least 10 data entries. The more training data, the better. We recommend 20 to 100 entries for stable results.
ZIP package: The total size of the package must be 1 GB or less when uploaded using an API.
Training image requirements:
Supported formats are BMP, JPEG, PNG, and WEBP.
Image resolution must be 4096×4096 or less.
There is no hard limit on the size of a single image file. The system automatically performs pre-processing.
Training video requirements:
Supported formats are MP4 and MOV.
Video resolution must be 4096×4096 or less.
There is no hard limit on the size of a single video file. The system automatically performs pre-processing.
Maximum duration of a single video: 5 seconds for wan2.2 models; 10 seconds for wan2.5 models; 10 seconds for wan2.6 models.
Data collection and cleansing
1. Determine the fine-tuning scenario
The fine-tuning scenarios for image-to-video generation supported by Wan include the following:
Fixed video effects: Teach the model a specific visual change, such as a carousel or a magical transformation.
Fixed character actions: Improve the model's ability to reproduce specific body movements, such as particular dance moves or martial arts forms.
Fixed video camera movements: Replicate complex camera language, such as fixed templates for push-pull, pan-tilt, and surround shots.
2. Obtain raw materials
AI generation and selection: Use the Wan foundation model to generate videos in batches, then manually select the high-quality samples that best match the target effect. This is the most common method.
Live shooting: If your goal is to achieve highly realistic interactive scenes (such as hugs or handshakes), using live-shot footage is the best choice.
3D software rendering: For effects or abstract animations that require detailed control, we recommend using 3D software (such as Blender or C4D) to create the materials.
3. Cleanse the data
Dimension | Positive requirements | Negative examples |
Consistency | Core features must be highly consistent. For example, to train a "360-degree rotation," all videos must rotate clockwise at a roughly consistent speed. | Mixed directions. The dataset contains both clockwise and counter-clockwise rotations. The model does not know which direction to learn. |
Diversity | The richer the subjects and scenes, the better. Cover different subjects (men, women, old, young, cats, dogs, buildings) and different compositions (close-ups, long shots, high-angle, low-angle). Also, the resolution and aspect ratio should be as diverse as possible. | Single scene or subject. All videos show "a person in red clothes rotating in front of a white wall." The model will mistakenly think that "red clothes" and "white wall" are part of the effect and will not rotate if the clothes are changed. |
Balance | Proportions of different data types are balanced. If multiple styles are included, their quantities should be roughly equal. | Severely imbalanced proportions. 90% are portrait videos, and 10% are landscape videos. The model may perform poorly when generating landscape videos. |
Purity | Clean and clear images. Use raw materials without interference. | Interfering elements. The video contains captions, station logos, watermarks, obvious black bars, or noise. The model might learn the watermark as part of the effect. |
Duration | Material duration ≤ Target duration. If you expect to generate a 5-second video, the material should preferably be cropped to 4–5 seconds. | Material is too long. Expecting a 5-second video but feeding the model an 8-second material will result in incomplete action learning and a sense of truncation. |
Video annotation: Write prompts for videos
In the dataset's annotation file (data.jsonl), each video has a corresponding prompt. The prompt describes the visual content of the video. The quality of the prompt directly determines what the model learns.
Prompt example The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The background is a brick wall covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain. |
Prompt writing formula
Prompt = [Subject description] + [Background description] + [Trigger word] + [Motion description]
Prompt description item | Description | Recommendations | Example |
Subject description | Describes the person or object originally present in the scene | Required | The video begins showing a young woman... |
Background description | Describes the environment where the subject is located | Required | The background is a brick wall covered with green vines... |
Trigger word | A rare word with no actual meaning | Recommended | s86b5p or m01aa |
Motion description | Describes in detail the motion changes that occur during the effect in the video | Recommended | Countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain... |
How to write good prompts
Follow the consistency principle for effect descriptions
For all samples containing the effect, the motion description part of the effect should be as consistent as possible. This rule applies to both the training set and the validation set.
Purpose: When the model finds that
s86b5pis always followed by a fixed description and the scene always shows a money rain, it will remember: s86b5p = money rain visual effect.Example: Whether it is a "young woman" or a "man in a suit," as long as it is a money rain effect, the second half of the prompt is uniformly written as: "...then the s86b5p money rain effect begins, countless US dollar bills pour down like a torrential rain..."
Sample type
Prompt content (Note the consistency of the underlined description)
Training set sample 1
The video begins showing a young woman standing in front of a brick wall... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, expression surprised, completely immersed in this wild money rain.
Training set sample 2
The video begins showing a man in a suit in a high-end restaurant... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall, he stretches his arms upward, expression surprised, completely immersed in this wild money rain.
Validation set sample 1
The video begins showing a young child in front of a cityscape... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.
Generate prompts with AI assistance
To obtain high-quality prompts, we recommend using a multimodal large language model (LLM) such as Qwen-VL to assist in generating prompts for videos.
Use AI to help generate initial descriptions
Brainstorm (find inspiration): If you do not know how to describe the effect, you can let the AI brainstorm first.
Directly send "
Describe the video content in detail" and observe the model's output.Focus on the words the model uses to describe the motion trajectory of the effect (such as "pour down like a torrential rain," "camera slowly zooms in"). These words can be used as material for subsequent optimization.
Fixed sentence structure (standardize output): Once you have a general idea, you can design a fixed sentence structure based on the annotation formula to guide the AI in generating prompts that conform to the format.
Refine the effect template
We recommend running this process repeatedly on multiple samples with the same effect to identify common, accurate phrases used to describe the effect. From these, extract a universal "effect description."
Copy and paste this standardized effect description into all datasets for that effect.
Keep the unique "subject" and "background" descriptions for each sample, but replace the "effect description" part with the unified template.
Manual check
AI may hallucinate or make recognition errors. Perform a final manual check, for example, to confirm that the subject and background descriptions match the actual scene.
Evaluate the model using a validation set
Specify a validation set
A fine-tuning job must include a training set, while a validation set is optional. You can choose to have the system automatically split the validation set or manually upload one. The specific methods are as follows:
Method 1: Do not upload a validation set (system automatically splits)
When you create a fine-tuning job, if you do not pass the validation_file_ids parameter to specify a validation set, the system automatically splits a portion of the training set to use as the validation set based on the following two hyperparameters:
split: The proportion of the training set to be used for training. For example, 0.9 means 90% of the data is used for training, and the remaining 10% is used for validation.max_split_val_dataset_sample: The maximum number of samples for the automatically split validation set.
Validation set splitting rule: The system takes the smaller value between total dataset size × (1 - split) and max_split_val_dataset_sample.
Example: Assume you only upload a training set with 100 data entries, split=0.9 (meaning 10% for validation), and max_split_val_dataset_sample=5.
Theoretical split: 100 × 10% = 10 entries.
Actual split: min(10, 5) = 5. Therefore, the system takes only 5 entries for the validation set.
Method 2: Upload a validation set (specify using validation_file_ids)
If you want to use your own prepared data to evaluate checkpoints instead of relying on the system's random split, you can upload a custom validation set.
Note: Once you choose to upload a validation set, the system will completely ignore the automatic splitting rule and use only the data you uploaded for validation.
Select the best checkpoint for deployment
During the training process, the system periodically saves "snapshots" of the model, known as checkpoints. By default, the system outputs the last checkpoint as the final fine-tuned model. However, checkpoints produced during the intermediate process may have better effects than the final version. You can select the most satisfactory one for deployment.
The system will run the checkpoint on the validation set and generate a preview video at the interval specified by the hyperparameter eval_epochs.
How to evaluate: Judge the effect by directly observing the generated preview videos.
Selection criteria: Find the checkpoint with the best effect and no action distortion.
Procedure
Step 1: View the preview effects generated by checkpoints
Step 2: Export a checkpoint and get the model name for deployment
Step 3: Deploy and call the model
Going live
In a production environment, if the initially trained model performs poorly (for example, with corrupted frames, indistinct effects, or inaccurate actions), you can fine-tune it based on the following dimensions:
1. Check the data and prompts
Data consistency: Data consistency is key. Check for "bad samples" with opposite directions or vastly different styles.
Number of samples: We recommend increasing the number of high-quality data entries to more than 20.
Prompt: Ensure the trigger word is a meaningless rare word (such as s86b5p) and avoid using common words (such as running) to prevent interference.
2. Adjust hyperparameters: For parameter descriptions, see Hyperparameters.
n_epochs (number of training epochs)
Default value: 400. We recommend using the default value. To adjust it, follow the principle of "Total training steps ≥ 800".
Formula for total steps:
steps = n_epochs × ceil(training set size / batch_size).Therefore, the formula for the minimum n_epochs is:
n_epochs = 800 / ceil(dataset size / batch_size).Example: Assume the training set has 5 data entries and you are using the Wan2.5 model (batch_size=2).
Training steps per epoch: 5 / 2 = 2.5, which rounds up to 3. Total number of training epochs: n_epochs = 800 / 3 ≈ 267. This is the recommended minimum value. You can increase it as needed for your business, for example, to 300.
learning_rate, batch_size: We recommend using the default values. You usually do not need to modify them.
Billing
Model training: Billed.
Cost = Total training tokens × Unit price. For more information, see Model training billing.
After the training is complete, you can view the total number of tokens consumed during training in the
usagefield of the Query the status of a fine-tuning job API.
Model deployment: Free of charge.
Model calling: Billed.
You are charged at the standard invocation price of the fine-tuned foundation model. For more information, see Model pricing.
API reference
FAQ
Q: How do I calculate the data volume for the training and validation sets?
A: A training set is required, and a validation set is optional. The calculation method is as follows:
If you do not pass a validation set: The uploaded training set is the "total dataset size." The system automatically splits a portion of the training set for validation.
Size of the validation set =
min(Total dataset size × (1 − split), max_split_val_dataset_sample). For a calculation example, see Specify a validation set.Number of training set entries =
Total dataset size − Number of validation set entries.
If you upload a validation set: The system no longer splits the training data for validation.
Number of training set entries = Data volume of the uploaded training set.
Number of validation set entries = Data volume of the uploaded validation set.
Q: How do I design a good trigger word?
A: The rules are as follows:
Use a meaningless combination of letters, such as sksstyle or a8z2_bbb.
Avoid using common English words (such as beautiful, fire, dance), as this will interfere with the model's original understanding of these words.
Q: Can fine-tuning change the video resolution or duration?
A: No. Fine-tuning learns content and motion, not specifications. The format of the output video (resolution, frame rate, maximum duration) is still determined by the foundation model.


