Introduction to the guide for fine-tuning the Wan image-to-video model - Alibaba Cloud Model Studio

When you generate videos with Wan, if prompt optimization or official video effects do not provide the desired specific actions, effects, or styles, you can use model fine-tuning.

Availability

Region: This document applies only to the Singapore region. You must use an API key from this region.
Model: Image-to-video - first frame: wan2.5-i2v-preview.
Method: SFT-LoRA efficient fine-tuning.

How to fine-tune a model

This topic provides an example of how to train a "money rain effect" LoRA model. The goal is to create a model that automatically generates a video with a "money rain effect" from an input image, without a prompt.

Input first frame image

Output video (before fine-tuning)

The base model cannot consistently generate a "money rain" effect with specific motion from a prompt, and the motion is uncontrollable.

Output video (after fine-tuning)

The fine-tuned model can consistently reproduce the specific "money rain" effect from the training dataset without a prompt.

Before you run the following code, you must obtain and configure an API key, and then configure the API key.

Step 1: Upload the dataset

Upload your local dataset in .zip format to Alibaba Cloud Model Studio and retrieve the file ID (id).

Sample training set: wan-i2v-training-dataset.zip. For more information about the dataset format, see Training set.

Sample request

This example uploads only a training dataset. The system automatically splits a portion of the training dataset to create a validation set. The dataset upload takes several minutes, depending on the file size.

curl --location --request POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/files' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--form 'file=@"./wan-i2v-training-dataset.zip"' \
--form 'purpose="fine-tune"'

Sample response

Save the id. This is the unique identifier for the uploaded dataset.

{
    "id": "file-ft-b2416bacc4d742xxxx",
    "object": "file",
    "bytes": 73310369,
    "filename": "wan-i2v-training-dataset.zip",
    "purpose": "fine-tune",
    "status": "processed",
    "created_at": 1766127125
}

Step 2: Fine-tune the model

Step 2.1: Create a fine-tuning job

Start the training job using the file ID from Step 1.

Note

For more information about hyperparameter settings for fine-tuning, see Hyperparameters.

Sample request

Replace <file_id_of_your_training_set> with the id that you retrieved in the previous step.

Wan2.5 model

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model":"wan2.5-i2v-preview",
    "training_file_ids":[
        "<file_id_of_your_training_set>"
    ],
    "training_type":"efficient_sft",
    "hyper_parameters":{
        "n_epochs":400,
        "batch_size":2,
        "learning_rate":2e-5,
        "split":0.9,
        "eval_epochs": 20,
        "max_pixels": 36864
    }
}'

Sample response

Note the following three key parameters in the output object:

job_id: The job ID, which is used to query the job progress.
finetuned_output: The name of the new fine-tuned model. You must use this name for subsequent deployment and invocation.
status: The model training status. After you create a fine-tuning job, the initial status is PENDING, which indicates that the training is waiting to start.

{
    ...
    "output": {
        "job_id": "ft-202511111122-xxxx",
        "status": "PENDING",
        "finetuned_output": "wan2.5-i2v-preview-ft-202511111122-xxxx",
        ...
    }
}

Step 2.2: Query the fine-tuning job status

Query the job progress using the job_id obtained in Step 2.1. Poll this API until the status changes to SUCCEEDED.

Note

The fine-tuning job in this example takes several hours to complete, depending on the model. Wait for the job to finish.

Sample request

Replace <your_fine-tuning_job_id> in the URL with the value of job_id.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'

Sample response

Note the two parameters in the output field:

status: When the status changes to SUCCEEDED, the model has finished training and is ready for deployment.
usage: The total number of tokens consumed for model training, which is used for billing.

{
    ...
    "output": {
        "job_id": "ft-202511111122-xxxx",
        "status": "SUCCEEDED",
        "usage": 432000,
        ...
    }
}

Step 3: Deploy the fine-tuned model

Step 3.1: Deploy the model as an online service

After the fine-tuning job status changes to SUCCEEDED, deploy the model as an online service.

Sample request

Replace <your_model_name> with the value of the finetuned_output parameter from the response of the Create a fine-tuning job operation.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "<your_model_name>",
    "aigc_config": {
        "prompt": "Provide a video description based on the image content. The description must include \"Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding him/her. The bills continue to fall as the camera slowly zooms in. He/She stretches his/her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain\".\nOutput template:\nThe video begins by showing [entity description] [background description]. Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding him/her. The bills continue to fall as the camera slowly zooms in. He/She stretches his/her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain.\nExample:\nThe video begins by showing a young woman on a beach. Her hair is wet, dark brown, curly, and slightly messy. She has a bright smile on her face. The background shows crashing waves and distant mountains. Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding her. The bills continue to fall as the camera slowly zooms in. She stretches her arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain.",
        "lora_prompt_default": "Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding the main character. The bills continue to fall as the camera slowly zooms in. The main character stretches their arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain."
    },
    "capacity": 1,
    "plan": "lora"
}'

Sample response

Note the two parameters in the output field:

deployed_model: The name of the deployed model, which is used to query the deployment status and call the model.
status: The model deployment status. After you deploy a fine-tuned model, the initial status is PENDING, which indicates that the deployment has not started.

{
    ...
    "output": {
        "deployed_model": "wan2.5-i2v-preview-ft-202511111122-xxxx",
        "status": "PENDING",
        ...
    }
}

Step 3.2: Query the deployment status

Query the deployment status. Poll this API until the status changes to RUNNING.

Note

The deployment process for the fine-tuned model in this example is expected to take 5 to 10 minutes.

Sample request

Replace <your_deployed_model_id> with the value of the deployed_model response parameter from Step 3.1.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/<your_deployed_model_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'

Sample response

Note the two parameters in the output field:

status: When the status changes to RUNNING, the model is successfully deployed and ready to be called.
deployed_model: The name of the deployed model.

{
    ...
    "output": {
        "status": "RUNNING",
        "deployed_model": "wan2.5-i2v-preview-ft-202511111122-xxxx",
        ...
    }
}

Step 4: Call the model to generate a video

After the model is successfully deployed (that is, the deployment status is RUNNING), you can call it.

Expected result: If you input an image, the model automatically generates a video with a "money rain effect" without requiring a prompt.

Step 4.1: Create a video generation task and retrieve the task_id

Sample request

Replace <your_deployed_model> with the deployed_model value from the output of the previous step.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--header 'X-DashScope-Async: enable' \
--data '{
    "model": "<your_deployed_model>",
    "input": {
        "img_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20251219/xmvyqn/lora.webp"
    },
    "parameters": {
        "resolution": "480P",
        "prompt_extend": false
    }
}'

Sample response

Copy and save the task_id to query the result in the next step.

{
    "output": {
        "task_status": "PENDING",
        "task_id": "0385dc79-5ff8-4d82-bcb6-xxxxxx"
    },
    "request_id": "4909100c-7b5a-9f92-bfe5-xxxxxx"
}

Input parameter descriptions

When you call a fine-tuned LoRA model, the input parameters are mostly the same as those for the "image-to-video - first frame" API of the base model.

The following table lists only the parameters and limitations that are specific to LoRA models. For information about other parameters not mentioned (such as duration, which defaults to 5 seconds), see Image-to-video - first frame - API reference.

Field	Type	Required	Description	Example value
model	string	Yes	The model name. You must use a fine-tuned model that is successfully deployed and in the RUNNING state.	wan2.5-i2v-preview-ft-202511111122-xxxx
input.prompt	string	No	The text prompt. This is not required. Because a prompt is preset during model deployment, the model automatically generates a prompt based on the input image.	-
parameters.resolution	string	No	The resolution tier for the generated video. Fine-tuned models support 480P and 720P. The default is 720P.	720P
parameters.prompt_extend	boolean	No	Specifies whether to enable intelligent prompt rewriting. When you call a fine-tuned LoRA model, set this to false.	false

Step 4.2: Query the result using the task_id

Poll the task status using the task_id until the task_status is SUCCEEDED, and then retrieve the video URL.

Sample request

Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task_id.

curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

The video URL is valid for 24 hours. Download the video promptly.

{
    "request_id": "c87415d2-f436-41c3-9fe8-xxxxxx",
    "output": {
        "task_id": "a017e64c-012b-431a-84fd-xxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2025-11-12 11:03:33.672",
        "scheduled_time": "2025-11-12 11:03:33.699",
        "end_time": "2025-11-12 11:04:07.088",
        "orig_prompt": "",
        "video_url": "https://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxxx"
    },
    "usage": {
        "duration": 5,
        "video_count": 1,
        "SR": 480
    }
}

Build a custom dataset

In addition to using the sample data in this topic, you can build your own dataset for fine-tuning.

The dataset must contain a training set (required) and a validation set (optional). The validation set can be automatically split from the training set. You must package all files into a .zip file. Use only English letters, numbers, underscores, or hyphens in the filename.

Dataset format

Training set: Required

The training set for the image-to-video - first frame model includes first-frame images, videos, and an annotation file (data.jsonl).

Sample dataset: wan-i2v-training-dataset.zip.

Zip package directory structure:

wan-i2v-training-dataset.zip
├── data.jsonl    (The jsonl filename must be data, maximum size 20 MB)
├── image_1.jpeg  (Maximum image resolution 1024×1024, maximum size per image 10 MB, supported formats: BMP, JPEG, PNG, WEBP)
├── video_1.mp4   (Maximum video resolution 512×512, maximum size per video 10 MB, supported formats: MP4, MOV)
├── image_2.jpeg
└── video_2.mp4

Annotation file (data.jsonl): Each line is a JSON object that contains three fields: Prompt, image path, and video path. The structure of one line of training data is as follows:

{
    "prompt": "The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
    "first_frame_path": "image_1.jpeg",
    "video_path": "video_1.mp4"        
}

Validation set: Optional

The validation set for the image-to-video - first frame model includes first-frame images and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.

Sample dataset: wan-i2v-valid-dataset.zip.

Zip package directory structure:

wan-i2v-valid-dataset.zip
├── data.jsonl    (The jsonl filename must be data, maximum size 20 MB)
├── image_1.jpeg  (Maximum image resolution 1024×1024, maximum size per image 10 MB, supported formats: BMP, JPEG, PNG, WEBP)
└── image_2.jpeg

Annotation file (data.jsonl): Each line is a JSON object that contains two fields: Prompt and image path. The structure of one line of data is as follows:

{
    "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
    "first_frame_path": "image_1.jpg"
}

Data scale and limits

Data volume: At least 5 data entries. The larger the training data volume, the better. We recommend 20 to 100 entries for stable results.
Zip package: Total size ≤ 2 GB.
Number of files: Up to 200 images and 200 videos.
Format requirements:
- Image: Supported formats are BMP, JPEG, PNG, and WEBP. Image resolution ≤ 1024x1024. Single image file size ≤ 10 MB.
- Video: Supported formats are MP4 and MOV. Video resolution ≤ 512x512. Single video file size ≤ 10 MB.

Data collection and cleaning

1. Identify the fine-tuning scenario

Wan supports fine-tuning for image-to-video scenarios, such as the following:

Fixed video effects: Teach the model a specific visual change, such as a carousel rotation or a magical costume change.
Fixed character actions: Improve the model's ability to reproduce specific body movements, such as particular dance moves or martial arts forms.
Fixed camera movements: Replicate complex camera language, such as push-pull, pan-tilt, or orbiting shots in a fixed template.

2. Get source materials

AI generation and selection: Use the Wan base model to generate videos in batches, and then manually select high-quality samples that best match the target effect. This is the most common method.
Live-action shooting: If your goal is to achieve highly realistic interactive scenes (such as hugs or handshakes), using live-action footage is the best choice.
3D software rendering: For effects or abstract animations that require detailed control, you can use 3D software (such as Blender or C4D) to create the materials.

3. Clean the data

Clean the collected data according to the following requirements:

Dimension

Requirement

Negative example

Consistency

Core features must be consistent.

For example, to train a "360-degree rotation," all videos must rotate clockwise and at a roughly consistent speed.

Inconsistent directions.

The dataset contains both clockwise and counter-clockwise rotations. The model cannot determine which direction to learn.

Diversity

Entities and scenes must be diverse.

Include diverse entities (men, women, old, young, cats, dogs, buildings) and compositions (close-ups, long shots, high-angle, low-angle). The resolution and aspect ratio should also be diverse.

Lack of diversity in scenes or entities.

All videos are of "a person in red clothes rotating in front of a white wall." The model might mistakenly associate "red clothes" and "white wall" with the rotation effect and fail to apply the effect if the clothes are different.

Balance

Proportions of different data types must be balanced.

If multiple styles are included, the number of samples for each style should be roughly equal.

Imbalanced proportions.

90% are portrait videos, and 10% are landscape videos. The model may perform poorly when generating landscape videos.

Purity

The image is clear and sharp.

Use original materials that do not contain interfering elements.

Presence of interfering elements.

The video contains captions, logos, watermarks, obvious black bars, or noise. The model might learn the watermark as part of the effect.

Duration

Source material duration must be less than or equal to the target duration.

If you expect to generate a 5-second video, the source material should be cropped to 4 to 5 seconds.

Source material is too long.

If you expect to generate a 5-second video but feed the model an 8-second source, this can lead to incomplete action learning and cause the generated video to feel truncated.

Video annotation: Write prompts for videos

In the dataset's annotation file (data.jsonl), each video has a corresponding prompt. The prompt is used to describe the visual content of the video, and the quality of the prompt directly determines what the model "learns."

Sample prompt

The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, is wearing a white sleeveless dress and a shiny silver necklace, and has a smile on her face. The background is a brick wall covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.

Prompt writing formula

Prompt = [Entity description] + [Background description] + [Trigger word] + [Motion description]

Prompt item	Description	Recommendation	Example
Entity description	Describes the person or object originally present in the frame	Required	The video begins showing a young woman...
Background description	Describes the environment where the entity is located	Required	The background is a brick wall covered with green vines...
Trigger word	A rare word with no actual meaning	Recommended	s86b5p or m01aa
Motion description	Describes in detail the motion changes of the effect in the video	Recommended	countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain...

About the "trigger word"

What is a trigger word?
It acts as a "visual anchor". Because many complex dynamics (such as a special dance trajectory or an original light and shadow change) are difficult to describe in words, a trigger word instructs the model to generate a specific visual effect when it encounters the word.
Why use it?
Model fine-tuning establishes a mapping between "text" and "video features." A trigger word binds the "indescribable effect" to a unique word, allowing the model to associate the word with the effect.
Why is a detailed motion description still necessary if a trigger word is used?
The two have different roles and work better together.
- Motion description: Explains "what is happening in the frame." It tells the model the basic physical actions and logic. The motion descriptions for multiple samples are usually consistent.
- Trigger word: Explains "what the action specifically looks like." It represents the unique changes and features that words cannot describe.

How to write good prompts

Follow the consistency principle for effect descriptions

For all samples that contain the effect, the motion description part of the effect must be as consistent as possible. Both the training and validation sets must follow this rule.

Purpose: When the model finds that s86b5p is always followed by a fixed description and the frame always shows a money rain, it learns to associate s86b5p with the money rain visual effect.

Example: Whether it is a "young woman" or a "man in a suit," as long as it is the money rain effect, the second half of the prompt is uniformly written as: "...then the s86b5p money rain effect begins, countless US dollar bills pour down like a torrential rain..."

Sample type	Prompt content (Note the consistency of the underlined description)
Training set sample 1	The video begins showing a young woman standing in front of a brick wall... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, expression surprised, completely immersed in this wild money rain.
Training set sample 2	The video begins showing a man in a suit in a high-end restaurant... (environment description omitted)....then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall, he stretches his arms upward, expression surprised, completely immersed in this wild money rain.
Validation set sample 1	The video begins showing a young child standing in front of a cityscape... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green pattern) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.

Generate prompts with AI assistance

To obtain high-quality prompts, you can use multimodal large models such as Qwen-VL to help generate prompts for videos.

Use AI to help generate initial descriptions

Brainstorm (find inspiration): If you do not know how to describe the effect, you can use AI to brainstorm ideas.
- Send the prompt "Describe the video content in detail" and observe what the model outputs.
- Focus on the vocabulary the model uses to describe the motion trajectory of the effect (such as "pour down like a torrential rain," "camera slowly zooms in"). These words can be used as material for later optimization.

Fixed sentence pattern (standardize output): Once you have a general idea, you can design a fixed sentence pattern based on the annotation formula to guide the AI in generating prompts that conform to the format.

Sample code

For more information about code invocation, see Visual understanding.

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and Beijing regions are different. To get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {"role": "user","content": [{
            # When you pass a video file directly, set the value of type to video_url
            # When you use the OpenAI SDK, the video file is sampled every 0.5 seconds by default and cannot be changed. To customize the sampling frequency, use the DashScope SDK.
            "type": "video_url",            
            "video_url": {"url": "https://cloud.video.taobao.com/vod/Tm1s_RpnvdXfarR12RekQtR66lbYXj1uziPzMmJoPmI.mp4"}},
            {"type": "text", "text": "Carefully analyze the video and generate a detailed video description according to the following fixed sentence pattern."
                                    "Sentence template: The video begins showing [entity description]. The background is [background description]. Then the s86b5p melting effect begins, [detailed motion description]."
                                    "Requirements:"
                                    "1.[Entity description]: Describe in detail the person or object originally present in the frame, including details such as appearance, clothing, and expression."
                                    "2.[Background description]: Describe in detail the environment where the entity is located, including details such as environment, lighting, and weather."
                                    "3.[Motion description]: Describe in detail the dynamic change process when the effect occurs (such as how objects move, how lighting changes, and how the camera changes)."
                                    "4.All content must be naturally integrated into the sentence pattern. Do not keep the '[ ]' symbols or add any text unrelated to the description."}]
         }]
)
print(completion.choices[0].message.content)

Refine the effect template
1. Repeat the process for multiple samples with the same effect to find common, accurate phrases used to describe the effect. From these, you can extract a universal "effect description."
2. Copy and paste this standardized effect description into all datasets for that effect.
3. Keep the unique "entity" and "background" descriptions for each sample, and only replace the "effect description" part with the unified template.

Manual check
AI models may hallucinate or make recognition errors. Therefore, you must perform a final manual check to confirm that the descriptions of the entity and background match the actual scene.

Evaluate the model using a validation set

Specify a validation set

A fine-tuning job must include a training set, while a validation set is optional. You can choose to have the system automatically split the set or manually upload a validation set. The methods are as follows:

Method 1: No validation set uploaded (system auto-split)

When you create a fine-tuning job, if you do not upload a separate validation set (that is, the validation_file_ids parameter is not passed), the system uses the following two hyperparameters to automatically split a portion of the training set for use as a validation set:

split: The training set split ratio. For example, 0.9 means 90% of the data is used for training, and the remaining 10% is used for validation.
max_split_val_dataset_sample: The maximum number of samples for the automatically split validation set.

Validation set splitting rule: The system selects the smaller value between total number of dataset entries × (1 - split) and max_split_val_dataset_sample.

Example: Assume you only upload a training set with 100 data entries, split=0.9 (meaning 10% for the validation set), and max_split_val_dataset_sample=5.
- Theoretical split: 100 × 10% = 10 entries.
- Actual split: min(10, 5) = 5. So the system takes only 5 entries as the validation set.

Method 2: Manually upload a validation set (by validation_file_ids)

If you want to use your own prepared data to evaluate checkpoints instead of relying on the system's random split, you can upload a custom validation set.

Note: If you choose to upload a validation set manually, the system completely ignores the automatic splitting rules and uses only the data you uploaded for validation.

Procedure: Manually upload a validation set

Prepare the validation set: Package the validation data into a separate .zip file. For more information, see Validation set format.
Upload the validation set: Call the Upload dataset API to upload this validation set .zip file and obtain a dedicated file ID.

Specify the validation set when creating a job: When calling the Create a fine-tuning job API, specify this file ID in the validation_file_ids parameter.

{
    "model":"wan2.5-i2v-preview",
    "training_file_ids":[ "<file_id_of_the_training_set>" ],
    "validation_file_ids": [ "<file_id_of_the_custom_validation_set>" ],
    ...
}

Select the best checkpoint for deployment

During the training process, the system periodically saves "snapshots" of the model (that is, checkpoints). By default, the system outputs the last checkpoint as the final fine-tuned model. However, checkpoints produced during the training process may yield better results than the final version. You can select the best one for deployment.

The system evaluates checkpoints against the validation set and generates preview videos at the interval set by the hyperparameter eval_epochs.

How to evaluate: Evaluate the effect by directly observing the generated preview videos.
Selection criteria: Select the checkpoint that produces the best results and does not cause action distortion.

Procedure

Step 1: View the preview effects generated by the checkpoint

Step 1.1: Query the list of validated checkpoints

This API returns only checkpoints that have passed validation and successfully generated preview videos. Failed validations are not listed.

Sample request

<your_fine-tuning_job_id>: Replace this with the job_id from the output of the Create a fine-tuning job API.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>/validation-results' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'

Sample response

This API returns a list of checkpoint names that have been successfully validated.

{
    "request_id": "da1310f5-5a21-4e29-99d4-xxxxxx",
    "output": [
        {
            "checkpoint": "checkpoint-160"
        },
        ...
    ]
}

Step 1.2: Query the validation set results for the checkpoint

Select a checkpoint, such as "checkpoint-160", from the list returned in the previous step to view the generated video effect.

Sample request

<your_fine-tuning_job_id>: Replace this placeholder with the value of the job_id response parameter from the Create a fine-tuning job request.
<your_selected_checkpoint>: Replace this placeholder with the checkpoint value, such as "checkpoint-160".

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>/validation-details/<your_selected_checkpoint>?page_no=1&page_size=10' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

The preview video URL is video_path. The URL is valid for 24 hours. Download the video promptly to view the result. You can repeat this step to compare the results of multiple Checkpoints and identify the best one.

{
    "request_id": "375b3ad0-d3fa-451f-b629-xxxxxxx",
    "output": {
        "page_no": 1,
        "page_size": 10,
        "total": 1,
        "list": [
            {
                "video_path": "https://finetune-swap-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/xxx.mp4?Expires=xxxx",
                "prompt": "The video begins by showing a young man sitting in a cafe. He is wearing a beige Polo shirt, looking focused and slightly contemplative, with his fingers gently supporting his chin. A cup of hot coffee is placed in front of him, and the background features a striped wooden wall and a decorative sign. Then, the s86b5p money rain effect begins. Countless large USD bills (beige with dark green patterns) pour down like a storm, densely hitting and surrounding him. The bills continue to fall as he stretches his arms upward, neck slightly tilted back, with a surprised expression, fully immersed in this wild money rain.",
                "first_frame_path": "https://finetune-swap-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/xxx.jpeg"
            }
        ]
    }
}

Step 2: Export the checkpoint and get the model name for deployment

Step 2.1: Export the model

If "checkpoint-160" produces the best results, the next step is to export it.

Sample request

<your_fine-tuning_job_id>: Replace this with the value of the job_id parameter from the response of the Create a fine-tuning job operation.
<checkpoint_to_export>: Replace this with the value of the checkpoint, for example, "checkpoint-160".
<display_name_for_exported_model>: Replace this with the custom model name, for console display only, for example, "wan2.5-checkpoint-160". The name must be globally unique. Multiple exports with the same name are not supported. For parameters, see 3. Export a checkpoint.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>/export/<checkpoint_to_export>?model_name=<display_name_for_exported_model>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

output=true indicates that the export request has been successfully created.

{
    "request_id": "0817d1ed-b6b6-4383-9650-xxxxx",
    "output": true
}

Step 2.2: Query the new model name after deployment

Query the status of the checkpoint to confirm that the export is complete and to obtain its exclusive new model name for deployment (model_name).

Sample request

<your_fine-tuning_job_id>: Replace this with the value of the job_id response parameter from the Create a fine-tuning job operation.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<your_fine-tuning_job_id>/checkpoints' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Sample response

Locate the exported Checkpoint (such as checkpoint-160) in the returned list. When its status changes to SUCCEEDED, the export is successful. The returned model_name field is the name of the exported model.

{
    "request_id": "b0e33c6e-404b-4524-87ac-xxxxxx",
    "output": [
         ...,
        {
            "create_time": "2025-11-11T13:27:29",
            "full_name": "ft-202511111122-496e-checkpoint-160",
            "job_id": "ft-202511111122-496e",
            "checkpoint": "checkpoint-160",                             
            "model_name": "wan2.5-i2v-preview-ft-202511111122-xxxx-c160", // Important field, will be used for model deployment and invocation
            "model_display_name": "wan2.5-i2v-preview-ft-202511111122-xxxx", 
            "status": "SUCCEEDED" // Successfully exported checkpoint
        },
        ...
        
    ]
}

Step 3: Deploy and call the model

After you successfully export the Checkpoint and obtain the model_name, perform the following steps:

Model deployment: For the model_name input parameter, enter the specific value that you obtained from the export.
Model invocation

Go live

In a production environment, if the initially trained model performs poorly (such as distorted frames, indistinct effects, or inaccurate actions), you can optimize it in the following ways:

1. Check the data and prompts

Data consistency: Data consistency is key. Check for "bad samples" with opposite directions or vastly different styles.
Sample quantity: Increase high-quality data to more than 20 entries.
Prompt: Ensure that the trigger word is a meaningless rare word (such as s86b5p) to avoid interference from common words (such as running).

2. Adjust hyperparameters: For more information, see Hyperparameters.

n_epochs (number of training epochs)
- Default value: 400. We recommend the default value. If you adjust it, follow the principle of "total training steps ≥ 800".
- Total steps calculation formula: steps = n_epochs × ceiling(training set size / batch_size).
- Therefore, the formula for calculating the minimum value of n_epochs is: n_epochs = 800 / ceiling(dataset size / batch_size).
- Example: Assume the training set has 5 data entries and you are using the Wan2.5 model (batch_size=2).
  - Training steps per epoch: 5 / 2 = 2.5, which rounds up to 3. Minimum training epochs: n_epochs = 800 / 3 ≈ 267. This is the recommended minimum value. You can increase it as needed, for example, to 300.
learning_rate (learning rate), batch_size (batch size): We recommend the default values. You usually do not need to modify them.

Billing

Training: Billed.
- Fee = Total training tokens × Unit price. For more information, see Model training billing.
- After the training is complete, you can view the total number of tokens consumed during training in the Query fine-tuning job status API's usage field.
Deployment: Free.
Invocation: Billed.
- Billed at the standard invocation price of the base model. For more information, see Model pricing.

API reference

Video generation model fine-tuning API reference

FAQ

Q: How do I calculate the data volume for the training and validation sets?

A: A training set is required, and a validation set is optional. The calculation method is as follows:

If you do not pass a validation set: The uploaded training set is the "total dataset size," and the system automatically splits a portion of the training data for validation.
- Validation set size = min(total dataset size × (1 - split), max_split_val_dataset_sample). For a calculation example, see Specify a validation set.
- Training set size = total dataset size - validation set size.
If you manually upload a validation set: The system no longer splits the training data for validation.
- Training set size = Uploaded training data volume.
- Validation set size = Uploaded validation data volume.

Q: How do I design a good trigger word?

A: The rules are as follows:

Use a meaningless combination of letters, such as sksstyle or a8z2_effect.
Avoid using common English words (such as beautiful, fire, or dance), because this will interfere with the model's original understanding of these words.

Q: Can fine-tuning change the video resolution or duration?

A: No. Fine-tuning allows the model to learn content and dynamics, but it does not change specifications. The output video format (resolution, frame rate, or maximum duration) is determined by the base model.