All Products
Search
Document Center

Alibaba Cloud Model Studio:Fine-tune video generation models

Last Updated:Feb 25, 2026

When using Wan for image-to-video, if prompt optimization or official video effects still cannot meet your customization needs for specific actions, effects, or styles, use model fine-tuning.

Applicability

  • Applicable deployment modes and regions: This document applies only to the Singapore region in International deployment mode, and you must use an API key from this region.

  • Supported fine-tuning method: SFT with LoRA efficient fine-tuning.

  • Supported models for fine-tuning:

    • Image-to-video based on the first frame: wan2.6-i2v, wan2.5-i2v-preview, wan2.2-i2v-flash.

    • Image-to-video based on the first and last frames: wan2.2-kf2v-flash.

How to fine-tune a model

Image-to-video based on the first frame

Fine-tuning goal: Train a LoRA model for a "money rain" effect.

Expected result: Input a first frame image, and the model automatically generates a video with the "money rain" effect without a prompt.

Input first frame image

image_3

Output video (before fine-tuning)

Prompts cannot consistently generate a "money rain" effect with fixed motion. The motion is uncontrollable.

Output video (after fine-tuning)

The fine-tuned model can stably reproduce the specific "money rain" effect from the training set without a prompt.

Image-to-video based on the first and last frames

Fine-tuning goal: Train a LoRA model for a "fashion magazine" effect.

Expected result: Input first and last frame images, and the model automatically generates a video with the "fashion magazine" effect without a prompt.

Input first frame image

3_first

Input last frame image

3_last

Output video (before fine-tuning)

Prompts cannot consistently generate a "fashion magazine" effect with fixed motion. The motion is uncontrollable.

Output video (after fine-tuning)

The fine-tuned model can stably reproduce the specific "fashion magazine" effect from the training set without a prompt.

Before you run the following code, create an API key and set the API key as an environment variable.

Step 1: Upload the dataset

Upload your local dataset (in .zip format) to the Alibaba Cloud Model Studio platform and obtain the file ID (id).

Training set sample data: For the format, see Training set.

Request example

This example uses the image-to-video model based on the first frame. Only a training set is uploaded. The system automatically splits a portion of the training set to use as a validation set. Uploading the dataset takes several minutes. The exact time depends on the file size.
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/files' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--form 'file=@"./wan-i2v-training-dataset.zip"' \
--form 'purpose="fine-tune"'

Response example

Save the id. It is the unique identifier for the uploaded dataset.

{
    "id": "file-ft-b2416bacc4d742xxxx",
    "object": "file",
    "bytes": 73310369,
    "filename": "wan-i2v-training-dataset.zip",
    "purpose": "fine-tune",
    "status": "processed",
    "created_at": 1766127125
}

Step 2: Fine-tune the model

Step 2.1: Create a fine-tuning job

Start a training job using the file ID from Step 1.

Note

Hyperparameter values vary across models. For hyperparameter settings, see Hyperparameters. For more call examples, see Request examples.

Request example

Replace <replace_with_training_dataset_file_id> with the id that you obtained in the previous step.

Image-to-video based on the first frame

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model":"wan2.6-i2v",
    "training_file_ids":[
        "<replace_with_training_dataset_file_id>"
    ],
    "training_type":"efficient_sft",
    "hyper_parameters":{
        "n_epochs":400,
        "batch_size":2,
        "learning_rate":2e-5,
        "split":0.9,
        "eval_epochs": 50,
        "max_pixels": 36864
    }
}'

Image-to-video based on the first and last frames

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model":"wan2.2-kf2v-flash",
    "training_file_ids":[
        "<replace_with_training_dataset_file_id>"
    ],
    "training_type":"efficient_sft",
    "hyper_parameters":{
        "n_epochs":400,
        "batch_size":4,
        "learning_rate":2e-5,
        "split":0.9,
        "eval_epochs": 50,
        "max_pixels": 262144
    }
}'

Response example

Note the following three key parameters in output:

  • job_id: The task ID, used to query progress.

  • finetuned_output: The name of the new fine-tuned model. You must use this name for subsequent deployment and calls.

  • status: The model training status. After you create a fine-tuning job, the initial status is PENDING, which indicates that the training has not started.

{
    ...
    "output": {
        "job_id": "ft-202511111122-xxxx",
        "status": "PENDING",
        "finetuned_output": "xxxx-ft-202511111122-xxxx",
        ...
    }
}
Step 2.2: Query the status of the fine-tuning job

Use the job_id obtained in Step 2.1 to query the job progress. Poll the following API operation until the status becomes SUCCEEDED.

Note

The fine-tuning job in this example takes several hours to train. The exact duration depends on the model being fine-tuned. Please wait patiently.

Request example

Replace <replace_with_fine-tuning_job_id> in the URL with the value of job_id.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json'

Response example

Note the following two parameters in the output field:

  • status: When the value becomes SUCCEEDED, the model has been trained and can be deployed.

  • usage: The total number of tokens consumed for model training, used for billing.

{
    ...
    "output": {
        "job_id": "ft-202511111122-xxxx",
        "status": "SUCCEEDED",
        "usage": 432000,
        ...
    }
}

Step 3: Deploy the fine-tuned model

Step 3.1: Deploy the model as an online service

After the status of the fine-tuning job becomes SUCCEEDED, deploy the model as an online service.

Request example

Replace <replace_with_model_name> with the value of the finetuned_output response parameter from the create a fine-tuning job operation.

Image-to-video based on the first frame

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "<replace_with_model_name>",
    "aigc_config": {
        "use_input_prompt": false,
        "prompt": "Provide a video description based on the image content. The description must include the phrase “Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.“\nOutput Template:\nThe video begins with a shot of [subject description]. [Environment description]. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.\nExample:\nThe video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
        "lora_prompt_default": "Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she/he stretches her/his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain."
    },
    "capacity": 1,
    "plan": "lora"
}'

Image-to-video based on the first and last frames

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "<replace_with_model_name>",
    "aigc_config": {
        "use_input_prompt": false,
        "prompt": "Provide a video description based on the image content. The description must include the phrase “Then she/he begins the s86b5p transformation.“\nOutput Template:\nThe video begins with a shot of [subject description]. [Environment description]. Then she/he begins the s86b5p transformation.\nExample:\nThe video begins with a young woman in an outdoor setting. She has short, curly dark brown hair and a friendly smile. She is wearing a black Polo shirt with colorful floral embroidery. The background features green vegetation and distant mountains. Then she begins the s86b5p transformation.",
        "lora_prompt_default": "Then she/he begins the s86b5p transformation."
    },
    "capacity": 1,
    "plan": "lora"
}'

Response example

Note the following two parameters in output:

  • deployed_model: The name of the deployed model, used to query the deployment status and call the model.

  • status: The model deployment status. After you deploy a fine-tuned model, the initial status is PENDING, which indicates that the deployment has not started.

{
    ...
    "output": {
        "deployed_model": "xxxx-ft-202511111122-xxxx",
        "status": "PENDING",
        ...
    }
}
Step 3.2: Query the deployment status

Query the deployment status. Poll the following API operation until the status becomes RUNNING.

Note

The deployment process for the fine-tuned model in this example is expected to take 5 to 10 minutes.

Request example

Replace <replace_with_deployed_model> with the value of the deployed_model parameter returned in Step 3.1.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/<replace_with_deployed_model>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' 

Response example

Note the following two parameters in the output field:

  • status: When the status becomes RUNNING, the model is deployed and ready to be called.

  • deployed_model: The name of the deployed model.

{
    ...
    "output": {
        "status": "RUNNING",
        "deployed_model": "xxxx-ft-202511111122-xxxx",
        ...
    }
}

Step 4: Call the model to generate a video

After the model is deployed (the deployment status is RUNNING), you can call it.

Step 4.1: Create a video generation task and get the task_id

Request example

Replace <replace_with_deployed_model> with the deployed_model value returned in the previous step.

Image-to-video based on the first frame

Expected result: Input a first frame image, and the model automatically generates a video with the "money rain" effect without a prompt.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/video-generation/video-synthesis' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--header 'X-DashScope-Async: enable' \
--data '{
    "model": "<replace_with_deployed_model_name>",
    "input": {
        "img_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20251219/xmvyqn/lora.webp"
    },
    "parameters": {
        "resolution": "720P",
        "prompt_extend": false
    }
}'

Response example

Copy and save the task_id to query the result in the next step.

{
    "output": {
        "task_status": "PENDING",
        "task_id": "0385dc79-5ff8-4d82-bcb6-xxxxxx"
    },
    "request_id": "4909100c-7b5a-9f92-bfe5-xxxxxx"
}

Input parameter description

Note

When you call a fine-tuned LoRA model, the input parameter usage is essentially the same as Wan - image-to-video - first frame.

The following table lists only the unique parameter usage or specific limits for LoRA models. For general parameters not mentioned in this table (such as duration), see the API reference.

Field

Type

Required

Description

Example

model

string

Yes

The model name.

You must use a fine-tuned model that has been successfully deployed and is in the RUNNING state.

xxxx-ft-202511111122-xxxx

input.prompt

string

No

The text prompt.

Whether this parameter takes effect depends on the configuration of aigc_config.use_input_prompt:

  • If use_input_prompt=true, this parameter takes effect. The system generates the video based on this prompt.

  • When use_input_prompt=false, this parameter is ignored. The system will use the preset template aigc_config.prompt to automatically generate a prompt.

-

input.img_url

string

Yes

The URL of the first frame image.

For more information, see the img_url parameter.

https://help-static-aliyun-doc.aliyuncs.com/xxx.jpg

parameters.resolution

string

No

The resolution tier of the generated video.

For wan2.2 and wan2.5 models: 480P and 720P. The default value is 720P.

For wan2.6 models: 720P and 1080P. The default value is 720P..

720P

parameters.prompt_extend

boolean

No

Specifies whether to enable prompt rewriting.

When you call a fine-tuned LoRA model, set this parameter to false.

false

Image-to-video based on the first and last frames

Expected result: Input first and last frame images, and the model automatically generates a video with the "fashion magazine" effect without a prompt.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--header 'X-DashScope-Async: enable' \
--data '{
    "model": "<replace_with_deployed_model_name>",
    "input": {
        "first_frame_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260113/typemn/kf2v-first.webp",
        "last_frame_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260113/ekzmff/kf2v_last.webp"
    },
    "parameters": {
        "resolution": "720P",
        "prompt_extend": false
    }
}'

Response example

Copy and save the task_id to query the result in the next step.

{
    "output": {
        "task_status": "PENDING",
        "task_id": "0385dc79-5ff8-4d82-bcb6-xxxxxx"
    },
    "request_id": "4909100c-7b5a-9f92-bfe5-xxxxxx"
}

Input parameter description

Note

When you call a fine-tuned LoRA model, the usage of the input parameters is largely consistent with that of the Image-to-Video - First and Last Frame-based API.

The following table lists only the unique parameter usage or specific limits for LoRA models. For general parameters not mentioned in this table (such as duration), see the API reference.

Field

Type

Required

Description

Example

model

string

Yes

The model name.

You must use a fine-tuned model that has been successfully deployed and is in the RUNNING state.

xxxx-ft-202511111122-xxxx

input.prompt

string

No

The text prompt.

Whether this parameter takes effect depends on the configuration of aigc_config.use_input_prompt:

  • If use_input_prompt=true, this parameter takes effect. The system generates the video based on this prompt.

  • When use_input_prompt=false, this parameter is ignored and does not need to be specified. The system automatically generates a prompt using the aigc_config.prompt preset template.

-

input.first_frame_url

string

Yes

The URL of the first frame image.

For information about how to pass the parameter, see the `first_frame_url` parameter.

https://help-static-aliyun-doc.aliyuncs.com/xxx.jpg

input.last_frame_url

string

No

The URL of the last frame image.

For information about how to pass the parameter, see the last_frame_url parameter.

https://help-static-aliyun-doc.aliyuncs.com/xxx.jpg

parameters.resolution

string

No

The resolution tier of the generated video.

Fine-tuned models support 480P and 720P. The default value is 720P.

720P

parameters.prompt_extend

boolean

No

Specifies whether to enable prompt rewriting.

When you call a fine-tuned LoRA model, set this parameter to false.

false

Step 4.2: Query the result based on the task_id

Use the task_id to poll the task status until task_status becomes SUCCEEDED, and then get the video URL.

Request example

Replace 86ecf553-d340-4e21-xxxxxxxxx with the actual task_id.
curl -X GET https://dashscope-intl.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Response example

The video URL is valid for 24 hours. Download the video promptly.
{
    "request_id": "c87415d2-f436-41c3-9fe8-xxxxxx",
    "output": {
        "task_id": "a017e64c-012b-431a-84fd-xxxxxx",
        "task_status": "SUCCEEDED",
        "submit_time": "2025-11-12 11:03:33.672",
        "scheduled_time": "2025-11-12 11:03:33.699",
        "end_time": "2025-11-12 11:04:07.088",
        "orig_prompt": "",
        "video_url": "https://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/xxx.mp4?Expires=xxxx"
    },
    "usage": {
        "duration": 5,
        "video_count": 1,
        "SR": 480
    }
}

Build a custom dataset

In addition to using the sample data in this topic to experience the fine-tuning process, you can also build your own dataset for fine-tuning.

A dataset must include a training set (required) and can optionally include a validation set (which can be automatically split from the training set). Package all files into a .zip file. We recommend that you use only English letters, digits, underscores (_), or hyphens (-) in the filename.

Dataset format

Training set: Required

Image-to-video based on the first frame

The training set includes first frame images, training videos, and an annotation file (data.jsonl).

  • Sample training set: wan-i2v-training-dataset.zip.

  • ZIP package directory structure:

    wan-i2v-training-dataset.zip
    ├── data.jsonl        # Must be named data.jsonl, max size 20 MB
    ├── image_1.jpeg      # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats
    ├── video_1.mp4       # Max video resolution 4096x4096, supports MP4, MOV formats
    ├── image_2.jpeg
    └── video_2.mp4
  • Annotation file (data.jsonl): Each line represents a training data entry and must be a JSON object. The structure of a training data entry is as follows:

    {
        "prompt": "The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The brick wall in the background is covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
        "first_frame_path": "image_1.jpg",
        "video_path": "video_1.mp4"        
    }

Image-to-video based on the first and last frames

The training set includes first frame images, last frame images, training videos, and an annotation file (data.jsonl).

  • Sample training set: wan-kf2v-training-dataset.zip.

  • ZIP package directory structure:

    wan-kf2v-training-dataset.zip
    ├── data.jsonl                # Must be named data.jsonl, max size 20 MB
    ├── image/                    # Stores first and last frame images
    │   ├── image_1_first.jpg     # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats
    │   └── image_1_last.png
    └── video/                    # Stores video files as "training targets"
        ├── video_1.mp4           # Max video resolution 4096x4096, supports MP4, MOV formats
        └── video_2.mov
  • Annotation file (data.jsonl): Each line represents a training data entry and must be a JSON object. The structure of a training data entry is as follows:

    {
        "prompt": "The video begins by showing a young woman in an outdoor setting. She has short, curly dark brown hair, a smile on her face, and looks very friendly. She is wearing a black polo shirt with colorful floral embroidery, with a background of green vegetation and distant mountains. Then she begins the s86b5p transformation.",
        "first_frame_path": "image/image_1_first.jpg",
        "last_frame_path": "image/image_1_last.jpg", 
        "video_path": "video/video_1.mp4"  
    }

Validation set: Optional

Image-to-video based on the first frame

The validation set includes first frame images and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.

  • Sample validation set: wan-i2v-valid-dataset.zip.

  • ZIP package directory structure:

    wan-i2v-valid-dataset.zip
    ├── data.jsonl       # Must be named data.jsonl, max size 20 MB
    ├── image_1.jpeg     # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats
    └── image_2.jpeg
  • Annotation file (data.jsonl): Each line represents a validation data entry and must be a JSON object. The structure of a validation data entry is as follows:

    {
        "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
        "first_frame_path": "image_1.jpg"
    }

Image-to-video based on the first and last frames

The validation set includes first frame images, last frame images, and an annotation file (data.jsonl). You do not need to provide videos. At each evaluation node, the training job automatically calls the model service to generate preview videos using the images and prompts from the validation set.

  • Sample validation set: wan-kf2v-valid-dataset.zip.

  • ZIP package directory structure:

    wan-kf2v-valid-dataset.zip
    ├── data.jsonl                 # Must be named data.jsonl, max size 20 MB
    └── image/                     # Stores first and last frame images
        ├── image_1_first.jpg      # Max image resolution 4096x4096, supports BMP, JPEG, PNG, WEBP formats
        └── image_1_last.jpg
  • Annotation file (data.jsonl): Each line represents a validation data entry and must be a JSON object. The structure of a validation data entry is as follows:

    {
        "prompt": "The video begins showing a scene of a young man standing in front of a cityscape. He is wearing a black and white checkered jacket over a black hoodie, with a smile on his face and a confident expression. The background is a city skyline at sunset, with a famous domed building and layered roofs visible in the distance, the sky filled with clouds showing warm orange-yellow hues. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.",
        "first_frame_path": "image/image_1_first.jpg",
        "last_frame_path": "image/image_1_last.jpg",
    }

Data volume and limitations

  • Data volume: Provide at least 10 data entries. The more training data, the better. We recommend 20 to 100 entries for stable results.

  • ZIP package: The total size of the package must be 1 GB or less when uploaded using an API.

  • Training image requirements:

    • Supported formats are BMP, JPEG, PNG, and WEBP.

    • Image resolution must be 4096×4096 or less.

    • There is no hard limit on the size of a single image file. The system automatically performs pre-processing.

  • Training video requirements:

    • Supported formats are MP4 and MOV.

    • Video resolution must be 4096×4096 or less.

    • There is no hard limit on the size of a single video file. The system automatically performs pre-processing.

    • Maximum duration of a single video: 5 seconds for wan2.2 models; 10 seconds for wan2.5 models; 10 seconds for wan2.6 models.

Data collection and cleansing

1. Determine the fine-tuning scenario

The fine-tuning scenarios for image-to-video generation supported by Wan include the following:

  • Fixed video effects: Teach the model a specific visual change, such as a carousel or a magical transformation.

  • Fixed character actions: Improve the model's ability to reproduce specific body movements, such as particular dance moves or martial arts forms.

  • Fixed video camera movements: Replicate complex camera language, such as fixed templates for push-pull, pan-tilt, and surround shots.

2. Obtain raw materials
  • AI generation and selection: Use the Wan foundation model to generate videos in batches, then manually select the high-quality samples that best match the target effect. This is the most common method.

  • Live shooting: If your goal is to achieve highly realistic interactive scenes (such as hugs or handshakes), using live-shot footage is the best choice.

  • 3D software rendering: For effects or abstract animations that require detailed control, we recommend using 3D software (such as Blender or C4D) to create the materials.

3. Cleanse the data

Dimension

Positive requirements

Negative examples

Consistency

Core features must be highly consistent.

For example, to train a "360-degree rotation," all videos must rotate clockwise at a roughly consistent speed.

Mixed directions.

The dataset contains both clockwise and counter-clockwise rotations. The model does not know which direction to learn.

Diversity

The richer the subjects and scenes, the better.

Cover different subjects (men, women, old, young, cats, dogs, buildings) and different compositions (close-ups, long shots, high-angle, low-angle). Also, the resolution and aspect ratio should be as diverse as possible.

Single scene or subject.

All videos show "a person in red clothes rotating in front of a white wall." The model will mistakenly think that "red clothes" and "white wall" are part of the effect and will not rotate if the clothes are changed.

Balance

Proportions of different data types are balanced.

If multiple styles are included, their quantities should be roughly equal.

Severely imbalanced proportions.

90% are portrait videos, and 10% are landscape videos. The model may perform poorly when generating landscape videos.

Purity

Clean and clear images.

Use raw materials without interference.

Interfering elements.

The video contains captions, station logos, watermarks, obvious black bars, or noise. The model might learn the watermark as part of the effect.

Duration

Material duration ≤ Target duration.

If you expect to generate a 5-second video, the material should preferably be cropped to 4–5 seconds.

Material is too long.

Expecting a 5-second video but feeding the model an 8-second material will result in incomplete action learning and a sense of truncation.

Video annotation: Write prompts for videos

In the dataset's annotation file (data.jsonl), each video has a corresponding prompt. The prompt describes the visual content of the video. The quality of the prompt directly determines what the model learns.

Prompt example

The video begins showing a young woman standing in front of a brick wall covered with ivy. She has long, smooth reddish-brown hair, wearing a white sleeveless dress, a shiny silver necklace, and a smile on her face. The background is a brick wall covered with green vines, appearing rustic and natural. Then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.

Prompt writing formula

Prompt = [Subject description] + [Background description] + [Trigger word] + [Motion description]

Prompt description item

Description

Recommendations

Example

Subject description

Describes the person or object originally present in the scene

Required

The video begins showing a young woman...

Background description

Describes the environment where the subject is located

Required

The background is a brick wall covered with green vines...

Trigger word

A rare word with no actual meaning

Recommended

s86b5p or m01aa

Motion description

Describes in detail the motion changes that occur during the effect in the video

Recommended

Countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain...

About "trigger words"
  • What is a trigger word?

    It acts as a "visual anchor". Because many complex dynamics (such as a special dance trajectory or an original light and shadow change) are difficult to describe with words, this word is used to forcibly tell the model: when you see s86b5p, you must generate this specific visual effect.

  • Why use it?

    Model fine-tuning establishes a mapping relationship between "text" and "video features." The trigger word is what binds the "indescribable effect" to a unique word, allowing the model to lock onto the target.

  • Since there is a trigger word, why still describe the motion in detail?

    The two have different roles and work better together.

    • Motion description: Responsible for explaining "what is happening in the scene." It tells the model the basic physical actions and logic, and the motion descriptions for multiple samples are usually consistent.

    • Trigger word: Responsible for explaining "what the action specifically looks like." It represents the unique changes and features that words cannot describe.

How to write good prompts

Follow the consistency principle for effect descriptions

For all samples containing the effect, the motion description part of the effect should be as consistent as possible. This rule applies to both the training set and the validation set.

  • Purpose: When the model finds that s86b5p is always followed by a fixed description and the scene always shows a money rain, it will remember: s86b5p = money rain visual effect.

  • Example: Whether it is a "young woman" or a "man in a suit," as long as it is a money rain effect, the second half of the prompt is uniformly written as: "...then the s86b5p money rain effect begins, countless US dollar bills pour down like a torrential rain..."

    Sample type

    Prompt content (Note the consistency of the underlined description)

    Training set sample 1

    The video begins showing a young woman standing in front of a brick wall... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding her. The bills continue to fall, she stretches her arms upward, expression surprised, completely immersed in this wild money rain.

    Training set sample 2

    The video begins showing a man in a suit in a high-end restaurant... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall, he stretches his arms upward, expression surprised, completely immersed in this wild money rain.

    Validation set sample 1

    The video begins showing a young child in front of a cityscape... (environment description omitted)...then the s86b5p money rain effect begins, countless huge-sized US dollar bills (beige background/dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall while the camera slowly zooms in, he stretches his arms upward, neck slightly tilted back, expression surprised, completely immersed in this wild money rain.

Generate prompts with AI assistance

To obtain high-quality prompts, we recommend using a multimodal large language model (LLM) such as Qwen-VL to assist in generating prompts for videos.

  1. Use AI to help generate initial descriptions

    1. Brainstorm (find inspiration): If you do not know how to describe the effect, you can let the AI brainstorm first.

      • Directly send "Describe the video content in detail" and observe the model's output.

      • Focus on the words the model uses to describe the motion trajectory of the effect (such as "pour down like a torrential rain," "camera slowly zooms in"). These words can be used as material for subsequent optimization.

    2. Fixed sentence structure (standardize output): Once you have a general idea, you can design a fixed sentence structure based on the annotation formula to guide the AI in generating prompts that conform to the format.

      Sample code

      For more information about code calls, see Image and video understanding.
      import os
      from openai import OpenAI
      
      client = OpenAI(
          # The API keys for the Singapore and Beijing regions are different. To obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
          # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
          api_key=os.getenv("DASHSCOPE_API_KEY"),
          # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
          base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
      )
      completion = client.chat.completions.create(
          model="qwen3-vl-plus",
          messages=[
              {"role": "user","content": [{
                  # When passing a video file directly, set the value of type to video_url
                  # When using the OpenAI SDK, one frame is extracted every 0.5 seconds from the video file by default, and this cannot be changed. To customize the frame extraction frequency, use the DashScope SDK.
                  "type": "video_url",            
                  "video_url": {"url": "https://cloud.video.taobao.com/vod/Tm1s_RpnvdXfarR12RekQtR66lbYXj1uziPzMmJoPmI.mp4"}},
                  {"type": "text", "text": "Please carefully analyze the video and generate a detailed video description according to the following fixed sentence structure."
                                          "Sentence template: The video begins showing [subject description]. The background is [background description]. Then the s86b5p melting effect begins, [detailed motion description]."
                                          "Requirements:"
                                          "1.[Subject description]: Describe in detail the person or object originally present in the scene, including details such as appearance, clothing, and expression."
                                          "2.[Background description]: Describe in detail the environment where the subject is located, including details such as environment, lighting, and weather."
                                          "3.[Motion description]: Describe in detail the dynamic change process when the effect occurs (such as how objects move, how lighting changes, how the camera moves)."
                                          "4.All content must be naturally integrated into the sentence structure. Do not retain the '[ ]' symbols, and do not add any text unrelated to the description."}]
               }]
      )
      print(completion.choices[0].message.content)

  1. Refine the effect template

    1. We recommend running this process repeatedly on multiple samples with the same effect to identify common, accurate phrases used to describe the effect. From these, extract a universal "effect description."

    2. Copy and paste this standardized effect description into all datasets for that effect.

    3. Keep the unique "subject" and "background" descriptions for each sample, but replace the "effect description" part with the unified template.

  1. Manual check

    AI may hallucinate or make recognition errors. Perform a final manual check, for example, to confirm that the subject and background descriptions match the actual scene.

Evaluate the model using a validation set

Specify a validation set

A fine-tuning job must include a training set, while a validation set is optional. You can choose to have the system automatically split the validation set or manually upload one. The specific methods are as follows:

Method 1: Do not upload a validation set (system automatically splits)

When you create a fine-tuning job, if you do not pass the validation_file_ids parameter to specify a validation set, the system automatically splits a portion of the training set to use as the validation set based on the following two hyperparameters:

  • split: The proportion of the training set to be used for training. For example, 0.9 means 90% of the data is used for training, and the remaining 10% is used for validation.

  • max_split_val_dataset_sample: The maximum number of samples for the automatically split validation set.

Validation set splitting rule: The system takes the smaller value between total dataset size × (1 - split) and max_split_val_dataset_sample.

  • Example: Assume you only upload a training set with 100 data entries, split=0.9 (meaning 10% for validation), and max_split_val_dataset_sample=5.

    • Theoretical split: 100 × 10% = 10 entries.

    • Actual split: min(10, 5) = 5. Therefore, the system takes only 5 entries for the validation set.

Method 2: Upload a validation set (specify using validation_file_ids)

If you want to use your own prepared data to evaluate checkpoints instead of relying on the system's random split, you can upload a custom validation set.

Note: Once you choose to upload a validation set, the system will completely ignore the automatic splitting rule and use only the data you uploaded for validation.

Procedure: Upload a validation set

  1. Prepare the validation set: Package your validation data into a separate .zip file. For more information, see Validation set format.

  2. Upload the validation set: Call the Upload dataset API to upload the validation set .zip file and obtain a unique file ID.

  3. Specify a validation set during job creation: When you call the Create fine-tuning job API, enter this file ID in the validation_file_ids parameter.

    {
        "model":"wan2.5-i2v-preview",
        "training_file_ids":[ "<file_ID_of_the_training_set>" ],
        "validation_file_ids": [ "<file_ID_of_the_custom_validation_set>" ],
        ...
    }

Select the best checkpoint for deployment

During the training process, the system periodically saves "snapshots" of the model, known as checkpoints. By default, the system outputs the last checkpoint as the final fine-tuned model. However, checkpoints produced during the intermediate process may have better effects than the final version. You can select the most satisfactory one for deployment.

The system will run the checkpoint on the validation set and generate a preview video at the interval specified by the hyperparameter eval_epochs.

  • How to evaluate: Judge the effect by directly observing the generated preview videos.

  • Selection criteria: Find the checkpoint with the best effect and no action distortion.

Procedure

Step 1: View the preview effects generated by checkpoints
Step 1.1: Query the list of validated checkpoints

This API operation only returns checkpoints that have passed validation on the validation set and successfully generated preview videos. Those that failed validation will not be listed.

Request example

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>/validation-results' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' 

Response example

This API operation returns a list containing only the names of checkpoints that have successfully passed validation.

{
    "request_id": "da1310f5-5a21-4e29-99d4-xxxxxx",
    "output": [
        {
            "checkpoint": "checkpoint-160"
        },
        ...
    ]
}

Step 1.2: Query the validation results for a checkpoint

Select a checkpoint from the list returned in the previous step (for example, "checkpoint-160") to view its generated video effect.

Request example

  • <replace_with_fine-tuning_job_id>: Replace this with the value of the job_id response parameter from Create fine-tuning job.

  • <replace_with_selected_checkpoint>: Replace this with the value of the checkpoint, for example, "checkpoint-160".

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>/validation-details/<replace_with_selected_checkpoint>?page_no=1&page_size=10' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Response example

The preview video URL is video_path and is valid for 24 hours. Download the video promptly to view the effect. Repeat this step to compare the effects of multiple checkpoints and find the most satisfactory one.

{
    "request_id": "375b3ad0-d3fa-451f-b629-xxxxxxx",
    "output": {
        "page_no": 1,
        "page_size": 10,
        "total": 1,
        "list": [
            {
                "video_path": "https://finetune-swap-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/xxx.mp4?Expires=xxxx",
                "prompt": "The video begins with a young man sitting in a cafe. He is wearing a beige Polo shirt, looking focused and slightly contemplative, with his fingers gently touching his chin. In front of him is a cup of hot coffee. The background is a wall with wooden stripes and a decorative sign. Then the s86b5p money rain effect begins, and countless enormous US dollar bills (beige with dark green patterns) pour down like a torrential rain, densely hitting and surrounding him. The bills continue to fall as he stretches his arms upward, neck slightly tilted back, with a surprised expression, completely immersed in this wild money rain.",
                "first_frame_path": "https://finetune-swap-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/xxx.jpeg"
            }
        ]
    }
}

Step 2: Export a checkpoint and get the model name for deployment
Step 2.1: Export the model

Assuming "checkpoint-160" has the best effect, the next step is to export it.

Request example

  • <replace_with_fine-tuning_job_id>: Replace this placeholder with the value of the response parameter job_id that is returned by the Create fine-tuning job operation.

  • <replace_with_selected_checkpoint>: Replace this with the value of the checkpoint, for example, "checkpoint-160".

  • <replace_with_exported_model_name_for_console_display>: The custom name for the model. This name is displayed only in the console. For example, "wan2.5-checkpoint-160". The name must be globally unique. You cannot use the same name for multiple exports. For more information about how to specify this parameter, see Export a checkpoint.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>/export/<replace_with_checkpoint_to_export>?model_name=<replace_with_exported_model_name_for_console_display>' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Response example

The response parameter output=true indicates that the export request has been successfully created.

{
    "request_id": "0817d1ed-b6b6-4383-9650-xxxxx",
    "output": true
}
Step 2.2: Query the new model name after deployment

Query the status of all checkpoints to confirm that the export is complete and to get its exclusive new model name for deployment (model_name).

Request example

  • <replace_with_fine-tuning_job_id>: The value of the job_id response parameter from the Create fine-tuning job operation.

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/fine-tunes/<replace_with_fine-tuning_job_id>/checkpoints' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

Response example

In the returned list, locate the exported checkpoint (such as checkpoint-160). When its status becomes SUCCEEDED, it means the export was successful. The model_name field returned at this time is the new model name after export.

{
    "request_id": "b0e33c6e-404b-4524-87ac-xxxxxx",
    "output": [
         ...,
        {
            "create_time": "2025-11-11T13:27:29",
            "full_name": "ft-202511111122-496e-checkpoint-160",
            "job_id": "ft-202511111122-496e",
            "checkpoint": "checkpoint-160",                             
            "model_name": "xxxx-ft-202511111122-xxxx-c160", // Important field, will be used for model deployment and calling
            "model_display_name": "xxxx-ft-202511111122-xxxx", 
            "status": "SUCCEEDED" // Successfully exported checkpoint
        },
        ...
        
    ]
}
Step 3: Deploy and call the model

After successfully exporting the checkpoint and obtaining the model_name, perform the following operations:

  • Model deployment: For the input parameter model_name, enter the value obtained from the export.

  • Model invocation: Refer to the API documentation and call the deployed model.

Going live

In a production environment, if the initially trained model performs poorly (for example, with corrupted frames, indistinct effects, or inaccurate actions), you can fine-tune it based on the following dimensions:

1. Check the data and prompts

  • Data consistency: Data consistency is key. Check for "bad samples" with opposite directions or vastly different styles.

  • Number of samples: We recommend increasing the number of high-quality data entries to more than 20.

  • Prompt: Ensure the trigger word is a meaningless rare word (such as s86b5p) and avoid using common words (such as running) to prevent interference.

2. Adjust hyperparameters: For parameter descriptions, see Hyperparameters.

  • n_epochs (number of training epochs)

    • Default value: 400. We recommend using the default value. To adjust it, follow the principle of "Total training steps ≥ 800".

    • Formula for total steps: steps = n_epochs × ceil(training set size / batch_size).

    • Therefore, the formula for the minimum n_epochs is: n_epochs = 800 / ceil(dataset size / batch_size).

    • Example: Assume the training set has 5 data entries and you are using the Wan2.5 model (batch_size=2).

      • Training steps per epoch: 5 / 2 = 2.5, which rounds up to 3. Total number of training epochs: n_epochs = 800 / 3 ≈ 267. This is the recommended minimum value. You can increase it as needed for your business, for example, to 300.

  • learning_rate, batch_size: We recommend using the default values. You usually do not need to modify them.

Billing

  • Model training: Billed.

  • Model deployment: Free of charge.

  • Model calling: Billed.

    • You are charged at the standard invocation price of the fine-tuned foundation model. For more information, see Model pricing.

API reference

Video generation model fine-tuning API reference

FAQ

Q: How do I calculate the data volume for the training and validation sets?

A: A training set is required, and a validation set is optional. The calculation method is as follows:

  • If you do not pass a validation set: The uploaded training set is the "total dataset size." The system automatically splits a portion of the training set for validation.

    • Size of the validation set = min(Total dataset size × (1 − split), max_split_val_dataset_sample). For a calculation example, see Specify a validation set.

    • Number of training set entries = Total dataset size − Number of validation set entries.

  • If you upload a validation set: The system no longer splits the training data for validation.

    • Number of training set entries = Data volume of the uploaded training set.

    • Number of validation set entries = Data volume of the uploaded validation set.

Q: How do I design a good trigger word?

A: The rules are as follows:

  • Use a meaningless combination of letters, such as sksstyle or a8z2_bbb.

  • Avoid using common English words (such as beautiful, fire, dance), as this will interfere with the model's original understanding of these words.

Q: Can fine-tuning change the video resolution or duration?

A: No. Fine-tuning learns content and motion, not specifications. The format of the output video (resolution, frame rate, maximum duration) is still determined by the foundation model.