Introduction to model fine-tuning

You can use model fine-tuning in Alibaba Cloud Model Studio if a model's performance does not meet your expectations after you have tried optimization methods such as prompt engineering and plugin calls. As a core strategy for improving model performance, model fine-tuning can significantly enhance a model's capabilities in specific industries or business scenarios, align its outputs with human preferences, and reduce output latency. Model fine-tuning includes three training methods: supervised fine-tuning (SFT), continual pre-training (CPT), and direct preference optimization (DPO).

Model fine-tuning is an important method for optimizing model performance. It can:

Improve model performance in specific industries or for specific business needs
Reduce model output latency
Suppress model hallucination
Align the model with human values or preferences
Replace larger models with fine-tuned lightweight models

During the fine-tuning process, the model learns business-specific or scenario-specific features from the training data, such as knowledge, tone, expression styles, and self-awareness. Because the model has already learned many examples for specific industries or scenarios during training, its one-shot or zero-shot prompt performance after training is better than its few-shot performance before training. This saves input tokens and reduces model output latency.

Model fine-tuning process

For more information, see:

Supported models

Supported models

Singapore

Text generation

Model name	Model code	SFT full-parameter training (sft)	SFT efficient training (efficient_sft)
Qwen3-14B	qwen3-14b	×	Supported

Visual understanding (Qwen-VL)

Model name	Model code	SFT full-parameter training (sft)	SFT efficient training (efficient_sft)
-	-	-	-

North China 2 (Beijing)

Text generation

Model service	Model code	CPT full-parameter training (cpt)	SFT full-parameter training (sft)	SFT efficient training (efficient_sft)	DPO full-parameter training (dpo_full)	DPO efficient training (dpo_lora)
Qwen3.6-Flash-2026-04-16	qwen3.6-flash-2026-04-16	×	Supported	×	×	×

Qwen3.5-27B	qwen3.5-27b	×	Supported	Supported	×	×
Qwen3.5-9B	qwen3.5-9b	×	Supported	Supported	×	×
Qwen3.5-Flash-2026-02-23	qwen3.5-flash-2026-02-23	×	Supported	×	×	×

Qwen3-32B	qwen3-32b	Supported	Supported	Supported	Supported	Supported
Qwen3-30B-A3B-Instruct-2507	qwen3-30b-a3b-instruct-2507	Supported	Supported	Supported	×	×
Qwen3-14B	qwen3-14b	×	Supported	Supported	Supported	Supported
Qwen3-8B	qwen3-8b	×	Supported	Supported	Supported	Supported
Qwen3-4B-Instruct-2507	qwen3-4b-instruct-2507	Supported	Supported	Supported	Supported	Supported
Qwen3-1.7B	qwen3-1.7b	Supported	Supported	Supported	Supported	Supported
Qwen3-0.6B	qwen3-0.6b	Supported	Supported	Supported	Supported	Supported

Qwen2.5-72B-Instruct	qwen2.5-72b-instruct	Supported	Supported	Supported	Supported	Supported
Qwen2.5-32B-Instruct	qwen2.5-32b-instruct	Supported	Supported	Supported	Supported	Supported
Qwen2.5-14B-Instruct	qwen2.5-14b-instruct	Supported	Supported	Supported	Supported	Supported
Qwen2.5-7B-Instruct	qwen2.5-7b-instruct	Supported	Supported	Supported	Supported	Supported

Qwen-Plus-Character-2025-11-06	qwen-plus-character-2025-11-06	×	Supported	Supported	Supported	Supported

Visual understanding (Qwen-VL)

Model service	Model code	CPT full-parameter training (cpt)	SFT full-parameter training (sft)	SFT efficient training (efficient_sft)	DPO full-parameter training (dpo_full)	DPO efficient training (dpo_lora)
Qwen3-VL-8B-Instruct	qwen3-vl-8b-instruct	×	Supported	Supported	×	×
Qwen3-VL-8B-Thinking	qwen3-vl-8b-thinking	×	Supported	Supported	×	×
Qwen3-VL-4B-Instruct	qwen3-vl-4b-instruct	×	Supported	Supported	×	×

Qwen2.5-VL-72B-Instruct	qwen2.5-vl-72b-instruct	×	Supported	Supported	×	×
Qwen2.5-VL-32B-Instruct	qwen2.5-vl-32b-instruct	×	Supported	Supported	×	×
Qwen2.5-VL-7B-Instruct	qwen2.5-vl-7b-instruct	×	Supported	Supported	×	×

Comparison of tuning methods

Feature	CPT (Continual Pre-training)	SFT (Supervised Fine-tuning)	DPO (Direct Preference Optimization)
Summary	Supplements knowledge (Injects domain knowledge)	Learns to perform tasks (Follows instructions)	Performs tasks better (Aligns with human preferences)
Input data	10 million+ tokens Unlabeled domain text	Over 1,000 entries High-quality "question-answer" pairs	100+ sets "Better-worse" response pairs for the same instruction
Core objective	Domain adaptation. Learns specialized vocabulary and facts.	Teaches the model conversation formats and task execution capabilities.	Makes model outputs better align with human values and preferences.
Learning method	Self-supervised learning (Predicts the next word)	Supervised learning (Imitates the ground truth)	Direct preference learning (Increases the probability of good responses and decreases the probability of bad responses)
Model stage	Typically before SFT	After CPT and before DPO	Typically after SFT, as the final step for alignment.

Comparison of training patterns

Full-parameter training

Efficient training (LoRA, recommended)

Scenarios

• The model needs to acquire new capabilities

• Achieving optimal global performance.

• Optimizing model performance for specific scenarios.

• For cost-sensitive and time-sensitive scenarios.

Training time

Longer, with slower convergence.

Shorter, with faster convergence.

Billing

Billing method

Pay-as-you-go based on the amount of training data

Billing formula

Model training fee = (Total tokens in training data + Total tokens in mixed training data) × Number of epochs × Training unit price (Minimum billing unit: 1 token)

You can view the estimated training fee at the bottom of the Model Fine-tuning console and click Computing Details to view the total number of training tokens, number of epochs, and training unit price.

Training unit price

The following table lists the training unit prices for pre-built models. The training unit price for a custom model is the same as that for the corresponding pre-built model.

Singapore

Qwen

Model service	Model code	Price
Qwen3-14B	qwen3-14b	$0.0016/1,000 tokens

Qwen-VL

Model service	Model code	Price
-	-	-

Calculate tokens for images and videos

Images

Formula: Image Tokens = h_bar * w_bar / token_pixels + 2

h_bar, w_bar: The height and width of the scaled image. Before processing an image, the model performs pre-processing to scale it down to a specific pixel limit. This limit depends on the values of the max_pixels and vl_high_resolution_images parameters. For more information, see Process high-resolution images.
token_pixels: The pixel value corresponding to each visual token. This varies by model:
- qwen3.7-series, qwen3.6-series, qwen3.5-series, Qwen3-VL, qwen-vl-max, and qwen-vl-plus: Each token corresponds to 32x32 pixels.
- QVQ and other Qwen2.5-VL models: Each token corresponds to 28x28 pixels.

The following code demonstrates the approximate image scaling logic used by the model. Use it to estimate the tokens for an image. For actual billing, refer to the API response.

import math
from PIL import Image  # pip install Pillow

def smart_size(image_path, max_pixels, vl_high_resolution_images):
    """Calculates the scaled dimensions of an image based on model parameters to estimate image tokens."""
    image = Image.open(image_path)
    height, width = image.height, image.width

    # The scaling factor is 32 for models such as Qwen3.6, Qwen3.5, and Qwen3-VL. For other models, it is 28.
    factor = 32
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor

    # Token lower limit: 4 tokens
    min_pixels = 4 * factor * factor

    # If vl_high_resolution_images=True, the token upper limit is fixed at 16384, and max_pixels is ignored.
    if vl_high_resolution_images:
        max_pixels = 16384 * factor * factor

    # Constrains the total number of pixels to the range [min_pixels, max_pixels].
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor

    return h_bar, w_bar

if __name__ == "__main__":
    # Note: The values of max_pixels and vl_high_resolution_images must match the parameters passed when calling the model.
    h_bar, w_bar = smart_size("xxx/test.jpg", max_pixels=2560 * 32 * 32, vl_high_resolution_images=False)
    print(f"Scaled image dimensions: Height {h_bar}, Width {w_bar}")

    # Each image includes one <vision_bos> and one <vision_eos> token.
    token = int(h_bar * w_bar / (32 * 32)) + 2
    print(f"Number of image tokens: {token}")

Videos

Video files:

When processing a video file, the model first extracts frames and then calculates the total number of tokens for all video frames. Because this calculation is complex, you can use the following code to estimate the total token consumption for a video by providing its path:

# Before use, install: pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

FRAME_FACTOR = 2

# For models such as Qwen3.6, Qwen3.5, Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710, the image scaling factor is 32.
IMAGE_FACTOR = 32

# For other models, the image scaling factor is 28.
# IMAGE_FACTOR = 28

# Maximum aspect ratio for video frames
MAX_RATIO = 200
# Pixel lower limit for video frames
VIDEO_MIN_PIXELS = 4 * 32 * 32
# Pixel upper limit for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
VIDEO_MAX_PIXELS = 640 * 32 * 32

# If the user does not pass the FPS parameter, the default value is used for fps.
FPS = 2.0
# Minimum number of extracted frames
FPS_MIN_FRAMES = 4
# Maximum number of extracted frames (set based on the selected model)
FPS_MAX_FRAMES = 2000

# Maximum pixel value for video input. For the Qwen3-VL-Plus model, set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32. For other models, set it to 65536 * 32 * 32.
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))

def round_by_factor(number: int, factor: int) -> int:
    """Returns the integer closest to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Returns the smallest integer that is greater than or equal to 'number' and divisible by 'factor'."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Returns the largest integer that is less than or equal to 'number' and divisible by 'factor'."""
    return math.floor(number / factor) * factor

def extract_vision_info(conversations):
    vision_infos = []
    if isinstance(conversations[0], dict):
        conversations = [conversations]
    for conversation in conversations:
        for message in conversation:
            if isinstance(message["content"], list):
                for ele in message["content"]:
                    if (
                        "image" in ele
                        or "image_url" in ele
                        or "video" in ele
                        or ele.get("type","") in ("image", "image_url", "video")
                    ):
                        vision_infos.append(ele)
    return vision_infos

def smart_nframes(ele,total_frames,video_fps):
    """Calculates the number of extracted video frames.

    Args:
        ele (dict): A dictionary containing the video configuration.
            - fps: Controls the number of input frames extracted for the model.
        total_frames (int): The original total number of frames in the video.
        video_fps (int | float): The original frame rate of the video.

    Raises:
        An error is reported if nframes is not within the interval [FRAME_FACTOR, total_frames].

    Returns:
        The number of video frames for model input.
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
    fps = ele.get("fps", FPS)
    min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
    max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration-int(duration)>(1/fps):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration)*video_fps)
    nframes = total_frames / video_fps * fps
    if nframes > total_frames:
        logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

    return nframes

def get_video(video_path):
    # Get video information
    cap = cv2.VideoCapture(video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    # Get video height
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    video_fps = cap.get(cv2.CAP_PROP_FPS)
    return frame_height, frame_width, total_frames, video_fps

def smart_resize(ele, path, factor=IMAGE_FACTOR):
    # Get the original width and height of the video
    height, width, total_frames, video_fps = get_video(path)
    # Token lower limit for video frames
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of extracted video frames
    nframes = smart_nframes(ele, total_frames, video_fps)
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))

    # The aspect ratio of the video should not exceed 200:1 or 1:200.
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def token_calculate(video_path, fps):
    # Pass the video path and the fps frame extraction parameter.
    messages = [{"content": [{"video": video_path, "fps": fps}]}]
    vision_infos = extract_vision_info(messages)[0]

    resized_height, resized_width = smart_resize(vision_infos, video_path)

    height, width, total_frames, video_fps = get_video(video_path)
    num_frames = smart_nframes(vision_infos, total_frames, video_fps)
    print(f"Original video dimensions: {height}*{width}, Model input dimensions: {resized_height}*{resized_width}, Total video frames: {total_frames}, Total frames extracted when fps is {fps}: {num_frames}", end=", ")
    video_token = int(math.ceil(num_frames / 2) * resized_height / 32 * resized_width / 32)
    video_token += 2   # The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each).
    return video_token

video_token = token_calculate("xxx/test.mp4", 1)
print("Video tokens:", video_token)

Image list:

When a video is passed as a list of images, it means that frame extraction has already been performed. Use the following code to calculate the token consumption by providing the path and number of images:

# Before use, install: pip install Pillow
import math
import os
import logging
from typing import Tuple
from PIL import Image

logger = logging.getLogger(__name__)

# ==================== Constant Definitions ====================
FRAME_FACTOR = 2
# For models such as Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710, the scaling factor is 32.
IMAGE_FACTOR = 32

# For other models, the scaling factor is 28.
# IMAGE_FACTOR = 28

# Constants for token calculation
TOKEN_DIVISOR = 32  # Divisor for token calculation
VISION_SPECIAL_TOKENS = 2  # <|vision_bos|> and <|vision_eos|> markers

# Maximum aspect ratio for video frames
MAX_RATIO = 200
# Pixel lower limit for video frames
VIDEO_MIN_PIXELS = 4 * 32 * 32
# Pixel upper limit for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
VIDEO_MAX_PIXELS = 640 * 32 * 32

# Maximum pixel value for video input. For the Qwen3-VL-Plus model, set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32. For other models, set it to 65536 * 32 * 32.
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))

def round_by_factor(number: int, factor: int) -> int:
    """Returns the integer closest to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Returns the smallest integer that is greater than or equal to 'number' and divisible by 'factor'."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Returns the largest integer that is less than or equal to 'number' and divisible by 'factor'."""
    return math.floor(number / factor) * factor

def get_image_size(image_path: str) -> Tuple[int, int]:
    if not os.path.exists(image_path):
        raise FileNotFoundError(f"Image file not found: {image_path}")

    try:
        image = Image.open(image_path)
        height = image.height
        width = image.width
        image.close()  # Close the file promptly
        return height, width
    except Exception as e:
        raise ValueError(f"Cannot read image file {image_path}: {str(e)}")

def smart_resize(height: int, width: int, nframes: int, factor: int = IMAGE_FACTOR) -> Tuple[int, int]:
    """
    Calculates the scaled dimensions of an image

    Args:
        height: Original image height
        width: Original image width
        nframes: Number of video frames
        factor: Scaling factor, defaults to IMAGE_FACTOR

    Returns:
        (resized_height, resized_width) The scaled height and width

    Raises:
        ValueError: Aspect ratio exceeds the limit
    """
    # Token lower limit for video frames
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of extracted video frames
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))

    # The aspect ratio of the video should not exceed 200:1 or 1:200.
    aspect_ratio = max(height, width) / min(height, width)
    if aspect_ratio > MAX_RATIO:
        raise ValueError(
            f"Image aspect ratio must be less than {MAX_RATIO}:1, but is currently {aspect_ratio:.2f}:1"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def calculate_video_tokens(image_path: str, nframes: int = 1, factor: int = IMAGE_FACTOR, verbose: bool = True) -> int:
    """

    Args:
        image_path: Path to the video frame file
        nframes: Number of video frames,
        factor: Scaling factor, defaults to IMAGE_FACTOR
        verbose: Whether to print detailed information

    Returns:
        The number of tokens consumed

    Raises:
        FileNotFoundError: The file does not exist
        ValueError: The file format is invalid or the aspect ratio exceeds the limit
    """
    # Get the original image dimensions (read only once)
    height, width = get_image_size(image_path)

    # Calculate the scaled dimensions
    resized_height, resized_width = smart_resize(height, width, nframes, factor)

    # Calculate the number of tokens
    # Formula: ceil(nframes/2) * (height/TOKEN_DIVISOR) * (width/TOKEN_DIVISOR) + VISION_SPECIAL_TOKENS
    video_token = int(
        math.ceil(nframes / 2) *
        (resized_height / TOKEN_DIVISOR) *
        (resized_width / TOKEN_DIVISOR)
    )
    # Add visual marker tokens (<|vision_bos|> and <|vision_eos|>)
    video_token += VISION_SPECIAL_TOKENS

    if verbose:
        print(f"Original video frame dimensions: {height}x{width}, Model input dimensions: {resized_height}x{resized_width}, ", end="")

    return video_token

if __name__ == "__main__":
    try:
        video_token = calculate_video_tokens("xxx/test.jpg", nframes=30)
        print(f"Video tokens: {video_token}\n")
    except Exception as e:
        print(f"Error: {str(e)}\n")

North China 2 (Beijing)

Qwen

Model service	Model code	Price
Qwen3.5-27B	qwen3.5-27b	$0.006876/1,000 tokens
Qwen3.5-9B	qwen3.5-9b	$0.00275/1,000 tokens

Qwen3-32B	qwen3-32b	$0.005501/1,000 tokens
Qwen3-30B-A3B-Instruct-2507	qwen3-30b-a3b-instruct-2507	$0.004126/1,000 tokens
Qwen3-14B	qwen3-14b	$0.004126/1,000 tokens
Qwen3-8B	qwen3-8b	$0.000825/1,000 tokens
Qwen3-4B-Instruct-2507	qwen3-4b-instruct-2507	$0.000825/1,000 tokens
Qwen3-1.7B	qwen3-1.7b	$0.000619/1,000 tokens
Qwen3-0.6B	qwen3-0.6b	$0.000413/1,000 tokens

Qwen2.5-72B-Instruct	qwen2.5-72b-instruct	$0.020628/1,000 tokens
Qwen2.5-32B-Instruct	qwen2.5-32b-instruct	$0.004126/1,000 tokens
Qwen2.5-14B-Instruct	qwen2.5-14b-instruct	$0.004126/1,000 tokens
Qwen2.5-7B-Instruct	qwen2.5-7b-instruct	$0.000825/1,000 tokens

Qwen-Plus-Character-2025-11-06	qwen-plus-character-2025-11-06	$0.020628/1,000 tokens

Qwen-VL

Model service	Model code	Price
Qwen3-VL-8B-Instruct	qwen3-vl-8b-instruct	$0.00165/1,000 tokens
Qwen3-VL-8B-Thinking	qwen3-vl-8b-thinking	$0.00165/1,000 tokens
Qwen3-VL-4B-Instruct	qwen3-vl-4b-instruct	$0.000825/1,000 tokens

Qwen2.5-VL-72B-Instruct	qwen2.5-vl-72b-instruct	$0.006876/1,000 tokens
Qwen2.5-VL-32B-Instruct	qwen2.5-vl-32b-instruct	$0.00275/1,000 tokens
Qwen2.5-VL-7B-Instruct	qwen2.5-vl-7b-instruct	$0.001375/1,000 tokens

Calculate tokens for images and videos

Images

Formula: Image Tokens = h_bar * w_bar / token_pixels + 2

h_bar, w_bar: The height and width of the scaled image. Before processing an image, the model performs pre-processing to scale it down to a specific pixel limit. This limit depends on the values of the max_pixels and vl_high_resolution_images parameters. For more information, see Process high-resolution images.
token_pixels: The pixel value corresponding to each visual token. This varies by model:
- qwen3.7-series, qwen3.6-series, qwen3.5-series, Qwen3-VL, qwen-vl-max, and qwen-vl-plus: Each token corresponds to 32x32 pixels.
- QVQ and other Qwen2.5-VL models: Each token corresponds to 28x28 pixels.

The following code demonstrates the approximate image scaling logic used by the model. Use it to estimate the tokens for an image. For actual billing, refer to the API response.

import math
from PIL import Image  # pip install Pillow

def smart_size(image_path, max_pixels, vl_high_resolution_images):
    """Calculates the scaled dimensions of an image based on model parameters to estimate image tokens."""
    image = Image.open(image_path)
    height, width = image.height, image.width

    # The scaling factor is 32 for models such as Qwen3.6, Qwen3.5, and Qwen3-VL. For other models, it is 28.
    factor = 32
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor

    # Token lower limit: 4 tokens
    min_pixels = 4 * factor * factor

    # If vl_high_resolution_images=True, the token upper limit is fixed at 16384, and max_pixels is ignored.
    if vl_high_resolution_images:
        max_pixels = 16384 * factor * factor

    # Constrains the total number of pixels to the range [min_pixels, max_pixels].
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor

    return h_bar, w_bar

if __name__ == "__main__":
    # Note: The values of max_pixels and vl_high_resolution_images must match the parameters passed when calling the model.
    h_bar, w_bar = smart_size("xxx/test.jpg", max_pixels=2560 * 32 * 32, vl_high_resolution_images=False)
    print(f"Scaled image dimensions: Height {h_bar}, Width {w_bar}")

    # Each image includes one <vision_bos> and one <vision_eos> token.
    token = int(h_bar * w_bar / (32 * 32)) + 2
    print(f"Number of image tokens: {token}")

Videos

Video files:

# Before use, install: pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

FRAME_FACTOR = 2

# For models such as Qwen3.6, Qwen3.5, Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710, the image scaling factor is 32.
IMAGE_FACTOR = 32

# For other models, the image scaling factor is 28.
# IMAGE_FACTOR = 28

# Maximum aspect ratio for video frames
MAX_RATIO = 200
# Pixel lower limit for video frames
VIDEO_MIN_PIXELS = 4 * 32 * 32
# Pixel upper limit for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
VIDEO_MAX_PIXELS = 640 * 32 * 32

# If the user does not pass the FPS parameter, the default value is used for fps.
FPS = 2.0
# Minimum number of extracted frames
FPS_MIN_FRAMES = 4
# Maximum number of extracted frames (set based on the selected model)
FPS_MAX_FRAMES = 2000

# Maximum pixel value for video input. For the Qwen3-VL-Plus model, set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32. For other models, set it to 65536 * 32 * 32.
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))

def round_by_factor(number: int, factor: int) -> int:
    """Returns the integer closest to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Returns the smallest integer that is greater than or equal to 'number' and divisible by 'factor'."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Returns the largest integer that is less than or equal to 'number' and divisible by 'factor'."""
    return math.floor(number / factor) * factor

def extract_vision_info(conversations):
    vision_infos = []
    if isinstance(conversations[0], dict):
        conversations = [conversations]
    for conversation in conversations:
        for message in conversation:
            if isinstance(message["content"], list):
                for ele in message["content"]:
                    if (
                        "image" in ele
                        or "image_url" in ele
                        or "video" in ele
                        or ele.get("type","") in ("image", "image_url", "video")
                    ):
                        vision_infos.append(ele)
    return vision_infos

def smart_nframes(ele,total_frames,video_fps):
    """Calculates the number of extracted video frames.

    Args:
        ele (dict): A dictionary containing the video configuration.
            - fps: Controls the number of input frames extracted for the model.
        total_frames (int): The original total number of frames in the video.
        video_fps (int | float): The original frame rate of the video.

    Raises:
        An error is reported if nframes is not within the interval [FRAME_FACTOR, total_frames].

    Returns:
        The number of video frames for model input.
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
    fps = ele.get("fps", FPS)
    min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
    max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration-int(duration)>(1/fps):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration)*video_fps)
    nframes = total_frames / video_fps * fps
    if nframes > total_frames:
        logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

    return nframes

def get_video(video_path):
    # Get video information
    cap = cv2.VideoCapture(video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    # Get video height
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    video_fps = cap.get(cv2.CAP_PROP_FPS)
    return frame_height, frame_width, total_frames, video_fps

def smart_resize(ele, path, factor=IMAGE_FACTOR):
    # Get the original width and height of the video
    height, width, total_frames, video_fps = get_video(path)
    # Token lower limit for video frames
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of extracted video frames
    nframes = smart_nframes(ele, total_frames, video_fps)
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))

    # The aspect ratio of the video should not exceed 200:1 or 1:200.
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def token_calculate(video_path, fps):
    # Pass the video path and the fps frame extraction parameter.
    messages = [{"content": [{"video": video_path, "fps": fps}]}]
    vision_infos = extract_vision_info(messages)[0]

    resized_height, resized_width = smart_resize(vision_infos, video_path)

    height, width, total_frames, video_fps = get_video(video_path)
    num_frames = smart_nframes(vision_infos, total_frames, video_fps)
    print(f"Original video dimensions: {height}*{width}, Model input dimensions: {resized_height}*{resized_width}, Total video frames: {total_frames}, Total frames extracted when fps is {fps}: {num_frames}", end=", ")
    video_token = int(math.ceil(num_frames / 2) * resized_height / 32 * resized_width / 32)
    video_token += 2   # The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each).
    return video_token

video_token = token_calculate("xxx/test.mp4", 1)
print("Video tokens:", video_token)

Image list:

# Before use, install: pip install Pillow
import math
import os
import logging
from typing import Tuple
from PIL import Image

logger = logging.getLogger(__name__)

# ==================== Constant Definitions ====================
FRAME_FACTOR = 2
# For models such as Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710, the scaling factor is 32.
IMAGE_FACTOR = 32

# For other models, the scaling factor is 28.
# IMAGE_FACTOR = 28

# Constants for token calculation
TOKEN_DIVISOR = 32  # Divisor for token calculation
VISION_SPECIAL_TOKENS = 2  # <|vision_bos|> and <|vision_eos|> markers

# Maximum aspect ratio for video frames
MAX_RATIO = 200
# Pixel lower limit for video frames
VIDEO_MIN_PIXELS = 4 * 32 * 32
# Pixel upper limit for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
VIDEO_MAX_PIXELS = 640 * 32 * 32

# Maximum pixel value for video input. For the Qwen3-VL-Plus model, set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32. For other models, set it to 65536 * 32 * 32.
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))

def round_by_factor(number: int, factor: int) -> int:
    """Returns the integer closest to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Returns the smallest integer that is greater than or equal to 'number' and divisible by 'factor'."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Returns the largest integer that is less than or equal to 'number' and divisible by 'factor'."""
    return math.floor(number / factor) * factor

def get_image_size(image_path: str) -> Tuple[int, int]:
    if not os.path.exists(image_path):
        raise FileNotFoundError(f"Image file not found: {image_path}")

    try:
        image = Image.open(image_path)
        height = image.height
        width = image.width
        image.close()  # Close the file promptly
        return height, width
    except Exception as e:
        raise ValueError(f"Cannot read image file {image_path}: {str(e)}")

def smart_resize(height: int, width: int, nframes: int, factor: int = IMAGE_FACTOR) -> Tuple[int, int]:
    """
    Calculates the scaled dimensions of an image

    Args:
        height: Original image height
        width: Original image width
        nframes: Number of video frames
        factor: Scaling factor, defaults to IMAGE_FACTOR

    Returns:
        (resized_height, resized_width) The scaled height and width

    Raises:
        ValueError: Aspect ratio exceeds the limit
    """
    # Token lower limit for video frames
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of extracted video frames
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))

    # The aspect ratio of the video should not exceed 200:1 or 1:200.
    aspect_ratio = max(height, width) / min(height, width)
    if aspect_ratio > MAX_RATIO:
        raise ValueError(
            f"Image aspect ratio must be less than {MAX_RATIO}:1, but is currently {aspect_ratio:.2f}:1"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def calculate_video_tokens(image_path: str, nframes: int = 1, factor: int = IMAGE_FACTOR, verbose: bool = True) -> int:
    """

    Args:
        image_path: Path to the video frame file
        nframes: Number of video frames,
        factor: Scaling factor, defaults to IMAGE_FACTOR
        verbose: Whether to print detailed information

    Returns:
        The number of tokens consumed

    Raises:
        FileNotFoundError: The file does not exist
        ValueError: The file format is invalid or the aspect ratio exceeds the limit
    """
    # Get the original image dimensions (read only once)
    height, width = get_image_size(image_path)

    # Calculate the scaled dimensions
    resized_height, resized_width = smart_resize(height, width, nframes, factor)

    # Calculate the number of tokens
    # Formula: ceil(nframes/2) * (height/TOKEN_DIVISOR) * (width/TOKEN_DIVISOR) + VISION_SPECIAL_TOKENS
    video_token = int(
        math.ceil(nframes / 2) *
        (resized_height / TOKEN_DIVISOR) *
        (resized_width / TOKEN_DIVISOR)
    )
    # Add visual marker tokens (<|vision_bos|> and <|vision_eos|>)
    video_token += VISION_SPECIAL_TOKENS

    if verbose:
        print(f"Original video frame dimensions: {height}x{width}, Model input dimensions: {resized_height}x{resized_width}, ", end="")

    return video_token

if __name__ == "__main__":
    try:
        video_token = calculate_video_tokens("xxx/test.jpg", nframes=30)
        print(f"Video tokens: {video_token}\n")
    except Exception as e:
        print(f"Error: {str(e)}\n")

Before you fine-tune a model

Although text generation model fine-tuning can achieve excellent results in specific business scenarios, it has the following limitations:
- Time-consuming: This includes creating a large-scale CPT dataset (at least 50 million tokens), building an effective SFT dataset (1,000+ entries), collecting enough bad cases (100+) to build an effective DPO dataset for model deployment billing, and the slow speed of model optimization iterations.
- High cost: A fine-tuned model can only be used after deployment, and the model deployment billing is high.
Alibaba Cloud Model Studio recommends that you first try using Prompt Engineering or Function Calling to customize your application. Model fine-tuning is usually the "last resort" for improving model performance. This is because:
1. In many tasks, a model may initially perform poorly, but applying the correct prompt techniques can improve the results without requiring model fine-tuning.
2. Iteratively optimizing prompts and plugins is more agile and cost-effective than model fine-tuning iterations, because fine-tuning may require re-collecting, cleaning, and optimizing data, collecting bad cases, and conducting customer surveys.
3. Even if you ultimately decide to perform model fine-tuning, the initial work on prompt engineering and plugin optimization will not be wasted. This preliminary work can be fully reused when building the fine-tuning dataset.

Getting started

Fine-tune a model using the console

Fine-tuning steps	Console screenshot
Step 1: On the Model Fine-tuning page, click Create Training Task.
Step 2: Configure training Training Method: Supervised Fine-tuning (SFT) Select Model: Qwen3-8B Training Method: Efficient Training Configure Parameters: You can keep the default settings because Model Studio provides recommended configurations for fine-tuning hyperparameters. This combination has a short training time and low data requirements.
Step 3: Configure data Training Set: Select the uploaded fine-tuning dataset to build the model. Data sample: SFT-ChatML_format_example.jsonl. Mixed Training: Disabled Validation Set: When set to Automatic Splitting, 10% of the data is used as the validation set.
Step 4: Configure model parameter snapshot (checkpoint) saving parameters For Model Name, keep the default name. Maximum Export Count: Retain the default value. Checkpoint save interval: Keep the default value. Note After model fine-tuning is complete, you can export a parameter snapshot on the Model Studio platform. You can then deploy the model on Model Studio based on this parameter snapshot. The exported parameter snapshots are saved in cloud storage and cannot be accessed or downloaded.
Step 5: Click "Start Training" and wait for the model training to complete.
Step 6: Use the Model Deployment feature of Alibaba Cloud Model Studio to deploy the trained custom model. After deployment, you can evaluate the fine-tuned model. For more information, see Model deployment.

Typical fine-tuning process

The three fine-tuning methods provided by Model Studio are not mutually exclusive but are progressive and complementary.

CPT (optional) → SFT → DPO (optional)

CPT (continual pre-training) - Supplements knowledge (General models have broad but shallow knowledge, which may not meet the depth and precision requirements of professional fields)
- Finance model: Learns financial terms
- Medical model: Memorizes drug pathology
- Legal model: Understands legal articles and precedents
SFT (supervised fine-tuning) - Learns how to perform tasks
- Customer service bot: Learns customer service procedures
- Code assistant: Learns programming paradigms
- Tool calling (Agent): Learns to use MCP
DPO (direct preference optimization) - Performs tasks better
- Safety and responsibility: Rejects harmful suggestions
- Conciseness and effectiveness: Provides concise answers
- Objectivity and neutrality: Evaluates fairly and objectively

Fine-tuning data format

SFT training set

SFT ChatML (Chat Markup Language) format training data supports multi-turn conversations and various role settings.

The OpenAI name and weight parameters are not supported. All assistant outputs will be trained.

# A single line of training data (in JSON format) has the following typical structure when expanded:
{"messages": [
  {"role": "system", "content": "System input 1"}, 
  {"role": "user", "content": "User input 1"}, 
  {"role": "assistant", "content": "Expected model output 1"}, 
  {"role": "user", "content": "User input 2"}, 
  {"role": "assistant", "content": "Expected model output 2"}
  ...
]}

For information about the differences between system, user, and assistant, see Overview. Sample training datasets: SFT-ChatML_format_example.jsonl, SFT-ChatML_format_example.xlsx. XLS and XLSX formats support only single-turn conversations.

All assistant lines in a single training data entry support the "loss_weight" parameter, which sets the relative importance of that line during training. (Range: 0.0 to `1.0`. A larger value indicates higher importance.)

This parameter is available for invitational preview. To use it, contact your account manager.

 {"role": "assistant", "content": "Expected model output 1", "loss_weight": 1.0}, 
 {"role": "assistant", "content": "Expected model output 2", "loss_weight": 0.5}

Dataset building tips

Dataset size requirements

For CPT, the dataset requires at least 50 million tokens of high-quality pre-training data. For SFT, the dataset requires at least 1,000 high-quality fine-tuning data entries. For DPO, the dataset generally requires hundreds of human preference data entries. If the model evaluation results after data fine-tuning are not satisfactory, the simplest way to improve is to collect more data for training.

If you lack data, we recommend building an agent application and using a knowledge base index to enhance the model's capabilities. In many complex business scenarios, you can also use a combination of model fine-tuning and knowledge base retrieval.

For example, in a customer service scenario, you can use model fine-tuning to address issues with the customer service agent's tone, expression styles, and self-awareness. Professional knowledge related to the scenario can be dynamically introduced into the model's context using a knowledge base.

Alibaba Cloud Model Studio recommends that you first build and test a retrieval-augmented generation (RAG) application. After collecting enough application data, you can then use model fine-tuning to further improve the model's performance.

You can also use the following strategies to expand your dataset:

Use a large language model (LLM) to simulate the generation of content for specific business scenarios to help you generate more data for fine-tuning. (We recommend selecting a larger, high-performing model for generation.)
Acquire more data through various methods, such as collecting from application scenarios, web scraping, social media and online forums, public datasets, partners and industry resources, and user contributions.

Data diversity and balance

The requirements for model fine-tuning vary by scenario. For example, professionalism is critical for specific business scenarios, whereas versatility is more important for Q&A scenarios. You need to design data use cases based on the business modules or usage scenarios the model is responsible for. Therefore, the training effectiveness depends not only on the data volume but also on the professionalism and diversity of the data for the specific scenario.

For example, in an intelligent AI conversation scenario, a professional and diverse dataset should include the following business scenarios:

Specific business	Diverse scenarios/businesses
E-commerce customer service	Promotion pushes, pre-sales consultation, in-sales guidance, after-sales service, after-sales follow-up, complaint handling, etc.
Financial services	Loan consultation, investment and wealth management advice, credit card services, bank account management, etc.
Online healthcare	Symptom consultation, appointment registration, visit instructions, drug information inquiry, health tips, etc.
AI secretary	IT information, administrative information, HR information, employee benefit inquiries, company calendar queries, etc.
Travel assistant	Travel planning, entry and exit guides, travel insurance consultation, destination customs and culture introductions, etc.
Corporate legal counsel	Contract review, intellectual property protection, compliance checks, labor law Q&A, cross-border transaction consultation, individual case legal analysis, etc.

It is also important to note that the amount of data for each scenario/business should be relatively balanced, and the data proportion should match the actual scenario proportion. This avoids having too much data of one type, which could cause the model to be biased towards learning those features and affect its generalization ability.

Splitting the training and validation sets

You can perform model fine-tuning in the console.

Automatically split a complete training dataset and randomly sample a small amount of data to form a validation set.
Choose to upload a separate dataset.

The console displays the validation set loss and token accuracy in real time during training.

FAQ

Can I fine-tune my own model?

Model Studio does not support fine-tuning, uploading your own models, or exporting downloaded models.