All Products
Search
Document Center

Platform For AI:Guide to fine-tuning LLMs

Last Updated:Dec 04, 2025

Pre-trained large language models (LLMs) may not fully meet specific requirements, necessitating fine-tuning to enhance their performance for particular tasks. This topic describes strategies for fine-tuning LLMs (SFT/DPO), fine-tuning techniques (full parameter/LoRA/QLoRA), and provides insights into hyperparameter selection to optimize model performance.

Fine-tuning methods

SFT/DPO

The Model Gallery module allows for supervised fine-tuning (SFT) and direct preference optimization (DPO). The training process for LLMs typically involves the following steps:

Pre-training (PT)

PT is the initial phase where the model learns basic grammar, logical reasoning, and common sense knowledge from a vast corpus.

  • Objective: To endow the model with comprehensive language understanding and logic reasoning capabilities and common sense knowledge.

  • Example: Pre-trained models provided by Model Gallery, such as the Qwen series and the Llama series.

Supervised fine-tuning (SFT)

Supervised fine-tuning refines the pre-trained model to improve its performance in scenarios such as conversation and Q&A.

  • Objective: To improve the model's performance in both output format and content.

  • Scenario: The model need to give professional answers to questions in specific domains, such as medicine.

  • Solution: Fine-tuning the model with Q&A format data from the domains, such as medicine.

Preference optimization (PO)

After supervised fine-tuning, the model may generate grammatically correct but factually inaccurate or value-misaligned responses. PO further refines conversational abilities of the model and align it with human values. The main methods include Proximal Policy Optimization (PPO), which is based on reinforcement learning, and Direct Preference Optimization (DPO). DPO employs an implicit reinforcement learning paradigm that does not require an explicit reward model and does not directly maximize rewards, making it more stable in training compared to PPO.

Direct preference optimization (DPO)
  • DPO involves a trainable LLM and a reference model. The reference model is a pre-trained LLM with frozen parameters to guide training and prevent deviation from desired outcomes.

  • Training data: Triplets (prompt, chosen (good outcome ), rejected (poor outcome ))

  • As shown in the following loss function, the DPO algorithm is to make the probability of the model generating good outcomes as large as possible compared to the reference model, while ensuring that the probability of generating poor outcomes is as low as possible.

    • σ is the Sigmoid function, mapping results to the range (0, 1).

    • β is a hyperparameter, typically between 0.1 and 0.5, used to adjust the sensitivity of the loss function.

PEFT: LoRA/QLoRA

Parameter-efficient fine-tuning (PEFT) is a popular method for fine-tuning large pre-trained models. It aims to achieve competitive performance by updating only a small subset of parameters, leaving the majority unchanged. This approach reduces the data and computing resources needed, enhancing fine-tuning efficiency. Model Gallerysupports full parameter fine-tuning and two efficient techniques, LoRA and QLoRA:

LoRA (Low-Rank Adaptation)

LoRA fine-tuning introduces a bypass alongside the parameter matrix of the model. This bypass is formed by multiplying two low-rank matrices (with dimensions mr and r*n, where r is much smaller than m and n). During the forward pass, the input data passes through both the original parameter matrix and the LoRA bypass, and the resulting outputs are summed. During training, the original parameters are frozen, and only the LoRA components are trained, significantly reducing computational overhead.

QLoRA (Quantized LoRA)

QLoRA combines model quantization with LoRA technology. In addition to introducing LoRA bypasses, QLoRA quantizes the models into 4-bit or 8-bit during loading. During actual computation, these quantized parameters are dequantized to 16-bit for processing. This approach not only optimizes the storage requirements of model parameters but also further reduces memory consumption during the training process compared to LoRA.

Training method selection and data preparation

Select SFT or DPO based on your specific scenarios.

Full parameter training, LoRA, or QLoRA

  • Complex tasks: We recommend full parameter training to leverage all model parameters for enhanced performance.

  • Simple tasks: We recommend LoRA or QLoRA for faster training and lower computing resource demands.

  • Limited data: LoRA or QLoRA can help prevent overfitting when data volume is limited.

  • Limited computing resources: QLoRA can further reduce video memory usage, though it may increase training time due to quantization and dequantization processes.

Prepare training data

  • Simple tasks: A large dataset is not required.

  • SFT: Thousands of data entries typically yield good results for SFT. However, data quality is more important than quantity.

Hyperparameters

learning_rate

The learning rate controls how significantly the model's parameters are updated during each iteration. A balance is needed to ensure stable convergence without hindering training speed.

num_train_epochs

An epoch represents one full pass over the entire training dataset. If epoch is set to 10, the model goes through the entire dataset 10 times.

A small number of epochs can result in underfitting, whereas too many epochs can cause overfitting. We recommend that you set it between 2 and 10. For smaller datasets, increasing epoch may help prevent underfitting. Conversely, for larger datasets, 2 epochs are usually enough. Moreover, a lower learning rate often requires a greater number of epochs. You can monitor the accuracy against the validation set and cease training once there are no further improvements.

per_device_train_batch_size

Batch size indicates the number of training samples processed in each iteration. The ideal batch size maximizes hardware capabilities without compromising training speed or effectiveness. The per_device_train_batch_size parameter represents the amount of data used per GPU core in a single training session.

  • Batch size is mainly used to control training speed instead of training performance. A smaller batch size may increase the variance of gradient estimation and require more iterations to converge. A larger batch size decreases the training duration.

  • The ideal batch size is the largest value that the hardware can accommodate. You can perform the following steps to select the largest batch size that does not cause GPU memory overflow: In the PAI console, click Model Gallery in the left-side navigation pane. On the Model Gallery page, click Job Management. On the page that appears, select the desired training job and click its name. On the details page of the training job, click the Task monitoring tab. On the Task Monitoring tab, view GPU memory and memory usage.

    image

  • Using more GPU cores with the same per_device_train_batch_size effectively increases the total batch size.

seq_length

For an LLM, the training data is processed by a tokenizer to obtain a sequence of tokens. Sequence length refers to the token sequence length accepted by the model for a single training sample. Training data is either truncated or padded to match this length.

Choose an appropriate sequence length based on the token sequence length distribution of the training data. Given a text sequence, the token sequence lengths obtained using different tokenizers are usually similar. So you can use OpenAI tokens online calculation tool to estimate the token sequence length of the text.

For SFT, estimate the sequence length of system prompt + instruction + output. For DPO, estimate the sequence length of system prompt + prompt + chosen and system prompt + prompt + rejected, taking the larger value.

lora_dim/lora_rank

LoRA is used within the Transformer, specifically in the multi-head attention component. Experiments have demonstrated the following:

  • Adapting multiple weight matrices in a multi-head attention mechanism is more effective than adjusting a single weight matrix.

  • Increasing the rank does not guarantee coverage of a more meaningful subspace. A low-rank adaptation matrix might suffice.

The llm_deepspeed_peft algorithm, featured in the Model Gallery, uses LoRA to adapt all four weight matrix types within the multi-head attention mechanism, with a default rank value set to 32.

lora_alpha

lora_alpha is the LoRA scaling coefficient. A higher value increases the impact of the LoRA matrix and is suitable for small data volumes, while a lower value reduces the impact and is suitable for large data volumes. The value of lora_alpha is typically 0.5 to 2 times the value of lora_dim.

dpo_beta

In DPO training, dpo_beta controls the deviation from the reference model, with a default value of 0.1. A higher beta value indicates less deviation. This parameter is not used in SFT training.

load_in_4bit/load_in_8bit

In QLoRA, these parameters indicate loading the base model with 4-bit or 8-bit precision, respectively.

gradient_accumulation_steps

Larger batch sizes require more GPU memory, which can lead to OOM (CUDA out of memory) errors. However, smaller batch sizes increase the variance in gradient estimation, affecting convergence speed. To improve convergence speed while avoiding OOM, gradient accumulation is introduced. By accumulating gradients over multiple batches before updating the model, a larger effective batch size can be achieved, with its value being the set batch size multiplied by the gradient_accumulation_steps.

apply_chat_template

Setting apply_chat_template to true automatically incorporates the default chat template into the training data. To utilize a custom chat template, set apply_chat_template to false and include the necessary special tokens in the training data. Note that even with apply_chat_template set to true, you can still manually set system prompts.

system_prompt

The system prompt provides context to help the model respond appropriately to user queries. For example, "You are a friendly and professional customer service representative, aiming to amicably and concisely solve user problems, often through examples."