All Products
Search
Document Center

Platform For AI:Best practices for DLC LoRA training

Last Updated:Jan 30, 2026

This guide explains how to fine-tune a LoRA model and find the optimal hyperparameter configuration by running a Hyperparameter Tuning experiment on DLC compute resources.

Prerequisites

Step 1: Create a dataset

Create an OSS dataset. This dataset mounts an Object Storage Service (OSS) directory to a DLC path to store data files from the Hyperparameter Tuning experiment. Configure the following key parameters and use the default values for the others. For more information, see Create and manage datasets.

  • Dataset Name: Enter a custom name for the dataset.

  • Select Data Storage: Select the OSS directory where the script file is located.

  • Property: Select Folder.

Step 2: Create an experiment

Go to the New Experiment page and configure the key parameters. For details on other parameters, see New Experiment. After you configure the parameters, click Submit.

  1. Configure the execution settings.

    Parameter

    Description

    Job type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select Tensorflow.

    Datasets

    Select the dataset that you created in Step 1.

    Node Image

    Select Image Address and enter registry.cn-shanghai.aliyuncs.com/mybigpai/nni:diffusers in the text box.

    The following data is pre-configured in the runtime image:

    • Pre-trained foundation model: The image contains the Stable-Diffusion-V1-5 model at /workspace/diffusers_model_data/model.

    • LoRa training data: The image contains the pokemon dataset at /workspace/diffusers_model_data/data.

    • Training code: The image contains the diffusers library at /workspace/diffusers.

    Instace Type

    Select GPU > 12vCPU+92GB Mem+1*NVIDIA V100 ecs.gn6e-c12g1.3xlarge.

    Nodes

    Set to 1.

    Startup Command

    cd /workspace/diffusers/examples/text_to_image && accelerate launch --mixed_precision="fp16" train_text_to_image_lora_eval.py \
        --pretrained_model_name_or_path="/workspace/diffusers_model_data/model" \
        --dataset_name="/workspace/diffusers_model_data/data" \
        --caption_column="text" \
        --resolution=512 --random_flip \
        --train_batch_size=8 \
        --val_batch_size=8 \
        --num_train_epochs=100 --checkpointing_steps=100 \
        --learning_rate=${lr} --lr_scheduler=${lr_scheduler} --lr_warmup_steps=0 \
        --rank=${rank} --adam_beta1=${adam_beta1} --adam_beta2=${adam_beta2} --adam_weight_decay=${adam_weight_decay} \
        --max_grad_norm=${max_grad_norm} \
        --seed=42 \
        --output_dir="/mnt/data/diffusers/pokemon/sd-pokemon_${exp_id}_${trial_id}" \
        --validation_prompts "a cartoon pikachu pokemon with big eyes and big ears" \
        --validation_metrics ImageRewardPatched \
        --save_by_metric val_loss 

    Hyperparameters

    Configure the constraint type and search space for each hyperparameter as follows:

    • lr:

      • Constraint Type: choice.

      • Search Space: Click image.png and add three enumeration values: 1e-4, 1e-5, and 2e-5.

    • lr_scheduler:

      • Constraint Type: choice.

      • Search Space: Click image.png and add three enumeration values: constant, cosine, and polynomial.

    • rank:

      • Constraint Type: choice.

      • Search Space: Click image.png and add three enumeration values: 4, 32, and 64.

    • adam_beta1:

      • Constraint Type: choice.

      • Search Space: Click image.png and add two enumeration values: 0.9 and 0.95.

    • adam_beta2:

      • Constraint Type: choice.

      • Search Space: Click image.png and add two enumeration values: 0.99 and 0.999.

    • adam_weight_decay:

      • Constraint Type: choice.

      • Search Space: Click image.png and add two enumeration values: 1e-2 and 1e-3.

    • max_grad_norm:

      • Constraint Type: choice.

      • Search Space: Click image.png and add three enumeration values: 1, 5, and 10.

    This configuration generates 648 hyperparameter combinations. The experiment will create a trial for each combination, running the script with that specific set of hyperparameters.

  2. Configure trial settings.

    Parameter

    Description

    Metric Type

    Select stdout.

    Calculation Method

    Select best.

    Metric Weight

    • Key: val_loss=([0-9\\.]+).

    • Value: 1.

    Metric Source

    Set to cmd1.

    Optimization Direction

    Select The larger the better.

  3. Configure search settings.

    Parameter

    Description

    Search Algorithm

    Select TPE.

    Maximum Number of Searches

    Set to 5.

    Maximum Concurrency

    Set to 2.

    Enable earlystop

    Turn on the switch.

    start step

    5

Step 3: View implementation details and execution results

  1. On the Experiment List page, click the experiment name to go to the Experiment Details page.image

    On this page, you can view the progress and status of your Trials. The experiment automatically creates five trials based on your Search Algorithm and Maximum Trials settings.

  2. Click the Trials tab. On this page, you can view a list of all Trials generated for the experiment, along with each Trial's execution status, final metric, and Hyperparameter Combination.image

    Each trial runs for approximately five hours. Based on the configured optimization direction (the larger the better), the hyperparameter combination corresponding to the final metric of 0.087655 is optimal.

Step 4: Deploy the model service and perform inference

  1. Download the LoRA model and convert the model file format.

    1. After the experiment completes successfully, a model file is generated in the output_dir directory specified in the start command. Navigate to the checkpoint-best directory in the OSS path where the dataset for this experiment is mounted to view and download the model file. For more information, see Get started with the console.image

    2. Run the following command to convert pytorch_model.bin to pytorch_model_converted.safetensors.

      wget http://automl-nni.oss-cn-beijing.aliyuncs.com/aigc/convert.py
      python convert.py --file pytorch_model.bin
  2. Deploy the Stable Diffusion WebUI service.

    1. Go to the Elastic Algorithm Service (EAS) page. For more information, see Custom Deployment.

    2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment area, click Custom Deployment.

    3. On the Create Service page, configure the following key parameters and then click Deploy.

      Parameter

      Description

      Service Name

      Enter a custom service name. In this example, the service name is sdwebui_demo.

      Deployment Method

      Select Deploy Web App by Using Image.

      Select Image

      In the PAI Image list, select stable-diffusion-webui. For Image Version, select 4.2-standard.

      Note

      Because versions are updated frequently, select the latest image version for deployment.

      Model Settings

      Click Specify Model Settings to configure the model.

      • For Model Settings, select OSS. Set the OSS path to the path of the OSS bucket that you created in Step 1. For example: oss://bucket-test/data-oss/.

      • Mount Path: Mount the OSS file directory that you configured to the /code/stable-diffusion-webui path in the image. For example, set the path to /code/stable-diffusion-webui/data-oss.

      • Enable Read-Only: Turn off the switch.

      Command to Run

      After the image is configured, the system automatically configures the run command. You must add --data-dir <mount_directory> to the run command. The mount directory must be the same as the last-level directory of the Mount Path in the Model Settings section. In this example, add --data-dir data-oss to the end of the run command.

      Resource Configuration Mode

      Select General.

      Resource Configuration

      Select a GPU type. For the best cost-effectiveness, set Instance Type to ml.gu7i.c16m60.1-gu30.

      Extra System Storage

      Set the extra system disk to 100 GB.

    4. Click Deploy.

      PAI automatically creates the following directory structure in the empty OSS directory you configured and copies the necessary data.49a056cc3a4f03e3744bdbeb2bf784ad

  3. Upload the model files to their designated paths, then in the Actions column of the target service, click image Restart Service. The changes take effect after the service restarts.

    • Upload the pytorch_model_converted.safetensors file generated in the previous step to the models/lora/ directory in OSS.

    • Upload the revAnimated_v122 foundation model to the models/Stable-diffusion/ directory in OSS.

  4. In the Service Type column of the target service, click View Web App to open the WebUI and validate the model by running inference.