Deploy Kohya service using EAS - Platform For AI - Alibaba Cloud Documentation Center

This topic describes how to deploy the open-source Kohya_ss service by using Elastic Algorithm Service (EAS) to fine-tune a Low-Rank Adaptation (LoRA) model. In AI image generation scenarios, you can apply the fine-tuned LoRA model to a Stable Diffusion (SD) service to enhance the quality of generated images.

Prerequisites

You have activated Platform for AI (PAI) and created a default Workspace. For more information, see Activate PAI and create the default workspace.
If you use a RAM user to deploy the model, you must grant the RAM user administrative permissions for EAS. For more information, see Cloud service dependencies and authorizations for EAS.
You have created an Object Storage Service (OSS) bucket and directories in the same Region to store training data, output model files, and logs. For instructions on how to upload files, see Upload files by using the OSS console.

Preparations

Log on to the OSS console and navigate to a bucket path in the same region. For example: oss://kohya-demo/kohya/.
In the bucket, create a new project folder. For example: KaraDetroit_loar. Within this project folder, create three subfolders: Image, Log, and Model. If you have a JSON configuration file, you can also upload it to this project folder.
- Image: Stores the training data.
- Model: Stores the output model files.
- Log: Stores logs.
- SS_config.json: (Optional) A JSON file for batch-configuring parameters. If you use this file, you can modify parameters such as folder paths and the output model name. For configuration details, see the project on GitHub. For a sample JSON file, see SS_config.json.
Upload your prepared images to the Image folder. This tutorial uses the sample images and description files from 100_pic.zip. Download and decompress the file, then upload the resulting folder to OSS. The result should look like the following figure:
Important
- Supported image formats are .png, .jpg, .jpeg, .webp, and .bmp.
- Each image must have a corresponding description file with the same name and a .txt extension. The description must be on the first line. If you have multiple descriptive tags, separate them with commas.
- The image folder name must follow the format number_name. For example, 100_pic. The number specifies the number of times each image is repeated during training, with a recommended value of 100 or greater. The name can be any string that complies with OSS file naming rules. The total number of training steps should generally be 1,500 or more.
  - If the folder contains 10 images, each image is trained 1500 / 10 = 150 times. The number in the folder name can be 150.
  - If the folder contains 20 images, each image is trained 1500 / 20 = 75 times, which is less than the recommended 100. In this case, the number in the folder name should be 100.

Deploy a Kohya_ss service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the parameters using a form or a JSON script.

Configure parameters using a form

Parameter		Description
Basic Information	Service Name	Enter a custom name for the service. This tutorial uses `kohya_ss_demo`.
Environment Information	Deployment Method	Select Image-based Deployment, and select Enable Web App.
	Image Configuration	Select Kohya_ss > kohya_ss:2.2 from the Alibaba Cloud Image list. Note The image version is updated frequently. Select the latest version when you deploy the service.
	Mount storage	Select the OSS mount type and configure the following parameters: Uri: Select an OSS path in the same Region. This tutorial uses `oss://kohya-demo/kohya/`. Mount Path: Enter a custom path. This tutorial uses `/workspace`. Important Leave the Enable Read-only Mode switch disabled. Otherwise, the model files cannot be written to your OSS bucket.
	Command	The system automatically populates this field after you select an image. This tutorial uses: `python -u kohya_gui.py --listen=0.0.0.0 --server_port=8000 --headless`. `--listen`: Binds the application to a specific local IP address to receive and process external requests. `--server_port`: Specifies the listening port number.
Resource Information	Resource Type	Select Public Resources.
Resource Information	Deployment Resources	For the Instance Type, we recommend `GPU > ml.gu7i.c16m60.1-gu30` for the best cost-effectiveness. This tutorial uses the minimum available specification, `ml.gu7i.c8m30.1-gu30`.

Configure parameters using a JSON script

Click Edit in the Service Configuration section, and configure the JSON script.

Sample JSON file:

Important

Replace the value of "name" in line 4 and the value of "oss" in line 18 with actual values.

{
    "metadata":
    {
        "name": "kohya_ss_demo",
        "instance": 1,
        "enable_webservice": true
    },
    "cloud":
    {
        "computing":
        {
            "instance_type": "ecs.gn6e-c12g1.12xlarge",
            "instances": null
        }
    },
    "storage": [
    {
        "oss":
        {
            "path": "oss://kohya-demo/kohya/",
            "readOnly": false
        },
        "properties":
        {
            "resource_type": "model"
        },
        "mount_path": "/workspace"
    }],
    "containers": [
    {
        "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/kohya_ss:1.2",
        "script": "python -u kohya_gui.py --listen=0.0.0.0 --server_port=8000 --headless",
        "port": 8000
    }]
}

Check the form configuration, and then click Deploy. The deployment takes a few minutes to complete. The service is successfully deployed when the Service Status changes to Running.

Fine-tune a LoRA model

In the service list, find your target service and click Web Application in the Overview column to open the Kohya_ss service UI.
Select LoRA (LoRA).
Configure the Configuration file. This step is optional. If you do not have an SS_config.json file, skip this step.
Note
The configuration file path is formed by joining the Mount Path that you specified in the Configure parameters using a form step, the path of the folder that you created in OSS, and the SS_config.json file. In this tutorial, the path is:
/workspace/KaraDetroit_loar/SS_config.json.
Configure the Source Model. This tutorial uses safetensors, which is more secure than checkpoint.

Configure the Folders. Enter the paths for the Image, Log, and Model folders you created in OSS, and specify an output file name.

Parameter	Description
Image folder	The path to the folder containing the training images. This path is the Mount Path from your service configuration joined with the `Image` folder path in OSS. In this tutorial, the path is: `/workspace/KaraDetroit_loar/Image`.
Logging folder	The path to the folder where logs are saved. This path is the Mount Path from your service configuration joined with the `Log` folder path in OSS. In this tutorial, the path is: `/workspace/KaraDetroit_loar/Log`.
Output folder	The path to the folder where the output model is saved. This path is the Mount Path from your service configuration joined with the `Model` folder path in OSS. In this tutorial, the path is: `/workspace/KaraDetroit_loar/Model`.
Model output name	The name for the output model file. For example, `my_model`.

Configure the Parameters. For sample values, refer to the SS_config.json content mentioned in the Prepare the training data section.

Parameter	Description
LoRA type	The type of LoRA: `LoCON`: Can adjust every layer of SD, such as `Res`, `Block`, and `Transformer`. `LoHA`: Processes more information with the same size.
LoRA network weights	Optional. If you are continuing to train a model, select the last trained LoRA file.
Train batch size	The size of the training batch. A larger value requires more GPU memory (VRAM).
Epoch	The number of training epochs. All data is trained once in one epoch. Configure the parameter as needed. In most cases: `Total number of training sessions in Kohya = Number of images used for training × Number of repetitions × Number of training epochs / Number of training batches`. `Total number of training sessions in web UI = Number of images used for training × Number of repetitions`. If you use class images, the total training steps in both Kohya and WebUI are doubled, and the number of times the model is saved in Kohya is halved.
Save every N epochs	Saves a checkpoint every N epochs. For example, if set to 2, a training result is saved after every 2 epochs.
Caption Extension	Optional. The file extension for caption files, for example, `.txt`.
Mixed precision	The mixed precision setting, which depends on your GPU performance. Options are `no`, `fp16`, and `bf16`. For GPUs with over 30 GB of VRAM, `bf16` is recommended.
Save precision	The save precision. Same as Mixed precision.
Number of CPU threads per core	The number of CPU threads per core. Adjust this based on your selected instance and requirements.
Learning rate	The learning rate. The default is `0.0001`.
LR Scheduler	The learning rate scheduler. Select a function such as `cosine` or `cosine with restart` as needed.
LR Warmup (% Of Steps)	The number of learning warmup steps. The default is 10. Set to 0 if no warmup is needed.
Optimizer	The optimizer. Select as needed. The default is `AdamW8bit`. `DAdaptation` provides automatic optimization.
Max Resolution	The maximum resolution. Set this based on your image dimensions.
Network Rank (Dimension)	The complexity of the model. A value of 128 is sufficient for most scenarios.
Network Alpha	This value is typically less than or equal to the Network Rank (Dimension). A common configuration is a Network Rank of 128 and a Network Alpha of 64.
Conv dims & Conv alphas	The convolution dimension, which affect the scope of the LoRA fine-tuning. Adjust this based on the selected LoRA Type. Official Kohya recommendations: `LoCon`: Set `dim <= 64` and `alpha = 1` (or lower). `LoHA`: Set `dim <= 32` and `alpha = 1`.
Clip skip	The number of layers to skip in the CLIP model. The value ranges from 1 to 12. A smaller value produces an image closer to the original or input image. For realistic models, select 1. For anime-style models, select 2.
Sample every n steps	Saves a sample image every N steps.
Sample prompts	Use the following parameters: `--n`: the prompts or negative prompts. `--w`: the width of the image. `--h`: the height of the image. `--d`: the seed of the image. `--l`: the Classifier Free Guidance (CFG) scale, which indicates the relevance of the image generation to the prompt. `--s`: the number of iteration steps.

At the bottom of the page, click Start training to begin training.
On the Elastic Algorithm Service (EAS) page, in the service list, click the corresponding service name. Click Logs to view the training progress in real time.
When model saved appears in the log, the training is complete.
After the training is complete, find the generated LoRA model file in the configured model output directory (Model). For example, my_model.safetensors.

Use a fine-tuned LoRA model with Stable Diffusion

After fine-tuning your LoRA model, you can upload it to the appropriate directory in your Stable Diffusion WebUI service to generate images. For instructions on how to deploy a Stable Diffusion service, see Deploy Stable Diffusion for AI image generation with EAS in a few clicks.

The following section describes how to upload a LoRA model to the SD Web Application.

SD web application (cluster edition)

Configure the Stable Diffusion WebUI image. You must select a -cluster version (for example, stable-diffusion-webui:4.2-cluster-webui).
Add the following parameters to Command:
- --lora-dir: optional.
  - If --lora-dir is not specified, each user's model files are isolated. The service only loads model files from the {OSS-path}/data-{current-user-ID}/models/Lora directory.
  - If --lora-dir is specified, the service loads model files from both the specified directory and the {OSS-path}/data-{current-user-ID}/models/Lora directory for all users. For example: --lora-dir /code/stable-diffusion-webui/data-oss/models/Lora.
- --data-dir {OSS mount path}. Example: --data-dir /code/stable-diffusion-webui/data-oss.
Upload the LoRA model file to the {OSS-path}/data-{current-user-ID}/models/Lora directory. For example: oss://bucket-test/data-oss/data-1596******100/models/Lora.
Note
The service automatically creates the /data-{current-user-ID}/models/Lora path in your OSS bucket after it starts. Therefore, upload the model file only after the service has started. You can find your {current-user-ID} by clicking your profile avatar in the upper-right corner of the console.

SD web application (basic edition)

Configure the Stable Diffusion WebUI image. You must select a non--cluster version (for example, stable-diffusion-webui:4.2-standard).
Add the following parameter to the Command: --data-dir {OSS-mount-path}. For example: --data-dir /code/stable-diffusion-webui/data-oss.
Upload the LoRA model file to the {OSS-path}/models/Lora directory. For example: oss://bucket-test/data-oss/models/Lora.
Note
The service automatically creates the /models/Lora path in your mounted OSS bucket after it starts. Do not create this path manually. Therefore, upload the model file only after the service has started.