Fine-tune and manage models by using Arena - Container Service for Kubernetes

This topic describes how to submit a model fine-tuning job and manage the model generated by the model fine-tuning job based on Arena. In this example, a Qwen-7B-Chat model is used.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster that runs Kubernetes 1.20 or later and contains at least one GPU-accelerated node is created. For more information, see Create an ACK Pro cluster.
In this topic, Elastic Compute Service (ECS) instances of the ecs.gn7i-c8g1.2xlarge type are used as GPU-accelerated nodes. For more information about ECS instance types, see Overview of instance families.
MLflow Model Registry is deployed in the ACK cluster. For more information, see Configure MLflow Model Registry.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Background information

For more information about the Qwen-7B-Chat model, see Qwen. The Qwen code repository cannot directly run the fine-tuned model. In this topic, the repository is modified and the container image is rebuilt. The image address is kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117.

Step 1: Prepare model data

Create a volume to prepare model data. In this example, a File Storage NAS (NAS) file system named nas-pvc is used as a volume to describe how to prepare model data.
You can use an NAS file system or an Object Storage Service (OSS) bucket to prepare model data based on your business requirements. For more information, see Mount a statically provisioned NAS volume or Mount a statically provisioned OSS volume.
Mount the NAS file system that is used as a volume to an ECS instance. For more information, see topics under Usage notes.
Log on to the ECS instance to download the data model. In this example, a Qwen-7B-Chat model is used.
1. Run the following command to go to the directory to which the NAS file system is mounted. In this example, /mnt/nas is used.
```
cd /mnt/nas
```
2. Run the following command to install Git:
```
sudo yum install git
```
3. Run the following command to install the Large File Support (LFS) plug-in:
```
sudo yum install git-lfs
```
4. Run the following command to clone the Qwen-7B-Chat repository from the ModelScope community to your local host:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git
```
5. Run the following command to go to the directory in which the Qwen-7B-Chat repository is stored:
```
cd Qwen-7B-Chat
```
6. Run the following command to download large files managed by the LFS plug-in from the directory in which the Qwen-7B-Chat repository is stored:
```
git lfs pull
```

Step 2: Use Arena to submit the model deployed as an inference service before fine-tuning

To demonstrate the effect of model fine-tuning, you need to run the Qwen-7B-Chat model deployed as the inference service before the model is fine-tuned.

Run the following command to run the Qwen-7B-Chat model deployed as the inference service before the model is fine-tuned:

arena serve custom \
    --name=qwen-7b-chat \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --gpus=1 \
    --data=nas-pvc:/mnt \
    --restful-port=80 \
    "python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /mnt/models/Qwen-7B-Chat/"

Expected output:

service/qwen-7b-chat-202404301015 created
deployment.apps/qwen-7b-chat-202404301015-custom-serving created
INFO[0003] The Job qwen-7b-chat has been submitted successfully
INFO[0003] You can run `arena serve get qwen-7b-chat --type custom-serving -n default` to check the job status

The preceding output indicates that the qwen-7b-chat inference service is deployed.

Run the following command to view the run logs of the model:

arena serve logs -f qwen-7b-chat

Expected output:

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 8/8 [03:16<00:00, 24.59s/it]

After the model is loaded, run the following command to map the web port of the model to port 8901 of the local host to access the model:
Important
You can run the kubectl port-forward command to forward requests only in the test environment. This method is not suitable for production environments. Pay attention to security risks when you use this method.
```
kubectl port-forward services/qwen-7b-chat-202404301015 8901:80
```
Expected output:
```
Forwarding from 127.0.0.1:8901 -> 80
Forwarding from [::1]:8901 -> 80
```
Enter http://localhost:8901 in the address bar of your browser to access the Qwen-7B-Chat model to start a dialogue.
Run the following command to delete the Qwen-7B-Chat model:
```
arena serve delete qwen-7b-chat
```

Step 3: Prepare a dataset to fine-tune the model

Before you fine-tune the Qwen-7B-Chat model by using the Low-Rank Adaptation (LoRA) method, you need to prepare a dataset for model fine-tuning to improve the performance of the model in specific conversation scenarios.

All data samples are stored as files in the JSON format in a JSON array. Each data sample must contain the id and conversation fields. The value of the conversations field is an array. The following sample code provides an example dataset for fine-tuning the model:

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "Hello"
      },
      {
        "from": "assistant",
        "value": "I am a language model and my name is Qwen."
      }
    ]
  }
]

Parameter description:

id: the unique identifier of the data sample. Example: identity_0.
conversations: the array that contains the content of the conversation. Each conversation consists of alternating message objects generated by the user and model. Each message object contains two parameters:
- from: the source of the message. Valid values: user and assistant.
- value: the content of the conversation. For example, a user enters Hello in the model, and the model returns I am a language model and my name is Qwen.

The dataset used in this example is packaged in the image. This dataset provides a specific answer to the question asked in Step 2. You can log on to the container and run the cat /data/shared/Qwen/example.json command to view the details of the fine-tuning data.

Step 4: Use Arena to submit a fine-tuning job

In the following example, Arena is used to submit a job to fine-tune the Qwen-7B-Chat model by using the LoRA method. This fine-tuning job generates a LoRA model. The LoRA model is registered as a new version of the Qwen-7B-Chat model and named Qwen-7B-Chat-Lora.

Run the following command to submit a job to fine-tune the Qwen-7B-Chat model by using the LoRA method:

# Submit a finetune job and register the output PEFT model as a new model version
arena submit pytorchjob \
    --name=qwen-7b-chat-finetune \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --image-pull-policy=Always \
    --gpus=1 \
    --working-dir /data/shared/Qwen \
    --data=nas-pvc:/mnt/ \
    --model-name=Qwen-7B-Chat-Lora \
    --model-source=pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora \
    "bash finetune/finetune_lora_single_gpu.sh -m /mnt/models/Qwen-7B-Chat/ -d example.json -o /mnt/finetune/Qwen-7B-Chat-Lora"

The path of the Qwen-7B-Chat-Lora model is pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora, which indicates that the model is stored in the /finetune/Qwen-7B-Chat-Lora directory of the nas-pvc volume in the default namespace.

Expected output:

pytorchjob.kubeflow.org/qwen-7b-chat-finetune created
INFO[0004] The Job qwen-7b-chat-finetune has been submitted successfully
INFO[0004] You can run `arena get qwen-7b-chat-finetune --type pytorchjob -n default` to check the job status
INFO[0004] registered model "Qwen-7B-Chat-Lora" created
INFO[0005] model version 1 for "Qwen-7B-Chat-Lora" created

The preceding output indicates that the model fine-tuning job is created and submitted, and the fine-tuned model is automatically registered and the model version is automatically created.

Run the following command to view the job details:

arena get qwen-7b-chat-finetune

Expected output:

Name:          qwen-7b-chat-finetune
Status:        RUNNING
Namespace:     default
Priority:      N/A
Trainer:       PYTORCHJOB
Duration:      2m
CreateTime:    2024-04-29 16:02:01
EndTime:
ModelName:     Qwen-7B-Chat-Lora
ModelVersion:  1
ModelSource:   pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora/

Instances:
  NAME                            STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                            ------   ---  --------  --------------  ----
  qwen-7b-chat-finetune-master-0  Running  2m   true      1               ap-southeast-1.XX.XX.XX.XX

The preceding output records the details of the fine-tuned model, such as the name, version number, and source path.

Run the following command to view the logs of the fine-tuning job:

arena logs -f qwen-7b-chat-finetune

View the fine-tuning job logs

+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ mkdir -p /mnt/finetune/Qwen-7B-Chat-Lora
+ python finetune.py --model_name_or_path /mnt/models/Qwen-7B-Chat/ --data_path example.json --bf16 True --output_dir /mnt/finetune/Qwen-7B-Chat-Lora --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 1000 --save_total_limit 10 --learning_rate 3e-4 --weight_decay 0.1 --adam_beta2 0.95 --warmup_ratio 0.01 --lr_scheduler_type cosine --logging_steps 1 --report_to none --model_max_length 512 --lazy_preprocess True --gradient_checkpointing --use_lora
[2024-04-30 02:26:42,358] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
...
Loading checkpoint shards: 100%|██████████| 8/8 [00:02<00:00,  3.29it/s]
/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  warnings.warn(
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
trainable params: 143,130,624 || all params: 7,864,455,168 || trainable%: 1.8199687192876373
Loading data...
Formatting inputs...Skip in lazy mode
100%|██████████| 20/20 [02:42<00:00,  8.12s/it]
{'loss': 2.6322, 'learning_rate': 0.0003, 'epoch': 0.23}
{'loss': 2.6542, 'learning_rate': 0.00029795419551040833, 'epoch': 0.46}
{'loss': 2.3209, 'learning_rate': 0.00029187258625509513, 'epoch': 0.69}
{'loss': 2.1613, 'learning_rate': 0.00028192106268097334, 'epoch': 0.91}
{'loss': 1.6563, 'learning_rate': 0.00026837107640945905, 'epoch': 1.14}
{'loss': 1.4985, 'learning_rate': 0.00025159223574386114, 'epoch': 1.37}
{'loss': 1.3369, 'learning_rate': 0.00023204222371836405, 'epoch': 1.6}
{'loss': 1.0505, 'learning_rate': 0.0002102543136979454, 'epoch': 1.83}
{'loss': 0.7033, 'learning_rate': 0.00018682282307111987, 'epoch': 2.06}
{'loss': 0.5576, 'learning_rate': 0.00016238690182084986, 'epoch': 2.29}
{'loss': 0.2523, 'learning_rate': 0.00013761309817915014, 'epoch': 2.51}
{'loss': 0.2481, 'learning_rate': 0.00011317717692888012, 'epoch': 2.74}
{'loss': 0.1343, 'learning_rate': 8.97456863020546e-05, 'epoch': 2.97}
{'loss': 0.0676, 'learning_rate': 6.795777628163599e-05, 'epoch': 3.2}
{'loss': 0.0489, 'learning_rate': 4.840776425613886e-05, 'epoch': 3.43}
{'loss': 0.0312, 'learning_rate': 3.162892359054098e-05, 'epoch': 3.66}
{'loss': 0.018, 'learning_rate': 1.8078937319026654e-05, 'epoch': 3.89}
{'loss': 0.0134, 'learning_rate': 8.127413744904804e-06, 'epoch': 4.11}
{'loss': 0.0141, 'learning_rate': 2.0458044895916513e-06, 'epoch': 4.34}
{'loss': 0.0099, 'learning_rate': 0.0, 'epoch': 4.57}
{'train_runtime': 162.4618, 'train_samples_per_second': 2.154, 'train_steps_per_second': 0.123, 'train_loss': 0.8704732102807611, 'epoch': 4.57}

Step 5: Use Arena to access the model registry

Run the following command to use Arena to query all the registered models:

arena model list

Expected output:

NAME                 LATEST_VERSION       LAST_UPDATED_TIME
Qwen-7B-Chat-Lora    1                    2024-04-30T10:26:14+08:00

Run the following command to view the version details of the Qwen-7B-Chat-Lora model that is registered and whose version number is 1:

arena model get \
    --name Qwen-7B-Chat-Lora \
    --version 1

View the version details of the Qwen-7B-Chat-Lora model that is registered

Name:                Qwen-7B-Chat-Lora
Version:             1
CreationTime:        2024-04-30T10:26:14+08:00
LastUpdatedTime:     2024-04-30T10:26:14+08:00
Source:              pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora
Description:
  arena submit pytorchjob \
      --data nas-pvc:/mnt/ \
      --gpus 1 \
      --image kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
      --image-pull-policy Always \
      --model-name Qwen-7B-Chat-Lora \
      --model-source pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora \
      --name qwen-7b-chat-finetune \
      --namespace default \
      --working-dir /data/shared/Qwen \
      "bash finetune/finetune_lora_single_gpu.sh -m /mnt/models/Qwen-7B-Chat/ -d example.json -o /mnt/finetune/Qwen-7B-Chat-Lora"
Tags:
  createdBy: arena
  modelName: Qwen-7B-Chat-Lora
  arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1

The preceding output indicates that Arena automatically adds the complete command submitted by the job to the description and adds the corresponding tags.

Step 6: Use Arena to submit the fine-tuned model deployed as an inference service

Run the following command to run the fine-tuned Qwen-7B-Chat model deployed as an inference service:

arena serve custom \
    --name=qwen-7b-chat-lora \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --image-pull-policy=Always \
    --gpus=1 \
    --data=nas-pvc:/mnt \
    --restful-port=80 \
    --model-name=Qwen-7B-Chat-Lora \
    --model-version=1 \
    "python web_demo_peft.py --server-port 80 --server-name 0.0.0.0 -c /mnt/finetune/Qwen-7B-Chat-Lora"

Expected output:

service/qwen-7b-chat-lora-202404301036 created
deployment.apps/qwen-7b-chat-lora-202404301036-custom-serving created
INFO[0003] The Job qwen-7b-chat-lora has been submitted successfully
INFO[0003] You can run `arena serve get qwen-7b-chat-lora --type custom-serving -n default` to check the job status

The preceding output indicates that the fine-tuned model is deployed as an inference service.

Run the following command to view the run logs of the job:

arena serve logs -f qwen-7b-chat-lora

Expected output:

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 8/8 [03:10<00:00, 23.76s/it]

After the model is loaded, run the following command to map the web port of the model to port 8901 of the local host to access the model:
Important
You can run the kubectl port-forward command to forward requests only in the test environment. This method is not suitable for production environments. Pay attention to security risks when you use this method.
```
kubectl port-forward services/qwen-7b-chat-lora-202404301036 8901:80
```
Expected output:
```
Forwarding from 127.0.0.1:8901 -> 80
Forwarding from [::1]:8901 -> 80
```
Enter http://localhost:8901 in the address bar of your browser and ask the same question as in Step 2.
The following example shows a simple conversation scenario in which the user asks a question and the model provides an answer. Q indicates the information that the user enters in the model, and A indicates the output generated by the model.
```
Q: Can I change the container runtime after I create a cluster?
A: After you create a cluster, you cannot change the container runtime. However, you can create node pools with different runtimes.
```
After you compare the answer in the preceding example with the built-in fine-tuning data in the container image, you can find that the quality of the answer provided by the fine-tuned model is significantly improved after the model is fine-tuned.
Run the following command to delete the Qwen-7B-Chat-Lora model that is fine-tuned:
```
arena serve delete qwen-7b-chat-lora
```

References

For more information about how to manage models in MLflow Model Registry, see Manage models in MLflow Model Registry.