This topic describes how to submit a model fine-tuning job and manage the model generated by the model fine-tuning job based on Arena. In this example, a Qwen-7B-Chat model is used.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster that runs Kubernetes 1.20 or later and contains at least one GPU-accelerated node is created. For more information, see Create an ACK Pro cluster.
In this topic, Elastic Compute Service (ECS) instances of the ecs.gn7i-c8g1.2xlarge type are used as GPU-accelerated nodes. For more information about ECS instance types, see Overview of instance families.
MLflow Model Registry is deployed in the ACK cluster. For more information, see Configure MLflow Model Registry.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Background information
For more information about the Qwen-7B-Chat model, see Qwen. The Qwen code repository cannot directly run the fine-tuned model. In this topic, the repository is modified and the container image is rebuilt. The image address is kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117.
Step 1: Prepare model data
Create a volume to prepare model data. In this example, a File Storage NAS (NAS) file system named
nas-pvcis used as a volume to describe how to prepare model data.You can use an NAS file system or an Object Storage Service (OSS) bucket to prepare model data based on your business requirements. For more information, see Mount a statically provisioned NAS volume or Mount a statically provisioned OSS volume.
Mount the NAS file system that is used as a volume to an ECS instance. For more information, see topics under Usage notes.
Log on to the ECS instance to download the data model. In this example, a Qwen-7B-Chat model is used.
Run the following command to go to the directory to which the NAS file system is mounted. In this example, /mnt/nas is used.
cd /mnt/nasRun the following command to install Git:
sudo yum install gitRun the following command to install the Large File Support (LFS) plug-in:
sudo yum install git-lfsRun the following command to clone the Qwen-7B-Chat repository from the ModelScope community to your local host:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.gitRun the following command to go to the directory in which the Qwen-7B-Chat repository is stored:
cd Qwen-7B-ChatRun the following command to download large files managed by the LFS plug-in from the directory in which the Qwen-7B-Chat repository is stored:
git lfs pull
Step 2: Use Arena to submit the model deployed as an inference service before fine-tuning
To demonstrate the effect of model fine-tuning, you need to run the Qwen-7B-Chat model deployed as the inference service before the model is fine-tuned.
Run the following command to run the Qwen-7B-Chat model deployed as the inference service before the model is fine-tuned:
arena serve custom \ --name=qwen-7b-chat \ --namespace=default \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \ --gpus=1 \ --data=nas-pvc:/mnt \ --restful-port=80 \ "python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /mnt/models/Qwen-7B-Chat/"Expected output:
service/qwen-7b-chat-202404301015 created deployment.apps/qwen-7b-chat-202404301015-custom-serving created INFO[0003] The Job qwen-7b-chat has been submitted successfully INFO[0003] You can run `arena serve get qwen-7b-chat --type custom-serving -n default` to check the job statusThe preceding output indicates that the
qwen-7b-chatinference service is deployed.Run the following command to view the run logs of the model:
arena serve logs -f qwen-7b-chatExpected output:
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Loading checkpoint shards: 100%|██████████| 8/8 [03:16<00:00, 24.59s/it]After the model is loaded, run the following command to map the web port of the model to port 8901 of the local host to access the model:
ImportantYou can run the
kubectl port-forwardcommand to forward requests only in the test environment. This method is not suitable for production environments. Pay attention to security risks when you use this method.kubectl port-forward services/qwen-7b-chat-202404301015 8901:80Expected output:
Forwarding from 127.0.0.1:8901 -> 80 Forwarding from [::1]:8901 -> 80Enter http://localhost:8901 in the address bar of your browser to access the Qwen-7B-Chat model to start a dialogue.
Run the following command to delete the Qwen-7B-Chat model:
arena serve delete qwen-7b-chat
Step 3: Prepare a dataset to fine-tune the model
Before you fine-tune the Qwen-7B-Chat model by using the Low-Rank Adaptation (LoRA) method, you need to prepare a dataset for model fine-tuning to improve the performance of the model in specific conversation scenarios.
All data samples are stored as files in the JSON format in a JSON array. Each data sample must contain the id and conversation fields. The value of the conversations field is an array. The following sample code provides an example dataset for fine-tuning the model:
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "Hello"
},
{
"from": "assistant",
"value": "I am a language model and my name is Qwen."
}
]
}
]Parameter description:
id: the unique identifier of the data sample. Example:identity_0.conversations: the array that contains the content of the conversation. Each conversation consists of alternating message objects generated by the user and model. Each message object contains two parameters:from: the source of the message. Valid values:userandassistant.value: the content of the conversation. For example, a user entersHelloin the model, and the model returnsI am a language model and my name is Qwen.
The dataset used in this example is packaged in the image. This dataset provides a specific answer to the question asked in Step 2. You can log on to the container and run the cat /data/shared/Qwen/example.json command to view the details of the fine-tuning data.
Step 4: Use Arena to submit a fine-tuning job
In the following example, Arena is used to submit a job to fine-tune the Qwen-7B-Chat model by using the LoRA method. This fine-tuning job generates a LoRA model. The LoRA model is registered as a new version of the Qwen-7B-Chat model and named Qwen-7B-Chat-Lora.
Run the following command to submit a job to fine-tune the Qwen-7B-Chat model by using the LoRA method:
# Submit a finetune job and register the output PEFT model as a new model version arena submit pytorchjob \ --name=qwen-7b-chat-finetune \ --namespace=default \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \ --image-pull-policy=Always \ --gpus=1 \ --working-dir /data/shared/Qwen \ --data=nas-pvc:/mnt/ \ --model-name=Qwen-7B-Chat-Lora \ --model-source=pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora \ "bash finetune/finetune_lora_single_gpu.sh -m /mnt/models/Qwen-7B-Chat/ -d example.json -o /mnt/finetune/Qwen-7B-Chat-Lora"The path of the
Qwen-7B-Chat-Loramodel ispvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora, which indicates that the model is stored in the/finetune/Qwen-7B-Chat-Loradirectory of thenas-pvcvolume in thedefaultnamespace.Expected output:
pytorchjob.kubeflow.org/qwen-7b-chat-finetune created INFO[0004] The Job qwen-7b-chat-finetune has been submitted successfully INFO[0004] You can run `arena get qwen-7b-chat-finetune --type pytorchjob -n default` to check the job status INFO[0004] registered model "Qwen-7B-Chat-Lora" created INFO[0005] model version 1 for "Qwen-7B-Chat-Lora" createdThe preceding output indicates that the model fine-tuning job is created and submitted, and the fine-tuned model is automatically registered and the model version is automatically created.
Run the following command to view the job details:
arena get qwen-7b-chat-finetuneExpected output:
Name: qwen-7b-chat-finetune Status: RUNNING Namespace: default Priority: N/A Trainer: PYTORCHJOB Duration: 2m CreateTime: 2024-04-29 16:02:01 EndTime: ModelName: Qwen-7B-Chat-Lora ModelVersion: 1 ModelSource: pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora/ Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- qwen-7b-chat-finetune-master-0 Running 2m true 1 ap-southeast-1.XX.XX.XX.XXThe preceding output records the details of the fine-tuned model, such as the name, version number, and source path.
Run the following command to view the logs of the fine-tuning job:
arena logs -f qwen-7b-chat-finetune
Step 5: Use Arena to access the model registry
Run the following command to use Arena to query all the registered models:
arena model listExpected output:
NAME LATEST_VERSION LAST_UPDATED_TIME Qwen-7B-Chat-Lora 1 2024-04-30T10:26:14+08:00Run the following command to view the version details of the Qwen-7B-Chat-Lora model that is registered and whose version number is 1:
arena model get \ --name Qwen-7B-Chat-Lora \ --version 1
Step 6: Use Arena to submit the fine-tuned model deployed as an inference service
Run the following command to run the fine-tuned Qwen-7B-Chat model deployed as an inference service:
arena serve custom \ --name=qwen-7b-chat-lora \ --namespace=default \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \ --image-pull-policy=Always \ --gpus=1 \ --data=nas-pvc:/mnt \ --restful-port=80 \ --model-name=Qwen-7B-Chat-Lora \ --model-version=1 \ "python web_demo_peft.py --server-port 80 --server-name 0.0.0.0 -c /mnt/finetune/Qwen-7B-Chat-Lora"Expected output:
service/qwen-7b-chat-lora-202404301036 created deployment.apps/qwen-7b-chat-lora-202404301036-custom-serving created INFO[0003] The Job qwen-7b-chat-lora has been submitted successfully INFO[0003] You can run `arena serve get qwen-7b-chat-lora --type custom-serving -n default` to check the job statusThe preceding output indicates that the fine-tuned model is deployed as an inference service.
Run the following command to view the run logs of the job:
arena serve logs -f qwen-7b-chat-loraExpected output:
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Loading checkpoint shards: 100%|██████████| 8/8 [03:10<00:00, 23.76s/it]After the model is loaded, run the following command to map the web port of the model to port 8901 of the local host to access the model:
ImportantYou can run the
kubectl port-forwardcommand to forward requests only in the test environment. This method is not suitable for production environments. Pay attention to security risks when you use this method.kubectl port-forward services/qwen-7b-chat-lora-202404301036 8901:80Expected output:
Forwarding from 127.0.0.1:8901 -> 80 Forwarding from [::1]:8901 -> 80Enter http://localhost:8901 in the address bar of your browser and ask the same question as in Step 2.
The following example shows a simple conversation scenario in which the user asks a question and the model provides an answer. Q indicates the information that the user enters in the model, and A indicates the output generated by the model.
Q: Can I change the container runtime after I create a cluster? A: After you create a cluster, you cannot change the container runtime. However, you can create node pools with different runtimes.After you compare the answer in the preceding example with the built-in fine-tuning data in the container image, you can find that the quality of the answer provided by the fine-tuned model is significantly improved after the model is fine-tuned.
Run the following command to delete the Qwen-7B-Chat-Lora model that is fine-tuned:
arena serve delete qwen-7b-chat-lora
References
For more information about how to manage models in MLflow Model Registry, see Manage models in MLflow Model Registry.