使用Arena進行模型微調和模型管理 - Container Service for Kubernetes

本文以大語言模型Qwen-7B-Chat為例，並結合Arena工具，示範如何在提交模型微調作業的同時，對模型微調作業所產生的模型進行管理。

前提條件

已建立至少包含一個GPU節點的ACK叢集Pro版，且叢集版本不低於1.20。具體操作，請參見建立ACK Pro版叢集。
本文所使用的GPU執行個體類型為ecs.gn7i-c8g1.2xlarge。關於GPU執行個體規格的詳細資料，請參見執行個體規格類型系列。
已在ACK叢集中部署MLflow模型倉庫。具體操作，請參見配置MLflow模型倉庫。
已安裝最新版本Arena用戶端。具體操作，請參見配置Arena用戶端。

背景資訊

關於大模型Qwen-7B-Chat的詳細資料，請參見Qwen官方代碼倉庫。由於該倉庫無法直接運行微調後的模型，因此本文對其進行了少量修改並重新構建了容器鏡像，鏡像地址為kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117。

步驟一：準備模型資料

建立一個儲存卷用於準備模型資料，本文將以NAS儲存卷為例說明如何準備模型資料，所使用的儲存卷名稱為nas-pvc。
您可以按需選擇使用NAS儲存卷或OSS儲存卷準備模型資料。具體操作，請參見使用NAS靜態儲存卷或使用OSS靜態儲存卷。
將NAS儲存卷對應的NAS檔案系統掛載至ECS執行個體中。具體操作，請參見掛載檔案系統情境說明。
登入ECS執行個體下載資料模型。本文以Qwen-7B-Chat模型為例。
1. 執行以下命令，進入NAS檔案系統掛載路徑。例如/mnt/nas。
```
cd /mnt/nas
```
2. 執行以下命令，安裝Git。
```
sudo yum install git
```
3. 執行以下命令，安裝Git LFS（Large File Support）外掛程式。
```
sudo yum install git-lfs
```
4. 執行以下命令，將ModelScope上的Qwen-7B-Chat倉庫複製到本地。
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git
```
5. 執行以下命令，進入Qwen-7B-Chat倉庫目錄。
```
cd Qwen-7B-Chat
```
6. 執行以下命令，在Qwen-7B-Chat目錄下，下載LFS管理的大檔案。
```
git lfs pull
```

步驟二：使用Arena提交微調前的模型推理服務

為了展示模型微調效果，首先將運行沒有微調之前的Qwen-7B-Chat模型推理服務。

執行以下命令，運行沒有微調的Qwen-7B-Chat模型推理服務。

arena serve custom \
    --name=qwen-7b-chat \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --gpus=1 \
    --data=nas-pvc:/mnt \
    --restful-port=80 \
    "python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /mnt/models/Qwen-7B-Chat/"

預期輸出：

service/qwen-7b-chat-202404301015 created
deployment.apps/qwen-7b-chat-202404301015-custom-serving created
INFO[0003] The Job qwen-7b-chat has been submitted successfully
INFO[0003] You can run `arena serve get qwen-7b-chat --type custom-serving -n default` to check the job status

輸出結果表明qwen-7b-chat推理服務已被成功部署。

執行以下命令，查看推理作業作業記錄。

arena serve logs -f qwen-7b-chat

預期輸出：

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 8/8 [03:16<00:00, 24.59s/it]

模型載入完成之後，執行以下命令，將其Web連接埠轉寄到本地8901連接埠進行訪問。
重要
kubectl port-forward建立的連接埠轉寄僅適用於測試環境下的快速驗證，不適合在生產環境中使用，使用時請注意安全風險。
```
kubectl port-forward services/qwen-7b-chat-202404301015 8901:80
```
預期輸出：
```
Forwarding from 127.0.0.1:8901 -> 80
Forwarding from [::1]:8901 -> 80
```

在瀏覽器地址欄中輸入http://localhost:8901，可以進入Qwen-7B-Chat進行對話。

以下樣本展示了一個簡單的對話情境，使用者發起提問，助手給出回複。其中，Q表示使用者輸入，A表示模型輸出：

Q: 建立叢集後，是否可以更改容器運行時？
A: 是的，建立叢集後，您可以更改容器運行時。阿里雲的伺服器鏡像和平台提供了多種不同的運行時環境，例如Docker、Kubernetes等。您可以根據自己的需求選擇合適的運行時環境。
要更改容器運行時，您需要先停止當前的容器運行，並重新啟動一個新容器。如果您想要在現有叢集上變更，那麼您可能需要建立一個新的叢集，並將應用程式從舊的叢集轉移到新的叢集。
另外，阿里雲還提供了一些自動化工具和服務，可以協助您輕鬆地在叢集中進行運行時更改。例如，您可以使用“一鍵部署”服務快速部署新的應用程式，或者使用“統一監控”服務來監控您的應用運行情況並及時發現和處理問題。

執行以下命令，刪除Qwen-7B-Chat推理作業。
```
arena serve delete qwen-7b-chat
```

步驟三：準備微調資料集

對Qwen-7B-Chat模型進行LoRA（Low-Rank Adaptation）微調前，您需要先準備微調資料集以提升模型在特定對話情境下的表現。

微調資料中的所有樣本以JSON檔案格式存放在一個JSON數組中，每個樣本需要包含id和conversation欄位，其中conversations欄位為一個數組。以下是一個微調資料集樣本：

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是一個語言模型，我叫通義千問。"
      }
    ]
  }
]

微調資料集參數說明如下：

id：唯一識別碼，用於區分不同的對話樣本，例如"id": "identity_0"。
conversations：包含實際對話內容的數組，每個對話由一系列交替的使用者和助手訊息組成，每個訊息對象有兩個欄位：
- "from"：指示訊息來源，可以是"user"或"assistant"。
- "value"：具體的對話內容，如使用者說"你好"，助手回應"我是一個語言模型，我叫通義千問。"。

本次微調使用的資料集已經打包在鏡像中，該資料集針對步驟二中提出的問題給出了特定的回答，您可以登入容器中執行cat /data/shared/Qwen/example.json命令查看微調資料詳情。

步驟四：使用Arena提交微調作業

以下使用Arena提交一個對Qwen-7B-Chat模型進行LoRA微調的作業，該次微調將產生一個LoRA模型，本文將該模型註冊為一個新的模型版本，模型名稱為Qwen-7B-Chat-Lora。

執行以下命令，提交一個對Qwen-7B-Chat模型進行LoRA微調的作業。

# Submit a finetune job and register the output PEFT model as a new model version
arena submit pytorchjob \
    --name=qwen-7b-chat-finetune \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --image-pull-policy=Always \
    --gpus=1 \
    --working-dir /data/shared/Qwen \
    --data=nas-pvc:/mnt/ \
    --model-name=Qwen-7B-Chat-Lora \
    --model-source=pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora \
    "bash finetune/finetune_lora_single_gpu.sh -m /mnt/models/Qwen-7B-Chat/ -d example.json -o /mnt/finetune/Qwen-7B-Chat-Lora"

Qwen-7B-Chat-Lora模型源路徑為pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora，表示該模型儲存在default命名空間裡一個名為nas-pvc的儲存卷中，儲存路徑為/finetune/Qwen-7B-Chat-Lora。

預期輸出：

pytorchjob.kubeflow.org/qwen-7b-chat-finetune created
INFO[0004] The Job qwen-7b-chat-finetune has been submitted successfully
INFO[0004] You can run `arena get qwen-7b-chat-finetune --type pytorchjob -n default` to check the job status
INFO[0004] registered model "Qwen-7B-Chat-Lora" created
INFO[0005] model version 1 for "Qwen-7B-Chat-Lora" created

預期輸出表明模型微調任務已經成功建立與提交，並自動地完成了模型註冊與模型版本建立。

執行以下命令，查看該次作業詳情。

arena get qwen-7b-chat-finetune

預期輸出：

Name:          qwen-7b-chat-finetune
Status:        RUNNING
Namespace:     default
Priority:      N/A
Trainer:       PYTORCHJOB
Duration:      2m
CreateTime:    2024-04-29 16:02:01
EndTime:
ModelName:     Qwen-7B-Chat-Lora
ModelVersion:  1
ModelSource:   pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora/

Instances:
  NAME                            STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                            ------   ---  --------  --------------  ----
  qwen-7b-chat-finetune-master-0  Running  2m   true      1               ap-southeast-1.XX.XX.XX.XX

輸出結果記錄了該作業關聯的模型的名稱、版本號碼及其源路徑等詳細資料。

執行以下命令，查看微調作業日誌。

arena logs -f qwen-7b-chat-finetune

展開查看微調作業日誌的詳細資料

+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ mkdir -p /mnt/finetune/Qwen-7B-Chat-Lora
+ python finetune.py --model_name_or_path /mnt/models/Qwen-7B-Chat/ --data_path example.json --bf16 True --output_dir /mnt/finetune/Qwen-7B-Chat-Lora --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 1000 --save_total_limit 10 --learning_rate 3e-4 --weight_decay 0.1 --adam_beta2 0.95 --warmup_ratio 0.01 --lr_scheduler_type cosine --logging_steps 1 --report_to none --model_max_length 512 --lazy_preprocess True --gradient_checkpointing --use_lora
[2024-04-30 02:26:42,358] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
...
Loading checkpoint shards: 100%|██████████| 8/8 [00:02<00:00,  3.29it/s]
/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  warnings.warn(
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
trainable params: 143,130,624 || all params: 7,864,455,168 || trainable%: 1.8199687192876373
Loading data...
Formatting inputs...Skip in lazy mode
100%|██████████| 20/20 [02:42<00:00,  8.12s/it]
{'loss': 2.6322, 'learning_rate': 0.0003, 'epoch': 0.23}
{'loss': 2.6542, 'learning_rate': 0.00029795419551040833, 'epoch': 0.46}
{'loss': 2.3209, 'learning_rate': 0.00029187258625509513, 'epoch': 0.69}
{'loss': 2.1613, 'learning_rate': 0.00028192106268097334, 'epoch': 0.91}
{'loss': 1.6563, 'learning_rate': 0.00026837107640945905, 'epoch': 1.14}
{'loss': 1.4985, 'learning_rate': 0.00025159223574386114, 'epoch': 1.37}
{'loss': 1.3369, 'learning_rate': 0.00023204222371836405, 'epoch': 1.6}
{'loss': 1.0505, 'learning_rate': 0.0002102543136979454, 'epoch': 1.83}
{'loss': 0.7033, 'learning_rate': 0.00018682282307111987, 'epoch': 2.06}
{'loss': 0.5576, 'learning_rate': 0.00016238690182084986, 'epoch': 2.29}
{'loss': 0.2523, 'learning_rate': 0.00013761309817915014, 'epoch': 2.51}
{'loss': 0.2481, 'learning_rate': 0.00011317717692888012, 'epoch': 2.74}
{'loss': 0.1343, 'learning_rate': 8.97456863020546e-05, 'epoch': 2.97}
{'loss': 0.0676, 'learning_rate': 6.795777628163599e-05, 'epoch': 3.2}
{'loss': 0.0489, 'learning_rate': 4.840776425613886e-05, 'epoch': 3.43}
{'loss': 0.0312, 'learning_rate': 3.162892359054098e-05, 'epoch': 3.66}
{'loss': 0.018, 'learning_rate': 1.8078937319026654e-05, 'epoch': 3.89}
{'loss': 0.0134, 'learning_rate': 8.127413744904804e-06, 'epoch': 4.11}
{'loss': 0.0141, 'learning_rate': 2.0458044895916513e-06, 'epoch': 4.34}
{'loss': 0.0099, 'learning_rate': 0.0, 'epoch': 4.57}
{'train_runtime': 162.4618, 'train_samples_per_second': 2.154, 'train_steps_per_second': 0.123, 'train_loss': 0.8704732102807611, 'epoch': 4.57}

步驟五：使用Arena訪問模型倉庫

執行以下命令，使用Arena列出所有登入模型。

arena model list

預期輸出：

NAME                 LATEST_VERSION       LAST_UPDATED_TIME
Qwen-7B-Chat-Lora    1                    2024-04-30T10:26:14+08:00

執行以下命令，查看註冊模型Qwen-7B-Chat-Lora（版本號碼為1）的版本詳情。

arena model get \
    --name Qwen-7B-Chat-Lora \
    --version 1

展開查看註冊模型Qwen-7B-Chat-Lora的版本詳情

Name:                Qwen-7B-Chat-Lora
Version:             1
CreationTime:        2024-04-30T10:26:14+08:00
LastUpdatedTime:     2024-04-30T10:26:14+08:00
Source:              pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora
Description:
  arena submit pytorchjob \
      --data nas-pvc:/mnt/ \
      --gpus 1 \
      --image kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
      --image-pull-policy Always \
      --model-name Qwen-7B-Chat-Lora \
      --model-source pvc://default/nas-pvc/finetune/Qwen-7B-Chat-Lora \
      --name qwen-7b-chat-finetune \
      --namespace default \
      --working-dir /data/shared/Qwen \
      "bash finetune/finetune_lora_single_gpu.sh -m /mnt/models/Qwen-7B-Chat/ -d example.json -o /mnt/finetune/Qwen-7B-Chat-Lora"
Tags:
  createdBy: arena
  modelName: Qwen-7B-Chat-Lora
  arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1

輸出結果表明Arena會自動將作業提交的完整命令添加至描述資訊中，並添加相應的標籤。

步驟六：使用Arena提交微調後的模型推理服務

執行以下命令，運行經過微調之後的Qwen-7B-Chat模型推理服務。

arena serve custom \
    --name=qwen-7b-chat-lora \
    --namespace=default \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/qwen:cu117 \
    --image-pull-policy=Always \
    --gpus=1 \
    --data=nas-pvc:/mnt \
    --restful-port=80 \
    --model-name=Qwen-7B-Chat-Lora \
    --model-version=1 \
    "python web_demo_peft.py --server-port 80 --server-name 0.0.0.0 -c /mnt/finetune/Qwen-7B-Chat-Lora"

預期輸出：

service/qwen-7b-chat-lora-202404301036 created
deployment.apps/qwen-7b-chat-lora-202404301036-custom-serving created
INFO[0003] The Job qwen-7b-chat-lora has been submitted successfully
INFO[0003] You can run `arena serve get qwen-7b-chat-lora --type custom-serving -n default` to check the job status

輸出結果表明微調後的模型推理服務已成功部署。

執行以下命令，查看作業作業記錄。

arena serve logs -f qwen-7b-chat-lora

預期輸出：

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 8/8 [03:10<00:00, 23.76s/it]

模型載入完成之後，執行以下命令，將其Web連接埠轉寄到本地的8901連接埠進行訪問。
重要
kubectl port-forward建立的連接埠轉寄僅適用於測試環境下的快速驗證，不適合在生產環境中使用，使用時請注意安全風險。
```
kubectl port-forward services/qwen-7b-chat-lora-202404301036 8901:80
```
預期輸出：
```
Forwarding from 127.0.0.1:8901 -> 80
Forwarding from [::1]:8901 -> 80
```
在瀏覽器地址欄中輸入http://localhost:8901，然後提出與步驟二（模型微調前）相同的問題。
以下樣本展示了一個簡單的對話情境，使用者發起提問，助手給出回複。其中，Q表示使用者輸入，A表示模型輸出：
```
Q: 建立叢集後，是否可以更改容器運行時？
A: 建立叢集後，不支援切換容器運行時，但是您可以建立不同類型運行時的節點池。節點池與節點池的運行時可以不同。更多資訊，請參見節點池概述。
```
將上述的答案和容器鏡像中內建的微調資料進行對比，可以發現模型微調後助手回答問題的品質顯著提升。
執行以下命令，刪除微調後的Qwen-7B-Chat推理作業。
```
arena serve delete qwen-7b-chat-lora
```

Container Service for Kubernetes：使用Arena進行模型微調和模型管理

前提條件

背景資訊

步驟一：準備模型資料

步驟二：使用Arena提交微調前的模型推理服務

步驟三：準備微調資料集

步驟四：使用Arena提交微調作業

步驟五：使用Arena訪問模型倉庫

步驟六：使用Arena提交微調後的模型推理服務

相關文檔