KServe, formerly known as KFServing, is a model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to use Service Mesh (ASM) and Arena to deploy a KServe model as an inference service in Serverless mode.
Prerequisites
Before you begin, ensure that you have:
-
A Container Service for Kubernetes (ACK) cluster with GPU-accelerated nodes
-
An ASM instance version 1.17.2.7 or later. For more information, see Create an ASM instance or Upgrade an ASM instance
-
The KServe component installed. For more information, see Integrate KServe with ASM to implement inference services based on cloud-native AI models
-
An Arena client version 0.9.11 or later. For more information, see Configure the Arena client
Step 1: Prepare model data
Use NAS or Object Storage Service (OSS) to store model data. For more information, see Mount a statically provisioned NAS volume and Use an ossfs 1.0 statically provisioned volume. This topic uses NAS as an example.
1.1 Get the NAS mount target
-
Log on to the File Storage NAS console. In the left-side navigation pane, choose File System > File System List. In the upper part of the page, select the region where the NAS file system resides.
-
On the File System List page, click the ID of the file system you want to manage. On the details page, click Mount Targets. Move the pointer over
in the Mount Target column to view the mount target. Record the mount target and mount command for later use.
1.2 Configure a PV and a PVC
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.
-
In the upper-right corner of the Persistent Volumes page, click Create.
-
In the Create PV dialog box, configure the following parameters and click Create to create a persistent volume (PV) named training-data. For other parameters, see Mount a statically provisioned NAS volume.
Parameter Value PV Type NAS Volume Name training-data Mount Target Domain Name Select the mount target from step 1.1 -
In the left navigation pane, choose Volumes > Persistent Volume Claims. On the Persistent Volume Claims page, click Create in the upper-right corner.
-
In the Create PVC dialog box, configure the following parameters and click Create to create a persistent volume claim (PVC) named training-data. For other parameters, see Mount a statically provisioned NAS volume.
Parameter Value PVC Type NAS Name training-data Allocation Mode Existing Volumes Existing Volumes Click Select PV to select the PV you created
1.3 Download model data to the NAS file system
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the name of the cluster you want. In the left navigation pane, choose Nodes > Nodes.
-
On the Nodes page, click the instance ID of the node you want to manage. On the Instance Details page, click More > Workbench Remote Access, then click Log in.
-
Run the mount command from step 1.1 to mount the NAS file system.
-
Download the
bloom-560mmodel from Hugging Face and store it in the PVC at the pathpvc://training-data/bloom-560m.
Step 2: Deploy the inference service
-
Query the GPU resources available in the cluster:
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 1 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 1 0 --------------------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 0/2 (0.0%)The output shows two GPU-accelerated nodes are available.
-
Start the inference service named bloom-560m. The PVC
training-datais mounted to/mnt/modelsin the container, which is where the model was downloaded in step 1.3.arena serve kserve \ --name=bloom-560m \ --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \ --gpus=1 \ --cpu=6 \ --memory=20Gi \ --port=8080 \ --env=STORAGE_URI=pvc://training-data \ "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"Expected output:
inferenceservice.serving.kserve.io/bloom-560m created INFO[0013] The Job bloom-560m has been submitted successfully INFO[0013] You can run `arena serve get bloom-560m --type kserve -n default` to check the job statusThe following table describes the parameters.
Parameter Required Description --nameYes The name of the submitted job. Must be globally unique. --imageYes The container image address for the inference service. --gpusNo The number of GPUs required. Default value: 0. --cpuNo The number of CPUs required. --memoryNo The amount of memory required. --portNo The port exposed for external access. --envNo Environment variables as key-value pairs. In this example, STORAGE_URIis set to point to the PVC storing the model.
Step 3: Verify the inference service
-
Check the deployment status of the KServe inference service:
arena serve get bloom-560mExpected output:
Name: bloom-560m Namespace: default Type: KServe Version: 00001 Desired: 1 Available: 1 Age: 9m Address: http://bloom-560m.default.example.com Port: :80 GPU: 1 LatestRevision: bloom-560m-predictor-00001 LatestPrecent: 100 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp Running 9m 2/2 0 1 cn-hongkong.192.1xx.x.xxxThe output confirms the inference service is deployed. The model access address is
http://bloom-560m.default.example.com. -
Get the IP address of the ASM ingress gateway. For more information, see Step 2: Obtain the IP address of the ASM ingress gateway.
-
Send a test request to the inference service using the ASM ingress gateway IP address:
# Replace ${ASM_GATEWAY} with the IP address of the ASM ingress gateway. curl -H "Host: bloom-560m.default.example.com" http://${ASM_GATEWAY}:80/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \ -H 'Content-Type: application/json'Expected output:
{"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}
Step 4: Update the inference service
Use canary deployment to gradually shift traffic to a new model version before full cutover.
-
Create a new version of the model and copy the model files in the PVC to the new path
bloom-560m-v2. -
Deploy the new version with 10% of traffic routed to it:
arena serve update kserve \ --name bloom-560m \ --canary-traffic-percent=10 \ "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080" -
Check the status of the inference service:
arena serve get bloom-560mExpected output:
Name: bloom-560m Namespace: default Type: KServe Version: 00002 Desired: 2 Available: 2 Age: 36m Address: http://bloom-560m.default.example.com Port: :80 GPU: 2 LatestRevision: bloom-560m-predictor-00002 LatestPrecent: 10 PrevRevision: bloom-560m-predictor-00001 PrevPrecent: 90 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp Running 36m 2/2 0 1 cn-hongkong.192.1xx.x.xxx bloom-560m-predictor-00002-deployment-5b7bb66cfb-nqprp Running 6m 2/2 0 1 cn-hongkong.192.1xx.x.xxxThe output shows 10% of traffic goes to
bloom-560m-predictor-00002(new version) and 90% tobloom-560m-predictor-00001(previous version). -
After the new version passes testing, shift all traffic to it:
arena serve update kserve \ --name bloom-560m \ --canary-traffic-percent=100
(Optional) Step 5: Delete the inference service
Delete the inference service to release all GPU and compute resources:
arena serve delete bloom-560m