KServe, formerly known as KFServing, is a model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to use Service Mesh (ASM) and Arena to deploy a KServe model as an inference service in Serverless mode.
Prerequisites
A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created.
An ASM instance whose version is 1.17.2.7 or later is created. For more information, see Create an ASM instance or Update an ASM instance.
The KServe component is installed. For more information, see Integrate KServe with ASM to implement inference services based on cloud-native AI models.
An Arena client whose version is 0.9.11 or later is installed. For more information, see Configure the Arena client.
Step 1: Prepare model data
You can use File Storage NAS (NAS) or Object Storage Service (OSS) to prepare model data. For more information, see Mount a statically provisioned NAS volume and Mount a statically provisioned OSS volume. This topic uses NAS as an example to describe how to prepare model data.
1. Obtain the mount target of the NAS file system.
Log on to the File Storage NAS console. In the left-side navigation pane, choose File System > File System List. In the upper part of the page, select the region where the NAS file system resides.
On the File System List page, click the ID of the file system that you want to manage. On the details page, click Mount Targets. Move the pointer over
in the Mount Target column to view the mount target of the NAS file system. Record the mount target and mount command for subsequent operations.
2. Configure a PV and a PVC for the cluster
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
In the upper-right corner of the Persistent Volumes page, click Create.
In the Create PV dialog box, configure the parameters and click Create to create a persistent volume (PV) named training-data.
The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.
Parameter
Description
PV Type
NAS
Volume Name
training-data
Mount Target Domain Name
Select the mount target that you obtained in Step 1.
In the left-side navigation pane, choose
. On the Persistent Volume Claims page, click Create in the upper-right corner.In the Create PVC dialog box, configure the parameters and click Create to create a persistent volume claim (PVC) named training-data.
The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.
Parameter
Description
PVC Type
NAS
Name
training-data
Allocation Mode
In this example, Existing Volumes is selected.
Existing Volumes
Click Select PV to select the PV that you created.
3. Download data to the NAS file system
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Nodes page, click the instance ID of the node that you want to manage. On the Instance Details page, click Connect and then click Sign in now.
Run the mount command obtained in Step 1 to mount the NAS file system.
Download the BLOOM model and training data.
Download the
bloom-560m
model from Hugging Face and store the model in the PVC. The path ispvc://training-data/bloom-560m
.
Step 2: Deploy the inference service
Run the following command to query the GPU resources available in the cluster:
arena top node
Expected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 0 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 1 0 cn-beijing.192.1xx.x.xx 192.1xx.x.xx <none> Ready 1 0 --------------------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 0/2 (0.0%)
The expected output shows that two nodes providing GPU resources are available to run the inference service.
Run the following command to start the inference service named bloom-560m:
pvc training-data
is mounted to the/mnt/models
directory in the container, which is the path of the model downloaded in Step 1.arena serve kserve \ --name=bloom-560m \ --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \ --gpus=1 \ --cpu=6 \ --memory=20Gi \ --port=8080 \ --env=STORAGE_URI=pvc://training-data \ "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"
Expected output:
inferenceservice.serving.kserve.io/bloom-560m created INFO[0013] The Job bloom-560m has been submitted successfully INFO[0013] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status
Configure the parameters. The following table describes the parameters in the command.
Parameter
Required
Description
--name
Yes
The name of the submitted job. The name is globally unique and cannot be duplicated.
--image
Yes
The image address of the inference service.
--gpus
No
The number of GPUs required by the inference service. Default value: 0.
--cpu
No
The number of CPUs required by the inference service.
--memory
No
The amount of memory resources required by the inference service.
--port
No
The port of the inference service that is exposed to external access.
--env
No
The environment variable of the inference service. In this example, the PVC named training-data is specified to store the model.
Step 3: Verify the inference service
Run the following command to view the deployment progress of the KServe inference service:
arena serve get bloom-560m
Expected output:
Name: bloom-560m Namespace: default Type: KServe Version: 00001 Desired: 1 Available: 1 Age: 9m Address: http://bloom-560m.default.example.com Port: :80 GPU: 1 LatestRevision: bloom-560m-predictor-00001 LatestPrecent: 100 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp Running 9m 2/2 0 1 cn-hongkong.192.1xx.x.xxx
The expected output indicates that the KServe inference service is deployed and the model access address is
http://bloom-560m.default.example.com
.Obtain the IP address of the ASM ingress gateway For more information, see Step 2: Obtain the IP address of the ASM ingress gateway.
Run the following command to access the inference service by using the obtained IP address of the ASM ingress gateway:
# Replace ${ASM_GATEWAY} with the IP address of the ASM ingress gateway. curl -H "Host: bloom-560m.default.example.com" http://${ASM_GATEWAY}:80/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \ -H 'Content-Type: application/json'
Expected output:
{"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}
Step 4: Update the inference service
Create a new version of the model and copy the model address in the PVC to the new path
bloom-560m-v2
.Run the following command to perform a canary release for the KServe inference service. Set the new model path to
bloom-560m-v2
. Distribute 10% traffic to the new version of the inference service and 90% traffic to the old version of the inference service.arena serve update kserve \ --name bloom-560m \ --canary-traffic-percent=10 \ "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080"
Run the following command to view the status of the inference service:
arena serve get bloom-560m
Expected output:
Name: bloom-560m Namespace: default Type: KServe Version: 00002 Desired: 2 Available: 2 Age: 36m Address: http://bloom-560m.default.example.com Port: :80 GPU: 2 LatestRevision: bloom-560m-predictor-00002 LatestPrecent: 10 PrevRevision: bloom-560m-predictor-00001 PrevPrecent: 90 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp Running 36m 2/2 0 1 cn-hongkong.192.1xx.x.xxx bloom-560m-predictor-00002-deployment-5b7bb66cfb-nqprp Running 6m 2/2 0 1 cn-hongkong.192.1xx.x.xxx
After the new version of the inference service passes the test, run the following command to set the value of the
canary-traffic-percent
parameter to 100 to forward all traffic to the new version of the inference service:arena serve update kserve \ --name bloom-560m \ --canary-traffic-percent=100
(Optional) Step 5: Delete the inference service
Run the following command to delete the inference service:
arena serve delete bloom-560m