Deploy a KServe model as an inference service - Container Service for Kubernetes

KServe, formerly known as KFServing, is a model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to use Service Mesh (ASM) and Arena to deploy a KServe model as an inference service in Serverless mode.

Prerequisites

A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created.
The cloud-native AI suite is installed.
An ASM instance whose version is 1.17.2.7 or later is created. For more information, see Create an ASM instance or Upgrade an ASM instance.
The KServe component is installed. For more information, see Integrate KServe with ASM to implement inference services based on cloud-native AI models.
An Arena client whose version is 0.9.11 or later is installed. For more information, see Configure the Arena client.

Step 1: Prepare model data

You can use File Storage NAS (NAS) or Object Storage Service (OSS) to prepare model data. For more information, see Mount a statically provisioned NAS volume and Use an ossfs 1.0 statically provisioned volume. This topic uses NAS as an example to describe how to prepare model data.

1. Obtain the mount target of the NAS file system.

Log on to the File Storage NAS console. In the left-side navigation pane, choose File System > File System List. In the upper part of the page, select the region where the NAS file system resides.
On the File System List page, click the ID of the file system that you want to manage. On the details page, click Mount Targets. Move the pointer over in the Mount Target column to view the mount target of the NAS file system. Record the mount target and mount command for subsequent operations.

2. Configure a PV and a PVC for the cluster

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.
In the upper-right corner of the Persistent Volumes page, click Create.
In the Create PV dialog box, configure the parameters and click Create to create a persistent volume (PV) named training-data.
The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.
Parameter
Description
PV Type
NAS
Volume Name
training-data
Mount Target Domain Name
Select the mount target that you obtained in Step 1.
In the left-side navigation pane, choose Volumes > Persistent Volume Claims. On the Persistent Volume Claims page, click Create in the upper-right corner.
In the Create PVC dialog box, configure the parameters and click Create to create a persistent volume claim (PVC) named training-data.
The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.
Parameter
Description
PVC Type
NAS
Name
training-data
Allocation Mode
In this example, Existing Volumes is selected.
Existing Volumes
Click Select PV to select the PV that you created.

3. Download data to the NAS file system

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the one you want to change. In the left navigation pane, choose Nodes > Nodes.
On the Nodes page, click the instance ID of the node that you want to manage. On the Instance Details page, click More > Workbench Remote Access, then click Log in.
Run the mount command obtained in Step 1 to mount the NAS file system.
Download the BLOOM model and training data.
Download the bloom-560m model from Hugging Face and store the model in the PVC. The path is pvc://training-data/bloom-560m.

Step 2: Deploy the inference service

Run the following command to query the GPU resources available in the cluster:

arena top node

Expected output:

NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/2 (0.0%)

The expected output shows that two nodes providing GPU resources are available to run the inference service.

Run the following command to start the inference service named bloom-560m: pvc training-data is mounted to the /mnt/models directory in the container, which is the path of the model downloaded in Step 1.

arena serve kserve \
    --name=bloom-560m \
    --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \
    --gpus=1 \
    --cpu=6 \
    --memory=20Gi \
    --port=8080 \
    --env=STORAGE_URI=pvc://training-data \
    "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"

Expected output:

inferenceservice.serving.kserve.io/bloom-560m created
INFO[0013] The Job bloom-560m has been submitted successfully
INFO[0013] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status

Configure the parameters. The following table describes the parameters in the command.

Parameter	Required	Description
--name	Yes	The name of the submitted job. The name is globally unique and cannot be duplicated.
--image	Yes	The image address of the inference service.
--gpus	No	The number of GPUs required by the inference service. Default value: 0.
--cpu	No	The number of CPUs required by the inference service.
--memory	No	The amount of memory resources required by the inference service.
--port	No	The port of the inference service that is exposed to external access.
--env	No	The environment variable of the inference service. In this example, the PVC named training-data is specified to store the model.

Step 3: Verify the inference service

Run the following command to view the deployment progress of the KServe inference service:

arena serve get bloom-560m

Expected output:

Name:       bloom-560m
Namespace:  default
Type:       KServe
Version:    00001
Desired:    1
Available:  1
Age:        9m
Address:    http://bloom-560m.default.example.com
Port:       :80
GPU:        1

LatestRevision:     bloom-560m-predictor-00001
LatestPrecent:      100

Instances:
  NAME                                                   STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                   ------   ---  -----  --------  ---  ----
  bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp  Running  9m   2/2    0         1    cn-hongkong.192.1xx.x.xxx

The expected output indicates that the KServe inference service is deployed and the model access address is http://bloom-560m.default.example.com.

Obtain the IP address of the ASM ingress gateway For more information, see Step 2: Obtain the IP address of the ASM ingress gateway.

Run the following command to access the inference service by using the obtained IP address of the ASM ingress gateway:

# Replace ${ASM_GATEWAY} with the IP address of the ASM ingress gateway. 
curl -H "Host: bloom-560m.default.example.com" http://${ASM_GATEWAY}:80/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

Expected output:

{"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}

Step 4: Update the inference service

Create a new version of the model and copy the model address in the PVC to the new path bloom-560m-v2.
Run the following command to perform a canary release for the KServe inference service. Set the new model path to bloom-560m-v2. Distribute 10% traffic to the new version of the inference service and 90% traffic to the old version of the inference service.
```
arena serve update kserve \
    --name bloom-560m \
    --canary-traffic-percent=10 \
    "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080"
```

Run the following command to view the status of the inference service:

arena serve get bloom-560m

Expected output:

Name:       bloom-560m
Namespace:  default
Type:       KServe
Version:    00002
Desired:    2
Available:  2
Age:        36m
Address:    http://bloom-560m.default.example.com
Port:       :80
GPU:        2

LatestRevision:     bloom-560m-predictor-00002
LatestPrecent:      10
PrevRevision:       bloom-560m-predictor-00001
PrevPrecent:        90

Instances:
  NAME                                                    STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                    ------   ---  -----  --------  ---  ----
  bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp   Running  36m  2/2    0         1    cn-hongkong.192.1xx.x.xxx
  bloom-560m-predictor-00002-deployment-5b7bb66cfb-nqprp  Running  6m   2/2    0         1    cn-hongkong.192.1xx.x.xxx

After the new version of the inference service passes the test, run the following command to set the value of the canary-traffic-percent parameter to 100 to forward all traffic to the new version of the inference service:
```
arena serve update kserve \
    --name bloom-560m \
    --canary-traffic-percent=100
```

(Optional) Step 5: Delete the inference service

Run the following command to delete the inference service:

arena serve delete bloom-560m

Parameter	Description
PV Type	NAS
Volume Name	training-data
Mount Target Domain Name	Select the mount target that you obtained in Step 1.

Parameter	Description
PVC Type	NAS
Name	training-data
Allocation Mode	In this example, Existing Volumes is selected.
Existing Volumes	Click Select PV to select the PV that you created.