All Products
Search
Document Center

Container Service for Kubernetes:Deploy a Serverless mode model as an inference service

Last Updated:Nov 01, 2024

KServe, formerly known as KFServing, is a model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to use Service Mesh (ASM) and Arena to deploy a KServe model as an inference service in Serverless mode.

Prerequisites

Step 1: Prepare model data

You can use File Storage NAS (NAS) or Object Storage Service (OSS) to prepare model data. For more information, see Mount a statically provisioned NAS volume and Mount a statically provisioned OSS volume. This topic uses NAS as an example to describe how to prepare model data.

1. Obtain the mount target of the NAS file system.

  1. Log on to the File Storage NAS console. In the left-side navigation pane, choose File System > File System List. In the upper part of the page, select the region where the NAS file system resides.

  2. On the File System List page, click the ID of the file system that you want to manage. On the details page, click Mount Targets. Move the pointer over image in the Mount Target column to view the mount target of the NAS file system. Record the mount target and mount command for subsequent operations.

2. Configure a PV and a PVC for the cluster

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Volumes > Persistent Volumes.

  3. In the upper-right corner of the Persistent Volumes page, click Create.

  4. In the Create PV dialog box, configure the parameters and click Create to create a persistent volume (PV) named training-data.

    The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.

    Parameter

    Description

    PV Type

    NAS

    Volume Name

    training-data

    Mount Target Domain Name

    Select the mount target that you obtained in Step 1.

  5. In the left-side navigation pane, choose Volumes > Persistent Volume Claims. On the Persistent Volume Claims page, click Create in the upper-right corner.

  6. In the Create PVC dialog box, configure the parameters and click Create to create a persistent volume claim (PVC) named training-data.

    The following table describes the required parameters. You can configure other parameters based on your business requirements. For more information, see Mount a statically provisioned NAS volume.

    Parameter

    Description

    PVC Type

    NAS

    Name

    training-data

    Allocation Mode

    In this example, Existing Volumes is selected.

    Existing Volumes

    Click Select PV to select the PV that you created.

3. Download data to the NAS file system

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, click the instance ID of the node that you want to manage. On the Instance Details page, click Connect and then click Sign in now.

  4. Run the mount command obtained in Step 1 to mount the NAS file system.

  5. Download the BLOOM model and training data.

    Download the bloom-560m model from Hugging Face and store the model in the PVC. The path is pvc://training-data/bloom-560m.

Step 2: Deploy the inference service

  1. Run the following command to query the GPU resources available in the cluster:

    arena top node

    Expected output:

    NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/2 (0.0%)

    The expected output shows that two nodes providing GPU resources are available to run the inference service.

  2. Run the following command to start the inference service named bloom-560m: pvc training-data is mounted to the /mnt/models directory in the container, which is the path of the model downloaded in Step 1.

    arena serve kserve \
        --name=bloom-560m \
        --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \
        --gpus=1 \
        --cpu=6 \
        --memory=20Gi \
        --port=8080 \
        --env=STORAGE_URI=pvc://training-data \
        "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"

    Expected output:

    inferenceservice.serving.kserve.io/bloom-560m created
    INFO[0013] The Job bloom-560m has been submitted successfully
    INFO[0013] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status

    Configure the parameters. The following table describes the parameters in the command.

    Parameter

    Required

    Description

    --name

    Yes

    The name of the submitted job. The name is globally unique and cannot be duplicated.

    --image

    Yes

    The image address of the inference service.

    --gpus

    No

    The number of GPUs required by the inference service. Default value: 0.

    --cpu

    No

    The number of CPUs required by the inference service.

    --memory

    No

    The amount of memory resources required by the inference service.

    --port

    No

    The port of the inference service that is exposed to external access.

    --env

    No

    The environment variable of the inference service. In this example, the PVC named training-data is specified to store the model.

Step 3: Verify the inference service

  1. Run the following command to view the deployment progress of the KServe inference service:

    arena serve get bloom-560m

    Expected output:

    Name:       bloom-560m
    Namespace:  default
    Type:       KServe
    Version:    00001
    Desired:    1
    Available:  1
    Age:        9m
    Address:    http://bloom-560m.default.example.com
    Port:       :80
    GPU:        1
    
    LatestRevision:     bloom-560m-predictor-00001
    LatestPrecent:      100
    
    Instances:
      NAME                                                   STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                   ------   ---  -----  --------  ---  ----
      bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp  Running  9m   2/2    0         1    cn-hongkong.192.1xx.x.xxx

    The expected output indicates that the KServe inference service is deployed and the model access address is http://bloom-560m.default.example.com.

  2. Obtain the IP address of the ASM ingress gateway For more information, see Step 2: Obtain the IP address of the ASM ingress gateway.

  3. Run the following command to access the inference service by using the obtained IP address of the ASM ingress gateway:

    # Replace ${ASM_GATEWAY} with the IP address of the ASM ingress gateway. 
    curl -H "Host: bloom-560m.default.example.com" http://${ASM_GATEWAY}:80/generate \
        -X POST \
        -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
        -H 'Content-Type: application/json'

    Expected output:

    {"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}

Step 4: Update the inference service

  1. Create a new version of the model and copy the model address in the PVC to the new path bloom-560m-v2.

  2. Run the following command to perform a canary release for the KServe inference service. Set the new model path to bloom-560m-v2. Distribute 10% traffic to the new version of the inference service and 90% traffic to the old version of the inference service.

    arena serve update kserve \
        --name bloom-560m \
        --canary-traffic-percent=10 \
        "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080"
  3. Run the following command to view the status of the inference service:

    arena serve get bloom-560m

    Expected output:

    Name:       bloom-560m
    Namespace:  default
    Type:       KServe
    Version:    00002
    Desired:    2
    Available:  2
    Age:        36m
    Address:    http://bloom-560m.default.example.com
    Port:       :80
    GPU:        2
    
    LatestRevision:     bloom-560m-predictor-00002
    LatestPrecent:      10
    PrevRevision:       bloom-560m-predictor-00001
    PrevPrecent:        90
    
    Instances:
      NAME                                                    STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                    ------   ---  -----  --------  ---  ----
      bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp   Running  36m  2/2    0         1    cn-hongkong.192.1xx.x.xxx
      bloom-560m-predictor-00002-deployment-5b7bb66cfb-nqprp  Running  6m   2/2    0         1    cn-hongkong.192.1xx.x.xxx
  4. After the new version of the inference service passes the test, run the following command to set the value of the canary-traffic-percent parameter to 100 to forward all traffic to the new version of the inference service:

    arena serve update kserve \
        --name bloom-560m \
        --canary-traffic-percent=100

(Optional) Step 5: Delete the inference service

Run the following command to delete the inference service:

arena serve delete bloom-560m