All Products
Search
Document Center

Container Service for Kubernetes:Deploy a KServe model inference service in serverless mode

Last Updated:Mar 26, 2026

KServe, formerly known as KFServing, is a model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to use Service Mesh (ASM) and Arena to deploy a KServe model as an inference service in Serverless mode.

Prerequisites

Before you begin, ensure that you have:

Step 1: Prepare model data

Use NAS or Object Storage Service (OSS) to store model data. For more information, see Mount a statically provisioned NAS volume and Use an ossfs 1.0 statically provisioned volume. This topic uses NAS as an example.

1.1 Get the NAS mount target

  1. Log on to the File Storage NAS console. In the left-side navigation pane, choose File System > File System List. In the upper part of the page, select the region where the NAS file system resides.

  2. On the File System List page, click the ID of the file system you want to manage. On the details page, click Mount Targets. Move the pointer over image in the Mount Target column to view the mount target. Record the mount target and mount command for later use.

1.2 Configure a PV and a PVC

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.

  3. In the upper-right corner of the Persistent Volumes page, click Create.

  4. In the Create PV dialog box, configure the following parameters and click Create to create a persistent volume (PV) named training-data. For other parameters, see Mount a statically provisioned NAS volume.

    Parameter Value
    PV Type NAS
    Volume Name training-data
    Mount Target Domain Name Select the mount target from step 1.1
  5. In the left navigation pane, choose Volumes > Persistent Volume Claims. On the Persistent Volume Claims page, click Create in the upper-right corner.

  6. In the Create PVC dialog box, configure the following parameters and click Create to create a persistent volume claim (PVC) named training-data. For other parameters, see Mount a statically provisioned NAS volume.

    Parameter Value
    PVC Type NAS
    Name training-data
    Allocation Mode Existing Volumes
    Existing Volumes Click Select PV to select the PV you created

1.3 Download model data to the NAS file system

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster you want. In the left navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, click the instance ID of the node you want to manage. On the Instance Details page, click More > Workbench Remote Access, then click Log in.

  4. Run the mount command from step 1.1 to mount the NAS file system.

  5. Download the bloom-560m model from Hugging Face and store it in the PVC at the path pvc://training-data/bloom-560m.

Step 2: Deploy the inference service

  1. Query the GPU resources available in the cluster:

    arena top node

    Expected output:

    NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   0           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    cn-beijing.192.1xx.x.xx   192.1xx.x.xx   <none>  Ready   1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/2 (0.0%)

    The output shows two GPU-accelerated nodes are available.

  2. Start the inference service named bloom-560m. The PVC training-data is mounted to /mnt/models in the container, which is where the model was downloaded in step 1.3.

    arena serve kserve \
        --name=bloom-560m \
        --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \
        --gpus=1 \
        --cpu=6 \
        --memory=20Gi \
        --port=8080 \
        --env=STORAGE_URI=pvc://training-data \
        "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"

    Expected output:

    inferenceservice.serving.kserve.io/bloom-560m created
    INFO[0013] The Job bloom-560m has been submitted successfully
    INFO[0013] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status

    The following table describes the parameters.

    Parameter Required Description
    --name Yes The name of the submitted job. Must be globally unique.
    --image Yes The container image address for the inference service.
    --gpus No The number of GPUs required. Default value: 0.
    --cpu No The number of CPUs required.
    --memory No The amount of memory required.
    --port No The port exposed for external access.
    --env No Environment variables as key-value pairs. In this example, STORAGE_URI is set to point to the PVC storing the model.

Step 3: Verify the inference service

  1. Check the deployment status of the KServe inference service:

    arena serve get bloom-560m

    Expected output:

    Name:       bloom-560m
    Namespace:  default
    Type:       KServe
    Version:    00001
    Desired:    1
    Available:  1
    Age:        9m
    Address:    http://bloom-560m.default.example.com
    Port:       :80
    GPU:        1
    
    LatestRevision:     bloom-560m-predictor-00001
    LatestPrecent:      100
    
    Instances:
      NAME                                                   STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                   ------   ---  -----  --------  ---  ----
      bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp  Running  9m   2/2    0         1    cn-hongkong.192.1xx.x.xxx

    The output confirms the inference service is deployed. The model access address is http://bloom-560m.default.example.com.

  2. Get the IP address of the ASM ingress gateway. For more information, see Step 2: Obtain the IP address of the ASM ingress gateway.

  3. Send a test request to the inference service using the ASM ingress gateway IP address:

    # Replace ${ASM_GATEWAY} with the IP address of the ASM ingress gateway.
    curl -H "Host: bloom-560m.default.example.com" http://${ASM_GATEWAY}:80/generate \
        -X POST \
        -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
        -H 'Content-Type: application/json'

    Expected output:

    {"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}

Step 4: Update the inference service

Use canary deployment to gradually shift traffic to a new model version before full cutover.

  1. Create a new version of the model and copy the model files in the PVC to the new path bloom-560m-v2.

  2. Deploy the new version with 10% of traffic routed to it:

    arena serve update kserve \
        --name bloom-560m \
        --canary-traffic-percent=10 \
        "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080"
  3. Check the status of the inference service:

    arena serve get bloom-560m

    Expected output:

    Name:       bloom-560m
    Namespace:  default
    Type:       KServe
    Version:    00002
    Desired:    2
    Available:  2
    Age:        36m
    Address:    http://bloom-560m.default.example.com
    Port:       :80
    GPU:        2
    
    LatestRevision:     bloom-560m-predictor-00002
    LatestPrecent:      10
    PrevRevision:       bloom-560m-predictor-00001
    PrevPrecent:        90
    
    Instances:
      NAME                                                    STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                    ------   ---  -----  --------  ---  ----
      bloom-560m-predictor-00001-deployment-ff4c49bf6-twrlp   Running  36m  2/2    0         1    cn-hongkong.192.1xx.x.xxx
      bloom-560m-predictor-00002-deployment-5b7bb66cfb-nqprp  Running  6m   2/2    0         1    cn-hongkong.192.1xx.x.xxx

    The output shows 10% of traffic goes to bloom-560m-predictor-00002 (new version) and 90% to bloom-560m-predictor-00001 (previous version).

  4. After the new version passes testing, shift all traffic to it:

    arena serve update kserve \
        --name bloom-560m \
        --canary-traffic-percent=100

(Optional) Step 5: Delete the inference service

Delete the inference service to release all GPU and compute resources:

arena serve delete bloom-560m