Accelerate AI Model Serving with KServe and Fluid - Service Mesh

KServe, formerly known as KFServing, is an AI model service and inference engine for cloud-native environments that supports automatic scaling, scale-to-zero, and canary deployment. Service Mesh (ASM) integrates Knative Serving capabilities deployed in ACK or ACK Serverless clusters and provides a KServe on ASM feature for one-click KServe integration. Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data and AI. You can directly integrate Fluid with the KServe on ASM feature to accelerate model loading.

This topic walks through integrating KServe on ASM with Fluid to deploy an AI inference service backed by Object Storage Service (OSS).

How it works

This integration connects three components:

Fluid caches model files from an OSS bucket to local SSD storage across cluster nodes. A JindoRuntime custom resource (CR) manages the distributed cache, and a DataLoad CR prefetches data before any inference pod starts.
KServe on ASM deploys an InferenceService CR that references cached data through a persistent volume claim (PVC). The inference container loads the model from the Fluid-managed mount point instead of downloading it from OSS.
Service Mesh (ASM) routes traffic through an ingress gateway to the KServe inference service. Knative Serving handles autoscaling and scale-to-zero.

When a request arrives and Knative scales a pod from zero, the model is already cached locally. The pod starts and serves predictions without a remote download.

Prerequisites

Before you begin, make sure that you have:

An ASM instance of version 1.17 or later with a Kubernetes cluster added to it. For more information, see Create an ASM instance and Add a cluster to an ASM instance
One of the following Kubernetes clusters added to the ASM instance:
- A Container Service for Kubernetes (ACK) cluster of version 1.22 or later. For GPU-based inference, the cluster must include GPU-accelerated nodes such as ecs.gn6i-c16g1.4xlarge. For more information, see Create an ACK managed cluster or Update an ACK cluster
- An ACK Serverless cluster of version 1.18 or later with the CoreDNS component installed. For more information, see Create an ACK Serverless cluster and Manage system components
Kubernetes API access to Istio resources enabled for the ASM instance. For more information, see Use the Kubernetes API of clusters on the data plane to access Istio resources
An ASM ingress gateway named ingressgateway with ports 80 and 443 exposed. For more information, see Create an ingress gateway
The Knative Serving component deployed in the cluster with the Knative on ASM feature enabled. For more information, see Use Knative on ASM to deploy a serverless application
An active OSS subscription with at least one bucket created. For more information, see Activate OSS and Create buckets

Deploy the Knative Serving component

For an ACK cluster, see Deploy Knative.
For an ACK Serverless cluster, see Enable Knative.

Note

If you select Kourier as the gateway during Knative installation, uninstall it after installation completes. In the ACK console, go to Clusters, click your cluster, and choose Applications > Knative. On the Components tab, uninstall Kourier in the Add-on Component section.

Enable Knative on ASM

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > Knative on ASM.
On the Knative on ASM page, click Enable Knative on ASM.

Note

To update an existing ASM instance, see Update an ASM instance.

Step 1: Enable KServe on ASM

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.
On the KServe on ASM page, configure the CertManager option and click Enable KServe on ASM. KServe depends on cert-manager for certificate lifecycle management. The CertManager component is automatically installed with KServe.
- If CertManager is not installed in the cluster, turn on Automatically install the CertManager component in the cluster.
- If CertManager is already installed in the cluster, turn off Automatically install the CertManager component in the cluster.

Step 2: Install Fluid and set up model caching

Install the ack-fluid component

Deploy the ack-fluid component (version 0.9.10 or later) in the data plane cluster.

For an ACK cluster, install the cloud-native AI suite and deploy the ack-fluid component:

If the cloud-native AI suite is not installed, enable Fluid acceleration during installation. For more information, see Deploy the cloud-native AI suite.
If the cloud-native AI suite is already installed, log on to the Container Service for Kubernetes console, click the cluster, and choose Applications > Cloud-native AI Suite to deploy the ack-fluid component.

Note

Uninstall open source Fluid before installing the ack-fluid component. The two cannot coexist.

For an ACK Serverless cluster, see the Deploy the control plane components of Fluid section of the Accelerate Jobs topic.

Upload the AI model to OSS

This tutorial uses the BLOOM-560m model, an open source transformer LLM based on PyTorch.

Download the model files from Hugging Face.
Upload the files to your OSS bucket and record the storage path.
The path format is oss://<bucket>/<path>. For example, if the bucket is fluid-demo and the files are in the models/bloom directory, the storage path is oss://fluid-demo/models/bloom.

Note

Use ossutil to upload files. For more information, see Install ossutil.

Create a namespace and configure OSS access

Connect to the data plane cluster with kubectl. For more information, see Connect to an ACK cluster by using kubectl.
Create a namespace for Fluid cache and inference workloads:
```
   kubectl create ns kserve-fluid-demo
```
Create a file named oss-secret.yaml with the following content:
Replace the following placeholders with actual values:
Placeholder Description Example
<your-access-key-id> AccessKey ID of an Alibaba Cloud account with OSS access LTAI5tXxx
<your-access-key-secret> AccessKey secret of the account xXxXxXx
```
   apiVersion: v1
   kind: Secret
   metadata:
     name: access-key
   stringData:
     fs.oss.accessKeyId: <your-access-key-id>       # AccessKey ID of an account with OSS access
     fs.oss.accessKeySecret: <your-access-key-secret> # AccessKey secret of the account
```

Placeholder	Description	Example
`<your-access-key-id>`	AccessKey ID of an Alibaba Cloud account with OSS access	LTAI5tXxx
`<your-access-key-secret>`	AccessKey secret of the account	xXxXxXx

Apply the Secret:

   kubectl apply -f oss-secret.yaml -n kserve-fluid-demo

Declare the model data in Fluid

Submit a Dataset CR and a JindoRuntime CR to declare the model data. The Dataset describes the remote storage location. The JindoRuntime configures the cache system and its specific configuration.

Create a file named oss-jindo.yaml with the following content.

Replace the following placeholders with actual values. For a list of available endpoints, see Regions and endpoints.

Placeholder	Description	Example
`<bucket>/<path>`	OSS storage path of the model files	`fluid-demo/models/bloom`
`<endpoint>`	OSS endpoint for the region where the bucket resides	`oss-cn-hangzhou-internal.aliyuncs.com`

The following table describes the JindoRuntime configuration parameters:

Parameter	Value	Description
`replicas`	2	Number of cache worker replicas
`mediumtype`	SSD	Storage medium for local cache
`quota`	50Gi	Maximum cache capacity per worker node
`high` / `low`	0.95 / 0.7	Cache eviction watermarks (high triggers eviction, low is the target)
`cleanPolicy`	OnDemand	Fuse sidecar clean-up policy

Expand to view oss-jindo.yaml

   apiVersion: data.fluid.io/v1alpha1
   kind: Dataset
   metadata:
     name: oss-data
   spec:
     mounts:
     - mountPoint: "oss://<bucket>/<path>"  # Storage path of the model files
       name: bloom-560m
       path: /bloom-560m
       options:
         fs.oss.endpoint: "<endpoint>"      # OSS endpoint for your region
       encryptOptions:
         - name: fs.oss.accessKeyId
           valueFrom:
             secretKeyRef:
               name: access-key
               key: fs.oss.accessKeyId
         - name: fs.oss.accessKeySecret
           valueFrom:
             secretKeyRef:
               name: access-key
               key: fs.oss.accessKeySecret
     accessModes:
       - ReadOnlyMany
   ---
   apiVersion: data.fluid.io/v1alpha1
   kind: JindoRuntime
   metadata:
     name: oss-data
   spec:
     replicas: 2
     tieredstore:
       levels:
         - mediumtype: SSD
           volumeType: emptyDir
           path: /mnt/ssd0/cache
           quota: 50Gi
           high: "0.95"
           low: "0.7"
     fuse:
       properties:
         fs.jindofsx.data.cache.enable: "true"
       args:
         - -okernel_cache
         - -oro
         - -oattr_timeout=7200
         - -oentry_timeout=7200
         - -ometrics_port=9089
       cleanPolicy: OnDemand

Deploy the Dataset and JindoRuntime:

   kubectl create -f oss-jindo.yaml -n kserve-fluid-demo

Verify the deployment:

   kubectl get jindoruntime,dataset -n kserve-fluid-demo

Expected output:

   NAME                                  MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
   jindoruntime.data.fluid.io/oss-data   Ready          Ready          Ready        3m

   NAME                             UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
   dataset.data.fluid.io/oss-data   3.14GiB          0.00B    100.00GiB        0.0%                Bound   3m

The Dataset PHASE shows Bound and the JindoRuntime FUSE PHASE shows Ready, which confirms successful deployment.

Prefetch model data

Prefetching loads model files into the local cache before any inference pod starts, eliminating remote download latency during pod startup.

Create a file named oss-dataload.yaml with the following content:

   apiVersion: data.fluid.io/v1alpha1
   kind: DataLoad
   metadata:
     name: oss-dataload
   spec:
     dataset:
       name: oss-data
       namespace: kserve-fluid-demo
     target:
       - path: /bloom-560m
         replicas: 2

Deploy the DataLoad:

   kubectl create -f oss-dataload.yaml -n kserve-fluid-demo

Check the prefetch progress:

   kubectl get dataload -n kserve-fluid-demo

Expected output:

   NAME           DATASET    PHASE      AGE     DURATION
   oss-dataload   oss-data   Complete   1m      45s

Wait until PHASE shows Complete before proceeding to the next step.

Step 3: Deploy the inference service

Create a file named oss-fluid-isvc.yaml with the InferenceService configuration that matches your cluster type.

ACK cluster

   apiVersion: "serving.kserve.io/v1beta1"
   kind: "InferenceService"
   metadata:
     name: "fluid-bloom"
   spec:
     predictor:
       timeout: 600
       minReplicas: 0
       containers:
         - name: kserve-container
           image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
           resources:
             limits:
               cpu: "12"
               memory: 48Gi
               nvidia.com/gpu: 1  # Number of GPUs required. Remove this line if GPUs are not used.
             requests:
               cpu: "12"
               memory: 48Gi
           env:
             - name: STORAGE_URI
               value: "pvc://oss-data/bloom-560m"
             - name: MODEL_NAME
               value: "bloom"
             - name: GPU_ENABLED
               value: "True"  # Set to "False" if GPUs are not used.

ACK Serverless cluster

   apiVersion: "serving.kserve.io/v1beta1"
   kind: "InferenceService"
   metadata:
     name: "fluid-bloom"
     labels:
       alibabacloud.com/fluid-sidecar-target: "eci"
     annotations:
       k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c16g1.4xlarge"  # ECS instance type
       knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.gn6i-c16g1.4xlarge"  # ECS instance type
   spec:
     predictor:
       timeout: 600
       minReplicas: 0
       containers:
         - name: kserve-container
           image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
           resources:
             limits:
               cpu: "12"
               memory: 48Gi
             requests:
               cpu: "12"
               memory: 48Gi
           env:
             - name: STORAGE_URI
               value: "pvc://oss-data/bloom-560m"
             - name: MODEL_NAME
               value: "bloom"
             - name: GPU_ENABLED
               value: "True"  # Set to "False" if GPUs are not used.

The following table describes the InferenceService parameters:

Parameter	Value	Description
`image`	`registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu`	Sample image with model loading and inference interfaces. To customize, see the KServe Fluid Docker samples
`STORAGE_URI`	`pvc://oss-data/bloom-560m`	PVC-based storage URI pointing to the Fluid-managed cache
`MODEL_NAME`	`bloom`	Model name used in the prediction API endpoint
`GPU_ENABLED`	`True`	Set to `False` if GPUs are not used
`resources`	12 CPU, 48 GiB memory	Resource allocation for the BLOOM-560m model. Adjust based on your model size and cluster capacity

Deploy the inference service:

   kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo

Verify the deployment:

   kubectl get inferenceservice -n kserve-fluid-demo

Expected output:

   NAME          URL                                                READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION           AGE
   fluid-bloom   http://fluid-bloom.kserve-fluid-demo.example.com   True           100                              fluid-bloom-predictor-00001   2d

READY shows True, which confirms the inference service is running.

Step 4: Send a prediction request

Get the ASM ingress gateway address.
1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.
3. In the Service address section of ingressgateway, copy the service address.

Send a test request. Replace <gateway-address> with the address from the previous step.

   curl -v \
     -H "Content-Type: application/json" \
     -H "Host: fluid-bloom.kserve-fluid-demo.example.com" \
     "http://<gateway-address>:80/v1/models/bloom:predict" \
     -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'

Expected output:

   *   Trying xxx.xx.xx.xx:80...
   * Connected to xxx.xx.xx.xx (xxx.xx.xx.xx) port 80 (#0)
   > POST /v1/models/bloom:predict HTTP/1.1
   > Host: fluid-bloom-predictor.kserve-fluid-demo.example.com
   > Content-Type: application/json
   >
   < HTTP/1.1 200 OK
   < content-type: application/json
   < server: istio-envoy
   <
   {
     "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was"
   }

A 200 OK response with generated text confirms that the inference service is working.

Clean up

To remove all resources created in this tutorial, run the following commands in order:

kubectl delete inferenceservice fluid-bloom -n kserve-fluid-demo
kubectl delete dataload oss-dataload -n kserve-fluid-demo
kubectl delete dataset oss-data -n kserve-fluid-demo
kubectl delete jindoruntime oss-data -n kserve-fluid-demo
kubectl delete secret access-key -n kserve-fluid-demo
kubectl delete ns kserve-fluid-demo

Alibaba Cloud Service Mesh:Accelerate AI model serving with KServe on ASM and Fluid

How it works

Prerequisites

Step 1: Enable KServe on ASM

Step 2: Install Fluid and set up model caching

Install the ack-fluid component

Upload the AI model to OSS

Create a namespace and configure OSS access

Declare the model data in Fluid

Prefetch model data

Step 3: Deploy the inference service

ACK cluster

ACK Serverless cluster

Step 4: Send a prediction request

Clean up

What's next