All Products
Search
Document Center

Alibaba Cloud Service Mesh:Accelerate AI model serving with KServe on ASM and Fluid

Last Updated:Mar 10, 2026

KServe, formerly known as KFServing, is an AI model service and inference engine for cloud-native environments that supports automatic scaling, scale-to-zero, and canary deployment. Service Mesh (ASM) integrates Knative Serving capabilities deployed in ACK or ACK Serverless clusters and provides a KServe on ASM feature for one-click KServe integration. Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data and AI. You can directly integrate Fluid with the KServe on ASM feature to accelerate model loading.

This topic walks through integrating KServe on ASM with Fluid to deploy an AI inference service backed by Object Storage Service (OSS).

How it works

This integration connects three components:

  1. Fluid caches model files from an OSS bucket to local SSD storage across cluster nodes. A JindoRuntime custom resource (CR) manages the distributed cache, and a DataLoad CR prefetches data before any inference pod starts.

  2. KServe on ASM deploys an InferenceService CR that references cached data through a persistent volume claim (PVC). The inference container loads the model from the Fluid-managed mount point instead of downloading it from OSS.

  3. Service Mesh (ASM) routes traffic through an ingress gateway to the KServe inference service. Knative Serving handles autoscaling and scale-to-zero.

When a request arrives and Knative scales a pod from zero, the model is already cached locally. The pod starts and serves predictions without a remote download.

Prerequisites

Before you begin, make sure that you have:

Deploy the Knative Serving component

Note

If you select Kourier as the gateway during Knative installation, uninstall it after installation completes. In the ACK console, go to Clusters, click your cluster, and choose Applications > Knative. On the Components tab, uninstall Kourier in the Add-on Component section.

Enable Knative on ASM

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > Knative on ASM.

  3. On the Knative on ASM page, click Enable Knative on ASM.

Note

To update an existing ASM instance, see Update an ASM instance.

Step 1: Enable KServe on ASM

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.

  3. On the KServe on ASM page, configure the CertManager option and click Enable KServe on ASM. KServe depends on cert-manager for certificate lifecycle management. The CertManager component is automatically installed with KServe.

    • If CertManager is not installed in the cluster, turn on Automatically install the CertManager component in the cluster.

    • If CertManager is already installed in the cluster, turn off Automatically install the CertManager component in the cluster.

Step 2: Install Fluid and set up model caching

Install the ack-fluid component

Deploy the ack-fluid component (version 0.9.10 or later) in the data plane cluster.

For an ACK cluster, install the cloud-native AI suite and deploy the ack-fluid component:

  • If the cloud-native AI suite is not installed, enable Fluid acceleration during installation. For more information, see Deploy the cloud-native AI suite.

  • If the cloud-native AI suite is already installed, log on to the Container Service for Kubernetes console, click the cluster, and choose Applications > Cloud-native AI Suite to deploy the ack-fluid component.

Note

Uninstall open source Fluid before installing the ack-fluid component. The two cannot coexist.

For an ACK Serverless cluster, see the Deploy the control plane components of Fluid section of the Accelerate Jobs topic.

Upload the AI model to OSS

This tutorial uses the BLOOM-560m model, an open source transformer LLM based on PyTorch.

  1. Download the model files from Hugging Face.

  2. Upload the files to your OSS bucket and record the storage path.

    The path format is oss://<bucket>/<path>. For example, if the bucket is fluid-demo and the files are in the models/bloom directory, the storage path is oss://fluid-demo/models/bloom.

Note

Use ossutil to upload files. For more information, see Install ossutil.

Create a namespace and configure OSS access

  1. Connect to the data plane cluster with kubectl. For more information, see Connect to an ACK cluster by using kubectl.

  2. Create a namespace for Fluid cache and inference workloads:

       kubectl create ns kserve-fluid-demo
  3. Create a file named oss-secret.yaml with the following content:

    Replace the following placeholders with actual values:

    PlaceholderDescriptionExample
    <your-access-key-id>AccessKey ID of an Alibaba Cloud account with OSS accessLTAI5tXxx
    <your-access-key-secret>AccessKey secret of the accountxXxXxXx
       apiVersion: v1
       kind: Secret
       metadata:
         name: access-key
       stringData:
         fs.oss.accessKeyId: <your-access-key-id>       # AccessKey ID of an account with OSS access
         fs.oss.accessKeySecret: <your-access-key-secret> # AccessKey secret of the account
  4. Apply the Secret:

       kubectl apply -f oss-secret.yaml -n kserve-fluid-demo

Declare the model data in Fluid

Submit a Dataset CR and a JindoRuntime CR to declare the model data. The Dataset describes the remote storage location. The JindoRuntime configures the cache system and its specific configuration.

  1. Create a file named oss-jindo.yaml with the following content.

    Replace the following placeholders with actual values. For a list of available endpoints, see Regions and endpoints.

    PlaceholderDescriptionExample
    <bucket>/<path>OSS storage path of the model filesfluid-demo/models/bloom
    <endpoint>OSS endpoint for the region where the bucket residesoss-cn-hangzhou-internal.aliyuncs.com

    The following table describes the JindoRuntime configuration parameters:

    ParameterValueDescription
    replicas2Number of cache worker replicas
    mediumtypeSSDStorage medium for local cache
    quota50GiMaximum cache capacity per worker node
    high / low0.95 / 0.7Cache eviction watermarks (high triggers eviction, low is the target)
    cleanPolicyOnDemandFuse sidecar clean-up policy

    Expand to view oss-jindo.yaml

       apiVersion: data.fluid.io/v1alpha1
       kind: Dataset
       metadata:
         name: oss-data
       spec:
         mounts:
         - mountPoint: "oss://<bucket>/<path>"  # Storage path of the model files
           name: bloom-560m
           path: /bloom-560m
           options:
             fs.oss.endpoint: "<endpoint>"      # OSS endpoint for your region
           encryptOptions:
             - name: fs.oss.accessKeyId
               valueFrom:
                 secretKeyRef:
                   name: access-key
                   key: fs.oss.accessKeyId
             - name: fs.oss.accessKeySecret
               valueFrom:
                 secretKeyRef:
                   name: access-key
                   key: fs.oss.accessKeySecret
         accessModes:
           - ReadOnlyMany
       ---
       apiVersion: data.fluid.io/v1alpha1
       kind: JindoRuntime
       metadata:
         name: oss-data
       spec:
         replicas: 2
         tieredstore:
           levels:
             - mediumtype: SSD
               volumeType: emptyDir
               path: /mnt/ssd0/cache
               quota: 50Gi
               high: "0.95"
               low: "0.7"
         fuse:
           properties:
             fs.jindofsx.data.cache.enable: "true"
           args:
             - -okernel_cache
             - -oro
             - -oattr_timeout=7200
             - -oentry_timeout=7200
             - -ometrics_port=9089
           cleanPolicy: OnDemand
  2. Deploy the Dataset and JindoRuntime:

       kubectl create -f oss-jindo.yaml -n kserve-fluid-demo
  3. Verify the deployment:

       kubectl get jindoruntime,dataset -n kserve-fluid-demo

    Expected output:

       NAME                                  MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
       jindoruntime.data.fluid.io/oss-data   Ready          Ready          Ready        3m
    
       NAME                             UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
       dataset.data.fluid.io/oss-data   3.14GiB          0.00B    100.00GiB        0.0%                Bound   3m

    The Dataset PHASE shows Bound and the JindoRuntime FUSE PHASE shows Ready, which confirms successful deployment.

Prefetch model data

Prefetching loads model files into the local cache before any inference pod starts, eliminating remote download latency during pod startup.

  1. Create a file named oss-dataload.yaml with the following content:

       apiVersion: data.fluid.io/v1alpha1
       kind: DataLoad
       metadata:
         name: oss-dataload
       spec:
         dataset:
           name: oss-data
           namespace: kserve-fluid-demo
         target:
           - path: /bloom-560m
             replicas: 2
  2. Deploy the DataLoad:

       kubectl create -f oss-dataload.yaml -n kserve-fluid-demo
  3. Check the prefetch progress:

       kubectl get dataload -n kserve-fluid-demo

    Expected output:

       NAME           DATASET    PHASE      AGE     DURATION
       oss-dataload   oss-data   Complete   1m      45s

    Wait until PHASE shows Complete before proceeding to the next step.

Step 3: Deploy the inference service

  1. Create a file named oss-fluid-isvc.yaml with the InferenceService configuration that matches your cluster type.

    ACK cluster

       apiVersion: "serving.kserve.io/v1beta1"
       kind: "InferenceService"
       metadata:
         name: "fluid-bloom"
       spec:
         predictor:
           timeout: 600
           minReplicas: 0
           containers:
             - name: kserve-container
               image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
               resources:
                 limits:
                   cpu: "12"
                   memory: 48Gi
                   nvidia.com/gpu: 1  # Number of GPUs required. Remove this line if GPUs are not used.
                 requests:
                   cpu: "12"
                   memory: 48Gi
               env:
                 - name: STORAGE_URI
                   value: "pvc://oss-data/bloom-560m"
                 - name: MODEL_NAME
                   value: "bloom"
                 - name: GPU_ENABLED
                   value: "True"  # Set to "False" if GPUs are not used.

    ACK Serverless cluster

       apiVersion: "serving.kserve.io/v1beta1"
       kind: "InferenceService"
       metadata:
         name: "fluid-bloom"
         labels:
           alibabacloud.com/fluid-sidecar-target: "eci"
         annotations:
           k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c16g1.4xlarge"  # ECS instance type
           knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.gn6i-c16g1.4xlarge"  # ECS instance type
       spec:
         predictor:
           timeout: 600
           minReplicas: 0
           containers:
             - name: kserve-container
               image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
               resources:
                 limits:
                   cpu: "12"
                   memory: 48Gi
                 requests:
                   cpu: "12"
                   memory: 48Gi
               env:
                 - name: STORAGE_URI
                   value: "pvc://oss-data/bloom-560m"
                 - name: MODEL_NAME
                   value: "bloom"
                 - name: GPU_ENABLED
                   value: "True"  # Set to "False" if GPUs are not used.

    The following table describes the InferenceService parameters:

    ParameterValueDescription
    imageregistry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpuSample image with model loading and inference interfaces. To customize, see the KServe Fluid Docker samples
    STORAGE_URIpvc://oss-data/bloom-560mPVC-based storage URI pointing to the Fluid-managed cache
    MODEL_NAMEbloomModel name used in the prediction API endpoint
    GPU_ENABLEDTrueSet to False if GPUs are not used
    resources12 CPU, 48 GiB memoryResource allocation for the BLOOM-560m model. Adjust based on your model size and cluster capacity
  2. Deploy the inference service:

       kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo
  3. Verify the deployment:

       kubectl get inferenceservice -n kserve-fluid-demo

    Expected output:

       NAME          URL                                                READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION           AGE
       fluid-bloom   http://fluid-bloom.kserve-fluid-demo.example.com   True           100                              fluid-bloom-predictor-00001   2d

    READY shows True, which confirms the inference service is running.

Step 4: Send a prediction request

  1. Get the ASM ingress gateway address.

    1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

    2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.

    3. In the Service address section of ingressgateway, copy the service address.

  2. Send a test request. Replace <gateway-address> with the address from the previous step.

       curl -v \
         -H "Content-Type: application/json" \
         -H "Host: fluid-bloom.kserve-fluid-demo.example.com" \
         "http://<gateway-address>:80/v1/models/bloom:predict" \
         -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'

    Expected output:

       *   Trying xxx.xx.xx.xx:80...
       * Connected to xxx.xx.xx.xx (xxx.xx.xx.xx) port 80 (#0)
       > POST /v1/models/bloom:predict HTTP/1.1
       > Host: fluid-bloom-predictor.kserve-fluid-demo.example.com
       > Content-Type: application/json
       >
       < HTTP/1.1 200 OK
       < content-type: application/json
       < server: istio-envoy
       <
       {
         "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was"
       }

    A 200 OK response with generated text confirms that the inference service is working.

Clean up

To remove all resources created in this tutorial, run the following commands in order:

kubectl delete inferenceservice fluid-bloom -n kserve-fluid-demo
kubectl delete dataload oss-dataload -n kserve-fluid-demo
kubectl delete dataset oss-data -n kserve-fluid-demo
kubectl delete jindoruntime oss-data -n kserve-fluid-demo
kubectl delete secret access-key -n kserve-fluid-demo
kubectl delete ns kserve-fluid-demo

What's next