KServe, formerly known as KFServing, is an AI model service and inference engine for cloud-native environments that supports automatic scaling, scale-to-zero, and canary deployment. Service Mesh (ASM) integrates Knative Serving capabilities deployed in ACK or ACK Serverless clusters and provides a KServe on ASM feature for one-click KServe integration. Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data and AI. You can directly integrate Fluid with the KServe on ASM feature to accelerate model loading.
This topic walks through integrating KServe on ASM with Fluid to deploy an AI inference service backed by Object Storage Service (OSS).
How it works
This integration connects three components:
Fluid caches model files from an OSS bucket to local SSD storage across cluster nodes. A JindoRuntime custom resource (CR) manages the distributed cache, and a DataLoad CR prefetches data before any inference pod starts.
KServe on ASM deploys an InferenceService CR that references cached data through a persistent volume claim (PVC). The inference container loads the model from the Fluid-managed mount point instead of downloading it from OSS.
Service Mesh (ASM) routes traffic through an ingress gateway to the KServe inference service. Knative Serving handles autoscaling and scale-to-zero.
When a request arrives and Knative scales a pod from zero, the model is already cached locally. The pod starts and serves predictions without a remote download.
Prerequisites
Before you begin, make sure that you have:
An ASM instance of version 1.17 or later with a Kubernetes cluster added to it. For more information, see Create an ASM instance and Add a cluster to an ASM instance
One of the following Kubernetes clusters added to the ASM instance:
A Container Service for Kubernetes (ACK) cluster of version 1.22 or later. For GPU-based inference, the cluster must include GPU-accelerated nodes such as
ecs.gn6i-c16g1.4xlarge. For more information, see Create an ACK managed cluster or Update an ACK clusterAn ACK Serverless cluster of version 1.18 or later with the CoreDNS component installed. For more information, see Create an ACK Serverless cluster and Manage system components
Kubernetes API access to Istio resources enabled for the ASM instance. For more information, see Use the Kubernetes API of clusters on the data plane to access Istio resources
An ASM ingress gateway named
ingressgatewaywith ports 80 and 443 exposed. For more information, see Create an ingress gatewayThe Knative Serving component deployed in the cluster with the Knative on ASM feature enabled. For more information, see Use Knative on ASM to deploy a serverless application
An active OSS subscription with at least one bucket created. For more information, see Activate OSS and Create buckets
To update an existing ASM instance, see Update an ASM instance.
Step 1: Enable KServe on ASM
Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.
On the KServe on ASM page, configure the CertManager option and click Enable KServe on ASM. KServe depends on cert-manager for certificate lifecycle management. The CertManager component is automatically installed with KServe.
If CertManager is not installed in the cluster, turn on Automatically install the CertManager component in the cluster.
If CertManager is already installed in the cluster, turn off Automatically install the CertManager component in the cluster.
Step 2: Install Fluid and set up model caching
Install the ack-fluid component
Deploy the ack-fluid component (version 0.9.10 or later) in the data plane cluster.
For an ACK cluster, install the cloud-native AI suite and deploy the ack-fluid component:
If the cloud-native AI suite is not installed, enable Fluid acceleration during installation. For more information, see Deploy the cloud-native AI suite.
If the cloud-native AI suite is already installed, log on to the Container Service for Kubernetes console, click the cluster, and choose Applications > Cloud-native AI Suite to deploy the ack-fluid component.
Uninstall open source Fluid before installing the ack-fluid component. The two cannot coexist.
For an ACK Serverless cluster, see the Deploy the control plane components of Fluid section of the Accelerate Jobs topic.
Upload the AI model to OSS
This tutorial uses the BLOOM-560m model, an open source transformer LLM based on PyTorch.
Download the model files from Hugging Face.
Upload the files to your OSS bucket and record the storage path.
The path format is
oss://<bucket>/<path>. For example, if the bucket isfluid-demoand the files are in themodels/bloomdirectory, the storage path isoss://fluid-demo/models/bloom.
Use ossutil to upload files. For more information, see Install ossutil.
Create a namespace and configure OSS access
Connect to the data plane cluster with kubectl. For more information, see Connect to an ACK cluster by using kubectl.
Create a namespace for Fluid cache and inference workloads:
kubectl create ns kserve-fluid-demoCreate a file named
oss-secret.yamlwith the following content:Replace the following placeholders with actual values:
Placeholder Description Example <your-access-key-id>AccessKey ID of an Alibaba Cloud account with OSS access LTAI5tXxx <your-access-key-secret>AccessKey secret of the account xXxXxXx apiVersion: v1 kind: Secret metadata: name: access-key stringData: fs.oss.accessKeyId: <your-access-key-id> # AccessKey ID of an account with OSS access fs.oss.accessKeySecret: <your-access-key-secret> # AccessKey secret of the accountApply the Secret:
kubectl apply -f oss-secret.yaml -n kserve-fluid-demo
Declare the model data in Fluid
Submit a Dataset CR and a JindoRuntime CR to declare the model data. The Dataset describes the remote storage location. The JindoRuntime configures the cache system and its specific configuration.
Create a file named
oss-jindo.yamlwith the following content.Replace the following placeholders with actual values. For a list of available endpoints, see Regions and endpoints.
Placeholder Description Example <bucket>/<path>OSS storage path of the model files fluid-demo/models/bloom<endpoint>OSS endpoint for the region where the bucket resides oss-cn-hangzhou-internal.aliyuncs.comThe following table describes the JindoRuntime configuration parameters:
Parameter Value Description replicas2 Number of cache worker replicas mediumtypeSSD Storage medium for local cache quota50Gi Maximum cache capacity per worker node high/low0.95 / 0.7 Cache eviction watermarks (high triggers eviction, low is the target) cleanPolicyOnDemand Fuse sidecar clean-up policy Deploy the Dataset and JindoRuntime:
kubectl create -f oss-jindo.yaml -n kserve-fluid-demoVerify the deployment:
kubectl get jindoruntime,dataset -n kserve-fluid-demoExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE jindoruntime.data.fluid.io/oss-data Ready Ready Ready 3m NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/oss-data 3.14GiB 0.00B 100.00GiB 0.0% Bound 3mThe Dataset
PHASEshowsBoundand the JindoRuntimeFUSE PHASEshowsReady, which confirms successful deployment.
Prefetch model data
Prefetching loads model files into the local cache before any inference pod starts, eliminating remote download latency during pod startup.
Create a file named
oss-dataload.yamlwith the following content:apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: oss-dataload spec: dataset: name: oss-data namespace: kserve-fluid-demo target: - path: /bloom-560m replicas: 2Deploy the DataLoad:
kubectl create -f oss-dataload.yaml -n kserve-fluid-demoCheck the prefetch progress:
kubectl get dataload -n kserve-fluid-demoExpected output:
NAME DATASET PHASE AGE DURATION oss-dataload oss-data Complete 1m 45sWait until
PHASEshowsCompletebefore proceeding to the next step.
Step 3: Deploy the inference service
Create a file named
oss-fluid-isvc.yamlwith the InferenceService configuration that matches your cluster type.ACK cluster
apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "fluid-bloom" spec: predictor: timeout: 600 minReplicas: 0 containers: - name: kserve-container image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu resources: limits: cpu: "12" memory: 48Gi nvidia.com/gpu: 1 # Number of GPUs required. Remove this line if GPUs are not used. requests: cpu: "12" memory: 48Gi env: - name: STORAGE_URI value: "pvc://oss-data/bloom-560m" - name: MODEL_NAME value: "bloom" - name: GPU_ENABLED value: "True" # Set to "False" if GPUs are not used.ACK Serverless cluster
apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "fluid-bloom" labels: alibabacloud.com/fluid-sidecar-target: "eci" annotations: k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c16g1.4xlarge" # ECS instance type knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.gn6i-c16g1.4xlarge" # ECS instance type spec: predictor: timeout: 600 minReplicas: 0 containers: - name: kserve-container image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu resources: limits: cpu: "12" memory: 48Gi requests: cpu: "12" memory: 48Gi env: - name: STORAGE_URI value: "pvc://oss-data/bloom-560m" - name: MODEL_NAME value: "bloom" - name: GPU_ENABLED value: "True" # Set to "False" if GPUs are not used.The following table describes the InferenceService parameters:
Parameter Value Description imageregistry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpuSample image with model loading and inference interfaces. To customize, see the KServe Fluid Docker samples STORAGE_URIpvc://oss-data/bloom-560mPVC-based storage URI pointing to the Fluid-managed cache MODEL_NAMEbloomModel name used in the prediction API endpoint GPU_ENABLEDTrueSet to Falseif GPUs are not usedresources12 CPU, 48 GiB memory Resource allocation for the BLOOM-560m model. Adjust based on your model size and cluster capacity Deploy the inference service:
kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demoVerify the deployment:
kubectl get inferenceservice -n kserve-fluid-demoExpected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE fluid-bloom http://fluid-bloom.kserve-fluid-demo.example.com True 100 fluid-bloom-predictor-00001 2dREADYshowsTrue, which confirms the inference service is running.
Step 4: Send a prediction request
Get the ASM ingress gateway address.
Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.
In the Service address section of ingressgateway, copy the service address.
Send a test request. Replace
<gateway-address>with the address from the previous step.curl -v \ -H "Content-Type: application/json" \ -H "Host: fluid-bloom.kserve-fluid-demo.example.com" \ "http://<gateway-address>:80/v1/models/bloom:predict" \ -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'Expected output:
* Trying xxx.xx.xx.xx:80... * Connected to xxx.xx.xx.xx (xxx.xx.xx.xx) port 80 (#0) > POST /v1/models/bloom:predict HTTP/1.1 > Host: fluid-bloom-predictor.kserve-fluid-demo.example.com > Content-Type: application/json > < HTTP/1.1 200 OK < content-type: application/json < server: istio-envoy < { "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was" }A
200 OKresponse with generated text confirms that the inference service is working.
Clean up
To remove all resources created in this tutorial, run the following commands in order:
kubectl delete inferenceservice fluid-bloom -n kserve-fluid-demo
kubectl delete dataload oss-dataload -n kserve-fluid-demo
kubectl delete dataset oss-data -n kserve-fluid-demo
kubectl delete jindoruntime oss-data -n kserve-fluid-demo
kubectl delete secret access-key -n kserve-fluid-demo
kubectl delete ns kserve-fluid-demo