Community Blog KServe + Fluid Accelerates Big Model Inference

KServe + Fluid Accelerates Big Model Inference

This article explores how to implement the KServe big model inference in Alibaba Cloud Container Service for Kubernetes (ACK).

By Chilin Huang, Luying and Cheyang


KServe is a standardized model inference platform designed for highly scalable scenarios on Kubernetes. It supports serverless inference workloads for Machine Learning (ML) models on any framework. KServe offers high-performance and abstract interfaces, allowing seamless integration with popular ML frameworks like Tensorflow, XGBoost, Scikit-Learn, PyTorch, and ONNX, making it ideal for production model serving. It simplifies auto-scaling, network, health check, and service configuration complexities, and supports GPU auto-scaling, zeroing, and canary releases. KServe provides a plug-and-play solution for production-level machine learning services, encompassing predictive, pre-processing, post-processing, and interpretability capabilities.


Artificial intelligence generated content (AIGC) and large-scale language models (LLM) have gained significant prominence in the past year, further raising public expectations for AI. To generate new business value, an increasing number of companies are turning to KServe for the following reasons:

  1. Distributed processing: LLMs have a massive number of parameters and require extensive computing resources. To address this challenge, the KServe platform provides distributed processing capabilities, enabling parallel computing across multiple nodes and accelerating the overall computational process.
  2. Serverless: KServe platform is an example of serverless algorithms that can scale automatically when demand changes. This feature makes the deployment of large language models more flexible and efficient and can significantly improve the response speed of the model.
  3. Unified deployment: The KServe platform offers users a unified and simplified approach to deploy and manage large language models. This eliminates the need for complex algorithm environment setup and configuration, enabling users to begin model training and prediction work seamlessly.
  4. Monitoring and management: The KServe platform encompasses comprehensive monitoring and management functionalities. Users can gain clear insights into model performance and runtime status, facilitating timely parameter adjustments and issue resolution, ensuring optimal model efficiency and reliability.

However, in real-world production scenarios, KServe encounters challenges in supporting LLMs. The main problems are as follows:

  1. Long model startup time. LLMs consist of an extensive parameter set, which can reach sizes of hundreds of gigabytes. Consequently, pulling these models into GPU memory is time-consuming and results in long startup times. Additionally, KServe's model retrieval from remote storage to local storage via the storage initializer also contributes to extensive delays, impacting the KServe serverless auto-scaling feature based on traffic.
  2. Long container images pull time. The runtime operationsof LLMs depends on the GPU base environment, resulting in large container images. This leads to a long pull time and slows down the application startup speed.
  3. Inefficient model update process and high complexity. LLMs consist of multiple files, meaning that updating a model may only require partial updates or additions of specific files. However, KServe needs to restart the container and re-pull the model, which cannot support the hot upgrade of the model. This leads to low efficiency and high complexity.

During Kubecon 2023, KServe mentioned that Fluid could potentially address its elasticity challenges. Fluid is an open-source, Kubernetes-native distributed dataset orchestration and acceleration engine. It primarily caters to data-intensive applications in cloud-native scenarios, such as big data and AI applications. For more details, please refer to the Overview of Fluid Data Acceleration [1].


The Alibaba Cloud Container Service team collaborates with KServe and the Fluid community to explore a simple, convenient, high-performance, production-level support for LLMs on the Alibaba Cloud Serverless Kubernetes platform. The key capabilities include:

  1. Host services and product support. Alibaba Cloud service mesh (ASM) provides native support for KServe, ensuring the stability of the underlying Istio. By using the hosted service mesh, KServe achieves high availability and built-in security, relieving you of operational and maintenance tasks. This allows you to focus more on LLM-related work. Additionally, KServe and Fluid can be easily installed with one click.
  2. Optimized usage mode with community participation. The latest version of KServe supports not only the storage initializer but also the standard Persistent Volume Claim (PVC) mode. This reduces the risk of insufficient storage and improves startup speed by avoiding the need to pull models from remote storage to local devices. Furthermore, it supports hot upgrades of models, enhancing flexibility.
  3. Accelerated model loading through elastic distributed caching. The combination of Fluid and KServe enables you to preload data into a distributed cache, reducing 80% of the pod startup time. This combination also supports hot upgrades of models without requiring container restarts.
  4. These capabilities are fully realized on the Alibaba Cloud Serverless Container Service for Kubernetes (ASK). KServe with Fluid is suitable for auto-scaling and scaling to zero based on GPU serverless requests.


Everything is ready. Let's explore the KServe big model inference in Alibaba Cloud Container Service for Kubernetes (ACK).



  • An ACK cluster is created. The Kubernetes version is 1.18 or later. For more information, see Create a Managed Kubernetes Cluster [2].

The ACK cluster used in this example contains three ecs.g7.xlarge ECS instances and one ecs.g7.2xlarge ECS instance. You can select to add three ecs.g7.xlarge ECS instances during creation. After creating the cluster, create a node pool to add one ecs.g7.2xlarge ECS instance. For details about how to create a node pool, see Create a Node Pool [3].

  • An ASM instance of version 1.17 or later is created, and the preceding ACK cluster is added to the instance. For more information, see Create an ASM Instance [4] and Add a Cluster to an ASM Instance [5].


You can use the preceding configuration to create an ASM instance.

  • Choose an Enterprise instance to enable the KubeAPI of the cluster on the data plane to access Istio resources.
  • Istio must be v1.17 or later.
  • The region and private network of the mesh must be consistent with those of the created Kubernetes cluster to ensure smooth network communication.
  • You can choose whether to select "Expose API Server with EIP" as required. After it is enabled, an EIP will be created and bound to the internal SLB instance. You will be able to access the API Server of the ASM instance from the Internet and operate Istio CR.
  • Observability and grid audit depend on Log Service and Alibaba Cloud Prometheus Monitoring Service. If the dependent service has not been activated, you can choose not to select the service.
  • [Important] You must select "Enable the KubeAPI of the cluster on the data plane to access Istio resources".
  • The local domain name of the cluster must be the same as the local domain name of your Kubernetes cluster.
  • The ASM instance is enabled to access Istio resources by using the KubeAPI of clusters on the data plane. For more information, see Access Istio Resource Through the KubeAPI of Clusters on the Data Plane[6].
  • An ingress gateway is added to the cluster. In this example, an ASM ingress gateway is used as the cluster gateway. The ASM ingress gateway name is ingressgateway by default, and ports 80 and 443 are enabled. For more information, see Create an Ingress Gateway Service [7].
  • The Knative Serving component is deployed in an ACK or ASK cluster, and the Knative on ASM feature is enabled. For more information, see Preconditions and Step 1 in Deploy Serverless Applications by Using Knative on ASM [8].

Step 1: Enable the KServe on ASM

  1. Log on to the ASM console [9]. In the left-side navigation pane, click Service Mesh > Mesh Management.
  2. On the ASM management page, click the name of the ASM instance you want to manage. In the left-side navigation pane, click Eco-Integration Center > KServe on ASM.
  3. On the KServe on ASM page, click Enable KServe on ASM.

Note: KServe on ASM relies on the deployment and usage of the cert-manager component. cert-manager is a certificate lifecycle management system that facilitates certificate issuance and deployment. If cert-manager is not installed in the cluster, you need to enable the Automatically Install CertManager in the Cluster option on the KServe on ASM page to install cert-manager automatically. Alternatively, if cert-manager is already installed, you can disable it before clicking on Enable KServe on ASM.

Step 2: Install ACK-Fluid and Enable AI Model Caching Acceleration

1.  Deploy ack-fluid in your ACK or ASK cluster and make sure that the version of ack-fluid is 0.9.10 or later.

Note: If your cluster is an ACK cluster on the data plane, you must install the cloud-native AI suite and deploy ack-fluid in the ACK cluster. Reference: Accelerate Online Application Data Access [10].

If your cluster on the data plane is an ASK cluster, you must deploy ack-fluid in the ASK cluster. For more information, see Accelerate Job Application Data Access [11].

2.  Prepare an AI model and upload it to an OSS bucket.

a) Prepare the trained AI model data. This article uses bloom, an open-source transformer language model based on pytorch, as an example. You can obtain the model data in the hugging face community: https://huggingface.co/bigscience/bloom-560m/tree/main

b) Upload the downloaded model data file to an OSS bucket and record the storage location of the model data file. The format of the storage location of the model data file is oss://{bucket}/{path}. For example, if you create a bucket named fluid-demo and upload all model data files to the models/bloom directory in the bucket, the storage location of the model data files is oss://fluid-demo/models/bloom

Note: You can use ossutil, a client tool provided by OSS, to upload data. For more information, see Install Ossutil [12].

3.  Create a namespace for deploying Fluid and AI services, and configure OSS access permissions.

a) Use kubectl to connect to the ACK/ASK cluster on the data plane. For more information, see Use kubectl to Connect to a Kubernetes Cluster [13].

b) Use kubectl to create a namespace to deploy the Fluid and KServe AI services. In this example, a kserve-fluid-demo namespace is used.

kubectl create ns kserve-fluid-demo

c) Use kubectl to add eci labels to the namespace to schedule pods in the namespace to virtual nodes.

kubectl label namespace kserve-fluid-demo alibabacloud.com/eci=true

d) Create an oss-secret.yaml file with the following content. fs.oss.accessKeyId and fs.oss.accessKeySecret respectively represent the AccessKey ID and AccessKey secret that are used to access OSS.

apiVersion: v1
kind: Secret
  name: access-key
  fs.oss.accessKeyId: xxx # Replace the value with the Alibaba Cloud AccessKey ID that is used to access OSS.
  fs.oss.accessKeySecret: xxx # Replace the value with the Alibaba Cloud AccessKey secret that is used to access OSS.

e) Run the following command to deploy the Secret and configure the OSS AccessKey:

kubectl apply -f oss-secret.yaml -n kserve-fluid-demo

4.  Declare the AI model data to be accessed in Fluid. You need to submit a Dataset CR and a Runtime CR. Dataset CR describes the URL location of the data in the external storage system. JindoRuntime CR describes the cache system and its specific configuration.

a) Use the following content and save it as a oss-jindo.yaml file.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
  name: oss-data
  - mountPoint: "oss://{bucket}/{path}" # Replace it with the location where the model data file is saved.
    name: bloom-560m
    path: /bloom-560m
      fs.oss.endpoint: "{endpoint}"  # Replace it with the actual OSS endpoint.
      - name: fs.oss.accessKeyId
            name: access-key
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
            name: access-key
            key: fs.oss.accessKeySecret
    - ReadOnlyMany
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
  name: oss-data
  replicas: 2
      - mediumtype: SSD
        volumeType: emptyDir
        path: /mnt/ssd0/cache
        quota: 50Gi
        high: "0.95"
        low: "0.7"
      - -ometrics_port=-1
      node.kubernetes.io/instance-type: ecs.g7.xlarge
      node.kubernetes.io/instance-type: ecs.g7.xlarge

Note: You need to replace the oss://{bucket}/{path} in Dataset CR with the storage location of the model data file recorded above. Replace {endpoint} in Dataset CR with the endpoint of OSS. For more information about how to obtain the endpoints of OSS in different regions, see Domain Names and Data Center [14].

a) Run the following command to deploy Dataset and JindoRuntime CR:

kubectl create -f oss-jindo.yaml -n kserve-fluid-demo

b) Run the following command to view the deployment of Dataset and JindoRuntime:

kubectl get jindoruntime,dataset -n kserve-fluid-demo

Expected output:

NAME                                  MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
jindoruntime.data.fluid.io/oss-data   Ready          Ready          Ready        3m

dataset.data.fluid.io/oss-data   3.14GiB          0.00B    100.00GiB        0.0%                Bound   3m

The output shows that the PHASE of Dataset is Bound and the FUSE PHASE of JindoRuntime is Ready. This indicates that both Dataset and JindoRuntime are deployed.

5.  Prefetch data in Fluid to improve data access performance.

a) Use the following content to save as an oss-dataload.yaml.

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
  name: oss-dataload
    name: oss-data
    namespace: kserve-fluid-demo
    - path: /bloom-560m
      replicas: 2

b) Run the following command to deploy Dataload for data prefetching:

kubectl create -f oss-dataload.yaml -n kserve-fluid-demo

c) Run the following command to query the progress of data prefetching:

kubectl get dataload -n kserve-fluid-demo

Expected output:

NAME           DATASET    PHASE      AGE     DURATIONoss-dataload
oss-data   Complete   1m      45s

The output shows that the duration of data prefetching is 45 seconds. It takes a while for data prefetching.

Step 3: Deploy the AI Model Inference Service

1.  Save the following as a oss-fluid-isvc.yaml

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
  name: "fluid-bloom"
    timeout: 600
    minReplicas: 0
      node.kubernetes.io/instance-type: ecs.g7.2xlarge
      - name: kserve-container
        image: cheyang/kserve-fluid:bloom-gpu
            cpu: "3"
            memory: 8Gi
            cpu: "3"
            memory: 8Gi
          - name: STORAGE_URI
            value: "pvc://oss-data/bloom-560m"
          - name: MODEL_NAME
            value: "bloom"
            # Set this parameter to True if GPU is used. Otherwise, set this parameter to False.
          - name: GPU_ENABLED
            value: "False"

Note: In this example, the cheyang/kserve-fluid:bloom-gpu sample image is used in the image field in the InferenceService configuration. This image provides an interface for loading models and inference services. You can find the code of this sample image in the KServe open source community and customize the image: https://github.com/kserve/kserve/tree/master/docs/samples/fluid/docker

2.  Run the following command to deploy the InferenceService AI model inference service:

kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo

3.  Run the following command to view the deployment status of the AI model inference service.

kubectl get inferenceservice -n kserve-fluid-demo

Expected output:

NAME          URL
READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION           AGEfluid-bloom   http://fluid-bloom.kserve-fluid-demo.example.com   True           100                              fluid-bloom-predictor-00001   2d

The expected output shows that the READY field is True, which proves that the AI model inference service has been deployed.

Step 4: Access the AI Model Inference Service

1.  Obtain the ASM ingress gateway address.

a) Log on to the ASM console. In the left-side navigation pane, click Service Mesh > Mesh Management.

b) On the Mesh Management page, click the name of the target instance, and click ASM Gateways > Ingress Gateway in the left-side navigation pane.

c) On the Ingress Gateway page, find the ASM ingress gateway named ingressgateway. In the Service Address section, view and obtain the service address of the ASM gateway.

2.  Access the sample of AI model inference service

Run the following command to access the sample AI model inference service bloom, and replace the ASM gateway service address with the ASM ingress gateway address you obtained.

curl -v -H "Content-Type: application/json" -H "Host: fluid-bloom.kserve-fluid-demo.example.com" "http://{ASM gateway service address}:80/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'

Expected output:

*   Trying xxx.xx.xx.xx :80...
* Connected to xxx.xx.xx.xx  (xxx.xx.xx.xx ) port 80 (#0)
> POST /v1/models/bloom:predict HTTP/1.1
> Host: fluid-bloom-predictor.kserve-fluid-demo.example.com
> User-Agent: curl/7.84.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 65
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 227
< content-type: application/json
< date: Thu, 20 Apr 2023 09:49:00 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 1142
  "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was"
* Connection #0 to host xxx.xx.xx.xx left intact

As the expected output shows, the AI model inference service has renewed the sample input and returned the inference result.

Performance Benchmark Test

Our performance benchmark will compare the expansion time of KServe's OSS Storage Initializer and Fluid's inference service and test the expansion time under different models. That is, measure the time required for the service to expand from 0 to 1.

The test models we selected are:

  • Fluid models: ecs.g7.xlarge
  • Inference models: ecs.g7.2xlarge

The Fluid Runtime used is:

  • JindoFS

Other prerequisites:

  • The OSS bucket and the ACK cluster are in the same region.
  • Data prefetching completed in advance.
  • Container images are cached on the nodes of the ACK cluster.

Performance Testing command:

# Total time includes: pod initialization and running + download model (storage initializer) + load model + inference + network
curl --connect-timeout 3600 --max-time 3600 -o /dev/null -s -w 'Total: %{time_total}s\n' -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'
# Total time:
Model Name Model Size (Snapshot) Machine Type KServe + Storage Initializer KServe + Fluid
bigscience/bloom-560m [15] 3.14GB ecs.g7.2xlarge
(cpu: 8, memory: 32G)
total: 58.031551s(download:33.866s, load: 5.016s) total: 8.488353s(load: 2.349s)
(2 workers)
bigscience/bloom-7b1 [16] 26.35GB ecs.g7.4xlarge
(cpu: 16, memory: 64G)
total: 329.019987s(download: 228.440s, load: 71.964s) total: 27.800123s(load: 12.084s)
(3 workers)

Total: the response time (scales out from 0) when no service is available, including container scheduling, startup, model download, model loading to video memory, model inference, and network latency.


In the context of LLMs, it is evident that:

  1. Fluid significantly improves the cold start speed of KServe. The optimization of startup time becomes more pronounced as the size of the model increases.
  2. Fluid not only greatly reduces the time required for downloading the model to the disk during Storage Initializer initialization, but it also overcomes the bandwidth limitation of the disk by increasing the cache worker bandwidth. In this example, the time taken to read the model from the disk to memory can be reduced by half or even two-thirds.

This greatly improves the elastic scaling capability of KServe in container scenarios.

Summary and Outlook

While there is still a long way to go in improving and optimizing the support for large models through existing cloud-native AI frameworks, progress can be made by continuous effort. The Alibaba Cloud Container Service team is committed to collaborating with community partners to explore solutions that enable better support for LLM inference scenarios at a lower cost. Our aim is to provide standard, open solutions and product-based capabilities. In the future, we will introduce methods to control costs, such as leveraging the rule of computing elastic scaling to trigger the elastic scaling of the data cache. Additionally, we will provide a hot update method for large models.

Related Links

[1] Overview of Fluid Data Acceleration
[2] Create a Managed Kubernetes Cluster
[3] Create a Node Pool
[4] Create an ASM Instance
[5] Add the Cluster to the ASM Instance
[6] Access Istio Resources Through the KubeAPI of Clusters on the Data Plane
[7] Create an Ingress Gateway Service
[8] Deploying Serverless Applications by Using Knative on ASM
[9] ASM Console
[10] Accelerate Online Application Data Access
[11] Accelerate Job Application Data Access
[12] Install Ossutil
[13] Connect to a Kubernetes Cluster Through kubectl
[14] Access Domains and Data Centers
[15] bigscience/bloom-560m
[16] bigscience/bloom-7b1

0 1 0
Share on

You may also like


Related Products