All Products
Search
Document Center

Container Service for Kubernetes:Accelerate models using Fluid on KServe

Last Updated:Mar 26, 2026

Large language models (LLMs) can easily reach tens of gigabytes, causing slow cold starts and high restart latency when model files are pulled from Object Storage Service (OSS). Fluid solves this by caching model files locally on cluster nodes using JindoRuntime. After the first load, the inference service reads model files from the local JindoFS memory cache instead of pulling them from OSS, so subsequent restarts are significantly faster. This topic shows how to set up Fluid-based caching for a Qwen-7B-Chat-Int8 model and deploy it as a KServe inference service backed by vLLM on NVIDIA V100 GPUs.

Prerequisites

Before you begin, make sure you have:

How it works

Fluid introduces two Kubernetes custom resources that work together:

  • Dataset — declares which remote storage path to expose: in this case, an OSS bucket path containing the model files.

  • JindoRuntime — starts a JindoFS cluster that caches the dataset contents in memory on cluster nodes, so subsequent reads are served locally instead of from OSS.

When the KServe inference service mounts the dataset, the vLLM server reads model files from the local JindoFS cache rather than pulling them from OSS on each start.

Step 1: Prepare model data and upload it to OSS

Download the Qwen-7B-Chat-Int8 model

  1. Install Git and the Large File Support (LFS) plug-in:

    sudo yum install git
    sudo yum install git-lfs
  2. Clone the Qwen-7B-Chat-Int8 repository from ModelScope, skipping LFS downloads during clone:

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
  3. Go to the cloned directory, then pull the LFS-managed model files:

    cd Qwen-7B-Chat-Int8
    git lfs pull

Upload the model to OSS

  1. Log in to the OSS console and record the name of your OSS bucket. To create a bucket, see Create a bucket.

  2. Install and configure ossutil. See Install ossutil.

  3. Create a directory in the bucket and upload the model files:

    ossutil mkdir oss://<your-bucket-name>/Qwen-7B-Chat-Int8
    ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<your-bucket-name>/Qwen-7B-Chat-Int8

Step 2: Create a dataset and a JindoRuntime

Create a Secret for OSS credentials

Create a Kubernetes Secret to store the AccessKey pair used to access the OSS bucket:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  fs.oss.accessKeyId: <your-access-key-id>
  fs.oss.accessKeySecret: <your-access-key-secret>
EOF

Replace <your-access-key-id> and <your-access-key-secret> with your actual credentials. To get an AccessKey pair, see Obtain an AccessKey pair.

Expected output:

secret/oss-secret created

Create the dataset and JindoRuntime

Create a file named resource.yaml with the following content. For configuration details, see Use JindoFS to accelerate access to OSS.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: qwen-7b-chat-int8
spec:
  mounts:
    - mountPoint: oss://<oss_bucket>/Qwen-7b-chat-Int8  # Replace with your actual OSS path
      options:
        fs.oss.endpoint: <oss_endpoint>                  # Replace with your OSS bucket endpoint
      name: models
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: oss-secret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: oss-secret
              key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: qwen-7b-chat-int8  # Must match the dataset name
spec:
  replicas: 3
  tieredstore:
    levels:
      - mediumtype: MEM       # Cache in memory
        volumeType: emptyDir
        path: /dev/shm
        quota: 3Gi            # Cache capacity per replica
        high: "0.95"
        low: "0.7"
  fuse:
    resources:
      requests:
        memory: 2Gi
    properties:
      fs.oss.download.thread.concurrency: "200"
      fs.oss.read.buffer.size: "8388608"
      fs.oss.read.readahead.max.buffer.count: "200"
      fs.oss.read.sequence.ambiguity.range: "2147483647"

Apply the configuration:

kubectl apply -f resource.yaml

Expected output:

dataset.data.fluid.io/qwen-7b-chat-int8 created
jindoruntime.data.fluid.io/qwen-7b-chat-int8 created

Step 3: Deploy a vLLM inference service

Deploy the Qwen-7B-Chat-Int8 model as a KServe inference service using vLLM. The --data flag mounts the Fluid dataset into the container, so the model is read from the JindoFS cache.

arena serve kserve \
    --name=qwen-fluid \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --data="qwen-7b-chat-int8:/mnt/models/Qwen-7B-Chat-Int8" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

Expected output:

inferenceservice.serving.kserve.io/qwen-fluid created
INFO[0002] The Job qwen-fluid has been submitted successfully
INFO[0002] You can run `arena serve get qwen-fluid --type kserve -n default` to check the job status

Step 4: Verify acceleration results

Check dataset cache status

kubectl get dataset qwen-7b-chat-int8

Expected output:

NAME                UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
qwen-7b-chat-int8   17.01GiB         10.46MiB   18.00GiB         0.1%                Bound   23h

The PHASE: Bound status confirms the dataset is bound and the JindoRuntime is active. The cached percentage increases as the inference service reads model files.

Check server startup time

Run the following commands to measure how long the server takes to become ready:

# Get the Pod name for the inference service
POD_NAME=$(kubectl get po | grep qwen-fluid | awk -F " " '{print $1}')
# Check how long the server took to become ready
kubectl logs $POD_NAME | grep -i "server ready takes"

Expected output:

server ready takes 25.875763 s

With Fluid caching enabled, model files are served from local memory on subsequent restarts instead of being pulled from OSS, which reduces startup time. The actual speedup varies based on dataset size, node memory, and network conditions.

For benchmark data comparing cached vs. uncached access times, see the Step 3: Create applications to test data acceleration section in "Use JindoFS to accelerate access to OSS."

Troubleshooting

JindoRuntime is not ready

Symptom: The dataset stays in a non-Bound phase after applying resource.yaml.

Cause: The JindoRuntime workers may be pending due to insufficient memory. Each replica requests 3 GiB of cache memory plus 2 GiB for the Fuse sidecar.

Fix: Check events on the JindoRuntime:

kubectl describe jindoruntime qwen-7b-chat-int8

If nodes lack free memory, either reduce quota in tieredstore or add nodes with more available memory.

OSS access errors

Symptom: The dataset shows errors or the JindoRuntime pods report access denied.

Cause: The AccessKey ID or AccessKey secret in the oss-secret Secret is incorrect, or the AccessKey does not have read permission on the OSS bucket.

Fix: Verify the credentials in the Secret:

kubectl get secret oss-secret -o yaml

Re-create the Secret with the correct values if needed, then restart the JindoRuntime.

Inference service Pod stuck in init

Symptom: The qwen-fluid Pod stays in Init state.

Cause: The Fuse sidecar may not have mounted the dataset yet, or the dataset is not in Bound phase.

Fix: Check the dataset phase first:

kubectl get dataset qwen-7b-chat-int8

Wait for PHASE: Bound before the inference service Pod can proceed. If the dataset is Bound but the Pod is still stuck, check the Pod events:

kubectl describe pod $POD_NAME

What's next

References