Large language models (LLMs) can easily reach tens of gigabytes, causing slow cold starts and high restart latency when model files are pulled from Object Storage Service (OSS). Fluid solves this by caching model files locally on cluster nodes using JindoRuntime. After the first load, the inference service reads model files from the local JindoFS memory cache instead of pulling them from OSS, so subsequent restarts are significantly faster. This topic shows how to set up Fluid-based caching for a Qwen-7B-Chat-Int8 model and deploy it as a KServe inference service backed by vLLM on NVIDIA V100 GPUs.
Prerequisites
Before you begin, make sure you have:
-
An ACK Pro cluster that does not run ContainerOS, running Kubernetes 1.22 or later, with at least 3 nodes and 3 GB of free memory per node. See Create an ACK Pro cluster.
-
The cloud-native AI suite installed with the ack-fluid component deployed. See Deploy the cloud-native AI suite.
-
Arena 0.9.15 or later installed. See Configure the Arena client.
-
The ack-kserve component installed. See Install ack-kserve.
-
OSS activated. See Activate OSS.
How it works
Fluid introduces two Kubernetes custom resources that work together:
-
Dataset — declares which remote storage path to expose: in this case, an OSS bucket path containing the model files.
-
JindoRuntime — starts a JindoFS cluster that caches the dataset contents in memory on cluster nodes, so subsequent reads are served locally instead of from OSS.
When the KServe inference service mounts the dataset, the vLLM server reads model files from the local JindoFS cache rather than pulling them from OSS on each start.
Step 1: Prepare model data and upload it to OSS
Download the Qwen-7B-Chat-Int8 model
-
Install Git and the Large File Support (LFS) plug-in:
sudo yum install git sudo yum install git-lfs -
Clone the Qwen-7B-Chat-Int8 repository from ModelScope, skipping LFS downloads during clone:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git -
Go to the cloned directory, then pull the LFS-managed model files:
cd Qwen-7B-Chat-Int8 git lfs pull
Upload the model to OSS
-
Log in to the OSS console and record the name of your OSS bucket. To create a bucket, see Create a bucket.
-
Install and configure ossutil. See Install ossutil.
-
Create a directory in the bucket and upload the model files:
ossutil mkdir oss://<your-bucket-name>/Qwen-7B-Chat-Int8 ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<your-bucket-name>/Qwen-7B-Chat-Int8
Step 2: Create a dataset and a JindoRuntime
Create a Secret for OSS credentials
Create a Kubernetes Secret to store the AccessKey pair used to access the OSS bucket:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
fs.oss.accessKeyId: <your-access-key-id>
fs.oss.accessKeySecret: <your-access-key-secret>
EOF
Replace <your-access-key-id> and <your-access-key-secret> with your actual credentials. To get an AccessKey pair, see Obtain an AccessKey pair.
Expected output:
secret/oss-secret created
Create the dataset and JindoRuntime
Create a file named resource.yaml with the following content. For configuration details, see Use JindoFS to accelerate access to OSS.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: qwen-7b-chat-int8
spec:
mounts:
- mountPoint: oss://<oss_bucket>/Qwen-7b-chat-Int8 # Replace with your actual OSS path
options:
fs.oss.endpoint: <oss_endpoint> # Replace with your OSS bucket endpoint
name: models
path: "/"
encryptOptions:
- name: fs.oss.accessKeyId
valueFrom:
secretKeyRef:
name: oss-secret
key: fs.oss.accessKeyId
- name: fs.oss.accessKeySecret
valueFrom:
secretKeyRef:
name: oss-secret
key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: qwen-7b-chat-int8 # Must match the dataset name
spec:
replicas: 3
tieredstore:
levels:
- mediumtype: MEM # Cache in memory
volumeType: emptyDir
path: /dev/shm
quota: 3Gi # Cache capacity per replica
high: "0.95"
low: "0.7"
fuse:
resources:
requests:
memory: 2Gi
properties:
fs.oss.download.thread.concurrency: "200"
fs.oss.read.buffer.size: "8388608"
fs.oss.read.readahead.max.buffer.count: "200"
fs.oss.read.sequence.ambiguity.range: "2147483647"
Apply the configuration:
kubectl apply -f resource.yaml
Expected output:
dataset.data.fluid.io/qwen-7b-chat-int8 created
jindoruntime.data.fluid.io/qwen-7b-chat-int8 created
Step 3: Deploy a vLLM inference service
Deploy the Qwen-7B-Chat-Int8 model as a KServe inference service using vLLM. The --data flag mounts the Fluid dataset into the container, so the model is read from the JindoFS cache.
arena serve kserve \
--name=qwen-fluid \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--gpus=1 \
--cpu=4 \
--memory=12Gi \
--data="qwen-7b-chat-int8:/mnt/models/Qwen-7B-Chat-Int8" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"
Expected output:
inferenceservice.serving.kserve.io/qwen-fluid created
INFO[0002] The Job qwen-fluid has been submitted successfully
INFO[0002] You can run `arena serve get qwen-fluid --type kserve -n default` to check the job status
Step 4: Verify acceleration results
Check dataset cache status
kubectl get dataset qwen-7b-chat-int8
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
qwen-7b-chat-int8 17.01GiB 10.46MiB 18.00GiB 0.1% Bound 23h
The PHASE: Bound status confirms the dataset is bound and the JindoRuntime is active. The cached percentage increases as the inference service reads model files.
Check server startup time
Run the following commands to measure how long the server takes to become ready:
# Get the Pod name for the inference service
POD_NAME=$(kubectl get po | grep qwen-fluid | awk -F " " '{print $1}')
# Check how long the server took to become ready
kubectl logs $POD_NAME | grep -i "server ready takes"
Expected output:
server ready takes 25.875763 s
With Fluid caching enabled, model files are served from local memory on subsequent restarts instead of being pulled from OSS, which reduces startup time. The actual speedup varies based on dataset size, node memory, and network conditions.
For benchmark data comparing cached vs. uncached access times, see the Step 3: Create applications to test data acceleration section in "Use JindoFS to accelerate access to OSS."
Troubleshooting
JindoRuntime is not ready
Symptom: The dataset stays in a non-Bound phase after applying resource.yaml.
Cause: The JindoRuntime workers may be pending due to insufficient memory. Each replica requests 3 GiB of cache memory plus 2 GiB for the Fuse sidecar.
Fix: Check events on the JindoRuntime:
kubectl describe jindoruntime qwen-7b-chat-int8
If nodes lack free memory, either reduce quota in tieredstore or add nodes with more available memory.
OSS access errors
Symptom: The dataset shows errors or the JindoRuntime pods report access denied.
Cause: The AccessKey ID or AccessKey secret in the oss-secret Secret is incorrect, or the AccessKey does not have read permission on the OSS bucket.
Fix: Verify the credentials in the Secret:
kubectl get secret oss-secret -o yaml
Re-create the Secret with the correct values if needed, then restart the JindoRuntime.
Inference service Pod stuck in init
Symptom: The qwen-fluid Pod stays in Init state.
Cause: The Fuse sidecar may not have mounted the dataset yet, or the dataset is not in Bound phase.
Fix: Check the dataset phase first:
kubectl get dataset qwen-7b-chat-int8
Wait for PHASE: Bound before the inference service Pod can proceed. If the dataset is Bound but the Pod is still stuck, check the Pod events:
kubectl describe pod $POD_NAME
What's next
-
Learn more about Fluid data acceleration: Overview of Fluid
-
Explore advanced JindoRuntime cache configurations: Use JindoFS to accelerate access to OSS