Reading large datasets from a remote persistent volume (PV) can bottleneck AI/ML training and data-intensive workloads because every read must cross the network. JindoRuntime — a Fluid runtime engine from Alibaba Cloud EMR based on the JindoFS system — sits between your application pods and the PV, caching frequently accessed data in memory or on local disk so that subsequent reads bypass remote storage entirely. JindoFS, written in C++, provides dataset management and caching capabilities for Fluid and supports integration with any self-managed storage system, such as CephFS.
This topic shows how to deploy JindoRuntime on an ACK Pro cluster to accelerate reads from an existing PV storage volume.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster running a non-ContainerOS operating system, with Kubernetes 1.18 or later. See Create an ACK Pro cluster.
Importantack-fluid is not supported on ContainerOS.
-
ack-fluid 1.0.6 or later, installed as part of the cloud-native AI suite.
-
To install from scratch: enable Fluid when installing the cloud-native AI suite. See Install the cloud-native AI suite.
-
If the suite is already installed: open the Container Service Management Console, go to the Cloud-native AI Suite page, and deploy ack-fluid.
ImportantIf you have already installed open-source Fluid, uninstall it before deploying ack-fluid.
-
-
A kubectl client connected to the ACK Pro cluster. See Connect to a cluster using kubectl.
-
Persistent volumes (PVs) and persistent volume claims (PVCs) already created for your target storage system. Follow the official documentation for your storage system to make sure the connection to the cluster is stable.
Step 1: Verify PV and PVC status
Run the following command to list PVs and PVCs in the cluster.
kubectl get pvc,pv
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/demo-pvc Bound demo-pv 5Gi RWX 19h
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/demo-pv 30Gi RWX Retain Bound default/demo-pvc 19h
The output shows that demo-pv (30 GiB, ReadWriteMany) is bound to demo-pvc. Both are ready to use.
Step 2: Create a Dataset and JindoRuntime
A Fluid Dataset declares which PVC to cache, and a JindoRuntime starts the JindoFS distributed caching system. Both resources must share the same name so that Fluid can associate them automatically.
-
Create
dataset.yamlwith the following content.apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: pv-demo-dataset spec: mounts: - mountPoint: pvc://demo-pvc # Format: pvc://<pvc-name>/<path>. The path must exist in the storage volume. name: data path: / accessModes: - ReadOnlyMany --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: pv-demo-dataset # Must match the Dataset name above. spec: replicas: 2 # Number of JindoFS worker replicas. Adjust based on your cluster size. tieredstore: levels: - mediumtype: MEM # Cache medium: HDD, SSD, or MEM. volumeType: emptyDir # emptyDir for memory/system-disk cache; hostPath for data-disk cache. path: /dev/shm # Use a memory-backed path for best performance. quota: 10Gi # Maximum cache capacity per worker. Adjust as needed. high: "0.9" # Eviction threshold: start evicting when usage reaches 90%. low: "0.8" # Eviction target: evict until usage drops to 80%.Choosing `mediumtype` and `volumeType` These two parameters determine where cached data lives on each worker node. The right combination depends on your cache storage requirements: For more guidance, see Strategy 2: Select a cache medium.
Cache location mediumtypevolumeTypeWhen to use Memory ( /dev/shm)MEMemptyDirFastest reads; cache is released when the pod exits, so no residual data accumulates on nodes. System disk HDDorSSDemptyDirPersistent local disk; cache is still cleaned up automatically when the pod exits. Data disk HDDorSSDhostPathUse when you have a dedicated data disk; set pathto the disk's mount point on the host.Key parameters
Parameter Description mountPointThe data source to cache. Format: pvc://<pvc-name>/<path>. The<pvc-name>must be in the same namespace as the Dataset. The<path>must exist in the storage volume.replicasNumber of JindoFS worker replicas. mediumtypeCache medium type. Valid values: HDD,SSD,MEM.volumeTypeVolume type for cache storage. Valid values: emptyDirandhostPath. The default value ishostPath. See the table above.pathDirectory where workers store cached data. quotaMaximum cache capacity per worker. -
Apply the configuration.
kubectl create -f dataset.yaml -
Wait for the Dataset to become ready.
kubectl get dataset pv-demo-datasetNote On first startup, JindoFS pulls container images. This typically takes 2–3 minutes depending on your network.Expected output when ready:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE pv-demo-dataset 10.96GiB 0.00B 20.00GiB 0.0% Bound 2m13sA
PHASEofBoundmeans the JindoFS caching system is running and application pods can start using the Dataset.
Step 3 (optional): Prefetch data into the cache
Without prefetching, the first read of each file fetches data from remote storage, which is slower. Fluid's DataLoad resource lets you warm the cache ahead of time so that all subsequent reads come from the local cache.
-
Create
dataload.yamlwith the following content.apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: dataset-warmup spec: dataset: name: pv-demo-dataset # Name of the Dataset to prefetch. namespace: default # Must match the namespace of the DataLoad object. loadMetadata: true # Required for JindoRuntime: syncs file metadata before prefetching. target: - path: / # Path relative to the Dataset mount point. "/" prefetches everything. replicas: 1 # Number of cache copies to create for each file. -
Create the DataLoad object.
kubectl create -f dataload.yaml -
Monitor the prefetch job.
kubectl get dataload dataset-warmupExpected output when complete:
NAME DATASET PHASE AGE DURATION dataset-warmup pv-demo-dataset Complete 62s 12s -
Confirm that the cache is fully populated.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE pv-demo-dataset 10.96GiB 10.96GiB 20.00GiB 100.0% Bound 3m13sWhen
CACHEDmatchesUFS TOTAL SIZEandCACHED PERCENTAGEis100.0%, all data is cached locally.
Step 4: Access data through the cache
Mount the Dataset into an application pod by setting claimName to the Dataset name. JindoRuntime intercepts read requests and serves data from the local cache instead of the remote PV.
-
Create
pod.yamlwith the following content.apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 command: - "bash" - "-c" - "sleep inf" volumeMounts: - mountPath: /data name: data-vol volumes: - name: data-vol persistentVolumeClaim: claimName: pv-demo-dataset # Set this to the Dataset name, not the original PVC name. -
Create the pod.
kubectl create -f pod.yaml -
Open a shell in the pod and read data.
kubectl exec -it nginx bashInside the pod, verify the data is accessible and measure read throughput:
# List files in the mounted directory. ls -lh /data total 11G -rw-r----- 1 root root 11G Jul 22 2022 demofile # Read the entire file and discard output to measure throughput. time cat /data/demofile > /dev/null real 0m11.004s user 0m0.065s sys 0m3.089sBecause the entire dataset is cached locally by JindoFS, reads retrieve data from memory rather than the remote storage system, eliminating network transfer overhead.
What's next
-
Strategy 2: Select a cache medium — guidance on choosing between HDD, SSD, and MEM cache tiers for your workload.