JindoRuntime is the execution engine of JindoFS, developed by the Alibaba Cloud E-MapReduce (EMR) team and implemented in C++. It provides dataset management and caching for Object Storage Service (OSS) data within a Kubernetes cluster. Alibaba Cloud provides cloud service-level support for JindoFS. Fluid manages and schedules JindoRuntime to deliver observability, auto scaling, and dataset portability.
This topic walks you through uploading a test dataset to OSS, creating a Dataset and JindoRuntime, and verifying the cache acceleration effect with a benchmark pod.
Limitations
-
JindoRuntime requires an ACK Pro cluster running a non-ContainerOS operating system. The ack-fluid component does not currently support ContainerOS.
-
If you have already installed open-source Fluid, uninstall it before deploying the ack-fluid component. Running both simultaneously is not supported.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster with a non-ContainerOS operating system, running Kubernetes 1.18 or later. See Create an ACK Pro cluster.
-
The ack-fluid component deployed in the cluster:
-
If the cloud-native AI suite is not installed, enable Fluid acceleration when you install the suite. See Deploy the cloud-native AI suite.
-
If the cloud-native AI suite is already installed, go to the Cloud-native AI Suite page in the ACK console and deploy the ack-fluid component.
-
-
A kubectl client connected to your ACK Pro cluster. See Connect to a cluster by using kubectl.
-
OSS activated. See Activate OSS.
Step 1: Upload data to OSS
The following steps use an Elastic Compute Service (ECS) instance running Alibaba Cloud Linux 3.2104 LTS 64-bit to upload a test dataset. For other operating systems, see ossutil and ossutil 1.0.
-
Download the test dataset to your ECS instance.
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz -
Install ossutil on the ECS instance.
-
Create an OSS bucket named
examplebucket.ossutil64 mb oss://examplebucketExpected output:
0.668238(s) elapsed -
Upload the test dataset to the bucket.
ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a dataset and a JindoRuntime
JindoRuntime reads OSS credentials from a Kubernetes Secret, so create the Secret before creating the Dataset.
-
Create a file named
mySecret.yamlwith the following content. Replacexxxwith the AccessKey ID and AccessKey secret that have read access to the OSS bucket you created in Step 1.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxx -
Apply the Secret. Kubernetes encrypts the stored values so they are not exposed as plaintext.
kubectl create -f mySecret.yaml -
Create a file named
resource.yamlwith the following content. This file defines the Dataset (which points to your OSS data) and the JindoRuntime (which launches the caching cluster).apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 2Gi high: "0.99" low: "0.95"The following table describes the key parameters.
Parameter Description Required Default mountPointPath to the underlying file system (UFS) in the format oss://<oss_bucket>/<bucket_dir>. The OSS endpoint is not included here.Yes — fs.oss.endpointPublic or internal endpoint of the OSS bucket. See Regions and endpoints. Yes — replicasNumber of workers in the JindoFS caching cluster. Yes — mediumtypeCache storage medium. Valid values: HDD,SSD,MEM.Yes — pathLocal storage path on the worker node. Only one path is allowed. Required when mediumtypeisMEMto store data such as logs.Yes — quotaMaximum cache size per worker. Yes — highCache eviction threshold (high watermark). When usage exceeds this ratio, eviction begins. No — lowCache retention threshold (low watermark). Eviction stops when usage drops to this ratio. No — -
Create the Dataset and JindoRuntime.
kubectl create -f resource.yaml -
Verify that the Dataset is bound.
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1h -
Verify that the JindoRuntime is ready.
kubectl get jindoruntime hadoopExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45s -
Verify that the persistent volume (PV) and persistent volume claim (PVC) are created.
kubectl get pv,pvcExpected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The Dataset and JindoRuntime are ready when all phases show Ready and the PVC status is Bound.
Step 3: Test data acceleration
Deploy a pod that mounts the Dataset PVC and compare file read times before and after data is cached.
-
Create a file named
app.yamlwith the following content.apiVersion: v1 kind: Pod metadata: name: demo-app spec: containers: - name: demo image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoop -
Deploy the pod.
kubectl create -f app.yaml -
Open a shell in the pod and check the file size.
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgzExpected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgz -
Measure the initial read time. This access comes directly from OSS, with no cache.
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m18.386s user 0m0.002s sys 0m0.105sReading the file takes about 18 seconds.
-
Check the cached data after the read.
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1hThe full 210 MiB is now cached in local storage.
-
Delete and recreate the pod to clear the OS page cache, so the next read comes from JindoFS cache rather than memory.
kubectl delete -f app.yaml && kubectl create -f app.yaml -
Measure the read time again with data served from cache.
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m0.048s user 0m0.001s sys 0m0.046sWith JindoFS cache, the same file read completes in 48 milliseconds — more than 300 times faster than the direct OSS access.
(Optional) Clean up
When data acceleration is no longer needed, delete the pod, the Dataset, and the JindoRuntime.
Delete the pod:
kubectl delete pod demo-app
Delete the Dataset and JindoRuntime:
kubectl delete dataset hadoop
What's next
-
To submit machine learning training jobs that use JindoFS-accelerated data, see the cloud-native AI suite documentation.
-
To explore other cache storage options (
HDD,SSD), update themediumtypeandpathfields inresource.yaml.