Accelerate OSS data access with JindoFS - Container Service for Kubernetes

JindoRuntime is the execution engine of JindoFS, developed by the Alibaba Cloud E-MapReduce (EMR) team and implemented in C++. It provides dataset management and caching for Object Storage Service (OSS) data within a Kubernetes cluster. Alibaba Cloud provides cloud service-level support for JindoFS. Fluid manages and schedules JindoRuntime to deliver observability, auto scaling, and dataset portability.

This topic walks you through uploading a test dataset to OSS, creating a Dataset and JindoRuntime, and verifying the cache acceleration effect with a benchmark pod.

Limitations

JindoRuntime requires an ACK Pro cluster running a non-ContainerOS operating system. The ack-fluid component does not currently support ContainerOS.
If you have already installed open-source Fluid, uninstall it before deploying the ack-fluid component. Running both simultaneously is not supported.

Prerequisites

Before you begin, ensure that you have:

An ACK Pro cluster with a non-ContainerOS operating system, running Kubernetes 1.18 or later. See Create an ACK Pro cluster.
The ack-fluid component deployed in the cluster:
- If the cloud-native AI suite is not installed, enable Fluid acceleration when you install the suite. See Deploy the cloud-native AI suite.
- If the cloud-native AI suite is already installed, go to the Cloud-native AI Suite page in the ACK console and deploy the ack-fluid component.
A kubectl client connected to your ACK Pro cluster. See Connect to a cluster by using kubectl.
OSS activated. See Activate OSS.

Step 1: Upload data to OSS

The following steps use an Elastic Compute Service (ECS) instance running Alibaba Cloud Linux 3.2104 LTS 64-bit to upload a test dataset. For other operating systems, see ossutil and ossutil 1.0.

Download the test dataset to your ECS instance.

wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Install ossutil on the ECS instance.
Create an OSS bucket named examplebucket.
```
ossutil64 mb oss://examplebucket
```
Expected output:
```
0.668238(s) elapsed
```

Upload the test dataset to the bucket.

ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket

Step 2: Create a dataset and a JindoRuntime

JindoRuntime reads OSS credentials from a Kubernetes Secret, so create the Secret before creating the Dataset.

Create a file named mySecret.yaml with the following content. Replace xxx with the AccessKey ID and AccessKey secret that have read access to the OSS bucket you created in Step 1.
```
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: xxx
  fs.oss.accessKeySecret: xxx
```
Apply the Secret. Kubernetes encrypts the stored values so they are not exposed as plaintext.
```
kubectl create -f mySecret.yaml
```

Create a file named resource.yaml with the following content. This file defines the Dataset (which points to your OSS data) and the JindoRuntime (which launches the caching cluster).

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hadoop
spec:
  mounts:
    - mountPoint: oss://<oss_bucket>/<bucket_dir>
      options:
        fs.oss.endpoint: <oss_endpoint>
      name: hadoop
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: hadoop
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        volumeType: emptyDir
        quota: 2Gi
        high: "0.99"
        low: "0.95"

The following table describes the key parameters.

Parameter	Description	Required	Default
`mountPoint`	Path to the underlying file system (UFS) in the format `oss://<oss_bucket>/<bucket_dir>`. The OSS endpoint is not included here.	Yes	—
`fs.oss.endpoint`	Public or internal endpoint of the OSS bucket. See Regions and endpoints.	Yes	—
`replicas`	Number of workers in the JindoFS caching cluster.	Yes	—
`mediumtype`	Cache storage medium. Valid values: `HDD`, `SSD`, `MEM`.	Yes	—
`path`	Local storage path on the worker node. Only one path is allowed. Required when `mediumtype` is `MEM` to store data such as logs.	Yes	—
`quota`	Maximum cache size per worker.	Yes	—
`high`	Cache eviction threshold (high watermark). When usage exceeds this ratio, eviction begins.	No	—
`low`	Cache retention threshold (low watermark). Eviction stops when usage drops to this ratio.	No	—

Create the Dataset and JindoRuntime.
```
kubectl create -f resource.yaml
```

Verify that the Dataset is bound.

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop        210MiB       0.00B    4.00GiB              0.0%          Bound   1h

Verify that the JindoRuntime is ready.

kubectl get jindoruntime hadoop

Expected output:

NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
hadoop   Ready          Ready          Ready        4m45s

Verify that the persistent volume (PV) and persistent volume claim (PVC) are created.

kubectl get pv,pvc

Expected output:

NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m

NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The Dataset and JindoRuntime are ready when all phases show Ready and the PVC status is Bound.

Step 3: Test data acceleration

Deploy a pod that mounts the Dataset PVC and compare file read times before and after data is cached.

Create a file named app.yaml with the following content.

apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
      volumeMounts:
        - mountPath: /data
          name: hadoop
  volumes:
    - name: hadoop
      persistentVolumeClaim:
        claimName: hadoop

Deploy the pod.
```
kubectl create -f app.yaml
```

Open a shell in the pod and check the file size.

kubectl exec -it demo-app -- bash
du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

Expected output:

210M    /data/spark-3.0.1-bin-hadoop2.7.tgz

Measure the initial read time. This access comes directly from OSS, with no cache.
```
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m18.386s
user    0m0.002s
sys    0m0.105s
```
Reading the file takes about 18 seconds.

Check the cached data after the read.

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop   210.00MiB       210.00MiB    4.00GiB        100.0%           Bound   1h

The full 210 MiB is now cached in local storage.

Delete and recreate the pod to clear the OS page cache, so the next read comes from JindoFS cache rather than memory.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
```
Measure the read time again with data served from cache.
```
kubectl exec -it demo-app -- bash
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m0.048s
user    0m0.001s
sys     0m0.046s
```
With JindoFS cache, the same file read completes in 48 milliseconds — more than 300 times faster than the direct OSS access.

(Optional) Clean up

When data acceleration is no longer needed, delete the pod, the Dataset, and the JindoRuntime.

Delete the pod:

kubectl delete pod demo-app

Delete the Dataset and JindoRuntime:

kubectl delete dataset hadoop

What's next

To submit machine learning training jobs that use JindoFS-accelerated data, see the cloud-native AI suite documentation.
To explore other cache storage options (HDD, SSD), update the mediumtype and path fields in resource.yaml.