All Products
Search
Document Center

Container Service for Kubernetes:Accelerate access to OSS files by using JindoFS

Last Updated:Mar 26, 2026

JindoRuntime is the execution engine of JindoFS, developed by the Alibaba Cloud E-MapReduce (EMR) team and implemented in C++. It provides dataset management and caching for Object Storage Service (OSS) data within a Kubernetes cluster. Alibaba Cloud provides cloud service-level support for JindoFS. Fluid manages and schedules JindoRuntime to deliver observability, auto scaling, and dataset portability.

This topic walks you through uploading a test dataset to OSS, creating a Dataset and JindoRuntime, and verifying the cache acceleration effect with a benchmark pod.

Limitations

  • JindoRuntime requires an ACK Pro cluster running a non-ContainerOS operating system. The ack-fluid component does not currently support ContainerOS.

  • If you have already installed open-source Fluid, uninstall it before deploying the ack-fluid component. Running both simultaneously is not supported.

Prerequisites

Before you begin, ensure that you have:

  • An ACK Pro cluster with a non-ContainerOS operating system, running Kubernetes 1.18 or later. See Create an ACK Pro cluster.

  • The ack-fluid component deployed in the cluster:

    • If the cloud-native AI suite is not installed, enable Fluid acceleration when you install the suite. See Deploy the cloud-native AI suite.

    • If the cloud-native AI suite is already installed, go to the Cloud-native AI Suite page in the ACK console and deploy the ack-fluid component.

  • A kubectl client connected to your ACK Pro cluster. See Connect to a cluster by using kubectl.

  • OSS activated. See Activate OSS.

Step 1: Upload data to OSS

The following steps use an Elastic Compute Service (ECS) instance running Alibaba Cloud Linux 3.2104 LTS 64-bit to upload a test dataset. For other operating systems, see ossutil and ossutil 1.0.

  1. Download the test dataset to your ECS instance.

    wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. Install ossutil on the ECS instance.

  3. Create an OSS bucket named examplebucket.

    ossutil64 mb oss://examplebucket

    Expected output:

    0.668238(s) elapsed
  4. Upload the test dataset to the bucket.

    ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket

Step 2: Create a dataset and a JindoRuntime

JindoRuntime reads OSS credentials from a Kubernetes Secret, so create the Secret before creating the Dataset.

  1. Create a file named mySecret.yaml with the following content. Replace xxx with the AccessKey ID and AccessKey secret that have read access to the OSS bucket you created in Step 1.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: xxx
      fs.oss.accessKeySecret: xxx
  2. Apply the Secret. Kubernetes encrypts the stored values so they are not exposed as plaintext.

    kubectl create -f mySecret.yaml
  3. Create a file named resource.yaml with the following content. This file defines the Dataset (which points to your OSS data) and the JindoRuntime (which launches the caching cluster).

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      mounts:
        - mountPoint: oss://<oss_bucket>/<bucket_dir>
          options:
            fs.oss.endpoint: <oss_endpoint>
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            quota: 2Gi
            high: "0.99"
            low: "0.95"

    The following table describes the key parameters.

    Parameter Description Required Default
    mountPoint Path to the underlying file system (UFS) in the format oss://<oss_bucket>/<bucket_dir>. The OSS endpoint is not included here. Yes
    fs.oss.endpoint Public or internal endpoint of the OSS bucket. See Regions and endpoints. Yes
    replicas Number of workers in the JindoFS caching cluster. Yes
    mediumtype Cache storage medium. Valid values: HDD, SSD, MEM. Yes
    path Local storage path on the worker node. Only one path is allowed. Required when mediumtype is MEM to store data such as logs. Yes
    quota Maximum cache size per worker. Yes
    high Cache eviction threshold (high watermark). When usage exceeds this ratio, eviction begins. No
    low Cache retention threshold (low watermark). Eviction stops when usage drops to this ratio. No
  4. Create the Dataset and JindoRuntime.

    kubectl create -f resource.yaml
  5. Verify that the Dataset is bound.

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop        210MiB       0.00B    4.00GiB              0.0%          Bound   1h
  6. Verify that the JindoRuntime is ready.

    kubectl get jindoruntime hadoop

    Expected output:

    NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    hadoop   Ready          Ready          Ready        4m45s
  7. Verify that the persistent volume (PV) and persistent volume claim (PVC) are created.

    kubectl get pv,pvc

    Expected output:

    NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
    persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m
    
    NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The Dataset and JindoRuntime are ready when all phases show Ready and the PVC status is Bound.

Step 3: Test data acceleration

Deploy a pod that mounts the Dataset PVC and compare file read times before and after data is cached.

  1. Create a file named app.yaml with the following content.

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
    spec:
      containers:
        - name: demo
          image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          volumeMounts:
            - mountPath: /data
              name: hadoop
      volumes:
        - name: hadoop
          persistentVolumeClaim:
            claimName: hadoop
  2. Deploy the pod.

    kubectl create -f app.yaml
  3. Open a shell in the pod and check the file size.

    kubectl exec -it demo-app -- bash
    du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

    Expected output:

    210M    /data/spark-3.0.1-bin-hadoop2.7.tgz
  4. Measure the initial read time. This access comes directly from OSS, with no cache.

    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m18.386s
    user    0m0.002s
    sys    0m0.105s

    Reading the file takes about 18 seconds.

  5. Check the cached data after the read.

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   210.00MiB       210.00MiB    4.00GiB        100.0%           Bound   1h

    The full 210 MiB is now cached in local storage.

  6. Delete and recreate the pod to clear the OS page cache, so the next read comes from JindoFS cache rather than memory.

    kubectl delete -f app.yaml && kubectl create -f app.yaml
  7. Measure the read time again with data served from cache.

    kubectl exec -it demo-app -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m0.048s
    user    0m0.001s
    sys     0m0.046s

    With JindoFS cache, the same file read completes in 48 milliseconds — more than 300 times faster than the direct OSS access.

(Optional) Clean up

When data acceleration is no longer needed, delete the pod, the Dataset, and the JindoRuntime.

Delete the pod:

kubectl delete pod demo-app

Delete the Dataset and JindoRuntime:

kubectl delete dataset hadoop

What's next

  • To submit machine learning training jobs that use JindoFS-accelerated data, see the cloud-native AI suite documentation.

  • To explore other cache storage options (HDD, SSD), update the mediumtype and path fields in resource.yaml.