All Products
Search
Document Center

Container Service for Kubernetes:Accelerate data access for PV storage volumes

Last Updated:Mar 26, 2026

Reading large datasets from a remote persistent volume (PV) can bottleneck AI/ML training and data-intensive workloads because every read must cross the network. JindoRuntime — a Fluid runtime engine from Alibaba Cloud EMR based on the JindoFS system — sits between your application pods and the PV, caching frequently accessed data in memory or on local disk so that subsequent reads bypass remote storage entirely. JindoFS, written in C++, provides dataset management and caching capabilities for Fluid and supports integration with any self-managed storage system, such as CephFS.

This topic shows how to deploy JindoRuntime on an ACK Pro cluster to accelerate reads from an existing PV storage volume.

Prerequisites

Before you begin, ensure that you have:

  • An ACK Pro cluster running a non-ContainerOS operating system, with Kubernetes 1.18 or later. See Create an ACK Pro cluster.

    Important

    ack-fluid is not supported on ContainerOS.

  • ack-fluid 1.0.6 or later, installed as part of the cloud-native AI suite.

    Important

    If you have already installed open-source Fluid, uninstall it before deploying ack-fluid.

  • A kubectl client connected to the ACK Pro cluster. See Connect to a cluster using kubectl.

  • Persistent volumes (PVs) and persistent volume claims (PVCs) already created for your target storage system. Follow the official documentation for your storage system to make sure the connection to the cluster is stable.

Step 1: Verify PV and PVC status

Run the following command to list PVs and PVCs in the cluster.

kubectl get pvc,pv

Expected output:

NAME                                          STATUS   VOLUME                          CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/demo-pvc                Bound    demo-pv                         5Gi        RWX                           19h

NAME                                             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                           STORAGECLASS   REASON   AGE
persistentvolume/demo-pv                         30Gi       RWX            Retain           Bound    default/demo-pvc                                        19h

The output shows that demo-pv (30 GiB, ReadWriteMany) is bound to demo-pvc. Both are ready to use.

Step 2: Create a Dataset and JindoRuntime

A Fluid Dataset declares which PVC to cache, and a JindoRuntime starts the JindoFS distributed caching system. Both resources must share the same name so that Fluid can associate them automatically.

  1. Create dataset.yaml with the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: pv-demo-dataset
    spec:
      mounts:
        - mountPoint: pvc://demo-pvc  # Format: pvc://<pvc-name>/<path>. The path must exist in the storage volume.
          name: data
          path: /
      accessModes:
        - ReadOnlyMany
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: pv-demo-dataset            # Must match the Dataset name above.
    spec:
      replicas: 2                      # Number of JindoFS worker replicas. Adjust based on your cluster size.
      tieredstore:
        levels:
          - mediumtype: MEM            # Cache medium: HDD, SSD, or MEM.
            volumeType: emptyDir       # emptyDir for memory/system-disk cache; hostPath for data-disk cache.
            path: /dev/shm             # Use a memory-backed path for best performance.
            quota: 10Gi                # Maximum cache capacity per worker. Adjust as needed.
            high: "0.9"                # Eviction threshold: start evicting when usage reaches 90%.
            low: "0.8"                 # Eviction target: evict until usage drops to 80%.

    Choosing `mediumtype` and `volumeType` These two parameters determine where cached data lives on each worker node. The right combination depends on your cache storage requirements: For more guidance, see Strategy 2: Select a cache medium.

    Cache location mediumtype volumeType When to use
    Memory (/dev/shm) MEM emptyDir Fastest reads; cache is released when the pod exits, so no residual data accumulates on nodes.
    System disk HDD or SSD emptyDir Persistent local disk; cache is still cleaned up automatically when the pod exits.
    Data disk HDD or SSD hostPath Use when you have a dedicated data disk; set path to the disk's mount point on the host.

    Key parameters

    Parameter Description
    mountPoint The data source to cache. Format: pvc://<pvc-name>/<path>. The <pvc-name> must be in the same namespace as the Dataset. The <path> must exist in the storage volume.
    replicas Number of JindoFS worker replicas.
    mediumtype Cache medium type. Valid values: HDD, SSD, MEM.
    volumeType Volume type for cache storage. Valid values: emptyDir and hostPath. The default value is hostPath. See the table above.
    path Directory where workers store cached data.
    quota Maximum cache capacity per worker.
  2. Apply the configuration.

    kubectl create -f dataset.yaml
  3. Wait for the Dataset to become ready.

    kubectl get dataset pv-demo-dataset
    Note On first startup, JindoFS pulls container images. This typically takes 2–3 minutes depending on your network.

    Expected output when ready:

    NAME              UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    pv-demo-dataset   10.96GiB         0.00B    20.00GiB         0.0%                Bound   2m13s

    A PHASE of Bound means the JindoFS caching system is running and application pods can start using the Dataset.

Step 3 (optional): Prefetch data into the cache

Without prefetching, the first read of each file fetches data from remote storage, which is slower. Fluid's DataLoad resource lets you warm the cache ahead of time so that all subsequent reads come from the local cache.

  1. Create dataload.yaml with the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: dataset-warmup
    spec:
      dataset:
        name: pv-demo-dataset    # Name of the Dataset to prefetch.
        namespace: default       # Must match the namespace of the DataLoad object.
      loadMetadata: true         # Required for JindoRuntime: syncs file metadata before prefetching.
      target:
        - path: /                # Path relative to the Dataset mount point. "/" prefetches everything.
          replicas: 1            # Number of cache copies to create for each file.
  2. Create the DataLoad object.

    kubectl create -f dataload.yaml
  3. Monitor the prefetch job.

    kubectl get dataload dataset-warmup

    Expected output when complete:

    NAME             DATASET           PHASE      AGE   DURATION
    dataset-warmup   pv-demo-dataset   Complete   62s   12s
  4. Confirm that the cache is fully populated.

    kubectl get dataset

    Expected output:

    NAME              UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    pv-demo-dataset   10.96GiB         10.96GiB   20.00GiB         100.0%              Bound   3m13s

    When CACHED matches UFS TOTAL SIZE and CACHED PERCENTAGE is 100.0%, all data is cached locally.

Step 4: Access data through the cache

Mount the Dataset into an application pod by setting claimName to the Dataset name. JindoRuntime intercepts read requests and serves data from the local cache instead of the remote PV.

  1. Create pod.yaml with the following content.

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
    spec:
      containers:
        - name: nginx
          image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          command:
          - "bash"
          - "-c"
          - "sleep inf"
          volumeMounts:
            - mountPath: /data
              name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: pv-demo-dataset  # Set this to the Dataset name, not the original PVC name.
  2. Create the pod.

    kubectl create -f pod.yaml
  3. Open a shell in the pod and read data.

    kubectl exec -it nginx bash

    Inside the pod, verify the data is accessible and measure read throughput:

    # List files in the mounted directory.
    ls -lh /data
    total 11G
    -rw-r----- 1 root root 11G Jul 22  2022 demofile
    
    # Read the entire file and discard output to measure throughput.
    time cat /data/demofile > /dev/null
    real    0m11.004s
    user    0m0.065s
    sys     0m3.089s

    Because the entire dataset is cached locally by JindoFS, reads retrieve data from memory rather than the remote storage system, eliminating network transfer overhead.

What's next