All Products
Search
Document Center

Container Compute Service:Use Fluid to accelerate data access

Last Updated:Mar 26, 2026

JindoRuntime is based on C++ and supports dataset management, data caching, and data storage in OSS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid to accelerate data access in scenarios in which ACS compute power is used.

When Pods read OSS data repeatedly, each read fetches the data over the network — even if the same file was just accessed moments ago. JindoFS eliminates those repeat round trips by caching data in local memory. Once a file is cached, subsequent reads serve it at near-local speed. The example in this topic demonstrates a 9x speedup on a 210 MiB file after the first cached read.

Prerequisites

Before you begin, ensure that you have:

Step 1: Upload data to OSS

  1. Download the test dataset.

    wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. Upload the dataset to an OSS bucket.

    1. Install ossutil.

    2. Create a bucket named examplebucket.

      Note

      If the command returns ErrorCode=BucketAlreadyExists, the bucket already exists. OSS bucket names must be globally unique — change the name as needed.

      ossutil64 mb oss://examplebucket

      Expected output:

      0.668238(s) elapsed
    3. Upload the dataset to the bucket.

      ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
    4. (Optional) Configure bucket and data access permissions. See Permission control.

    Important

    The following sub-steps use an ECS instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. For other operating systems, see the ossutil command reference and ossutil 1.0.

  3. Create a file named mySecret.yaml with the following content.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: <your-access-key-id>          # Replace with your AccessKey ID
      fs.oss.accessKeySecret: <your-access-key-secret>  # Replace with your AccessKey Secret

    Kubernetes automatically encrypts Secrets to avoid exposing sensitive data in plaintext.

  4. Apply the Secret.

    kubectl create -f mySecret.yaml

Step 2: Create a Dataset and a JindoRuntime

Note

Before proceeding, verify that the dataset-controller and jindoruntime-controller of the ack-fluid component are running:

kubectl get pods --field-selector=status.phase=Running -n fluid-system

In this example, CPU compute power is preferably used. To accelerate the loading of LLMs, make sure that the zone of your cluster provides GPU resources. See Introduction to GPU compute classes.

  1. Create a file named resource.yaml with the following content. The file defines a Dataset that points to your OSS data, and a JindoRuntime that launches a JindoFS cluster to cache it.

    Parameter Description
    mountPoint The OSS path to mount as the underlying file system (UFS). Use the format oss://<bucket> or oss://<bucket>/<path> for a subdirectory.
    fs.oss.endpoint The endpoint of the OSS bucket — public or private. Example: oss-cn-beijing-internal.aliyuncs.com. See OSS regions and endpoints.
    replicas The number of worker nodes in the JindoFS cluster.
    mediumtype The cache storage medium. Supported values: MEM, HDD, SSD.
    quota The maximum cache size per worker.
    high / low The upper and lower thresholds for cache eviction.
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      placement: Shared          # Required for ACS virtual nodes
      mounts:
          # To mount a subdirectory, use oss://<oss_bucket>/<oss_path>
        - mountPoint: oss://<oss_bucket>       # Replace with your OSS bucket name, e.g. oss://examplebucket
          options:
            fs.oss.endpoint: <oss_endpoint>    # Replace with your OSS endpoint, e.g. oss-cn-beijing-internal.aliyuncs.com
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop                # Must match the Dataset name
    spec:
      networkmode: ContainerNetwork
      replicas: 4                 # Number of JindoFS worker nodes; adjust as needed
      master:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default
      worker:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default
        resources:
          requests:
            cpu: 24
            memory: 48Gi
          limits:
            cpu: 24
            memory: 48Gi
      tieredstore:
        levels:
          - mediumtype: MEM       # Cache medium: MEM, HDD, or SSD
            path: /dev/shm        # Storage path for the cache medium
            volumeType: emptyDir
            quota: 48Gi           # Maximum cache size per worker; adjust as needed
            high: "0.99"          # Eviction starts when usage reaches this threshold
            low: "0.95"           # Eviction stops when usage drops to this threshold

    Key parameters:

  2. Apply the configuration.

    kubectl create -f resource.yaml
  3. Verify that the Dataset and JindoRuntime are ready. Check the Dataset:

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   209.74MiB        0.00B    4.00GiB          0.0%                Bound   56s

    Check the JindoRuntime:

    kubectl get jindoruntime hadoop

    Expected output:

    NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    hadoop   Ready          Ready          Ready        2m11s
  4. Confirm that the persistent volume (PV) and persistent volume claim (PVC) are created. The PV uses the Dataset name.

    kubectl get pv,pvc

    Expected output:

    NAME                              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
    persistentvolume/default-hadoop   100Pi      ROX            Retain           Bound    default/hadoop   fluid          <unset>                          2m5s
    
    NAME                           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
    persistentvolumeclaim/hadoop   Bound    default-hadoop   100Pi      ROX            fluid          <unset>                 2m5s

Step 3: Create a DataLoad resource

Preloading the dataset into the JindoFS cache before your workload runs ensures that the first access is already fast and that the data processing logic is valid.

  1. If the data in your OSS bucket is static, create a file named dataload.yaml with the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: hadoop
    spec:
      dataset:
        name: hadoop
        namespace: default
      loadMetadata: true

    If the data changes periodically, set up a recurring preload instead. See Scenario 2: Data in the backend storage is read-only but periodically changes.

  2. Apply the DataLoad resource to start preloading.

    kubectl create -f dataload.yaml
  3. Monitor preloading progress.

    kubectl get dataload

    Expected output when complete:

    NAME     DATASET   PHASE      AGE   DURATION
    hadoop   hadoop    Complete   92m   51s

Step 4: Verify data acceleration

Deploy a test Pod that mounts the Dataset and measure file copy time before and after JindoFS caching takes effect.

  1. Create a file named app.yaml.

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
      labels:
        # Required: instructs the Fluid webhook to inject JindoFS sidecar containers into ACS Pods
        alibabacloud.com/fluid-sidecar-target: acs
    spec:
      containers:
        - name: demo
          image: mirrors-ssl.aliyuncs.com/nginx:latest
          volumeMounts:
            - mountPath: /data
              name: hadoop
          resources:
            requests:
              cpu: 14
              memory: 56Gi
      volumes:
        - name: hadoop
          persistentVolumeClaim:
            claimName: hadoop    # Matches the Fluid Dataset name
      nodeSelector:
        type: virtual-kubelet
      tolerations:
        - key: virtual-kubelet.io/provider
          operator: Equal
          value: alibabacloud
          effect: NoSchedule
  2. Deploy the Pod.

    kubectl create -f app.yaml
  3. Measure the file copy time without JindoFS caching. Check the file size:

    kubectl exec -it demo-app -c demo -- du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

    Expected output:

    210M    /data/spark-3.0.1-bin-hadoop2.7.tgz

    Time the copy:

    kubectl exec -it demo-app -c demo -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m1.883s
    user    0m0.001s
    sys     0m0.041s
  4. Confirm that the data is fully cached.

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   209.74MiB        209.74MiB   4.00GiB          100.0%              Bound   64m
  5. Delete and recreate the Pod, then rerun the copy test against the JindoFS cache.

    Note

    Recreating the Pod clears the OS page cache, so the second measurement reflects only the JindoFS cache speed — not any in-memory residue from the first run.

    Delete the existing Pod:

    kubectl delete pod demo-app

    Recreate it:

    kubectl create -f app.yaml

    Run the copy test again:

    kubectl exec -it demo-app -c demo -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m0.203s
    user    0m0.000s
    sys     0m0.047s

    The copy now takes 0.203 seconds — about 9x faster than the 1.883 seconds without caching. The speedup comes from JindoFS serving the file from its in-memory cache on /dev/shm rather than fetching it from OSS over the network. Once data is cached locally, subsequent reads skip the network entirely.

    Important

    The copy times shown here are for reference only and may vary based on your cluster configuration and network conditions.

Use ACS compute power in ACK Pro clusters

The steps above apply to ACS clusters. To use ACS compute power in ACK managed clusters instead, see Use the computing power of ACS in ACK Pro clusters.

For ACK managed clusters, make the following adjustments:

  1. Install the ack-fluid component in the ACK managed cluster. See Use Helm to simplify application deployment.

  2. Create the Dataset and JindoRuntime using the following configuration. The key difference from the ACS configuration is the absence of placement: Shared and networkmode, and no compute-class labels — standard ACK nodes do not require these settings.

    • ACS clusters use virtual nodes that do not support standard node scaling. To enable shared dataset access and inter-pod communication, set placement: Shared and networkmode: ContainerNetwork in the ACS configuration. These fields are not needed for ACK managed clusters.

    • Fluid workers on ACS require high bandwidth. Set compute-class: performance and configure sufficient CPU and memory resources in the ACS configuration to ensure adequate bandwidth. ACK managed clusters allocate resources differently and do not need these labels.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      mounts:
          # To mount a subdirectory, use oss://<oss_bucket>/<oss_path>
        - mountPoint: oss://<oss_bucket>       # Replace with your OSS bucket name
          options:
            fs.oss.endpoint: <oss_endpoint>    # Replace with your OSS endpoint
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop
    spec:
      replicas: 4                 # Adjust as needed
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            quota: 48Gi
            high: "0.99"
            low: "0.95"

    Differences from the ACS configuration:

What's next