Boost OSS Edge Access Speed & Cut Latency Using Fluid - ACK Edge

In edge computing, each OSS file access travels over the cloud-edge network, adding significant latency. Fluid caches OSS data on edge node memory so that repeated reads bypass the network entirely. This tutorial walks you through uploading a test dataset to OSS, creating a Dataset and JindoRuntime on an edge node pool, deploying a test Pod, and verifying the caching effect—reducing a 210 MiB file read from 18 seconds down to 48 milliseconds.

Prerequisites

Before you begin, make sure you have:

An ACK Edge cluster running Kubernetes 1.18 or later. See Create an ACK Edge cluster.
An edge node pool with edge nodes added. See Create an edge node pool and Add edge nodes.
The cloud-native AI suite installed with the ack-fluid component deployed.
Important
Uninstall any open-source Fluid installation before deploying ack-fluid.
- If the suite is not yet deployed: enable Fluid under Data Access Acceleration when deploying the suite.
- If the suite is already deployed: go to the Cloud-native AI Suite page in the ACK console and deploy ack-fluid.
kubectl connected to the ACK cluster. See Get a cluster kubeconfig and connect to the cluster using kubectl.
Object Storage Service (OSS) activated. See Activate OSS.

Step 1: Upload data to OSS

Download the test dataset to an Elastic Compute Service (ECS) instance:

wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Upload the dataset to an OSS bucket.
Important
The following steps use an ECS instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. For other operating systems, see ossutil and ossutil 1.0.
1. Install ossutil.
2. Create a bucket named examplebucket:
```
ossutil mb oss://examplebucket
```
  Expected output:
```
0.668238(s) elapsed
```
3. Upload the test dataset to examplebucket:
```
ossutil cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
```

Step 2: Create a Dataset and a JindoRuntime

In an ACK Edge cluster, both edge node management and OSS access use the cloud-edge network. Deploy the Dataset and JindoRuntime to the same node pool as your workloads to keep data access within the node pool and to reserve bandwidth for the management channel.

Create a file named mySecret.yaml with the following content. Replace xxx with the AccessKey ID and AccessKey secret used in Step 1.
```
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: xxx
  fs.oss.accessKeySecret: xxx
```
Create the Secret. Kubernetes encrypts the Secret to prevent credentials from being stored as plaintext.
```
kubectl create -f mySecret.yaml
```

Create a file named resource.yaml with the following content:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hadoop
spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: alibabacloud.com/nodepool-id
            operator: In
            values:
              - npxxxxxxxxxxxxxx
  mounts:
    - mountPoint: oss://<oss_bucket>/<bucket_dir>
      options:
        fs.oss.endpoint: <oss_endpoint>
      name: hadoop
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: hadoop
spec:
  nodeSelector:
    alibabacloud.com/nodepool-id: npxxxxxxxxxxxxxx
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        volumeType: emptyDir
        quota: 2Gi
        high: "0.99"
        low: "0.95"

This template creates two resources:

A Dataset that specifies the OSS path to mount and references the Secret for credentials. The nodeAffinity field pins the Dataset to the target node pool.
A JindoRuntime that launches a JindoFS cluster for data caching. Set nodeSelector to the same node pool as the Dataset's nodeAffinity.

Key parameters:

Parameter	Description
`mountPoint`	The OSS path to mount, in the format `oss://<oss_bucket>/<bucket_dir>`. Must point to a directory, not a file. The endpoint is specified separately.
`fs.oss.endpoint`	The public or private endpoint of the OSS bucket. See Regions and endpoints.
`replicas`	The number of workers in the JindoFS cluster.
`mediumtype`	The cache storage type. Valid values: `HDD`, `SDD`, `MEM`.
`path`	The local storage path for cache data. Required when `mediumtype` is `MEM`.
`quota`	The maximum cache size. Unit: GiB.
`high`	The upper limit of the storage capacity.
`low`	The lower limit of the storage capacity.

Create the Dataset and JindoRuntime:
```
kubectl create -f resource.yaml
```

Verify the Dataset is bound:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop        210MiB       0.00B    4.00GiB              0.0%          Bound   1h

Verify the JindoRuntime is ready:

kubectl get jindoruntime hadoop

Expected output:

NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
hadoop   Ready          Ready          Ready        4m45s

Verify the persistent volume (PV) and persistent volume claim (PVC) are created:

kubectl get pv,pvc

Expected output:

NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m

NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The Dataset and JindoRuntime are ready when all phases show Ready and the PVC status shows Bound.

Step 3: Test data access acceleration

Deploy a test Pod to the same node pool, read a file twice, and compare the access times to observe the JindoFS caching effect.

Create a file named app.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  nodeSelector:
    alibabacloud.com/nodepool-id: npxxxxxxxxxxxxx
  containers:
    - name: demo
      image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
      volumeMounts:
        - mountPath: /data
          name: hadoop
  volumes:
    - name: hadoop
      persistentVolumeClaim:
        claimName: hadoop

Note

Set nodeSelector to the same node pool ID used in Step 2.

Deploy the Pod:
```
kubectl create -f app.yaml
```

Verify the file is accessible and check its size:

kubectl exec -it demo-app -- bash
du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

Expected output:

210M    /data/spark-3.0.1-bin-hadoop2.7.tgz

Measure the first read time. Because no data is cached yet, JindoFS fetches the file from OSS over the cloud-edge network—this read will be slow.
```
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m18.386s
user    0m0.002s
sys    0m0.105s
```

Confirm that the file is now fully cached:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop   210.00MiB       210.00MiB    4.00GiB        100.0%           Bound   1h

The dataset shows 100% cached, meaning all data is stored in the JindoFS workers on the node pool.

Recreate the Pod to clear the Linux page cache. This ensures the second read uses only the JindoFS cache, not the OS-level cache.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
```
Measure the second read time:
```
kubectl exec -it demo-app -- bash
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m0.048s
user    0m0.001s
sys     0m0.046s
```
The second read completes in 48 milliseconds—more than 300x faster—because JindoFS serves the data from local node memory instead of fetching it from OSS.

(Optional) Clear the environment

Delete the test Pod:

kubectl delete pod demo-app

Delete the Dataset and JindoRuntime:

kubectl delete dataset hadoop