In edge computing, each OSS file access travels over the cloud-edge network, adding significant latency. Fluid caches OSS data on edge node memory so that repeated reads bypass the network entirely. This tutorial walks you through uploading a test dataset to OSS, creating a Dataset and JindoRuntime on an edge node pool, deploying a test Pod, and verifying the caching effect—reducing a 210 MiB file read from 18 seconds down to 48 milliseconds.
Prerequisites
Before you begin, make sure you have:
An ACK Edge cluster running Kubernetes 1.18 or later. See Create an ACK Edge cluster.
An edge node pool with edge nodes added. See Create an edge node pool and Add edge nodes.
The cloud-native AI suite installed with the ack-fluid component deployed.
ImportantUninstall any open-source Fluid installation before deploying ack-fluid.
If the suite is not yet deployed: enable Fluid under Data Access Acceleration when deploying the suite.
If the suite is already deployed: go to the Cloud-native AI Suite page in the ACK console and deploy ack-fluid.
kubectl connected to the ACK cluster. See Get a cluster kubeconfig and connect to the cluster using kubectl.
Object Storage Service (OSS) activated. See Activate OSS.
Step 1: Upload data to OSS
Download the test dataset to an Elastic Compute Service (ECS) instance:
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgzUpload the dataset to an OSS bucket.
ImportantThe following steps use an ECS instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. For other operating systems, see ossutil and ossutil 1.0.
Create a bucket named
examplebucket:ossutil mb oss://examplebucketExpected output:
0.668238(s) elapsedUpload the test dataset to
examplebucket:ossutil cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a Dataset and a JindoRuntime
In an ACK Edge cluster, both edge node management and OSS access use the cloud-edge network. Deploy the Dataset and JindoRuntime to the same node pool as your workloads to keep data access within the node pool and to reserve bandwidth for the management channel.
Create a file named
mySecret.yamlwith the following content. Replacexxxwith the AccessKey ID and AccessKey secret used in Step 1.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxxCreate the Secret. Kubernetes encrypts the Secret to prevent credentials from being stored as plaintext.
kubectl create -f mySecret.yamlCreate a file named
resource.yamlwith the following content:apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: alibabacloud.com/nodepool-id operator: In values: - npxxxxxxxxxxxxxx mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxxx replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 2Gi high: "0.99" low: "0.95"This template creates two resources:
A Dataset that specifies the OSS path to mount and references the Secret for credentials. The
nodeAffinityfield pins the Dataset to the target node pool.A JindoRuntime that launches a JindoFS cluster for data caching. Set
nodeSelectorto the same node pool as the Dataset'snodeAffinity.
Key parameters:
Parameter Description mountPointThe OSS path to mount, in the format oss://<oss_bucket>/<bucket_dir>. Must point to a directory, not a file. The endpoint is specified separately.fs.oss.endpointThe public or private endpoint of the OSS bucket. See Regions and endpoints. replicasThe number of workers in the JindoFS cluster. mediumtypeThe cache storage type. Valid values: HDD,SDD,MEM.pathThe local storage path for cache data. Required when mediumtypeisMEM.quotaThe maximum cache size. Unit: GiB. highThe upper limit of the storage capacity. lowThe lower limit of the storage capacity. Create the Dataset and JindoRuntime:
kubectl create -f resource.yamlVerify the Dataset is bound:
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1hVerify the JindoRuntime is ready:
kubectl get jindoruntime hadoopExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45sVerify the persistent volume (PV) and persistent volume claim (PVC) are created:
kubectl get pv,pvcExpected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The Dataset and JindoRuntime are ready when all phases show Ready and the PVC status shows Bound.
Step 3: Test data access acceleration
Deploy a test Pod to the same node pool, read a file twice, and compare the access times to observe the JindoFS caching effect.
Create a file named
app.yamlwith the following content:apiVersion: v1 kind: Pod metadata: name: demo-app spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxx containers: - name: demo image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoopNoteSet
nodeSelectorto the same node pool ID used in Step 2.Deploy the Pod:
kubectl create -f app.yamlVerify the file is accessible and check its size:
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgzExpected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgzMeasure the first read time. Because no data is cached yet, JindoFS fetches the file from OSS over the cloud-edge network—this read will be slow.
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m18.386s user 0m0.002s sys 0m0.105sConfirm that the file is now fully cached:
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1hThe dataset shows 100% cached, meaning all data is stored in the JindoFS workers on the node pool.
Recreate the Pod to clear the Linux page cache. This ensures the second read uses only the JindoFS cache, not the OS-level cache.
kubectl delete -f app.yaml && kubectl create -f app.yamlMeasure the second read time:
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m0.048s user 0m0.001s sys 0m0.046sThe second read completes in 48 milliseconds—more than 300x faster—because JindoFS serves the data from local node memory instead of fetching it from OSS.
(Optional) Clear the environment
Delete the test Pod:
kubectl delete pod demo-appDelete the Dataset and JindoRuntime:
kubectl delete dataset hadoop