Fluid is an open-source, Kubernetes-native, distributed dataset orchestration and acceleration engine. It is designed for data-intensive applications in cloud-native environments, such as big data and AI applications. In edge scenarios, Fluid can significantly accelerate file access from edge nodes to Object Storage Service (OSS). This topic describes how to use the Fluid data acceleration feature in an ACK Edge cluster.
Prerequisites
-
You have an ACK Edge cluster, version 1.18 or later. For more information, see Create an ACK Edge cluster.
-
You have created an edge node pool and added edge nodes to it. For more information, see Create an edge node pool and Add an edge node.
-
You have installed the cloud-native AI suite and deployed the ack-fluid component.
ImportantIf you have installed open source Fluid, uninstall it before you deploy the ack-fluid component.
-
If the cloud-native AI suite is not installed: enable Fluid data acceleration when you install the suite. For more information, see Deploy the AI suite console.
-
If the cloud-native AI suite is already installed: deploy ack-fluid on the Cloud-native AI Suite page of the ACK Console.
-
-
You have connected to your Kubernetes cluster by using kubectl. For more information, see Connect to an ACK cluster by using kubectl.
-
You have activated Object Storage Service (OSS). For more information, see Activate OSS.
Step 1: Prepare OSS data
-
Run the following command to download the test data to an ECS instance.
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz -
Upload the downloaded test data to an OSS bucket.
ImportantThe following steps use an ECS instance running Alibaba Cloud Linux 3.2104 LTS 64-bit as an example to show how to upload data to OSS. For other operating systems, see ossutil quick start and ossutil command reference 1.0.
-
Create a bucket named
examplebucket.-
Run the following command to create
examplebucket.ossutil mb oss://examplebucket -
The following output indicates that
examplebuckethas been created.0.668238(s) elapsed
-
-
Upload the downloaded test data to the
examplebucketbucket.ossutil cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a Dataset and a JindoRuntime
Before you create a Dataset, create a file named
mySecret.yaml.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxxThe
fs.oss.accessKeyIdandfs.oss.accessKeySecretparameters are theAccessKey IDandAccessKey Secretfrom Step 1 that are used to access OSS.Run the following command to create the Secret. Kubernetes encrypts and encodes the Secret to prevent it from being exposed as plaintext.
kubectl create -f mySecret.yaml-
Create a
resource.yamlfile with the following content. This file serves two purposes:-
Create a Dataset, which describes the remote dataset and provides information about the underlying file system (UFS).
-
Create a JindoRuntime to launch a JindoFS cluster for data caching.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: alibabacloud.com/nodepool-id operator: In values: - npxxxxxxxxxxxxxx mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxxx replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 2Gi high: "0.99" low: "0.95"Note-
In an ACK Edge cluster, you must use
nodeAffinityandnodeSelectorto deploy the Dataset and JindoRuntime to the same node pool. This ensures that nodes in the node pool can communicate. -
Because both edge node management and OSS access require cloud-to-edge network communication, we recommend that you ensure sufficient network bandwidth to maintain the stability of the control channel.
The following table describes the parameters.
Parameter
Description
mountPoint
oss://<oss_bucket>/<bucket_dir>specifies the path of the UFS to mount. This path must point to a directory, not a single file. Do not include the endpoint in this path.fs.oss.endpoint
The endpoint of the OSS bucket. You can use a public or internal endpoint. For more information, see Regions and endpoints.
replicas
The number of workers in the JindoFS cluster.
mediumtype
The cache medium type. JindoFS supports only one cache type at a time: HDD, SSD, or MEM.
path
The storage path. Only a single path is supported. If
mediumtypeisMEM, you must specify a local path for files such as logs.quota
The maximum cache capacity, in GB.
high
The high watermark for storage usage.
low
The low watermark for storage usage.
-
Run the following command to create the JindoRuntime and Dataset.
kubectl create -f resource.yamlRun the following command to check the status of the Dataset.
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1hRun the following command to check the status of the JindoRuntime.
kubectl get jindoruntime hadoopExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45sRun the following command to verify that the PersistentVolume (PV) and PersistentVolumeClaim (PVC) have been created.
kubectl get pv,pvcExpected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The preceding output shows that the Dataset and JindoRuntime have been created.
Step 3: Test data access acceleration
You can create an application container or submit a machine learning job to use the JindoFS acceleration service. This topic demonstrates the acceleration effect of JindoRuntime by using an application container to access the same data multiple times and comparing the access times.
-
Create a file named app.yaml with the following content.
apiVersion: v1 kind: Pod metadata: name: demo-app spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxx containers: - name: demo image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoopNoteIn an ACK Edge cluster, you must use
nodeSelectorto deploy the test pod to the node pool specified in Step 2. Run the following command to create the application container.
kubectl create -f app.yaml-
Open a shell in the pod and check the file size.
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgzExpected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgz Run the following command to time the file copy.
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m18.386s user 0m0.002s sys 0m0.105sThe output shows that it took about 18 seconds to copy the file.
Run the following command to check the cache status of the Dataset.
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1hThe output shows that all 210 MiB of data has been cached locally.
Run the following command to delete the previous application container and create a new one.
NoteThis step prevents interference from other factors, such as the operating system's page cache.
kubectl delete -f app.yaml && kubectl create -f app.yamlRun the following commands to time the file copy again.
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m0.048s user 0m0.001s sys 0m0.046sThe output shows that copying the file now takes about 48 milliseconds, over 300 times faster than the initial copy.
NoteThe second access is much faster because JindoFS has cached the file.
(Optional) Clean up
When data acceleration is no longer needed, delete the pod, the Dataset, and the JindoRuntime.
Delete the pod:
kubectl delete pod demo-app
Delete the Dataset and JindoRuntime:
kubectl delete dataset hadoop