The S3 compatibility feature provided by Lindorm works as a plug-in that optimizes the storage of a large number of small files. The plug-in allows you to access Lindorm over the S3 protocol. Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications This topic describes how to use Fluid to accelerate data access to Lindorm over the S3 protocol.
Prerequisites
The S3 compatibility features is enabled for the Lindorm instance. For more information, see Enable the S3 compatibility feature.
An ACK Pro cluster is created. For more information about how to create an ACK Pro cluster, see Create a managed Kubernetes cluster.
The IP address of the ACK Pro cluster is added to the whitelist of the Lindorm instance. For more information, see Configure whitelists.
Step 1: Install Fluid on the ACK Pro cluster
Log on to the ACK console.
In the left-side navigation pane, choose .
On the Marketplace page, enter ack-fluid in the search box and click the Search icon. Then, click ack-fluid.
In the upper-right corner of the page that appears, click Deploy.
In the Basic Information step, select the ACK Pro cluster that is associated with the Lindorm instance, and then click Next.
In the Parameters step, select a chart version, configure the parameters, and then click OK.
Step 2: Create a dataset and a runtime
To facilitate data management, Fluid introduces data sets and runtimes. A data set is a set of logically related data that is used by computing engines. A runtime is the execution engine that implements security, versioning, and data acceleration for datasets and defines a series of lifecycle interfaces. For more information, see Overview of Fluid.
Create a dataset.
You must create a dataset.yaml file to configure a S3-compatible dataset for Lindorm.
The dataset.yaml file defines a dataset. To make sure that Alluxio can mount to the S3 path of the Lindorm instance to access data, you must configure the parameters in the spec section of the dataset.yaml file. We recommend that you configure the parameters in the following table.
NoteAlluxio is a cloud-oriented open source data orchestration technology for data analytics and AI. Fluid uses Alluxio to preheat data and accelerate data access for cloud-based applications.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: lindorm spec: mounts: - mountPoint: s3://<BUCKET>/<DIRECTORY>/ options: aws.accessKeyId: <accessKeyId> aws.secretKey: <secretKey> alluxio.underfs.s3.endpoint: <LINDORM_ENDPOINT> alluxio.underfs.s3.disable.dns.buckets: "true" name: lindorm accessModes: - ReadWriteOnce placement: "Shared"Parameter
Description
mountPoint
The UFS path that can be used by Alluxio to mount to the Lindorm instance. The path is in the following format:
s3://<BUCKET>/<DIRECTORY>.You do not need to include the endpoint of LindormTable in the path.
aws.accessKeyId
The AccessKey ID used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.
aws.secretKey
The AccessKey secret used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.
alluxio.underfs.s3.endpoint
The endpoint that can be used to access LindormTable over the S3 protocol. For more information, see View the endpoints of LindormTable.
alluxio.underfs.s3.disable.dns.buckets
Specifies whether the S3 compatibility feature supports path-style URL access. The S3 compatibility feature supports only path-style URL access. Therefore, set this parameter to true.
accessModes
The access mode. Valid values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany, and ReadWriteOncePod. Default value: ReadOnlyMany.
placement
Specifies whether only one worker can run on a node. Default value: Exclusive. Valid values:
Exclusive: Only one worker can run on a node.
Shared: Multiple workers can run on a node at the same time.
Run the following command to deploy the dataset:
kubectl create -f dataset.yaml
Create a runtime. In this example, Alluxio is used as the runtime for the created dataset.
You must define the runtime in the runtime.yaml file and configure the parameters in the spec section. The following example shows the content of the runtime.yaml file.
apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: lindorm spec: replicas: 3 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 32Gi high: "0.9" low: "0.8" properties: alluxio.user.file.writetype.default: THROUGH alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards: "3" alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.DeterministicHashPolicyParameter
Description
replicas
The total number of workers in the Alluxio cluster.
mediumtype
The cache type. When you create a sample AlluxioRuntime template, the following cache types are supported: HDD, SSD, and MEM.
path
The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a path of the on-premises storage to store data such as logs.
quota
The maximum size of cached data. Unit: GB.
high
The ratio of the high watermark to the maximum size of the cache.
low
The ratio of the low watermark to the maximum size of the cache.
properties
The items that you can configure for Alluxio. For more information, see Properties-List.
Run the following command to create the AlluxioRuntime instance:
kubectl create -f runtime.yaml
Step 3: View the status of each component
Run the following command to view the status of the created AlluxioRuntime instance:
kubectl get alluxioruntime lindormThe following result is returned:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE lindorm Ready Ready Ready 52mRun the following command to check the status of the pods:
kubectl get podsThe following result is returned. According to the result, a master and three workers are running.
NAME READY STATUS RESTARTS AGE lindorm-master-0 2/2 Running 0 54m lindorm-worker-0 2/2 Running 0 54m lindorm-worker-1 2/2 Running 0 54m lindorm-worker-2 2/2 Running 0 54mRun the following command to check whether the dataset is created:
kubectl get dataset lindormIf the dataset is created, the following result is returned. If no result is returned, the dataset fails to be created.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE lindorm 0.00B 0.00B 96.00GiB +Inf% Bound 60mRun the following command to check whether the PV and PVC are created.
NoteA PV is a storage resource in a Kubernetes cluster, which is similar to a node in a cluster. A PV has a lifecycle that is independent of the pod that uses the PV. Different types of PV can be created based on different types of StorageClass.
A PVC is a request for storage sent by a user. PVCs consume PVs in the similar way as that Pods consume the resources of nodes.
kubectl get pv,pvcIf the PV and PVC are created, the following result is returned. If no result is returned, the PV and PVC fail to be created.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/default-lindorm 100Gi RWO Retain Bound default/lindorm fluid 10m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/lindorm Bound default-lindorm 100Gi RWO fluid 10m
Step 4: Access data
Method 1: Use Fluid to accelerate data access to Lindorm
You can create containers or submit machine learning jobs to accelerate data access to Lindorm by using Fluid. In the following example, an application is deployed in a container to test the time used to access the same data. The test is run multiple times to show how data access is accelerated by the AlluxioRuntime instance. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster.
Prepare test data. Run the following command to generate a test file with a size of 1,000 MB and upload the test file to the S3 path of the Lindorm instance:
dd if=/dev/zero of=test bs=1M count=1000Create an application container to test data access acceleration.
Create a file named app.yaml by using the following YAML template:
apiVersion: v1 kind: Pod metadata: name: demo-app spec: containers: - name: demo image: nginx volumeMounts: - mountPath: /data name: lindorm volumes: - name: lindorm persistentVolumeClaim: claimName: lindormRun the following command to create the application container:
kubectl create -f app.yamlRun the following command to query the cache information about the dataset:
kubectl get datasetThe following result is returned. According to the result, the cache of the dataset is 0 byte in size.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE lindorm 0.00B 0.00B 96.00GiB NaN% Bound 4m19sRun the following command to query the size of the test file:
kubectl exec -it demo-app -- bash du -sh /data/lindorm/testThe following result is returned. According to the result, the test file is 1,000 MB in size.
1000M /data/lindorm/testRun the following command to query the time required to copy the test file:
time grep LindormBlob /data/lindorm/testThe following result is returned. According to the result, 55.603 seconds are required to copy the test file.
real 0m55.603s user 0m0.469s sys 0m0.353sRun the following command to query the cache information about the dataset:
kubectl get dataset lindormThe following result is returned. According to the result, the cache of the dataset is 1,000 MB in size, which indicates that the data of the test file is cached to the Lindorm instance.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE lindorm 0.00B 1000.00MiB 96.00GiB +Inf% Bound 11mRun the following command to delete the current application container and create a same application container.
NoteThis step is performed to avoid other factors, such as the page cache, from affecting the result.
kubectl delete -f app.yaml && kubectl create -f app.yamlRun the following command to query the time required to copy the test file:
kubectl exec -it demo-app -- bash time grep LindormBlob /data/lindorm/testThe following result is returned:
real 0m0.646s user 0m0.429s sys 0m0.216sAccording to the result, only 0.646 seconds are required to copy the test file. The time required to copy the same file is significantly reduced when the file is copied for the second time. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.
(Optional) If you no longer need to accelerate data access to the file, run the following command to delete the container and clear the environment:
kubectl delete -f .
Method 2: Use the elbencho tool to test data access to Lindorm through Fluid
elbencho is a testing tool for distributed storage. In the following example, elbencho is used to simplify the deployment of data reading and writing jobs. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster. In this case, you need only to run commands to submit data reading and writing jobs.
Prepare test data.
Before the test, you must write data to the Lindorm instance over the S3 protocol. In this example, a file named write.yaml file is configured to create a data writing job. In the job, a container is created by using an elbencho image to write data to Lindorm. In this example, 15.625 GB of data in total is written to the Lindorm instance. To make sure that data can be written to the Lindorm instance in a timely manner, set the data writing mode in the properties parameter in the runtime.yaml file to
alluxio.user.file.writetype.default: THROUGH.apiVersion: batch/v1 kind: Job metadata: name: fluid-elbencho-write spec: template: spec: restartPolicy: OnFailure containers: - name: elbencho image: breuner/elbencho command: ["/usr/bin/elbencho"] args: ["-d","--write", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"] volumeMounts: - mountPath: /data name: lindorm-vol volumes: - name: lindorm-vol persistentVolumeClaim: claimName: lindormThe following table describes the parameters that you can configure in args. For more information about other parameters that you can configure in the write.yaml file, visit elbencho.
Parameter
Description
-d
Creates a test directory.
--write
Specifies that the job is a data writing job.
-t
Specifies the number of threads used to write data.
-n
Specifies the number of directories created by each thread.
-N
Specifies the number of files that you want to create in each directory.
-s
Specifies the size of the files that you want to write.
--direct
Specifies that no data is cached during the job.
-b
Specifies the size of data blocks for each write operation.
/data/lindorm
Specifies the path to which you want to write data.
Execute the data writing job and wait until the job is complete.
kubectl create -f write.yaml kubectl get podsThe following result is returned:
NAME READY STATUS RESTARTS AGE fluid-elbencho-write-stfpq 0/1 Completed 0 3m29s lindorm-fuse-8lgj9 1/1 Running 0 3m29s lindorm-master-0 2/2 Running 0 5m37s lindorm-worker-0 2/2 Running 0 5m10s lindorm-worker-1 2/2 Running 0 5m9s lindorm-worker-2 2/2 Running 0 5m7sClear the cache of the dataset and restart the dataset and the AlluxioRuntime instance.
kubectl delete -f . kubectl create -f dataset.yaml kubectl create -f runtime.yamlVerify how data reading is accelerated.
Use elbencho to read the same data multiple times. Then, compare the time used to query the data and the throughput for data reading to verify how data reading is accelerated by using Fluid.
Configure the read.yaml file to create a job to read data in the Lindorm instance over the S3 protocol.
apiVersion: batch/v1 kind: Job metadata: name: fluid-elbencho-read spec: template: spec: restartPolicy: OnFailure containers: - name: elbencho image: breuner/elbencho command: ["/usr/bin/elbencho"] args: ["-d","--read", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"] volumeMounts: - mountPath: /data name: lindorm-vol volumes: - name: lindorm-vol persistentVolumeClaim: claimName: lindormExecute the data reading job and wait until the job is complete.
kubectl create -f read.yaml kubectl get podsThe following result is returned:
NAME READY STATUS RESTARTS AGE fluid-elbencho-read-stfpq 0/1 Completed 0 3m29s lindorm-fuse-8lgj9 1/1 Running 0 3m29s lindorm-master-0 2/2 Running 0 5m37s lindorm-worker-0 2/2 Running 0 5m10s lindorm-worker-1 2/2 Running 0 5m9s lindorm-worker-2 2/2 Running 0 5m7sRun the following command to view the time required to read the data and the throughput for data reading:
kubectl logs fluid-elbencho-read-stfpqThe following result is returned:
OPERATION RESULT TYPE FIRST DONE LAST DONE ========= ================ ========== ========= MKDIRS Elapsed ms : 33 120 Dirs/s : 30 83 Dirs total : 1 10 --- READ Elapsed ms : 17585 18479 Files/s : 54 54 Throughput MiB/s : 869 865 Total MiB : 15296 16000 Files total : 956 1000 ---Run the following command to query the status of the dataset:
kubectl get dataset lindormThe following result is returned. According to the result, all files that have been read are cached in Alluxio.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE lindorm 0.00B 15.63GiB 96.00GiB +Inf% Bound 9m54sRun the following command to delete the current container and create a same container for data reading:
kubectl delete -f read.yaml && kubectl create -f read.yamlRun the following command to use elbencho to read the data in the Lindorm again. The time used to execute the data reading job is significantly reduced because the data has been cached to Fluid.
kubectl get podsThe following result is returned:
NAME READY STATUS RESTARTS AGE fluid-elbencho-read-9gxkk 0/1 Completed 0 9s lindorm-fuse-ckwd9 1/1 Running 0 4m1s lindorm-fuse-vlr6r 1/1 Running 0 3m6s lindorm-master-0 2/2 Running 0 10m lindorm-worker-0 2/2 Running 0 9m28s lindorm-worker-1 2/2 Running 0 9m27s lindorm-worker-2 2/2 Running 0 9m26sRun the following command to view the time required to read the data and the throughput for data reading:
kubectl logs fluid-elbencho-read-9gxkkThe following result is returned:
OPERATION RESULT TYPE FIRST DONE LAST DONE ========= ================ ========== ========= MKDIRS Elapsed ms : 7 32 Dirs/s : 132 312 Dirs total : 1 10 --- READ Elapsed ms : 8081 9165 Files/s : 110 109 Throughput MiB/s : 1771 1745 Total MiB : 14320 16000 Files total : 895 1000According to the result, the throughput for data reading is improved from 869 MiB/s to 1771 MiB/s. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.
(Optional) If you no longer need to accelerate data access to the files, run the following command to delete the container and clear the environment:
kubectl delete -f .