How do I use Fluid to accelerate data access to Lindorm over the S3 protocol - Lindorm

The S3 compatibility feature provided by Lindorm works as a plug-in that optimizes the storage of a large number of small files. The plug-in allows you to access Lindorm over the S3 protocol. Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications This topic describes how to use Fluid to accelerate data access to Lindorm over the S3 protocol.

Prerequisites

The S3 compatibility features is enabled for the Lindorm instance. For more information, see Enable the S3 compatibility feature.
An ACK Pro cluster is created. For more information about how to create an ACK Pro cluster, see Create a managed Kubernetes cluster.
The IP address of the ACK Pro cluster is added to the whitelist of the Lindorm instance. For more information, see Configure whitelists.

Step 1: Install Fluid on the ACK Pro cluster

Log on to the ACK console.
In the left-side navigation pane, choose Marketplace > Marketplace.
On the Marketplace page, enter ack-fluid in the search box and click the Search icon. Then, click ack-fluid.
In the upper-right corner of the page that appears, click Deploy.
In the Basic Information step, select the ACK Pro cluster that is associated with the Lindorm instance, and then click Next.
In the Parameters step, select a chart version, configure the parameters, and then click OK.

Step 2: Create a dataset and a runtime

Note

To facilitate data management, Fluid introduces data sets and runtimes. A data set is a set of logically related data that is used by computing engines. A runtime is the execution engine that implements security, versioning, and data acceleration for datasets and defines a series of lifecycle interfaces. For more information, see Overview of Fluid.

Create a dataset.

You must create a dataset.yaml file to configure a S3-compatible dataset for Lindorm.

The dataset.yaml file defines a dataset. To make sure that Alluxio can mount to the S3 path of the Lindorm instance to access data, you must configure the parameters in the spec section of the dataset.yaml file. We recommend that you configure the parameters in the following table.

Note

Alluxio is a cloud-oriented open source data orchestration technology for data analytics and AI. Fluid uses Alluxio to preheat data and accelerate data access for cloud-based applications.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: lindorm
spec:
  mounts:
    - mountPoint: s3://<BUCKET>/<DIRECTORY>/
      options:
        aws.accessKeyId: <accessKeyId>
        aws.secretKey: <secretKey>
        alluxio.underfs.s3.endpoint: <LINDORM_ENDPOINT>
        alluxio.underfs.s3.disable.dns.buckets: "true"
      name: lindorm
  accessModes:
    - ReadWriteOnce
  placement: "Shared"

Parameter	Description
mountPoint	The UFS path that can be used by Alluxio to mount to the Lindorm instance. The path is in the following format: `s3://<BUCKET>/<DIRECTORY>`. You do not need to include the endpoint of LindormTable in the path.
aws.accessKeyId	The AccessKey ID used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.
aws.secretKey	The AccessKey secret used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.
alluxio.underfs.s3.endpoint	The endpoint that can be used to access LindormTable over the S3 protocol. For more information, see View the endpoints of LindormTable.
alluxio.underfs.s3.disable.dns.buckets	Specifies whether the S3 compatibility feature supports path-style URL access. The S3 compatibility feature supports only path-style URL access. Therefore, set this parameter to true.
accessModes	The access mode. Valid values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany, and ReadWriteOncePod. Default value: ReadOnlyMany.
placement	Specifies whether only one worker can run on a node. Default value: Exclusive. Valid values: Exclusive: Only one worker can run on a node. Shared: Multiple workers can run on a node at the same time.

Run the following command to deploy the dataset:
```
kubectl create -f dataset.yaml
```

Create a runtime. In this example, Alluxio is used as the runtime for the created dataset.

You must define the runtime in the runtime.yaml file and configure the parameters in the spec section. The following example shows the content of the runtime.yaml file.

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: lindorm
spec:
  replicas: 3
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 32Gi
        high: "0.9"
        low: "0.8"
  properties:
    alluxio.user.file.writetype.default: THROUGH
    alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards: "3"
    alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.DeterministicHashPolicy

Parameter	Description
replicas	The total number of workers in the Alluxio cluster.
mediumtype	The cache type. When you create a sample AlluxioRuntime template, the following cache types are supported: HDD, SSD, and MEM.
path	The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a path of the on-premises storage to store data such as logs.
quota	The maximum size of cached data. Unit: GB.
high	The ratio of the high watermark to the maximum size of the cache.
low	The ratio of the low watermark to the maximum size of the cache.
properties	The items that you can configure for Alluxio. For more information, see Properties-List.

Run the following command to create the AlluxioRuntime instance:
```
kubectl create -f runtime.yaml
```

Step 3: View the status of each component

Run the following command to view the status of the created AlluxioRuntime instance:

kubectl get alluxioruntime lindorm

The following result is returned:

NAME      MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
lindorm   Ready          Ready          Ready        52m

Run the following command to check the status of the pods:

kubectl get pods

The following result is returned. According to the result, a master and three workers are running.

NAME                         READY   STATUS      RESTARTS   AGE
lindorm-master-0             2/2     Running     0          54m
lindorm-worker-0             2/2     Running     0          54m
lindorm-worker-1             2/2     Running     0          54m
lindorm-worker-2             2/2     Running     0          54m

Run the following command to check whether the dataset is created:

kubectl get dataset lindorm

If the dataset is created, the following result is returned. If no result is returned, the dataset fails to be created.

NAME  UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
lindorm   0.00B            0.00B      96.00GiB         +Inf%               Bound   60m

Run the following command to check whether the PV and PVC are created.
Note
- A PV is a storage resource in a Kubernetes cluster, which is similar to a node in a cluster. A PV has a lifecycle that is independent of the pod that uses the PV. Different types of PV can be created based on different types of StorageClass.
- A PVC is a request for storage sent by a user. PVCs consume PVs in the similar way as that Pods consume the resources of nodes.
```
kubectl get pv,pvc
```
If the PV and PVC are created, the following result is returned. If no result is returned, the PV and PVC fail to be created.
```
NAME                               CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
persistentvolume/default-lindorm   100Gi      RWO            Retain           Bound    default/lindorm   fluid                   10m

NAME                            STATUS   VOLUME            CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/lindorm   Bound    default-lindorm   100Gi      RWO            fluid          10m
```

Step 4: Access data

Method 1: Use Fluid to accelerate data access to Lindorm

You can create containers or submit machine learning jobs to accelerate data access to Lindorm by using Fluid. In the following example, an application is deployed in a container to test the time used to access the same data. The test is run multiple times to show how data access is accelerated by the AlluxioRuntime instance. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster.

Prepare test data. Run the following command to generate a test file with a size of 1,000 MB and upload the test file to the S3 path of the Lindorm instance:
```
dd if=/dev/zero of=test bs=1M count=1000
```
Create an application container to test data access acceleration.
1. Create a file named app.yaml by using the following YAML template:
```
apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: lindorm
  volumes:
    - name: lindorm
      persistentVolumeClaim:
        claimName: lindorm
```
2. Run the following command to create the application container:
```
kubectl create -f app.yaml
```
3. Run the following command to query the cache information about the dataset:
```
kubectl get dataset
```
  The following result is returned. According to the result, the cache of the dataset is 0 byte in size.
```
NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
lindorm   0.00B            0.00B    96.00GiB         NaN%                Bound   4m19s
```
4. Run the following command to query the size of the test file:
```
kubectl exec -it demo-app -- bash
du -sh /data/lindorm/test
```
  The following result is returned. According to the result, the test file is 1,000 MB in size.
```
1000M   /data/lindorm/test
```
5. Run the following command to query the time required to copy the test file:
```
time grep LindormBlob  /data/lindorm/test
```
  The following result is returned. According to the result, 55.603 seconds are required to copy the test file.
```
real    0m55.603s
user    0m0.469s
sys     0m0.353s
```
6. Run the following command to query the cache information about the dataset:
```
kubectl get dataset lindorm
```
  The following result is returned. According to the result, the cache of the dataset is 1,000 MB in size, which indicates that the data of the test file is cached to the Lindorm instance.
```
NAME      UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
lindorm   0.00B            1000.00MiB   96.00GiB         +Inf%               Bound   11m
```
7. Run the following command to delete the current application container and create a same application container.
  Note
  This step is performed to avoid other factors, such as the page cache, from affecting the result.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
```
8. Run the following command to query the time required to copy the test file:
```
kubectl exec -it demo-app -- bash
time grep LindormBlob  /data/lindorm/test
```
  The following result is returned:
```
real    0m0.646s
user    0m0.429s
sys     0m0.216s
```
  According to the result, only 0.646 seconds are required to copy the test file. The time required to copy the same file is significantly reduced when the file is copied for the second time. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.
(Optional) If you no longer need to accelerate data access to the file, run the following command to delete the container and clear the environment:
```
kubectl delete -f .
```

Method 2: Use the elbencho tool to test data access to Lindorm through Fluid

elbencho is a testing tool for distributed storage. In the following example, elbencho is used to simplify the deployment of data reading and writing jobs. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster. In this case, you need only to run commands to submit data reading and writing jobs.

Prepare test data.

Before the test, you must write data to the Lindorm instance over the S3 protocol. In this example, a file named write.yaml file is configured to create a data writing job. In the job, a container is created by using an elbencho image to write data to Lindorm. In this example, 15.625 GB of data in total is written to the Lindorm instance. To make sure that data can be written to the Lindorm instance in a timely manner, set the data writing mode in the properties parameter in the runtime.yaml file to alluxio.user.file.writetype.default: THROUGH.

apiVersion: batch/v1
kind: Job
metadata:
  name: fluid-elbencho-write
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: elbencho
          image: breuner/elbencho
          command: ["/usr/bin/elbencho"]
          args: ["-d","--write", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"]
          volumeMounts:
            - mountPath: /data
              name: lindorm-vol
      volumes:
        - name: lindorm-vol
          persistentVolumeClaim:
            claimName: lindorm

The following table describes the parameters that you can configure in args. For more information about other parameters that you can configure in the write.yaml file, visit elbencho.

Parameter	Description
-d	Creates a test directory.
--write	Specifies that the job is a data writing job.
-t	Specifies the number of threads used to write data.
-n	Specifies the number of directories created by each thread.
-N	Specifies the number of files that you want to create in each directory.
-s	Specifies the size of the files that you want to write.
--direct	Specifies that no data is cached during the job.
-b	Specifies the size of data blocks for each write operation.
/data/lindorm	Specifies the path to which you want to write data.

Execute the data writing job and wait until the job is complete.

kubectl create -f write.yaml
kubectl get pods

The following result is returned:

NAME                         READY   STATUS      RESTARTS   AGE
fluid-elbencho-write-stfpq   0/1     Completed   0          3m29s
lindorm-fuse-8lgj9           1/1     Running     0          3m29s
lindorm-master-0             2/2     Running     0          5m37s
lindorm-worker-0             2/2     Running     0          5m10s
lindorm-worker-1             2/2     Running     0          5m9s
lindorm-worker-2             2/2     Running     0          5m7s

Clear the cache of the dataset and restart the dataset and the AlluxioRuntime instance.
```
kubectl delete -f .
kubectl create -f dataset.yaml
kubectl create -f runtime.yaml
```

Verify how data reading is accelerated.

Use elbencho to read the same data multiple times. Then, compare the time used to query the data and the throughput for data reading to verify how data reading is accelerated by using Fluid.

Configure the read.yaml file to create a job to read data in the Lindorm instance over the S3 protocol.

apiVersion: batch/v1
kind: Job
metadata:
  name: fluid-elbencho-read
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: elbencho
          image: breuner/elbencho
          command: ["/usr/bin/elbencho"]
          args: ["-d","--read", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"]
          volumeMounts:
            - mountPath: /data
              name: lindorm-vol
      volumes:
        - name: lindorm-vol
          persistentVolumeClaim:
            claimName: lindorm

Execute the data reading job and wait until the job is complete.

kubectl create -f read.yaml
kubectl get pods

The following result is returned:

NAME                         READY   STATUS      RESTARTS   AGE
fluid-elbencho-read-stfpq    0/1     Completed   0          3m29s
lindorm-fuse-8lgj9           1/1     Running     0          3m29s
lindorm-master-0             2/2     Running     0          5m37s
lindorm-worker-0             2/2     Running     0          5m10s
lindorm-worker-1             2/2     Running     0          5m9s
lindorm-worker-2             2/2     Running     0          5m7s

Run the following command to view the time required to read the data and the throughput for data reading:

kubectl logs fluid-elbencho-read-stfpq

The following result is returned:

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
MKDIRS    Elapsed ms       :         33        120
          Dirs/s           :         30         83
          Dirs total       :          1         10
---
READ      Elapsed ms       :      17585      18479
          Files/s          :         54         54
          Throughput MiB/s :        869        865
          Total MiB        :      15296      16000
          Files total      :        956       1000
---

Run the following command to query the status of the dataset:

kubectl get dataset lindorm

The following result is returned. According to the result, all files that have been read are cached in Alluxio.

NAME      UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
lindorm   0.00B            15.63GiB   96.00GiB         +Inf%               Bound   9m54s

Run the following command to delete the current container and create a same container for data reading:
```
kubectl delete -f read.yaml && kubectl create -f read.yaml
```

Run the following command to use elbencho to read the data in the Lindorm again. The time used to execute the data reading job is significantly reduced because the data has been cached to Fluid.

kubectl get pods

The following result is returned:

NAME                         READY   STATUS      RESTARTS   AGE
fluid-elbencho-read-9gxkk    0/1     Completed   0          9s
lindorm-fuse-ckwd9           1/1     Running     0          4m1s
lindorm-fuse-vlr6r           1/1     Running     0          3m6s
lindorm-master-0             2/2     Running     0          10m
lindorm-worker-0             2/2     Running     0          9m28s
lindorm-worker-1             2/2     Running     0          9m27s
lindorm-worker-2             2/2     Running     0          9m26s

Run the following command to view the time required to read the data and the throughput for data reading:

kubectl logs fluid-elbencho-read-9gxkk

The following result is returned:

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
MKDIRS    Elapsed ms       :          7         32
          Dirs/s           :        132        312
          Dirs total       :          1         10
---
READ      Elapsed ms       :       8081       9165
          Files/s          :        110        109
          Throughput MiB/s :       1771       1745
          Total MiB        :      14320      16000
          Files total      :        895       1000

According to the result, the throughput for data reading is improved from 869 MiB/s to 1771 MiB/s. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.

(Optional) If you no longer need to accelerate data access to the files, run the following command to delete the container and clear the environment:
```
kubectl delete -f .
```