All Products
Search
Document Center

Lindorm:Use Fluid to accelerate data access to Lindorm over the S3 protocol

Last Updated:Jan 21, 2025

The S3 compatibility feature provided by Lindorm works as a plug-in that optimizes the storage of a large number of small files. The plug-in allows you to access Lindorm over the S3 protocol. Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications This topic describes how to use Fluid to accelerate data access to Lindorm over the S3 protocol.

Prerequisites

Step 1: Install Fluid on the ACK Pro cluster

  1. Log on to the ACK console.

  2. In the left-side navigation pane, choose Marketplace > Marketplace.

  3. On the Marketplace page, enter ack-fluid in the search box and click the Search icon. Then, click ack-fluid.

  4. In the upper-right corner of the page that appears, click Deploy.

  5. In the Basic Information step, select the ACK Pro cluster that is associated with the Lindorm instance, and then click Next.

  6. In the Parameters step, select a chart version, configure the parameters, and then click OK.

Step 2: Create a dataset and a runtime

Note

To facilitate data management, Fluid introduces data sets and runtimes. A data set is a set of logically related data that is used by computing engines. A runtime is the execution engine that implements security, versioning, and data acceleration for datasets and defines a series of lifecycle interfaces. For more information, see Overview of Fluid.

  1. Create a dataset.

    1. You must create a dataset.yaml file to configure a S3-compatible dataset for Lindorm.

      The dataset.yaml file defines a dataset. To make sure that Alluxio can mount to the S3 path of the Lindorm instance to access data, you must configure the parameters in the spec section of the dataset.yaml file. We recommend that you configure the parameters in the following table.

      Note

      Alluxio is a cloud-oriented open source data orchestration technology for data analytics and AI. Fluid uses Alluxio to preheat data and accelerate data access for cloud-based applications.

      apiVersion: data.fluid.io/v1alpha1
      kind: Dataset
      metadata:
        name: lindorm
      spec:
        mounts:
          - mountPoint: s3://<BUCKET>/<DIRECTORY>/
            options:
              aws.accessKeyId: <accessKeyId>
              aws.secretKey: <secretKey>
              alluxio.underfs.s3.endpoint: <LINDORM_ENDPOINT>
              alluxio.underfs.s3.disable.dns.buckets: "true"
            name: lindorm
        accessModes:
          - ReadWriteOnce
        placement: "Shared"

      Parameter

      Description

      mountPoint

      The UFS path that can be used by Alluxio to mount to the Lindorm instance. The path is in the following format: s3://<BUCKET>/<DIRECTORY>.

      You do not need to include the endpoint of LindormTable in the path.

      aws.accessKeyId

      The AccessKey ID used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.

      aws.secretKey

      The AccessKey secret used to access data in the Lindorm instance. Authentication is not supported for Lindorm instances with S3 compatibility enabled. Therefore, you can set this parameter to a custom value.

      alluxio.underfs.s3.endpoint

      The endpoint that can be used to access LindormTable over the S3 protocol. For more information, see View the endpoints of LindormTable.

      alluxio.underfs.s3.disable.dns.buckets

      Specifies whether the S3 compatibility feature supports path-style URL access. The S3 compatibility feature supports only path-style URL access. Therefore, set this parameter to true.

      accessModes

      The access mode. Valid values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany, and ReadWriteOncePod. Default value: ReadOnlyMany.

      placement

      Specifies whether only one worker can run on a node. Default value: Exclusive. Valid values:

      • Exclusive: Only one worker can run on a node.

      • Shared: Multiple workers can run on a node at the same time.

    2. Run the following command to deploy the dataset:

      kubectl create -f dataset.yaml
  2. Create a runtime. In this example, Alluxio is used as the runtime for the created dataset.

    1. You must define the runtime in the runtime.yaml file and configure the parameters in the spec section. The following example shows the content of the runtime.yaml file.

      apiVersion: data.fluid.io/v1alpha1
      kind: AlluxioRuntime
      metadata:
        name: lindorm
      spec:
        replicas: 3
        tieredstore:
          levels:
            - mediumtype: MEM
              path: /dev/shm
              quota: 32Gi
              high: "0.9"
              low: "0.8"
        properties:
          alluxio.user.file.writetype.default: THROUGH
          alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards: "3"
          alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.DeterministicHashPolicy

      Parameter

      Description

      replicas

      The total number of workers in the Alluxio cluster.

      mediumtype

      The cache type. When you create a sample AlluxioRuntime template, the following cache types are supported: HDD, SSD, and MEM.

      path

      The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a path of the on-premises storage to store data such as logs.

      quota

      The maximum size of cached data. Unit: GB.

      high

      The ratio of the high watermark to the maximum size of the cache.

      low

      The ratio of the low watermark to the maximum size of the cache.

      properties

      The items that you can configure for Alluxio. For more information, see Properties-List.

    2. Run the following command to create the AlluxioRuntime instance:

      kubectl create -f runtime.yaml

Step 3: View the status of each component

  1. Run the following command to view the status of the created AlluxioRuntime instance:

    kubectl get alluxioruntime lindorm

    The following result is returned:

    NAME      MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    lindorm   Ready          Ready          Ready        52m
  2. Run the following command to check the status of the pods:

    kubectl get pods

    The following result is returned. According to the result, a master and three workers are running.

    NAME                         READY   STATUS      RESTARTS   AGE
    lindorm-master-0             2/2     Running     0          54m
    lindorm-worker-0             2/2     Running     0          54m
    lindorm-worker-1             2/2     Running     0          54m
    lindorm-worker-2             2/2     Running     0          54m
  3. Run the following command to check whether the dataset is created:

    kubectl get dataset lindorm

    If the dataset is created, the following result is returned. If no result is returned, the dataset fails to be created.

    NAME  UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    lindorm   0.00B            0.00B      96.00GiB         +Inf%               Bound   60m
  4. Run the following command to check whether the PV and PVC are created.

    Note
    • A PV is a storage resource in a Kubernetes cluster, which is similar to a node in a cluster. A PV has a lifecycle that is independent of the pod that uses the PV. Different types of PV can be created based on different types of StorageClass.

    • A PVC is a request for storage sent by a user. PVCs consume PVs in the similar way as that Pods consume the resources of nodes.

    kubectl get pv,pvc

    If the PV and PVC are created, the following result is returned. If no result is returned, the PV and PVC fail to be created.

    NAME                               CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
    persistentvolume/default-lindorm   100Gi      RWO            Retain           Bound    default/lindorm   fluid                   10m
    
    NAME                            STATUS   VOLUME            CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    persistentvolumeclaim/lindorm   Bound    default-lindorm   100Gi      RWO            fluid          10m

Step 4: Access data

Method 1: Use Fluid to accelerate data access to Lindorm

You can create containers or submit machine learning jobs to accelerate data access to Lindorm by using Fluid. In the following example, an application is deployed in a container to test the time used to access the same data. The test is run multiple times to show how data access is accelerated by the AlluxioRuntime instance. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster.

  1. Prepare test data. Run the following command to generate a test file with a size of 1,000 MB and upload the test file to the S3 path of the Lindorm instance:

    dd if=/dev/zero of=test bs=1M count=1000
  2. Create an application container to test data access acceleration.

    1. Create a file named app.yaml by using the following YAML template:

      apiVersion: v1
      kind: Pod
      metadata:
        name: demo-app
      spec:
        containers:
          - name: demo
            image: nginx
            volumeMounts:
              - mountPath: /data
                name: lindorm
        volumes:
          - name: lindorm
            persistentVolumeClaim:
              claimName: lindorm
    2. Run the following command to create the application container:

      kubectl create -f app.yaml
    3. Run the following command to query the cache information about the dataset:

      kubectl get dataset

      The following result is returned. According to the result, the cache of the dataset is 0 byte in size.

      NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
      lindorm   0.00B            0.00B    96.00GiB         NaN%                Bound   4m19s
    4. Run the following command to query the size of the test file:

      kubectl exec -it demo-app -- bash
      du -sh /data/lindorm/test

      The following result is returned. According to the result, the test file is 1,000 MB in size.

      1000M   /data/lindorm/test
    5. Run the following command to query the time required to copy the test file:

      time grep LindormBlob  /data/lindorm/test

      The following result is returned. According to the result, 55.603 seconds are required to copy the test file.

      real    0m55.603s
      user    0m0.469s
      sys     0m0.353s
    6. Run the following command to query the cache information about the dataset:

      kubectl get dataset lindorm

      The following result is returned. According to the result, the cache of the dataset is 1,000 MB in size, which indicates that the data of the test file is cached to the Lindorm instance.

      NAME      UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
      lindorm   0.00B            1000.00MiB   96.00GiB         +Inf%               Bound   11m
    7. Run the following command to delete the current application container and create a same application container.

      Note

      This step is performed to avoid other factors, such as the page cache, from affecting the result.

      kubectl delete -f app.yaml && kubectl create -f app.yaml
    8. Run the following command to query the time required to copy the test file:

      kubectl exec -it demo-app -- bash
      time grep LindormBlob  /data/lindorm/test

      The following result is returned:

      real    0m0.646s
      user    0m0.429s
      sys     0m0.216s

      According to the result, only 0.646 seconds are required to copy the test file. The time required to copy the same file is significantly reduced when the file is copied for the second time. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.

  3. (Optional) If you no longer need to accelerate data access to the file, run the following command to delete the container and clear the environment:

    kubectl delete -f .

Method 2: Use the elbencho tool to test data access to Lindorm through Fluid

elbencho is a testing tool for distributed storage. In the following example, elbencho is used to simplify the deployment of data reading and writing jobs. Make sure the dataset that is created for the Lindorm instance is deployed in Fluid in the Kubernetes cluster. In this case, you need only to run commands to submit data reading and writing jobs.

  1. Prepare test data.

    Before the test, you must write data to the Lindorm instance over the S3 protocol. In this example, a file named write.yaml file is configured to create a data writing job. In the job, a container is created by using an elbencho image to write data to Lindorm. In this example, 15.625 GB of data in total is written to the Lindorm instance. To make sure that data can be written to the Lindorm instance in a timely manner, set the data writing mode in the properties parameter in the runtime.yaml file to alluxio.user.file.writetype.default: THROUGH.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: fluid-elbencho-write
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: elbencho
              image: breuner/elbencho
              command: ["/usr/bin/elbencho"]
              args: ["-d","--write", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"]
              volumeMounts:
                - mountPath: /data
                  name: lindorm-vol
          volumes:
            - name: lindorm-vol
              persistentVolumeClaim:
                claimName: lindorm

    The following table describes the parameters that you can configure in args. For more information about other parameters that you can configure in the write.yaml file, visit elbencho.

    Parameter

    Description

    -d

    Creates a test directory.

    --write

    Specifies that the job is a data writing job.

    -t

    Specifies the number of threads used to write data.

    -n

    Specifies the number of directories created by each thread.

    -N

    Specifies the number of files that you want to create in each directory.

    -s

    Specifies the size of the files that you want to write.

    --direct

    Specifies that no data is cached during the job.

    -b

    Specifies the size of data blocks for each write operation.

    /data/lindorm

    Specifies the path to which you want to write data.

  2. Execute the data writing job and wait until the job is complete.

    kubectl create -f write.yaml
    kubectl get pods

    The following result is returned:

    NAME                         READY   STATUS      RESTARTS   AGE
    fluid-elbencho-write-stfpq   0/1     Completed   0          3m29s
    lindorm-fuse-8lgj9           1/1     Running     0          3m29s
    lindorm-master-0             2/2     Running     0          5m37s
    lindorm-worker-0             2/2     Running     0          5m10s
    lindorm-worker-1             2/2     Running     0          5m9s
    lindorm-worker-2             2/2     Running     0          5m7s
  3. Clear the cache of the dataset and restart the dataset and the AlluxioRuntime instance.

    kubectl delete -f .
    kubectl create -f dataset.yaml
    kubectl create -f runtime.yaml
  4. Verify how data reading is accelerated.

    Use elbencho to read the same data multiple times. Then, compare the time used to query the data and the throughput for data reading to verify how data reading is accelerated by using Fluid.

    1. Configure the read.yaml file to create a job to read data in the Lindorm instance over the S3 protocol.

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: fluid-elbencho-read
      spec:
        template:
          spec:
            restartPolicy: OnFailure
            containers:
              - name: elbencho
                image: breuner/elbencho
                command: ["/usr/bin/elbencho"]
                args: ["-d","--read", "-t", "10", "-n", "1", "-N", "100", "-s", "16M", "--direct", "-b", "16M", "/data/lindorm"]
                volumeMounts:
                  - mountPath: /data
                    name: lindorm-vol
            volumes:
              - name: lindorm-vol
                persistentVolumeClaim:
                  claimName: lindorm
    2. Execute the data reading job and wait until the job is complete.

      kubectl create -f read.yaml
      kubectl get pods

      The following result is returned:

      NAME                         READY   STATUS      RESTARTS   AGE
      fluid-elbencho-read-stfpq    0/1     Completed   0          3m29s
      lindorm-fuse-8lgj9           1/1     Running     0          3m29s
      lindorm-master-0             2/2     Running     0          5m37s
      lindorm-worker-0             2/2     Running     0          5m10s
      lindorm-worker-1             2/2     Running     0          5m9s
      lindorm-worker-2             2/2     Running     0          5m7s
    3. Run the following command to view the time required to read the data and the throughput for data reading:

      kubectl logs fluid-elbencho-read-stfpq

      The following result is returned:

      OPERATION RESULT TYPE        FIRST DONE  LAST DONE
      ========= ================   ==========  =========
      MKDIRS    Elapsed ms       :         33        120
                Dirs/s           :         30         83
                Dirs total       :          1         10
      ---
      READ      Elapsed ms       :      17585      18479
                Files/s          :         54         54
                Throughput MiB/s :        869        865
                Total MiB        :      15296      16000
                Files total      :        956       1000
      ---
    4. Run the following command to query the status of the dataset:

      kubectl get dataset lindorm

      The following result is returned. According to the result, all files that have been read are cached in Alluxio.

      NAME      UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
      lindorm   0.00B            15.63GiB   96.00GiB         +Inf%               Bound   9m54s
    5. Run the following command to delete the current container and create a same container for data reading:

      kubectl delete -f read.yaml && kubectl create -f read.yaml
    6. Run the following command to use elbencho to read the data in the Lindorm again. The time used to execute the data reading job is significantly reduced because the data has been cached to Fluid.

      kubectl get pods

      The following result is returned:

      NAME                         READY   STATUS      RESTARTS   AGE
      fluid-elbencho-read-9gxkk    0/1     Completed   0          9s
      lindorm-fuse-ckwd9           1/1     Running     0          4m1s
      lindorm-fuse-vlr6r           1/1     Running     0          3m6s
      lindorm-master-0             2/2     Running     0          10m
      lindorm-worker-0             2/2     Running     0          9m28s
      lindorm-worker-1             2/2     Running     0          9m27s
      lindorm-worker-2             2/2     Running     0          9m26s
    7. Run the following command to view the time required to read the data and the throughput for data reading:

      kubectl logs fluid-elbencho-read-9gxkk

      The following result is returned:

      OPERATION RESULT TYPE        FIRST DONE  LAST DONE
      ========= ================   ==========  =========
      MKDIRS    Elapsed ms       :          7         32
                Dirs/s           :        132        312
                Dirs total       :          1         10
      ---
      READ      Elapsed ms       :       8081       9165
                Files/s          :        110        109
                Throughput MiB/s :       1771       1745
                Total MiB        :      14320      16000
                Files total      :        895       1000

      According to the result, the throughput for data reading is improved from 869 MiB/s to 1771 MiB/s. This is because Fluid caches a file when you remotely access the file in Lindorm for the first time. When you remotely read the file in subsequent access, the cached file instead of the original file is read. This way, data access to the file is significantly accelerated.

  5. (Optional) If you no longer need to accelerate data access to the files, run the following command to delete the container and clear the environment:

    kubectl delete -f .