All Products
Search
Document Center

Container Compute Service:Use Fluid to accelerate data access

Last Updated:Feb 24, 2025

JindoRuntime is based on C++ and supports dataset management, data caching, and data storage in OSS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid to accelerate data access in scenarios in which ACS compute power is used.

Prerequisites

  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.

  • ack-fluid 1.0.11-* or later is installed. For more information, see Use Helm to manage applications in ACS.

  • The privileged mode is enabled for ACS pods.

    Note

    The privileged mode is required for using Fluid to accelerate data access. To enable this mode, submit a ticket.

Procedure

Step 1: Upload data to OSS

  1. Run the following command to download the test data.

    wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. Upload the test dataset to the OSS bucket.

    Important

    This example describes how to upload a test dataset to OSS from an ECS instance that runs the Alibaba Cloud Linux 3.2104 LTS 64-bit operating system. If you use other operating systems, see ossutil command reference and ossutil 1.0.

    1. Install ossutil.

    2. Run the following command to create a bucket named examplebucket.

      Note

      if the command returns ErrorCode=BucketAlreadyExists, the bucket already exists. OSS bucket names must be globally unique. Modify the examplebucket name on demand.

      ossutil64 mb oss://examplebucket

      Expected results:

      0.668238(s) elapsed

      If the preceding output is displayed, the bucket named examplebucket is created.

    3. Upload the test dataset to examplebucket.

      ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
    4. (Optional) Configure permissions to access the bucket and data For more information, see Permission control.

  3. Create a file named mySecret.yaml and add the following content to the file.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: xxx
      fs.oss.accessKeySecret: xxx

    fs.oss.accessKeyId and fs.oss.accessKeySecret specify the AccessKey ID and AccessKey secret used to access OSS.

  4. Run the following command to create a Secret: Kubernetes automatically encrypts Secrets to avoid disclosing sensitive data in plaintext.

    kubectl create -f mySecret.yaml

Step 2: Create a dataset and a JindoRuntime

  1. Create a file named resource.yaml and add the following content to the file.

    • Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).

    • Create a JindoRuntime to launch a JindoFS cluster for data caching.

    Note

    Run the kubectl get pods --field-selector=status.phase=Running -n fluid-system command to check whether the dataset-controller and jindoruntime-controller of the ack-fluid component run as normal.

    In this example, CPU compute power is preferably used. To accelerate the loading of LLMs, make sure that the zone of your cluster provides GPU resources. For more information, see Introduction to GPU compute classes.

    Show the YAML file content

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      placement: Shared
      mounts:
          ## To specify subdirectories, configure oss://<oss_bucket>/{oss_path}.
        - mountPoint: oss://<oss_bucket>       # Replace <oss_bucket> with the actual value.
          options:
            fs.oss.endpoint: <oss_endpoint>    # Replace <oss_endpoint> with the actual value.
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      ## Specify the name of the dataset.
      name: hadoop
    spec:
      networkmode: ContainerNetwork
      ## Modify on demand.
      replicas: 4
      master:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default
      worker:
        podMetadata:
          labels:
            alibabacloud.com/compute-class: performance
            alibabacloud.com/compute-qos: default. 
        resources:
          requests:
            cpu: 24
            memory: 48Gi
          limits:
            cpu: 24
            memory: 48Gi
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            ## Modify on demand.
            quota: 48Gi
            high: "0.99"
            low: "0.95"

    The following table describes the parameters.

    Parameter

    Description

    mountPoint

    oss://<oss_bucket> indicates the UFS path that is mounted. <oss_bucket> indicates the name of the OSS bucket, such as oss://examplebucket.

    fs.oss.endpoint

    The endpoint of the OSS bucket. You can specify the public or private endpoint. Example: oss-cn-beijing-internal.aliyuncs.com. For more information, see OSS regions and endpoints.

    replicas

    The number of workers in the JindoFS cluster.

    mediumtype

    The type of cache. When you create a JindoRuntime template, JindoFS supports only one of the following cache types: HDD, SDD, and MEM.

    path

    The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a path of the on-premises storage to store data, such as log.

    quota

    The maximum size of the cached data. Unit: GB.

    high

    The upper limit of the storage capacity.

    low

    The lower limit of the storage capacity.

  2. Run the following command to create a dataset and a JindoRuntime:

    kubectl create -f resource.yaml
  3. View the deployment of the JindoRuntime and dataset.

    1. View the deployment of the dataset.

      kubectl get dataset hadoop

      Expected results:

      NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
      hadoop   209.74MiB        0.00B    4.00GiB          0.0%                Bound   56s
    2. View the deployment of the JindoRuntime.

      kubectl get jindoruntime hadoop

      Expected results:

      NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
      hadoop   Ready          Ready          Ready        2m11s

      The preceding outputs indicate that the dataset and JindoRuntime are created.

  4. Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created: The PVS uses the name of the dataset.

    kubectl get pv,pvc

    Expected results:

    NAME                              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
    persistentvolume/default-hadoop   100Pi      ROX            Retain           Bound    default/hadoop   fluid          <unset>                          2m5s
    
    NAME                           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
    persistentvolumeclaim/hadoop   Bound    default-hadoop   100Pi      ROX            fluid          <unset>                 2m5s

Step 3: Create a DataLoad resource

To accelerate data loading and ensure the validity of the data processing logic, you need to preload the dataset once.

  1. If the model data stored in the OSS bucket is static, you create a file named dataload.yaml and add the following content to the file to preload the data.

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: hadoop
    spec:
      dataset:
        name: hadoop
        namespace: default
      loadMetadata: true
  2. If the model data stored in the OSS bucket is dynamic, you need to periodically preload the data. For more information, see Scenario 2: Data in the backend storage is read-only but periodically changes.

  3. Create a DataLoad resource to preload the model data once.

    kubectl create -f dataload.yaml
  4. View the status of preloading.

    kubectl get dataload

    Expected results:

    NAME          DATASET    PHASE       AGE   DURATION
    hadoop        hadoop   Complete      92m   51s

Step 4: Create pods to verify data acceleration

You can create pods or submit a machine learning job to verify the JindoFS data acceleration service. In this example, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.

  1. Create a file named app.yaml by using the following YAML template:

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
      labels:
        # To mount ACS pods, use Fluid webhook to inject Fluid-related components as sidecar containers. You must configure the following label:
        alibabacloud.com/fluid-sidecar-target: acs
    spec:
      containers:
        - name: demo
          image: mirrors-ssl.aliyuncs.com/nginx:latest
          volumeMounts:
            - mountPath: /data
              name: hadoop
          resources:
            requests:
              cpu: 14
              memory: 56Gi
      volumes:
        - name: hadoop
          persistentVolumeClaim:
            ## The name of the Fluid dataset.
            claimName: hadoop
      nodeSelector:
        type: virtual-kubelet
      tolerations:
        - key: virtual-kubelet.io/provider
          operator: Equal
          value: alibabacloud
          effect: NoSchedule
  2. Run the following command to create an application pod:

    kubectl create -f app.yaml
  3. Test the file copy speed without using JindoFS caches.

    1. View the size of the test file.

      kubectl exec -it demo-app -c demo -- du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

      Expected results:

      210M    /data/spark-3.0.1-bin-hadoop2.7.tgz
    2. View the amount of time required to copy the file.

      time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

      Expected results:

      real    0m1.883s
      user    0m0.001s
      sys     0m0.041s

      The preceding output indicates that it takes 1.883 seconds to copy the file.

  4. View the dataset cache.

    kubectl get dataset hadoop

    Expected results:

    NAME     UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   209.74MiB        209.74MiB   4.00GiB          100.0%              Bound   64m

    The preceding output indicates that 100.0% of the data is cached by JindoFS.

  5. Delete the sample pod and view the file copy time.

    Note

    You need to delete the sample pod to eliminate the impact of other factors, such as page caches. If the pod already contains a local cache, the system preferably copies the file from the local cache.

    Run the following command to query the time required to copy the file:

    kubectl exec -it demo-app -c demo -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected results:

    real    0m0.203s
    user    0m0.000s
    sys     0m0.047s

    The preceding output indicates that it takes 0.203 seconds to copy the file, which is nine times faster than the first time. This is because the file is cached by JindoFS. It is much faster to access a cached file.

    Important

    The copy time provided in this topic is for reference only.

Use ACS compute power in ACK Pro clusters

This topic describes how to use JindoFS to accelerate file copy operations based on ACS clusters. You can also use ACS compute power in ACK managed clusters to complete the operation. For more information, see Use the computing power of ACS in ACK Pro clusters.

To verify data acceleration in an ACK managed cluster, make the following adjustments:

  1. Install the ack-fluid component in the ACK managed cluster. For more information, see Use Helm to simplify application deployment.

  2. Create a dataset and JindoRuntime based on the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      mounts:
          ## To specify subdirectories, configure oss://<oss_bucket>/{oss_path}.
        - mountPoint: oss://<oss_bucket>       # Replace <oss_bucket> with the actual value.
          options:
            fs.oss.endpoint: <oss_endpoint>    # Replace <oss_endpoint> with the actual value.
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop
    spec:
      ## Modify on demand.
      replicas: 4
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            quota: 48Gi
            high: "0.99"
            low: "0.95"

    Differences between ACK managed clusters and ACS clusters:

    • The nodes of ACS clusters cannot be scaled in the same way as ACK clusters because these nodes are virtual nodes. Therefore, you must configure .spec.placement: Shared and networkmode.

    • Fluid workers require high bandwidth. You must make sure that your ACS cluster has sufficient bandwidth. To do this, you can configure compute-class: performance and resources to ensure that the ACS pods have sufficient bandwidth.