All Products
Search
Document Center

Container Service for Kubernetes:Use JindoFS to accelerate access to OSS

Last Updated:Nov 24, 2023

JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Alibaba Cloud provides cloud service-level support for JindoFS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use JindoFS to accelerate access to OSS.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
  • The cloud-native AI suite is installed and the ack-fluid component is deployed.
    • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
    • If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
  • A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
  • OSS is activated. For more information, see Activate OSS.

Background information

After you set up the Container Service for Kubernetes (ACK) cluster and OSS bucket, you need to deploy JindoRuntime. The deployment requires about 10 minutes.

Step 1: Upload data to OSS

  1. Run the following command to download a test dataset to an Elastic Compute Service (ECS) instance:

    wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. Upload the test dataset to the OSS bucket.

    Important

    This example describes how to upload a test dataset to OSS from an ECS instance that runs the Alibaba Cloud Linux 3.2104 LTS 64-bit operating system. If you use other operating systems, see ossutil and Overview.

    1. Install ossutil.

    2. Create a bucket named examplebucket.

      • Run the following command to create a bucket named examplebucket:

        ossutil64 mb oss://examplebucket
      • If the following output is displayed, the bucket named examplebucket is created:

        0.668238(s) elapsed
    3. Upload the test dataset to examplebucket.

      ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket

Step 2: Create a dataset and a JindoRuntime

  1. Before you create a dataset, create a file named mySecret.yaml in the root directory of the ECS instance.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: xxx
      fs.oss.accessKeySecret: xxx

    Specify fs.oss.accessKeyId and fs.oss.accessKeySecret as the AccessKey ID and AccessKey secret that are used to access OSS in Step 1.

  2. Run the following command to create a Secret: Kubernetes encrypts the created Secret to prevent the stored information from being exposed as plaintext.

    kubectl create -f mySecret.yaml
  3. Create a file named resource.yaml by using the following YAML template. This template is used to perform the following operations:

    • Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).

    • Create a JindoRuntime to launch a JindoFS cluster for data caching.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      mounts:
        - mountPoint: oss://<oss_bucket>/<bucket_dir>
          options:
            fs.oss.endpoint: <oss_endpoint>
          name: hadoop
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: HDD
            path: /mnt/disk1
            quota: 100G
            high: "0.99"
            low: "0.8"

    The following table describes the parameters in the YAML template.

    Parameter

    Description

    mountPoint

    oss://<oss_bucket>/<bucket_dir> specifies the path to the UFS that is mounted. The endpoint is not required in the path.

    fs.oss.endpoint

    The public or private endpoint of the OSS bucket. For more information, see Regions and endpoints.

    replicas

    The number of workers in the JindoFS cluster.

    mediumtype

    The cache type. This parameter specifies the cache type used when you create JindoRuntime templates. Valid values: HDD, SDD, and MEM.

    path

    The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs.

    quota

    The maximum size of cached data. Unit: GB.

    high

    The upper limit of the storage capacity.

    low

    The lower limit of the storage capacity.

  4. Run the following command to create a dataset and a JindoRuntime:

    kubectl create -f resource.yaml
  5. Run the following command to check whether the dataset is deployed:

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop        210MiB       0.00B    180.00GiB              0.0%          Bound   1h
  6. Run the following command to check whether the JindoRuntime is deployed:

    kubectl get jindoruntime hadoop

    Expected output:

    NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    hadoop   Ready          Ready          Ready        4m45s
  7. Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:

    kubectl get pv,pvc

    Expected output:

    NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
    persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m
    
    NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The preceding outputs indicate that the dataset and JindoRuntime are created.

Step 3: Create applications to test data acceleration

You can deploy applications in containers to test data acceleration of JindoFS. You can also submit machine learning jobs to use relevant features. In this topic, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.

  1. Create a file named app.yaml by using the following YAML template:

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
    spec:
      containers:
        - name: demo
          image: fluidcloudnative/serving
          volumeMounts:
            - mountPath: /data
              name: hadoop
      volumes:
        - name: hadoop
          persistentVolumeClaim:
            claimName: hadoop
  2. Run the following command to deploy the application:

    kubectl create -f app.yaml
  3. Run the following command to query the size of the specified file:

    kubectl exec -it demo-app -- bash
    du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

    Expected output:

    210M    /data/spark-3.0.1-bin-hadoop2.7.tgz
  4. Run the following command to query the time consumed to copy the file:

    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m18.386s
    user    0m0.002s
    sys    0m0.105s

    The output indicates that 18 seconds are consumed to copy the file.

  5. Run the following command to check the cached data of the dataset:

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   210.00MiB       210.00MiB    180.00GiB        100.0%           Bound   1h

    The output indicates that 210 MiB of data is cached to the on-premises storage.

  6. Run the following command to delete the current application and then create the same application:

    Note

    This step is performed to avoid other factors, such as the page cache, from affecting the result.

    kubectl delete -f app.yaml && kubectl create -f app.yaml
  7. Run the following command to query the time consumed to copy the file:

    kubectl exec -it demo-app -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null

    Expected output:

    real    0m0.048s
    user    0m0.001s
    sys     0m0.046s

    The output indicates that 48 milliseconds are consumed to copy the file. This means that the time consumption is reduced by more than 300 times.

    Note

    This is because the file is cached by JindoFS.

Clear the environment

If you no longer use data acceleration, clear the environment.

Run the following command to delete the JindoRuntime and application:

kubectl delete jindoruntime hadoop

Run the following command to delete the dataset:

kubectl delete dataset hadoop