All Products
Search
Document Center

E-MapReduce:Use JindoFS to accelerate access to OSS

Last Updated:Mar 29, 2024

This topic describes how to use Fluid and JindoRuntime in a Spark cluster that is created in E-MapReduce (EMR) on Container Service for Kubernetes (ACK) to accelerate access to Object Storage Service (OSS) buckets.

Background information

Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. For more information about Fluid, see Overview. JindoRuntime is the execution engine of JindoFS that is developed by the Alibaba Cloud EMR team. JindoRuntime is based on C++ and supports dataset management, data caching, and data storage in OSS.

You can use Fluid and JindoRuntime to accelerate data reads from OSS buckets for Spark jobs that are created in EMR on ACK clusters. This reduces bandwidth usage and traffic consumption.

Prerequisites

  • A Spark cluster is created on the EMR on ACK page. For more information, see Getting started.

  • Helm is installed on your on-premises machine. For more information, see Install Helm.

Procedure

  1. Step 1: Install Fluid in the associated ACK cluster

  2. Step 2: Create a dataset and JindoRuntime

  3. Step 3: Configure the connection parameters of the Spark cluster

Step 1: Install Fluid in the associated ACK cluster

  1. Log on to the ACK console.

  2. In the left-side navigation pane, choose Marketplace > Marketplace.

  3. On the Marketplace page, enter ack-fluid in the search box and click the Search icon. Then, click ack-fluid.

  4. In the upper-right corner of the page that appears, click Deploy.

  5. In the Basic Information step, select the ACK cluster that is associated with the EMR Spark cluster and click Next.

  6. In the Parameters step, select a chart version, configure the parameters, and then click OK.

Step 2: Create a dataset and JindoRuntime

  1. Create a file named resource.yaml. The file contains the following content:

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: hadoop
    spec:
      mounts:
        - mountPoint: oss://test-bucket/
          options:
            fs.oss.accessKeyId: <OSS_ACCESS_KEY_ID>
            fs.oss.accessKeySecret: <OSS_ACCESS_KEY_SECRET>
            fs.oss.endpoint: <OSS_ENDPOINT>
          name: hadoop
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: hadoop
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: HDD
            path: /mnt/disk1
            quota: 100Gi
            high: "0.9"
            low: "0.8"

    The file content consists of the following parts:

    • The first part is the code of the custom resource definition (CRD) for the dataset. The code describes the source of the dataset. In this example, the source is test-bucket, which is the OSS bucket to which you want to accelerate access.

    • The second part is the code of JindoRuntime. JindoRuntime is created to enable JindoFS for data caching in the cluster. For information about the parameters in the code, see the JindoRuntime documentation.

      The path parameter specifies the directory in which data on the nodes is cached. We recommend that you use nodes with sufficient data disk capacity and set the path parameter to the directory to which data disks are mounted.

  2. Run the following command to create a dataset and JindoRuntime:

    kubectl create -f resource.yaml -n fluid-system

Step 3: Configure the connection parameters of the Spark cluster

  1. Run the following command to obtain the connection address that can be used to connect to JindoFS:

    kubectl get svc -n fluid-system | grep jindo

    The following figure shows the returned information. In this example, the connection address is hadoop-jindofs-master-0.fluid-system:18000.IP Address

  2. Go to the spark-defaults.conf tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.

    2. On the EMR on ACK page, find the Spark cluster that you created and click Configure in the Actions column.

    3. On the Configure tab, click spark-defaults.conf.

  3. Connect the Spark cluster to JindoFS.

    1. Click Add Configuration Item.

    2. In the Add Configuration Item dialog box, add the configuration items that are described in the following table.

      Configuration item

      Description

      spark.hadoop.fs.xengine

      Set the value to jindofsx.

      spark.hadoop.fs.jindofsx.data.cache.enable

      Specifies whether to enable data caching. Set the value to true.

      spark.hadoop.fs.jindofsx.meta.cache.enable

      Specifies whether to enable metadata caching. Valid values:

      • false: disables metadata caching. This is the default value.

      • true: enables metadata caching.

      spark.hadoop.fs.jindofsx.client.metrics.enable

      Set the value to true.

      spark.hadoop.fs.jindofsx.storage.connect.enable

      Set the value to true.

      spark.hadoop.fs.jindofsx.namespace.rpc.address

      The connection address of JindoFS obtained in Step 1. In this example, the connection address is hadoop-jindofs-master-0.fluid-system:18000.

      spark.hadoop.fs.oss.accessKeyId

      The AccessKey ID that is used to access OSS. Your account must have read and write permissions on OSS.

      spark.hadoop.fs.oss.accessKeySecret

      The AccessKey secret that is used to access OSS. Your account must have read and write permissions on OSS.

    3. Click OK.

    4. In the dialog box that appears, configure the Execution Reason parameter and click Save.

  4. Make the configurations take effect.

    1. In the lower part of the Configure tab, click Deploy Client Configuration.

    2. In the dialog box that appears, configure the Execution Reason parameter and click OK.

    3. In the Confirm message, click OK.

    After you complete the preceding operations, you can submit Spark jobs in the EMR on ACK cluster to experience accelerated access to OSS data.