All Products
Search
Document Center

Use Apache Flink to access LindormDFS

Last Updated: Oct 20, 2021

This topic uses self-managed Apache Flink clusters to access LindormDFS.

Preparations

  1. Activate the LindormDFS service. For more information, see Activate the LindormDFS service.

  2. Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.

  3. Install Scala on compute nodes.

    Download Scala from its official website. The Scala version must be compatible with the Apache Spark version.

  4. Download the Apache Hadoop package.

    To download Apache Hadoop from its official website, click official website. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.
  5. Download the Apache Flink package.

    The version of Apache Flink used in LindormDFS must be 1.9.0 or later. To download Apache Flink from its official website, click official website. The Apache Flink version used in this topic is Apache Flink 1.9.0. This version is an official precompiled version.
Note

Replace the installation package versions and the folder paths throughout the procedure in this topic with the actual values.

Configure Apache Hadoop

  1. Extract the downloaded package to the specified directory.

    tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
  2. Modify the hadoop-env.sh configuration file.

    1. Run the following command to open the hadoop-env.sh configuration file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
    2. Configure JAVA_HOME.

      export JAVA_HOME=${JDK installation directory}
  3. Modify the core-site.xml file.

    1. Run the following command to open the core-site.xml file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
    2. Modify the core-site.xml file, as shown in the following example. Replace Instance ID in ${Instance ID} with your actual instance ID.

      <configuration>
        <property>
           <name>fs.defaultFS</name>
           <value>hdfs://${Instance ID}</value>
        </property>
      </configuration>
  4. Modify the mapred-site.xml profile.

    1. Run the following command to open the mapred-site.xml file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
    2. Modify the mapred-site.xml file, as shown in the following example:

      <configuration>
      <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
      </property>
      </configuration>
  5. Modify the yarn-site.xml configuration file.

    1. Run the following command to open the yarn-site.xml configuration file.

      vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
    2. Modify the yarn-site.xml file, as shown in the following example:

      <configuration>
      <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>xxxx</value>
        <!-- Enter the host name for the ResourceManager of Apache Hadoop YARN in your cluster. -->
      </property>
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>16384</value>
          <! -- Configure this item based on the capabilities of your cluster. -->
      </property>
      <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>4</value>
           <! -- Configure this item based on the capabilities of your cluster. -->
      </property>
      <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>4</value>
          <! -- Configure this item based on the capabilities of your cluster. -->
      </property>
      <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>3584</value>
          <! -- Configure this item based on the capabilities of your cluster. -->
      </property>
      <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>14336</value>
          <! -- Configure this item based on the capabilities of your cluster. -->
      </property>
      </configuration>
  6. Modify the slaves configuration file.

    1. Run the following command to open the slaves configuration file:

       vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves 
    2. Modify the slaves configuration file, as shown in the following example. In this example, the Apache Spark cluster has two nodes: node1 and node2.

      node1
      node2

      node1 and node2 are the names of the machines where the Apache Spark cluster nodes are deployed.

  7. Configure environment variables.

    1. Run the following command to open the profile file in the /etc path:

      vim /etc/profile
    2. Add the following information to the end of the content in the /etc/profile configuration file:

      export HADOOP_HOME=/usr/local/hadoop-2.7.3
      export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar
      export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
    3. Run the following command to make the configuration take effect:

  8. Run the following command to synchronize the folder specify by ${HADOOP_HOME} to other nodes in the cluster.

    scp -r hadoop-2.7.2/ root@node2:/usr/local/

Verify Apache Hadoop configuration

After Apache Hadoop is configured, do not format NameNode or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you need to use the Apache Hadoop YARN service, you need only to start this service on your ResourceManager node. For more information about how to check whether Apache Hadoop is configured as expected, see Use open source HDFS clients to access LindormDFS.

Configure Apache Flink

  1. Run the following command to extract the Apache Flink package to the specified directory:

    tar -zxvf flink-1.9.0-bin-scala_2.11.tgz -C /usr/local/
Note
  • Before you use Apache Flink, you must configure HADOOP_HOME, HADOOP_CLASSPATH, and HADOOP_CONF_DIR in your cluster environment variables. For more information, see Step 7 Configure environment variables in Configure Apache Hadoop.

  • For more information about how to configure Apache Flink, see Configuration in the Apache Flink documentation.

Verify Apache Flink configuration

Use the WordCount.jar JAR file provided by Apache Flink to read data from LindormDFS and write the computing results to LindormDFS. Before you perform these operations, you must start the Apache Hadoop YARN service.

  1. Create test data.

    RandomTextWriter in the hadoop-mapreduce-examples-2.7.3.jar JAR file provided by Apache Hadoop 2.7.3 is used to generate test data in LindormDFS.

    /usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar randomtextwriter \
    -D mapreduce.randomtextwriter.totalbytes=10240 \
    -D mapreduce.randomtextwriter.bytespermap=1024 \
    -D mapreduce.job.maps=4  \
    -D mapreduce.job.reduces=2  \
    /flink-test/input \
  2. View the test data generated in LindormDFS.

     /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/input/*
  3. Submit a WordCount Flink job.

    /usr/local/flink-1.9.0/bin/flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 \
    /usr/local/flink-1.9.0/examples/batch/WordCount.jar \
    --input hdfs://ld-uf630q8031846lsxm/flink-test/input \
    --output hdfs://ld-uf630q8031846lsxm/flink-test/output \
  4. View the result file on LindormDFS.

    /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/output
    View the result file