Use Apache Flink to access LindormDFS - Lindorm - Alibaba Cloud Documentation Center

This topic describes how to use a self-managed Apache Flink cluster to access LindormDFS.

Before you begin

Activate the LindormDFS service. For more information, see Activate the LindormDFS service.
Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.
Install Scala on compute nodes.
Download Scala from its official website. The Scala version must be compatible with the Apache Spark version.
Download the Apache Hadoop package.
To download Apache Hadoop from its official website, click Apache Hadoop. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.
Download the Apache Flink package.
The version of Apache Flink used in LindormDFS must be 1.9.0 or later. To download Apache Flink from its official website, click Apache Flink. The Apache Flink version used in this topic is Apache Flink 1.9.0. This version is an official precompiled version.

Note

Replace the installation package versions and the folder paths throughout the procedure in this topic with the actual values.

Configure Apache Hadoop

Decompress the downloaded package to the specified directory.
```
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
```
Modify the configuration file hadoop-env.sh.
1. Run the following command to open the configuration file hadoop-env.sh:
```
vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
```
2. Configure JAVA_HOME.
```
export JAVA_HOME=${JDK installation directory}
```
Modify the core-site.xml file.
1. Run the following command to open the core-site.xml file:
```
vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
```
2. Modify the core-site.xml file, as shown in the following example. Replace Instance ID in ${Instance ID} with your actual instance ID.
```
<configuration>
  <property>
     <name>fs.defaultFS</name>
     <value>hdfs://${Instance ID}</value>
  </property>
</configuration>
```

Modify the configuration file mapred-site.xml.

Run the following command to open the mapred-site.xml file:
```
vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
```

Modify the mapred-site.xml file, as shown in the following example:

<configuration>
<property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
</property>
</configuration>

Modify the yarn-site.xml configuration file.

Run the following command to open the yarn-site.xml configuration file:
```
vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
```

Add the following configuration to the yarn-site.xml file:

<configuration>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>xxxx</value>
  <!-- Enter the host name for the ResourceManager of Apache Hadoop YARN in your cluster. -->
</property>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>16384</value>
    <! -- Configure this item based on the capabilities of your cluster. -->
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>4</value>
     <! -- Configure this item based on the capabilities of your cluster. -->
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>4</value>
    <! -- Configure this item based on the capabilities of your cluster. -->
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>3584</value>
    <! -- Configure this item based on the capabilities of your cluster. -->
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>14336</value>
    <! -- Configure this item based on the capabilities of your cluster. -->
</property>
</configuration>

Modify the configuration file slaves.
1. Run the following command to open the configuration file slaves:
```
 vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves 
```
2. Modify the configuration file slaves, as shown in the following example. In this example, the Apache Spark cluster contains two nodes: node1 and node2.
```
node1
node2
```
  Note
  node1 and node2 are the names of the machines where the Apache Spark cluster nodes are deployed.

Configure environment variables.

Run the following command to open the configuration file /etc/profile:
```
vim /etc/profile
```

Add the following information to the end of the content in the configuration file /etc/profile:

export HADOOP_HOME=/usr/local/hadoop-2.7.3
export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop

Run the following command to make the configuration take effect:

Run the following command to synchronize the folder specified by ${HADOOP_HOME} to other nodes in the cluster.
```
scp -r hadoop-2.7.2/ testuser@node2:/usr/local/
```

Verify Apache Hadoop configuration

After Apache Hadoop is configured, do not format NameNode or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you need to use the Apache Hadoop YARN service, you need to only start this service on your ResourceManager node. For more information about how to check whether Apache Hadoop is configured as expected, see Use open source HDFS clients to access LindormDFS.

Configure Apache Flink

Run the following command to decompress the Apache Flink package to the specified directory:

tar -zxvf flink-1.9.0-bin-scala_2.11.tgz -C /usr/local/

Note

Before you use Apache Flink, you must configure HADOOP_HOME, HADOOP_CLASSPATH, and HADOOP_CONF_DIR in your cluster environment variables. For more information, see Step 7 Configure environment variables in Configure Apache Hadoop.
For more information about how to configure Apache Flink, see Configuration in the Apache Flink documentation.

Verify Apache Flink configuration

Use the WordCount.jar JAR file provided by Apache Flink to read data from LindormDFS and write the computing results to LindormDFS. Before you verify Apache Flink configuration, start the YARN service.

Generate test data.

RandomTextWriter in the hadoop-mapreduce-examples-2.7.3.jar JAR file provided by Apache Hadoop 2.7.3 is used to generate test data in LindormDFS.

/usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar randomtextwriter \
-D mapreduce.randomtextwriter.totalbytes=10240 \
-D mapreduce.randomtextwriter.bytespermap=1024 \
-D mapreduce.job.maps=4  \
-D mapreduce.job.reduces=2  \
/flink-test/input \

View the test data that is generated in LindormDFS.

 /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/input/*

Submit a WordCount Flink job.

/usr/local/flink-1.9.0/bin/flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 \
/usr/local/flink-1.9.0/examples/batch/WordCount.jar \
--input hdfs://ld-uf630q8031846lsxm/flink-test/input \
--output hdfs://ld-uf630q8031846lsxm/flink-test/output \

View the result file that is generated in LindormDFS.

/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/output