This topic uses self-managed Apache Flink clusters to access LindormDFS.
Preparations
Activate the LindormDFS service. For more information, see Activate the LindormDFS service.
Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.
Install Scala on compute nodes.
Download Scala from its official website. The Scala version must be compatible with the Apache Spark version.
Download the Apache Hadoop package.
To download Apache Hadoop from its official website, click official website. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.Download the Apache Flink package.
The version of Apache Flink used in LindormDFS must be 1.9.0 or later. To download Apache Flink from its official website, click official website. The Apache Flink version used in this topic is Apache Flink 1.9.0. This version is an official precompiled version.
Replace the installation package versions and the folder paths throughout the procedure in this topic with the actual values.
Configure Apache Hadoop
Extract the downloaded package to the specified directory.
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
Modify the
hadoop-env.sh
configuration file.Run the following command to open the
hadoop-env.sh
configuration file:vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
Configure
JAVA_HOME
.export JAVA_HOME=${JDK installation directory}
Modify the
core-site.xml
file.Run the following command to open the
core-site.xml
file:vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
Modify the
core-site.xml
file, as shown in the following example. Replace Instance ID in ${Instance ID} with your actual instance ID.<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${Instance ID}</value> </property> </configuration>
Modify the
mapred-site.xml
profile.Run the following command to open the
mapred-site.xml
file:vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
Modify the
mapred-site.xml
file, as shown in the following example:<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Modify the
yarn-site.xml
configuration file.Run the following command to open the
yarn-site.xml
configuration file.vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
Modify the
yarn-site.xml
file, as shown in the following example:<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>xxxx</value> <!-- Enter the host name for the ResourceManager of Apache Hadoop YARN in your cluster. --> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3584</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>14336</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> </configuration>
Modify the
slaves
configuration file.Run the following command to open the
slaves
configuration file:vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves
Modify the
slaves
configuration file, as shown in the following example. In this example, the Apache Spark cluster has two nodes: node1 and node2.node1 node2
node1 and node2 are the names of the machines where the Apache Spark cluster nodes are deployed.
Configure environment variables.
Run the following command to open the profile file in the
/etc
path:vim /etc/profile
Add the following information to the end of the content in the
/etc/profile
configuration file:export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
Run the following command to make the configuration take effect:
Run the following command to synchronize the folder specify by
${HADOOP_HOME}
to other nodes in the cluster.scp -r hadoop-2.7.2/ root@node2:/usr/local/
Verify Apache Hadoop configuration
After Apache Hadoop is configured, do not format NameNode or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you need to use the Apache Hadoop YARN service, you need only to start this service on your ResourceManager node. For more information about how to check whether Apache Hadoop is configured as expected, see Use open source HDFS clients to access LindormDFS.
Configure Apache Flink
Run the following command to extract the Apache Flink package to the specified directory:
tar -zxvf flink-1.9.0-bin-scala_2.11.tgz -C /usr/local/
Before you use Apache Flink, you must configure
HADOOP_HOME
,HADOOP_CLASSPATH
, andHADOOP_CONF_DIR
in your cluster environment variables. For more information, see Step 7 Configure environment variables in Configure Apache Hadoop.For more information about how to configure Apache Flink, see Configuration in the Apache Flink documentation.
Verify Apache Flink configuration
Use the WordCount.jar
JAR file provided by Apache Flink to read data from LindormDFS and write the computing results to LindormDFS. Before you perform these operations, you must start the Apache Hadoop YARN service.
Create test data.
RandomTextWriter in the
hadoop-mapreduce-examples-2.7.3.jar
JAR file provided by Apache Hadoop 2.7.3 is used to generate test data in LindormDFS./usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar randomtextwriter \ -D mapreduce.randomtextwriter.totalbytes=10240 \ -D mapreduce.randomtextwriter.bytespermap=1024 \ -D mapreduce.job.maps=4 \ -D mapreduce.job.reduces=2 \ /flink-test/input \
View the test data generated in LindormDFS.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/input/*
Submit a WordCount Flink job.
/usr/local/flink-1.9.0/bin/flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 \ /usr/local/flink-1.9.0/examples/batch/WordCount.jar \ --input hdfs://ld-uf630q8031846lsxm/flink-test/input \ --output hdfs://ld-uf630q8031846lsxm/flink-test/output \
View the result file on LindormDFS.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/output