This topic describes how to use a self-managed Apache Flink cluster to access LindormDFS.
Before you begin
Activate the LindormDFS service. For more information, see Activate the LindormDFS service.
Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.
Install Scala on compute nodes.
Download Scala from its official website. The Scala version must be compatible with the Apache Spark version.
Download the Apache Hadoop package.
To download Apache Hadoop from its official website, click Apache Hadoop. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.
Download the Apache Flink package.
The version of Apache Flink used in LindormDFS must be 1.9.0 or later. To download Apache Flink from its official website, click Apache Flink. The Apache Flink version used in this topic is Apache Flink 1.9.0. This version is an official precompiled version.
Replace the installation package versions and the folder paths throughout the procedure in this topic with the actual values.
Configure Apache Hadoop
Decompress the downloaded package to the specified directory.
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/Modify the configuration file
hadoop-env.sh.Run the following command to open the configuration file
hadoop-env.sh:vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.shConfigure
JAVA_HOME.export JAVA_HOME=${JDK installation directory}
Modify the
core-site.xmlfile.Run the following command to open the
core-site.xmlfile:vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xmlModify the
core-site.xmlfile, as shown in the following example. Replace Instance ID in ${Instance ID} with your actual instance ID.<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${Instance ID}</value> </property> </configuration>
Modify the configuration file
mapred-site.xml.Run the following command to open the
mapred-site.xmlfile:vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xmlModify the
mapred-site.xmlfile, as shown in the following example:<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Modify the
yarn-site.xmlconfiguration file.Run the following command to open the
yarn-site.xmlconfiguration file:vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xmlAdd the following configuration to the
yarn-site.xmlfile:<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>xxxx</value> <!-- Enter the host name for the ResourceManager of Apache Hadoop YARN in your cluster. --> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3584</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>14336</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> </configuration>
Modify the configuration file
slaves.Run the following command to open the configuration file
slaves:vim /usr/local/hadoop-2.7.3/etc/hadoop/slavesModify the configuration file
slaves, as shown in the following example. In this example, the Apache Spark cluster contains two nodes: node1 and node2.node1 node2Notenode1 and node2 are the names of the machines where the Apache Spark cluster nodes are deployed.
Configure environment variables.
Run the following command to open the configuration file
/etc/profile:vim /etc/profileAdd the following information to the end of the content in the configuration file
/etc/profile:export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoopRun the following command to make the configuration take effect:
Run the following command to synchronize the folder specified by
${HADOOP_HOME}to other nodes in the cluster.scp -r hadoop-2.7.2/ testuser@node2:/usr/local/
Verify Apache Hadoop configuration
After Apache Hadoop is configured, do not format NameNode or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you need to use the Apache Hadoop YARN service, you need to only start this service on your ResourceManager node. For more information about how to check whether Apache Hadoop is configured as expected, see Use open source HDFS clients to access LindormDFS.
Configure Apache Flink
Run the following command to decompress the Apache Flink package to the specified directory:
tar -zxvf flink-1.9.0-bin-scala_2.11.tgz -C /usr/local/Before you use Apache Flink, you must configure
HADOOP_HOME,HADOOP_CLASSPATH, andHADOOP_CONF_DIRin your cluster environment variables. For more information, see Step 7 Configure environment variables in Configure Apache Hadoop.For more information about how to configure Apache Flink, see Configuration in the Apache Flink documentation.
Verify Apache Flink configuration
Use the WordCount.jar JAR file provided by Apache Flink to read data from LindormDFS and write the computing results to LindormDFS. Before you verify Apache Flink configuration, start the YARN service.
Generate test data.
RandomTextWriter in the
hadoop-mapreduce-examples-2.7.3.jarJAR file provided by Apache Hadoop 2.7.3 is used to generate test data in LindormDFS./usr/local/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar randomtextwriter \ -D mapreduce.randomtextwriter.totalbytes=10240 \ -D mapreduce.randomtextwriter.bytespermap=1024 \ -D mapreduce.job.maps=4 \ -D mapreduce.job.reduces=2 \ /flink-test/input \View the test data that is generated in LindormDFS.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/input/*Submit a WordCount Flink job.
/usr/local/flink-1.9.0/bin/flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 \ /usr/local/flink-1.9.0/examples/batch/WordCount.jar \ --input hdfs://ld-uf630q8031846lsxm/flink-test/input \ --output hdfs://ld-uf630q8031846lsxm/flink-test/output \View the result file that is generated in LindormDFS.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /flink-test/output