This topic describes how to configure and connect Apache Spark to LindormDFS of ApsaraDB for Lindorm (Lindorm).
Configure a runtime environment
Activate LindormDFS. For more information, see Activate LindormDFS.
Install Java Development Kit (JDK) 1.8 or later on compute nodes.
Install Scala on compute nodes.
Download Scala from the official website. The version of Scala must be compatible with the Apache Spark version that you use.
Download Apache Hadoop.
Download Apache Hadoop from the official website. We recommend that you download Apache Hadoop 2.7.3 or later. In this topic, Apache Hadoop 2.7.3 is used.Download Apache Spark. Download Apache Spark from the official website. The version of Apache Spark must be compatible with the Apache Hadoop version that you use. In this topic, Apache Spark 2.4.3 is used.
The versions of the installation packages and the file paths used in this topic are for your reference. Replace the versions and paths with the actual values.
Configure Apache Hadoop
Decompress the installation package of Apache Hadoop to a specified directory.
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
Modify the configuration file
hadoop-env.sh
.Run the following command to open the configuration file
hadoop-env.sh
:vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
Configure
JAVA_HOME
.export JAVA_HOME =${JDK installation directory}
Modify the file
core-site.xml
.Run the following command to open the file
core-site.xml
:vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
Add the following configuration to the
core-site.xml
file. Replace Instance ID with your actual instance ID.<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${Instance ID}</value> </property> </configuration>
Modify the configuration file
mapred-site.xml
.Run the following command to open the configuration file
mapred-site.xml
:vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
Add the following configuration to the file
mapred-site.xml
:<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Modify the configuration file
yarn-site.xml
.Run the following command to open the configuration file
yarn-site.xml
:vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
Add the following configuration to the file
yarn-site.xml
:<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>xxxx</value> <!-- Specify the name of the host that runs ResourceManager of YARN. --> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle,mapreduce_shuffle</value> <!-- If you do not need to run Apache Spark on YARN, set the value to mapreduce_shuffle. --> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> <! -- If you do not need to run Apache Spark on YARN, this parameter is not required. --> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> <! -- Configure this parameter based on the specifications of the Lindorm cluster. --> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <! -- Configure this parameter based on the specifications of the Lindorm cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> <! -- Configure this parameter based on the specifications of the Lindorm cluster. --> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3584</value> <! -- Configure this parameter based on the specifications of the Lindorm cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>14336</value> <! -- Configure this parameter based on the specifications of the Lindorm cluster. --> </property> </configuration>
Modify the configuration file
slaves
.Run the following command to open the configuration file
slaves
:vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves
Add the following configuration to the
slaves
file.In this example, two nodes are deployed in the Spark cluster: node1 and node2.node1 node2
In the configuration, node1 and node2 specify the name of each node in the Spark cluster.
Configure environment variables.
Run the following command to open the configuration file
/etc/profile
:vim /etc/profile
Append the following content to the configuration file
/etc/profile
:export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
Run the following command to validate the configuration:
source /etc/profile
Run the following command to synchronize the folder
${HADOOP_HOME}
to other nodes in the cluster:scp -r hadoop-2.7.2/ root@node2:/usr/local/
Verify the Apache Hadoop configuration
After you configure Apache Hadoop, you do not need to format NameNodes or run the start-dfs.sh command to start HDFS daemons. If you need to use YARN, start YARN on ResourceManager. For more information about how to verify the Apache Hadoop configuration, see Use the open source HDFS client to connect to Lindorm.
/usr/local/hadoop-2.7.3/sbin/start-yarn.sh
Configure Apache Spark
The following example is provided to describe how to configure Apache Spark on YARN. For more information about the official documentation, see Run Apache Spark on YARN.
Run the following command to decompress the installation package to a specified directory:
tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz -C /usr/local/
Modify the configuration file
spark-env.sh
.Run the following command to open the configuration file
spark-env.sh
:vim /usr/local/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh
Configure the following settings in the configuration file
spark-env.sh
:export JAVA_HOME=JDK installation directory export SCALA_HOME=Scala installation directory export SPARK_CONF_DIR=/usr/local/spark-2.4.3-bin-hadoop2.7/conf export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
Copy the JAR file.
Copy the
spark-x.x.x-yarn-shuffle.jar
file in the yarn directory of the Apache Spark configuration directory to theyarn/lib
directory of each node in the Apache Spark cluster.
When you configure Apache Spark on YARN, you do not need to copy the Apache Spark configuration directory to each node in the cluster. You need only to copy the directory to a node on which you can submit a job to Apache Spark.
Verify the Apache Spark configuration
Use Apache Spark to read files from LindormDFS, run WordCount, and then print and write the result into LindormDFS.
Run the following command to create a test file:
echo -e "11,22\n11,22\n11,23\n11,22\n11,22" > /tmp/words
Then, you can run the
cat /tmp/words
command to check whether the test file is created.Run the following commands to create folders on LindormDFS:
/usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/input /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/output
Run the following command to upload the test file to a folder on LindormDFS:
/usr/local/hadoop-2.7.3/bin/hadoop fs -put /tmp/words /test/input
Then, you can run the
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/input/words
command to check whether the test file is uploaded.Run the following command to start spark-shell:
/usr/local/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master yarn \ --deploy-mode client \ --driver-cores 1 \ --driver-memory 1G \ --executor-memory 1G \ --num-executors 1 \
Run the following command to run WordCount:
scala> val res = sc.textFile("/test/input/words").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_) scala> res.collect.foreach(println) scala> res.saveAsTextFile("/test/output/res")
View the result.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/output/res/part-00000