This topic describes how to deploy Apache Spark and how to use it to access LindormDFS.
Configure a runtime environment
Activate LindormDFS. For more information, see Activate LindormDFS.
Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.
Install Scala on compute nodes.
Download Scala from the official website. The Scala version must be compatible with the Apache Spark version.
Download the Apache Hadoop package.
Download the Apache Hadoop package from the official website. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.
Download Apache Spark. Download Apache Spark from the official website. The version of Apache Spark must be compatible with the Apache Hadoop version that you use. In this topic, Apache Spark 2.4.3 is used.
Replace the installation package versions and the folder paths provided in this topic with actual values.
Configure Apache Hadoop
Decompress the installation package of Apache Hadoop to a specified directory.
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/Modify the
hadoop-env.shconfiguration file.Run the following command to open the
hadoop-env.shconfiguration file:vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.shConfigure
JAVA_HOME.export JAVA_HOME=${JDK installation directory}
Modify the
core-site.xmlfile.Run the following command to open the
core-site.xmlfile:vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xmlModify the
core-site.xmlfile based on the content in the following example. Replace Instance ID in ${Instance ID} with the ID of your Lindorm instance.<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${Instance ID}</value> </property> </configuration>
Modify the configuration file
mapred-site.xml.Run the following command to open the
mapred-site.xmlfile:vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xmlModify the
mapred-site.xmlfile, as shown in the following example:<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>Modify the
yarn-site.xmlconfiguration file.Run the following command to open the
yarn-site.xmlconfiguration file.vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xmlAdd the following configuration to the file
yarn-site.xml:<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>xxxx</value> <!-- Enter the host name that you want to use for the ResourceManager of Apache Hadoop YARN in your cluster. --> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle,mapreduce_shuffle</value> <!-- If you do not want to run Apache Spark on YARN, set the value to mapreduce_shuffle. --> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> <! -- If you do not want to run Apache Spark on YARN, you do not need to configure this parameter. --> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3584</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>14336</value> <! -- Configure this item based on the capabilities of your cluster. --> </property> </configuration>
Modify the
slavesconfiguration file.Run the following command to open the
slavesconfiguration file:vim /usr/local/hadoop-2.7.3/etc/hadoop/slavesModify the
slavesconfiguration file, as shown in the following example. In this example, the Apache Spark cluster contains two nodes named node1 and node2.node1 node2Notenode1 and node2 are the names of the machines on which the Apache Spark cluster nodes are deployed.
Configure environment variables.
Run the following command to open the
/etc/profileconfiguration file:vim /etc/profileAppend the following information to the content in the
/etc/profileconfiguration file:export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoopRun the following command to make the configuration take effect:
source /etc/profile
Run the following command to synchronize the folder whose name is specified by
${HADOOP_HOME}to other nodes in the cluster.scp -r hadoop-2.7.2/ testuser@node2:/usr/local/
Verify the Apache Hadoop configuration
After Apache Hadoop is configured, do not format the name node or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you want to use YARN, start YARN on ResourceManager. For information about how to verify the Apache Hadoop configuration, see Use the open source HDFS client to connect to Lindorm.
/usr/local/hadoop-2.7.3/sbin/start-yarn.shConfigure Apache Spark
The following example shows how to configure Apache Spark on YARN.
Run the following command to decompress the installation package to a specified directory:
tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz -C /usr/local/Modify the configuration file
spark-env.sh.Run the following command to open the configuration file
spark-env.sh:vim /usr/local/spark-2.4.3-bin-hadoop2.7/conf/spark-env.shConfigure the following settings in the configuration file
spark-env.sh:export JAVA_HOME=JDK installation directory export SCALA_HOME=Scala installation directory export SPARK_CONF_DIR=/usr/local/spark-2.4.3-bin-hadoop2.7/conf export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
Copy the JAR file.
Copy the
spark-x.x.x-yarn-shuffle.jarfile in the yarn directory of the Apache Spark configuration directory to theyarn/libdirectory of each node in the Apache Spark cluster.
When you configure Apache Spark on YARN, you do not need to copy the Apache Spark configuration directory to each node in the cluster. You need to copy the directory to only one node from which you can submit a job to Apache Spark.
Verify the Apache Spark configuration
Use Apache Spark to read files from LindormDFS, run WordCount, print the result, and write the result to LindormDFS.
Run the following command to create a test file:
echo -e "11,22\n11,22\n11,23\n11,22\n11,22" > /tmp/wordsThen, run the
cat /tmp/wordscommand to check whether the test file is created.Run the following commands to create folders on LindormDFS:
/usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/input /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/outputRun the following command to upload the test file to a folder on LindormDFS:
/usr/local/hadoop-2.7.3/bin/hadoop fs -put /tmp/words /test/inputThen, run the
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/input/wordscommand to check whether the test file is uploaded.Run the following command to start spark-shell:
/usr/local/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master yarn \ --deploy-mode client \ --driver-cores 1 \ --driver-memory 1G \ --executor-memory 1G \ --num-executors 1 \Run the following command to run WordCount:
scala> val res = sc.textFile("/test/input/words").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_) scala> res.collect.foreach(println) scala> res.saveAsTextFile("/test/output/res")View the result.
/usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/output/res/part-00000