All Products
Search
Document Center

Connect Apache Spark to LindormDFS

Last Updated: Jul 09, 2021

This topic describes how to configure and connect Apache Spark to LindormDFS of ApsaraDB for Lindorm (Lindorm).

Configure a runtime environment

  1. Activate LindormDFS. For more information, see Activate LindormDFS.

  2. Install Java Development Kit (JDK) 1.8 or later on compute nodes.

  3. Install Scala on compute nodes.

    Download Scala from the official website. The version of Scala must be compatible with the Apache Spark version that you use.

  4. Download Apache Hadoop.

    Download Apache Hadoop from the official website. We recommend that you download Apache Hadoop 2.7.3 or later. In this topic, Apache Hadoop 2.7.3 is used.
  5. Download Apache Spark. Download Apache Spark from the official website. The version of Apache Spark must be compatible with the Apache Hadoop version that you use. In this topic, Apache Spark 2.4.3 is used.

Note

The versions of the installation packages and the file paths used in this topic are for your reference. Replace the versions and paths with the actual values.

Configure Apache Hadoop

  1. Decompress the installation package of Apache Hadoop to a specified directory.

    tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
  2. Modify the configuration file hadoop-env.sh.

    1. Run the following command to open the configuration file hadoop-env.sh:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
    2. Configure JAVA_HOME.

      export JAVA_HOME =${JDK installation directory}
  3. Modify the file core-site.xml.

    1. Run the following command to open the file core-site.xml:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
    2. Add the following configuration to the core-site.xml file. Replace Instance ID with your actual instance ID.

      <configuration>
        <property>
           <name>fs.defaultFS</name>
           <value>hdfs://${Instance ID}</value>
        </property>
      </configuration>
  4. Modify the configuration file mapred-site.xml.

    1. Run the following command to open the configuration file mapred-site.xml:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
    2. Add the following configuration to the file mapred-site.xml:

      <configuration>
      <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
      </property>
      </configuration>
    3. Modify the configuration file yarn-site.xml.

      1. Run the following command to open the configuration file yarn-site.xml:

        vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
      2. Add the following configuration to the file yarn-site.xml:

        <configuration>
        <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>xxxx</value>
          <!-- Specify the name of the host that runs ResourceManager of YARN. -->
        </property>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>spark_shuffle,mapreduce_shuffle</value>
            <!-- If you do not need to run Apache Spark on YARN, set the value to mapreduce_shuffle. -->
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
            <value>org.apache.spark.network.yarn.YarnShuffleService</value>
            <! -- If you do not need to run Apache Spark on YARN, this parameter is not required. -->
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
          <name>yarn.nodemanager.vmem-pmem-ratio</name>
          <value>2.1</value>
        </property>
        <property>
          <name>yarn.nodemanager.pmem-check-enabled</name>
          <value>false</value>
        </property>
        <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
        </property>
        <property>
          <name>yarn.nodemanager.resource.memory-mb</name>
          <value>16384</value>
            <! -- Configure this parameter based on the specifications of the Lindorm cluster. -->
        </property>
        <property>
          <name>yarn.nodemanager.resource.cpu-vcores</name>
          <value>4</value>
             <! -- Configure this parameter based on the specifications of the Lindorm cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.maximum-allocation-vcores</name>
          <value>4</value>
            <! -- Configure this parameter based on the specifications of the Lindorm cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>3584</value>
            <! -- Configure this parameter based on the specifications of the Lindorm cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>14336</value>
            <! -- Configure this parameter based on the specifications of the Lindorm cluster. -->
        </property>
        </configuration>
    4. Modify the configuration file slaves.

      1. Run the following command to open the configuration file slaves:

         vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves 
      2. Add the following configuration to the slaves file.In this example, two nodes are deployed in the Spark cluster: node1 and node2.

        node1
        node2

        In the configuration, node1 and node2 specify the name of each node in the Spark cluster.

    5. Configure environment variables.

      1. Run the following command to open the configuration file /etc/profile:

        vim /etc/profile
      2. Append the following content to the configuration file /etc/profile:

        export HADOOP_HOME=/usr/local/hadoop-2.7.3
        export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar
        export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
      3. Run the following command to validate the configuration:

        source /etc/profile
  5. Run the following command to synchronize the folder ${HADOOP_HOME} to other nodes in the cluster:

    scp -r hadoop-2.7.2/ root@node2:/usr/local/

Verify the Apache Hadoop configuration

After you configure Apache Hadoop, you do not need to format NameNodes or run the start-dfs.sh command to start HDFS daemons. If you need to use YARN, start YARN on ResourceManager. For more information about how to verify the Apache Hadoop configuration, see Use the open source HDFS client to connect to Lindorm.

/usr/local/hadoop-2.7.3/sbin/start-yarn.sh

Configure Apache Spark

The following example is provided to describe how to configure Apache Spark on YARN. For more information about the official documentation, see Run Apache Spark on YARN.

  1. Run the following command to decompress the installation package to a specified directory:

    tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz -C /usr/local/
  2. Modify the configuration file spark-env.sh.

    1. Run the following command to open the configuration file spark-env.sh:

       vim /usr/local/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh
    2. Configure the following settings in the configuration file spark-env.sh:

      export JAVA_HOME=JDK installation directory
      export SCALA_HOME=Scala installation directory
      export SPARK_CONF_DIR=/usr/local/spark-2.4.3-bin-hadoop2.7/conf
      export HADOOP_HOME=/usr/local/hadoop-2.7.3
      export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
  3. Copy the JAR file.

    Copy the spark-x.x.x-yarn-shuffle.jar file in the yarn directory of the Apache Spark configuration directory to the yarn/lib directory of each node in the Apache Spark cluster.

Note

  • When you configure Apache Spark on YARN, you do not need to copy the Apache Spark configuration directory to each node in the cluster. You need only to copy the directory to a node on which you can submit a job to Apache Spark.

Verify the Apache Spark configuration

Use Apache Spark to read files from LindormDFS, run WordCount, and then print and write the result into LindormDFS.

  1. Run the following command to create a test file:

    echo -e "11,22\n11,22\n11,23\n11,22\n11,22" > /tmp/words

    Then, you can run the cat /tmp/words command to check whether the test file is created.

  2. Run the following commands to create folders on LindormDFS:

    /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/input
    /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/output
  3. Run the following command to upload the test file to a folder on LindormDFS:

    /usr/local/hadoop-2.7.3/bin/hadoop fs -put /tmp/words /test/input

    Then, you can run the /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/input/words command to check whether the test file is uploaded.

  4. Run the following command to start spark-shell:

    /usr/local/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master yarn \
    --deploy-mode client \
    --driver-cores 1  \
    --driver-memory 1G \
    --executor-memory 1G \
    --num-executors 1 \
  5. Run the following command to run WordCount:

    scala> val res = sc.textFile("/test/input/words").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
    scala> res.collect.foreach(println)
    scala> res.saveAsTextFile("/test/output/res")
  6. View the result.

    /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/output/res/part-00000