All Products
Search
Document Center

Lindorm:Use Apache Spark to connect to LindormDFS

Last Updated:Aug 30, 2024

This topic describes how to deploy Apache Spark and how to use it to access LindormDFS.

Configure a runtime environment

  • Activate LindormDFS. For more information, see Activate LindormDFS.

  • Install Java Development Kits (JDKs) on compute nodes. The JDK version must be 1.8 or later.

  • Install Scala on compute nodes.

    Download Scala from the official website. The Scala version must be compatible with the Apache Spark version.

  • Download the Apache Hadoop package.

    Download the Apache Hadoop package from the official website. We recommend that you download Apache Hadoop version 2.7.3 or later. Apache Hadoop version 2.7.3 is used in this topic.

  • Download Apache Spark. Download Apache Spark from the official website. The version of Apache Spark must be compatible with the Apache Hadoop version that you use. In this topic, Apache Spark 2.4.3 is used.

Note

Replace the installation package versions and the folder paths provided in this topic with actual values.

Configure Apache Hadoop

  1. Decompress the installation package of Apache Hadoop to a specified directory.

    tar -zxvf hadoop-2.7.3.tar.gz -C /usr/local/
  2. Modify the hadoop-env.sh configuration file.

    1. Run the following command to open the hadoop-env.sh configuration file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/hadoop-env.sh
    2. Configure JAVA_HOME.

      export JAVA_HOME=${JDK installation directory}
  3. Modify the core-site.xml file.

    1. Run the following command to open the core-site.xml file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml
    2. Modify the core-site.xml file based on the content in the following example. Replace Instance ID in ${Instance ID} with the ID of your Lindorm instance.

      <configuration>
        <property>
           <name>fs.defaultFS</name>
           <value>hdfs://${Instance ID}</value>
        </property>
      </configuration>
  4. Modify the configuration file mapred-site.xml.

    1. Run the following command to open the mapred-site.xml file:

      vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml
    2. Modify the mapred-site.xml file, as shown in the following example:

      <configuration>
      <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
      </property>
      </configuration>
    3. Modify the yarn-site.xml configuration file.

      1. Run the following command to open the yarn-site.xml configuration file.

        vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site.xml
      2. Add the following configuration to the file yarn-site.xml:

        <configuration>
        <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>xxxx</value>
          <!-- Enter the host name that you want to use for the ResourceManager of Apache Hadoop YARN in your cluster. -->
        </property>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>spark_shuffle,mapreduce_shuffle</value>
            <!-- If you do not want to run Apache Spark on YARN, set the value to mapreduce_shuffle. -->
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
            <value>org.apache.spark.network.yarn.YarnShuffleService</value>
            <! -- If you do not want to run Apache Spark on YARN, you do not need to configure this parameter. -->
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
          <name>yarn.nodemanager.vmem-pmem-ratio</name>
          <value>2.1</value>
        </property>
        <property>
          <name>yarn.nodemanager.pmem-check-enabled</name>
          <value>false</value>
        </property>
        <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
        </property>
        <property>
          <name>yarn.nodemanager.resource.memory-mb</name>
          <value>16384</value>
            <! -- Configure this item based on the capabilities of your cluster. -->
        </property>
        <property>
          <name>yarn.nodemanager.resource.cpu-vcores</name>
          <value>4</value>
             <! -- Configure this item based on the capabilities of your cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.maximum-allocation-vcores</name>
          <value>4</value>
            <! -- Configure this item based on the capabilities of your cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>3584</value>
            <! -- Configure this item based on the capabilities of your cluster. -->
        </property>
        <property>
          <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>14336</value>
            <! -- Configure this item based on the capabilities of your cluster. -->
        </property>
        </configuration>
    4. Modify the slaves configuration file.

      1. Run the following command to open the slaves configuration file:

         vim /usr/local/hadoop-2.7.3/etc/hadoop/slaves 
      2. Modify the slaves configuration file, as shown in the following example. In this example, the Apache Spark cluster contains two nodes named node1 and node2.

        node1
        node2
        Note

        node1 and node2 are the names of the machines on which the Apache Spark cluster nodes are deployed.

    5. Configure environment variables.

      1. Run the following command to open the /etc/profile configuration file:

        vim /etc/profile
      2. Append the following information to the content in the /etc/profile configuration file:

        export HADOOP_HOME=/usr/local/hadoop-2.7.3
        export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.3/etc/hadoop:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/common/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/yarn/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.3/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.3/contrib/capacity-scheduler/*.jar
        export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
      3. Run the following command to make the configuration take effect:

        source /etc/profile
  5. Run the following command to synchronize the folder whose name is specified by ${HADOOP_HOME} to other nodes in the cluster.

    scp -r hadoop-2.7.2/ testuser@node2:/usr/local/

Verify the Apache Hadoop configuration

After Apache Hadoop is configured, do not format the name node or use the start-dfs.sh script to start the services that are related to Hadoop Distributed File System (HDFS). If you want to use YARN, start YARN on ResourceManager. For information about how to verify the Apache Hadoop configuration, see Use the open source HDFS client to connect to Lindorm.

/usr/local/hadoop-2.7.3/sbin/start-yarn.sh

Configure Apache Spark

The following example shows how to configure Apache Spark on YARN.

  1. Run the following command to decompress the installation package to a specified directory:

    tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz -C /usr/local/
  2. Modify the configuration file spark-env.sh.

    1. Run the following command to open the configuration file spark-env.sh:

       vim /usr/local/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh
    2. Configure the following settings in the configuration file spark-env.sh:

      export JAVA_HOME=JDK installation directory
      export SCALA_HOME=Scala installation directory
      export SPARK_CONF_DIR=/usr/local/spark-2.4.3-bin-hadoop2.7/conf
      export HADOOP_HOME=/usr/local/hadoop-2.7.3
      export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
  3. Copy the JAR file.

    Copy the spark-x.x.x-yarn-shuffle.jar file in the yarn directory of the Apache Spark configuration directory to the yarn/lib directory of each node in the Apache Spark cluster.

Note

When you configure Apache Spark on YARN, you do not need to copy the Apache Spark configuration directory to each node in the cluster. You need to copy the directory to only one node from which you can submit a job to Apache Spark.

Verify the Apache Spark configuration

Use Apache Spark to read files from LindormDFS, run WordCount, print the result, and write the result to LindormDFS.

  1. Run the following command to create a test file:

    echo -e "11,22\n11,22\n11,23\n11,22\n11,22" > /tmp/words

    Then, run the cat /tmp/words command to check whether the test file is created.

  2. Run the following commands to create folders on LindormDFS:

    /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/input
    /usr/local/hadoop-2.7.3/bin/hadoop fs -mkdir -p /test/output
  3. Run the following command to upload the test file to a folder on LindormDFS:

    /usr/local/hadoop-2.7.3/bin/hadoop fs -put /tmp/words /test/input

    Then, run the /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/input/words command to check whether the test file is uploaded.

  4. Run the following command to start spark-shell:

    /usr/local/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master yarn \
    --deploy-mode client \
    --driver-cores 1  \
    --driver-memory 1G \
    --executor-memory 1G \
    --num-executors 1 \
  5. Run the following command to run WordCount:

    scala> val res = sc.textFile("/test/input/words").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
    scala> res.collect.foreach(println)
    scala> res.saveAsTextFile("/test/output/res")
  6. View the result.

    /usr/local/hadoop-2.7.3/bin/hadoop fs -cat /test/output/res/part-00000