This white paper describes comparative performance analysis of TeraSort benchmark tests performed on self-managed Apache Hadoop+Spark clusters and Spark clusters in Data Lake Analytics (DLA) in three scenarios. This test helps you understand the cost performance of the serverless Spark engine of DLA.

Background information

This topic describes the test methods in three scenarios. For scenario descriptions and configuration requirements in these scenarios, see Test environment.

Preparations

  1. Download the JAR package required for performing TeraSort benchmark tests.

    When you use DLA Spark clusters to perform TeraSort benchmark tests, you must download the JAR package provided by DLA. This package contains Spark applications that can be used to generate test data and perform tests. You can visit GitHub to download the source code of the JAR package.

  2. Upload the JAR package to Object Storage Service (OSS).

    In subsequent test procedures, DLA Spark clusters need to use this JAR package to generate test data and perform tests.

Procedure

Scenario 1: Performance comparison between DLA Spark clusters+OSS and self-managed Apache Hadoop+Spark clusters (1 TB data)
  1. Prepare test data.
    • Generate the test data on OSS.
      Log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to submit a Spark job for generating the 1 TB test data. Sample code:
      {
          "args": [
              "1000g",
              "oss://<bucket-name>/<Directory for saving the test data>",  # The OSS directory for saving the test data, for example, oss://test-bucket/terasort/input/1T.
              "true"
          ],
          "file": "<OSS directory for saving the JAR package>",  # The OSS directory for saving the JAR package, for example, oss://test/performance/dla-spark-perf.jar.
          "name": "TeraGen-1T",
          "className": "com.aliyun.dla.perf.terasort.TeraGen",
          "conf": {
              "spark.dla.connectors": "oss",
              "spark.hadoop.job.oss.fileoutputcommitter.enable": "true",
              "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2,
              "spark.driver.resourceSpec": "medium",
              "spark.executor.resourceSpec": "medium",
              "spark.default.parallelism": "2000",
              "spark.executor.memoryOverhead": 2000,
              "spark.executor.instances": 19
          }
      }
    • Generate the test data on self-managed Apache Hadoop clusters.
      Run the spark-submit command to submit the Spark application for generating the test data to the self-managed Apache Spark cluster. Sample code:
      ./bin/spark-submit \
      --class com.aliyun.dla.perf.terasort.TeraGen \
      --executor-cores 2 \
      --executor-memory 6G \
      --num-executors 19 \
      --driver-memory 8G \
      --driver-cores 2 \
      --name terasort-sort-1000g \
      --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn (Local directory for saving data on a data disk)\
      --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
      --conf spark.yarn.executor.memoryOverhead=2000 \
      --conf spark.default.parallelism=2000 \
      /dla-spark-perf.jar (Directory for saving the JAR package) 1000g hdfs://test/terasort/input/1T (HDFS path of the test data)
  2. Run the test application.
    • Run the test application on the DLA Spark cluster.
      Log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to submit a Spark job for running the TeraSort benchmark test. Sample code:
      {
          "args": [
              "--input",
              "<OSS directory for saving the test data>", # Sample directory: oss://test-bucket/terasort/input/1T
              "--output",
              "<OSS directory for saving the output data that is generated by the test application", # Sample directory: oss://test-bucket/terasort/output/1T
              "--optimized",
              "true",
              "--shuffle-part",
              "2000"
          ],
          "file": "<OSS directory for saving the JAR package>", # Sample directory: oss://test/performance/dla-spark-perf.jar
          "name": "Terasort-1T",
          "className": "com.aliyun.dla.perf.terasort.TeraSort",
          "conf": {
              "spark.dla.connectors": "oss",
              "spark.hadoop.job.oss.fileoutputcommitter.enable": "true",
              "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2,
              "spark.driver.resourceSpec": "medium",
              "spark.executor.resourceSpec": "medium",
              "spark.default.parallelism": "2000",
              "spark.executor.memoryOverhead": 2000,
              "spark.executor.instances": 19
          }
      }
    • Run the test application on the self-managed Apache Spark cluster.
      Run the spark-submit command to submit the Spark application for running the TeraSort benchmark test to the self-managed Apache Spark cluster.
      ./bin/spark-submit \
      --class com.aliyun.dla.perf.terasort.TeraSort \
      --driver-memory 8G \
      --driver-cores 2 \
      --executor-cores 2 \
      --executor-memory 6G \
      --num-executors 19 \
      --name terasort-sort-1000g \
      --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn \
      --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
      --conf spark.default.parallelism=2000 \
      --conf spark.yarn.executor.memoryOverhead=2000 \
      /dla-spark-perf.jar (Directory for saving the JAR package) \
      --input hdfs://test/terasort/input/1T (Replace it with the HDFS path of the test data) --output hdfs://test/terasort/output/1t/ (Replace it with the HDFS path of the output data)--optimized false --shuffle-part 2000
  3. Record test results.

    Record the test durations.

Scenario 2: Performance comparison between DLA Spark clusters+OSS and self-managed Apache Hadoop+Spark clusters (10 TB data)
  1. Prepare test data.
    • Generate the test data on OSS.
      Log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to submit a Spark job for generating the 10 TB test data. Sample code:
      {
          "args": [
              "10000g",
              "oss://<bucket-name>/<Directory for saving the test data>",  # The OSS directory for saving the test data, for example, oss://test-bucket/terasort/input/10T.
              "true"
          ],
          "file": "<OSS directory for saving the JAR package>",  # The OSS directory for saving the JAR package, for example, oss://test/performance/dla-spark-perf.jar.
          "name": "TeraGen-1T",
          "className": "com.aliyun.dla.perf.terasort.TeraGen",
          "conf": {
              "spark.dla.connectors": "oss",
              "spark.hadoop.job.oss.fileoutputcommitter.enable": "true",
              "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2,
              "spark.driver.resourceSpec": "medium",
              "spark.executor.resourceSpec": "medium",
              "spark.default.parallelism": "20000",
              "spark.executor.memoryOverhead": 2000,
              "spark.executor.instances": 39
          }
      }
    • Generate the test data on the self-managed Apache Hadoop cluster.
      Run the spark-submit command to submit the Spark application for generating the test data to the self-managed Apache Spark cluster. Sample code:
      ./bin/spark-submit \
      --class com.aliyun.dla.perf.terasort.TeraGen \
      --executor-cores 2 \
      --executor-memory 6G \
      --num-executors 39 \
      --driver-memory 8G \
      --driver-cores 2 \
      --name terasort-sort-1000g \
      --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn (Local directory for saving data on a data disk)\
      --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
      --conf spark.yarn.executor.memoryOverhead=2000 \
      --conf spark.default.parallelism=20000 \
      /dla-spark-perf.jar (Directory for saving the JAR package) 1000g hdfs://test/terasort/input/1T (HDFS path of the test data)
  2. Run the test application.
    • Run the test application on the DLA Spark cluster.
      Log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to submit a Spark job for running the TeraSort benchmark test. Sample code:
      {
          "args": [
              "--input",
              "<OSS directory for saving the test data>", # Sample directory: oss://test-bucket/terasort/input/10T
              "--output",
              "<OSS directory for saving the output data that is generated by the test application>", # Sample directory: oss://test-bucket/terasort/output/10T
              "--optimized",
              "true",
              "--shuffle-part",
              "2000"
          ],
          "file": "<OSS directory for saving the JAR package>", # Sample directory: oss://test/performance/dla-spark-perf.jar
          "name": "Terasort-10T",
          "className": "com.aliyun.dla.perf.terasort.TeraSort",
          "conf": {
              "spark.dla.connectors": "oss",
              "spark.hadoop.job.oss.fileoutputcommitter.enable": "true",
              "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2,
              "spark.driver.resourceSpec": "medium",
              "spark.executor.resourceSpec": "medium",
              "spark.default.parallelism": "2000",
              "spark.executor.memoryOverhead": 2000,
              "spark.executor.instances": 39
          }
      }
    • Run the test application on the self-managed Apache Spark cluster.
      Run the spark-submit command to submit the Spark application for running the TeraSort benchmark test to the self-managed Apache Spark cluster.
      ./bin/spark-submit \
      --class com.aliyun.dla.perf.terasort.TeraSort \
      --driver-memory 8G \
      --driver-cores 2 \
      --executor-cores 2 \
      --executor-memory 6G \
      --num-executors 19 \
      --name terasort-sort-1000g \
      --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn \
      --conf spark.default.parallelism=20000 \
      --conf spark.yarn.executor.memoryOverhead=2000 \
      /dla-spark-perf.jar (Directory for saving the JAR package)\
      --input hdfs://test/terasort/input/10T (Replace it with the HDFS path of the test data)--output hdfs://test/terasort/output/10t/ (Replace it with the HDFS path of the output data)--optimized false --shuffle-part 20000
  3. Record test results.

    Record the test durations.

Scenario 3: Performance comparison between DLA Spark clusters+self-managed Apache Hadoop clusters and self-managed Apache Hadoop+Spark clusters (1 TB data)
  1. Prepare test data.
    Generate the test data on the self-managed Apache Hadoop cluster. Run the spark-submit command to submit the Spark application for generating the test data to the self-managed Apache Spark cluster. Sample code:
    ./bin/spark-submit \
    --class com.aliyun.dla.perf.terasort.TeraGen \
    --executor-cores 2 \
    --executor-memory 6G \
    --num-executors 19 \
    --driver-memory 8G \
    --driver-cores 2 \
    --name terasort-sort-1000g \
    --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn (Local directory for saving data on a data disk)\
    --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
    --conf spark.yarn.executor.memoryOverhead=2000 \
    --conf spark.default.parallelism=2000 \
    /dla-spark-perf.jar (Directory for saving the JAR package) 1000g hdfs://test/terasort/input/1T (HDFS path of the test data)
  2. Run the test application.
    • Run the test application on the DLA Spark cluster.
      Note Before you use a DLA Spark cluster to access a self-managed Apache Hadoop cluster, you must configure a virtual private cloud (VPC) and connect the DLA Spark cluster to the VPC. For the descriptions and configuration procedures of HDFS-related parameters, see Hadoop.
      Log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to submit a Spark job for running the TeraSort benchmark test. Sample code:
      {
          "args": [
              "--input",
              "<OSS directory for saving the test data>", # Sample directory: hdfs://test/terasort/input/1T.
              "--output",
              "<Directory for saving the output data that is generated by the test application>", # Sample directory: hdfs://test/terasort/output/1t/
              "--optimized",
              "false",
              "--shuffle-part",
              "2000"
          ],
          "file": "<OSS directory for saving the JAR package>", # Sample directory: oss://test/performance/dla-spark-perf.jar
          "name": "TeraSort-HDFS",
          "className": "com.aliyun.dla.perf.terasort.TeraSort",
          "conf": {
              "spark.dla.eni.enable": "true",
              "spark.dla.eni.vswitch.id": "vsw-xxxxx",
              "spark.dla.eni.security.group.id": "sg-xxxx",
              "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 2,
              "spark.driver.resourceSpec": "medium",
              "spark.hadoop.dfs.namenode.rpc-address.<nameservices>.nn2": "xxxx2:8020",
              "spark.hadoop.dfs.namenode.rpc-address.<nameservices>.nn1": "xxxx1:8020",
              "spark.hadoop.dfs.ha.automatic-failover.enabled.<nameservices>": "true",
              "spark.hadoop.dfs.namenode.http-address.<nameservices>.nn1": "xxxx1:50070",
              "spark.executor.resourceSpec": "medium",
              "spark.hadoop.dfs.nameservices": "<nameservices>",
              "spark.hadoop.dfs.client.failover.proxy.provider.<nameservices>": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
              "spark.hadoop.dfs.namenode.http-address.<nameservices>.nn2": "xxxx2:50070",
              "spark.hadoop.dfs.ha.namenodes.<nameservices>": "nn1,nn2",
              "spark.executor.memoryOverhead": 2000,
              "spark.default.parallelism": "2000",
              "spark.executor.instances": 19
          }
    • Run the test application on the self-managed Apache Spark cluster.
      Run the spark-submit command to submit the Spark application for running the TeraSort benchmark test to the self-managed Apache Spark cluster.
      ./bin/spark-submit \
      --class com.aliyun.dla.perf.terasort.TeraSort \
      --driver-memory 8G \
      --driver-cores 2 \
      --executor-cores 2 \
      --executor-memory 6G \
      --num-executors 19 \
      --name terasort-sort-1000g \
      --conf yarn.nodemanager.local-dirs=/mnt/disk1/yarn \
      --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
      --conf spark.default.parallelism=2000 \
      --conf spark.yarn.executor.memoryOverhead=2000 \
      /dla-spark-perf.jar (Directory for saving the JAR package)\
      --input hdfs://test/terasort/input/1T (Replace it with the HDFS path of the test data)--output hdfs://test/terasort/output/1t/ (Replace it with the HDFS path of the output data)--optimized false --shuffle-part 2000
  3. Record test results.

    Record the test durations.