This white paper describes comparative performance analysis of TeraSort benchmark tests performed on self-managed Apache Hadoop+Spark clusters and Spark clusters in Data Lake Analytics (DLA) in three scenarios. This topic describes the configuration requirements for the test environment in three scenarios.

Configuration requirements

Overall requirements for the test environment:
  • Self-managed Apache Hadoop+Spark clusters reside in a virtual private cloud (VPC).
  • Self-managed Apache Hadoop+Spark clusters and DLA Spark clusters are deployed in the same region.
  • Apache Spark 2.4.5 and Apache Hadoop 2.7.3 are used for self-managed Apache Hadoop+Spark clusters.
Configuration requirements for the test environment in three scenarios:
  • Scenario 1: Performance comparison between DLA Spark clusters+OSS and self-managed Apache Hadoop+Spark clusters (1 TB data)

    Scenario description: The TeraSort benchmark test on 1 TB data is run once a day for a month. Self-managed Apache Hadoop+Spark clusters use the subscription billing method. DLA Spark clusters+OSS use the pay-as-you-go billing method. Comparative performance analysis is performed based on the fees of the two types of clusters and the test durations.

    The following table describes the configuration requirements for DLA Spark clusters+OSS.
    Item Specifications Requirement
    Driver medium (2 CPU cores and 8 GB of memory) 1
    Executor medium (2 CPU cores and 8 GB of memory) 19
    OSS N/A 2 TB of storage space
    The following table describes the configuration requirements for self-managed Apache Hadoop+Spark clusters.
    Item Specifications Requirement
    Master ecs.g5.xlarge (4 CPU cores and 16 GB of memory) 2
    Slave ecs.g6.2xlarge (8 CPU cores and 32 GB of memory), 4 cloud disks with storage space of 500 GB for each 5
    Note
    • The TeraSort benchmark test is expected to occupy 1 TB storage space for input data, 1 TB storage space for shuffled data, and 1 TB storage space for output data.
    • Self-managed Apache Hadoop+Spark clusters use a typical configuration with a total storage space of 5 TB. The total storage space of four cloud disks is 10 TB (4 × 500 × 5). Cloud disks use the dual backup mechanism for HDFS configurations. Therefore, the available storage space is 5 TB. In most cases, the cluster disk usage cannot be too high. We recommend that the disk usage do not exceed 80%. If the disk usage exceeds 80%, various stability issues may occur due to insufficient space.
    • DLA Spark clusters allow you to use storage space and computing resources on demand. Shuffled data does not occupy OSS storage space. Input and output data occupy a total OSS storage space of 2 TB, 1 TB for input data and 1 TB for output data.
  • Scenario 2: Performance comparison between DLA Spark clusters+OSS and self-managed Apache Hadoop+Spark clusters (10 TB data)

    Scenario description: The TeraSort benchmark test on 10 TB data is run once a day for a month. Self-managed Apache Hadoop+Spark clusters use the subscription billing method. DLA Spark clusters+OSS use the pay-as-you-go billing method. Comparative performance analysis is performed based on the fees of the two types of clusters and the test durations.

    The following table describes the configuration requirements for DLA Spark clusters+OSS.
    Item Specifications Requirement
    Driver medium (2 CPU cores and 8 GB of memory) 1
    Executor medium (2 CPU cores and 8 GB of memory), 200 GB ESSDs 39
    OSS N/A 30 TB of storage space
    The following table describes the configuration requirements for self-managed Apache Hadoop+Spark clusters.
    Item Specifications Requirement
    Master ecs.g5.xlarge (4 CPU cores and 16 GB of memory) 2
    Slave ecs.d1ne.4xlarge (16 CPU cores and 64 GB of memory), 8 local disks with storage space of 5.5 TB for each 5
    Note
    • The TeraSort benchmark test is expected to occupy 10 TB storage space for input data, 10 TB storage space for shuffled data, and 10 TB storage space for output data.
    • Self-managed Apache Hadoop+Spark clusters use typical storage configurations in big data scenarios. ECS instances of the ecs.d1ne.4xlarge type are deployed and equipped with local disks, which are more cost-effective than cloud disks. ECS instances with local disks require large storage space. Only 44 TB local disks can be configured for an ECS instance of the ecs.d1ne.4xlarge type. In most cases, local disks use the triple backup mechanism. Therefore, the available storage space is 73 TB (5.5 × 8 × 5/3).
    • DLA provides a 200 GB ESSD for each executor to store shuffled data. Therefore, shuffled data does not occupy OSS storage space.
    • The actual OSS storage space required in this test environment is 20 TB. However, the OSS storage space provided for this test is 30 TB. This ensures that local disks have available storage space for self-managed Apache Hadoop clusters.
  • Scenario 3: Performance comparison between DLA Spark clusters+self-managed Apache Hadoop clusters and self-managed Apache Hadoop+Spark clusters (1 TB data)

    Scenario description: Self-managed Apache Spark cluster and DLA Spark clusters separately access self-managed Apache Hadoop clusters. The TeraSort benchmark test on 1 TB data is run to compare the test duration.

    The following table describes the configuration requirements for DLA Spark clusters.
    Item Specifications Requirement
    Driver medium (2 CPU cores and 8 GB of memory) 1
    Executor medium (2 CPU cores and 8 GB of memory) 39
    The following table describes the configuration requirements for self-managed Apache Hadoop+Spark clusters.
    Item Specifications Requirement
    Master 4 CPU cores and 8 GB of memory 2
    Slave 8 CPU cores and 32 GB of memory, 4 cloud disks with storage space of 500 GB for each 5
    Note
    • DLA Spark clusters can be used with self-managed Apache Hadoop clusters. This allows you to scale Apache Hadoop clusters.
    • In this test, both the self-managed Apache Spark clusters and DLA Spark clusters use the specifications of 40 CPU cores and 160 GB of memory.