This white paper describes comparative performance analysis of TeraSort benchmark tests performed on self-managed Apache Hadoop+Spark clusters and Spark clusters in Data Lake Analytics (DLA) in three scenarios. This topic provides the performance comparison results in three scenarios.

Performance comparison results for 1 TB test data in Scenario 1

Cluster type Test duration (hours) Fee (USD)
DLA Spark clusters+OSS 0.701 88.55
Self-managed Apache Hadoop+Spark clusters 0.733 1616.79

The preceding performance comparison results show that the job performance of the two types of clusters is basically the same. However, DLA Spark clusters are more cost-effective. DLA Spark clusters reduce costs by 90% and offer up to a ninefold or tenfold improvement to cost-effectiveness. DLA Spark clusters are suitable for small and medium-sized enterprises whose business is relatively simple and cluster resource usage is low. After DLA Spark clusters are deployed for these enterprises, costs are significantly reduced.

It must be noted that DLA Spark clusters allow you to use storage and computing resources on demand. The OSS access performance of DLA Spark clusters is deeply optimized and improved by approximately 100%. The performance for accessing OSS is the same as that for accessing the Hadoop Distributed File System (HDFS).

Performance comparison results for 10 TB test data in Scenario 2

Cluster type Test duration (hours) Fee (USD)
DLA Spark clusters+OSS 5.2 1685.24
Self-managed Apache Hadoop+Spark clusters 13.9 3628.32

The preceding performance comparison results show that DLA Spark clusters outperform self-managed Apache Hadoop+Spark clusters in both performance and cost. DLA Spark clusters offer twice the performance at only half the price. DLA Spark clusters improve cost-effectiveness by four times.

The performance analysis results for 10 TB test data show that I/O bandwidth preemption occurs between data storage of local disks and data shuffling. Compute nodes of the serverless Spark engine are equipped with enhanced SSDs (ESSDs) that are completely independent of the disks used for data shuffling. This enables DLA Spark clusters to deliver higher performance.

Performance comparison results for 1 TB test data in Scenario 3

Cluster type Test duration (hours)
DLA Spark clusters+OSS 43.5
Self-managed Apache Hadoop+Spark clusters 44.8

You can use DLA Spark clusters with self-managed Apache Hadoop clusters. Self-managed Apache Hadoop clusters require more computing resources during peak hours. DLA Spark clusters can directly connect to your virtual private cloud (VPC). This way, you can use the internal bandwidth for data computations. The computing performance of DLA Spark clusters is the same as local computing. DLA Spark clusters are fully elastic. You can start 500 to 1000 compute nodes within 1 minute to meet the requirements of elastic computing.