This topic compares the performance of Container Service for Kubernetes (ACK)-based Spark SQL queries on 1 TB of data before and after the Alluxio distributed caching service is used.

Hardware configurations

The following table lists the ACK cluster configurations.

Cluster type Standard dedicated cluster
ECS instances
  • Instance type: ecs.d1ne.6xlarge
  • Alibaba Cloud Linux 2.1903
  • CPU: 24 cores
  • Memory: 96 GB
  • Disk size: 5,500 GB. Disk type: HDD.
Number of worker nodes 20

Software configurations

  • Version
    • Apache Spark: 2.4.5
    • Alluxio: 2.3.0
  • Spark configurations
    Parameter Value
    spark.driver.cores 5
    spark.driver.memory (MB) 20480
    spark.executor.cores 7
    spark.executor.memory (MB) 20480
    spark.executor.instances 20

Test results

The following table lists the amount of time consumed by the tests based on different benchmarks. The queries are performed on 1 TB of data one after another.

Benchmark Total time consumed by 104 queries (Unit: minutes)
Spark with OSS 180
Spark with Alluxio Cold 145
Spark with Alluxio Warm 137
The following figure shows the amount of time consumed by each query. 1


The test results show that the query performance is improved after the Alluxio caching service is used. The first time Alluxio is used, the query performance is not high because Alluxio has to cache data from Object Storage Service (OSS). The query performance will be greatly improved in subsequent tests.
Notice The tests analyzed in this topic use ACK-based Spark SQL queries on datasets generated by using Transaction Processing Performance Council-Decision Support (TPC-DS) to compare the performance before and after the Alluxio distributed caching service is used. Therefore, these tests are not based on the TPC benchmarks and may result in a discrepancy between these tests and tests that are based on the TPC benchmarks.

What to do next