This topic compares the performance of ACK-based Spark SQL queries on 1 TB of data before and after the Alluxio distributed cache is used.

Hardware configurations

The following table lists the ACK cluster configurations.

Cluster type Standard dedicated cluster
ECS instances
  • Instance type: ecs.d1ne.6xlarge
  • Aliyun Linux 2.1903
  • CPU: 24 cores
  • Memory: 96 GB
  • Disk size: 5,500 GB. Disk type: HDD.
Number of worker nodes 20

Software configurations

  • Software version
    • Apache Spark: 2.4.5
    • Alluxio: 2.3.0
  • Spark configurations
    Parameter Value
    spark.driver.cores 5
    spark.driver.memory (MB) 20480
    spark.executor.cores 7
    spark.executor.memory (MB) 20480
    spark.executor.instances 20

Test results

The following table lists the time consumed by the tests based on each benchmark. The queries are performed on 1 TB of data one after another.

Benchmark Total time consumed by 104 queries (Unit: minutes)
Spark with OSS 180
Spark with Alluxio Cold 145
Spark with Alluxio Warm 137
The following figure shows the time consumed by each query.1


The test results show that the query performance is improved after the Alluxio cache is used. When Alluxio is used for the first time, the query performance is not high because Alluxio has to cache data from OSS. However, the performance is greatly improved in later tests.

What to do next