This topic compares the performance of ACK-based Spark SQL queries on 1 TB of data before and after the Alluxio distributed cache is used.
The following table lists the ACK cluster configurations.
|Cluster type||Standard dedicated cluster|
|Number of worker nodes||20|
- Software version
- Apache Spark: 2.4.5
- Alluxio: 2.3.0
- Spark configurations
Parameter Value spark.driver.cores 5 spark.driver.memory (MB) 20480 spark.executor.cores 7 spark.executor.memory (MB) 20480 spark.executor.instances 20
The following table lists the time consumed by the tests based on each benchmark. The queries are performed on 1 TB of data one after another.
|Benchmark||Total time consumed by 104 queries (Unit: minutes)|
|Spark with OSS||180|
|Spark with Alluxio Cold||145|
|Spark with Alluxio Warm||137|
The test results show that the query performance is improved after the Alluxio cache is used. When Alluxio is used for the first time, the query performance is not high because Alluxio has to cache data from OSS. However, the performance is greatly improved in later tests.