Sevenfold Performance Improvement | Alibaba Cloud AnalyticDB Spark Vectorization Capability Analysis

By Helin Jin (Pangbei)

1. AnalyticDB Spark Product Architecture

AnalyticDB Spark is an open-source Spark engine provided by AnalyticDB for MySQL. It is used to meet the requirements of complex offline processing and machine learning scenarios.

• The upper layer of AnalyticDB Spark provides a variety of scheduling portals for users, including the console, DMS, and spark-submit scripts commonly used in Spark.

• The intermediate control service provides capabilities such as resource, metadata, multi-tenancy, and security management.

• The lower layer provides Spark in the elastic serverless architecture. The Spark cluster manages database and table information through a unified metadata service and applies for elastic resources through a unified management base.

• The underlying layer is data sources supported by AnalyticDB Spark. One is Object Storage Service (OSS) and MaxCompute which can be accessed by using AnyTunnel or STS Token. The other is AnalyticDB, RDS, and HBase in users' VPC which can be accessed only by using Elastic Network Interface (ENI) to connect different VPC networks.

The overall architecture diagram is as follows:

This article describes how to improve the performance of AnalyticDB Spark by using the vectorized engine in the current architecture. The performance of the optimized AnalyticDB Spark is 6.98 times that of Spark Community.

2. Reasons for Performing Spark Vectorized Computing

Currently, Spark is a relatively stable and mature project. Since Spark 2.4, there has been limited optimization at the operator level. Most of the optimization work is in the execution plan, such as AQE and dynamic partition pruning. The speed of operator computing has reached a bottleneck. The following figure shows the performance of the HashAgg, HashJoin, and TableScan operators of various Spark versions. The higher the value, the better the performance. Under the current Spark architecture, it is difficult for Spark to achieve significant performance breakthroughs at the operator level.

Modern SQL engines often have better operator performance. For instance, ClickHouse and Arrow are optimized in the implementation of the Native engine using columnar data structures and vectorization, which have better performance than the row processing of the Java stack of Spark. Vectorization is an essential means for Spark to achieve significant performance improvements.

3. Spark Vectorization Solutions in the Industry

In OLAP engines in the industry, vectorized computing is already a relatively core and mature optimization capability. For example, open-source Doris, ClickHouse, or Redshift self-developed by enterprises all have mature vectorization solutions.

Databricks officially released the vectorized engine Photon in 2022 and published a related paper Photon-A Fast Query Engine for Lakehouse Systems. The test results show that Photon demonstrates great performance, but is not open source, which also provides a new idea for Spark acceleration. In the same year, Kyligence and Intel collaborated to open source a Gluten. Gluten relied on Native Engine as the backend to push operators down to Velox or ClickHouse to improve execution efficiency. After two years of community iteration, Gluten currently supports most Spark operators and UDFs. Kuaishou has also been trying to vectorize Spark operators and has open-sourced the Spark vectorization processing plug-in, Blaze. Blaze serves as the middle layer to execute operator vectorization based on DataFusion in the Rust language. In addition, Apple has also open-sourced datafusion-comet, which also executes Spark operator vectorization based on DataFusion and Arrow.

In short, to improve Spark performance, many companies are trying Spark vectorization technology to further improve the operator performance of Spark.

4. Vectorization Solution for AnalyticDB Spark

To provide highly cost-effective Spark capabilities, AnalyticDB Spark also began to investigate vectorization solutions at the end of 2022. Each solution is comprehensively evaluated in terms of operator compatibility, community popularity, and number of practical usage scenarios. In the initial performance survey, the solution of Gluten + Velox achieved a 1.76 times performance improvement. Afterward, the AnalyticDB Spark team also conducted in-depth discussions and cooperation with the Intel Gluten team. Finally, AnalyticDB Spark chose Gluten + Velox as the vectorization solution.

4.1 Process of Gluten + Velox

In the Gluten + Velox solution of Spark, Gluten follows the original framework of Spark and is integrated into Spark as a plug-in. When an SQL statement comes in, the Driver will convert the SQL statement into a physical plan of Spark through the Catalyst module, and then the physical plan will be passed to Gluten. The rules in Gluten then convert the operators of the execution plan into the execution operators of the Native Engine. If the execution plan includes unsupported operators, Gluten will return them to the Java operators of Spark and embed the row-column conversion operator for compatibility.

The Driver sends the generated execution plan to the Executor. The Executor then converts the execution plan into an actual operator for Velox execution, which is executed using the Velox library by calling the JNI.

On the whole, Gluten still follows the SQL parsing and optimization logic of Spark itself, replacing only the operators that are actually executed in the end, and executing the supported operators by using Velox.

4.2 Features of AnalyticDB Spark Native Engine

While integrating the Native Engine capabilities of Gluten + Velox, AnalyticDB Spark also conducts various improvements. It contains optimizations in security, usability, and performance.

Always-confidential Computing

To meet the security demands of customers, the AnalyticDB Spark team and the DAMO Academy team have jointly built an always-confidential cloud-native big-data computing engine for privacy computing. This enables trusted and secure one-stop data exchange to reach a new level of platformization and meet the needs of security scenarios. The completely self-developed TEE engine has also passed the highest level of security certification of the China Academy of Information and Communications Technology. You can enable always-confidential computing with simple configurations at a very low cost.

The AnalyticDB Spark Native Engine also adapts to always-confidential computing to provide secure, fast, and easy-to-use analysis capabilities.

Enhanced Usability

Open-source Spark uses Gluten, which requires cumbersome configurations and high usage and understanding costs. However, AnalyticDB Spark can enable Native Engine with only one configuration. In addition, it provides optimal default configurations for users to improve usability.

Moreover, accessing Object Storage Service (OSS) data sources is a common scenario in which Spark performs data analysis. To access OSS from open-source hadoop-oss, you need to configure a plaintext AccessKey pair. However, using a plaintext AccessKey pair may incur information security risks. AnalyticDB Spark develops a RAM and STS solution based on the Alibaba Cloud RAM system. The Spark Driver/Executor periodically requests the metadata service center to refresh the STS token, enabling access to OSS data sources without using AccessKey pairs. This feature is available not only in non-Native Engine scenarios but also in Velox access to OSS in Native Engine.

UDF Support

In the process of promoting the use of AnalyticDB Spark Native Engine, we found that some customers use more Spark UDFs in their queries, such as from_json and from_csv. UDFs of this type are not currently supported in the Gluten community. Therefore, in customer testing scenarios, operators are rolled back to the Java engine because UDFs are not supported. During this process, a large number of row-column conversion overheads are incurred, resulting in poor performance.

AnalyticDB Spark has made major optimizations for these scenarios. UDFs such as from_json have been released internally, and no operator rollback occurs in SQL tests, which can result in additional row-column conversion overheads.

Integrated Intelligent Caching

Lakecache, self-developed by the AnalyticDB team, provides efficient, stable, and reliable I/O acceleration capability. In this case, we will integrate intelligent caching into the vectorized engine to further improve performance.

The basic process of integrating Lakecache acceleration with Velox is as follows: The operator that reads OSS in Velox is proxied to the Lakecache CPP Client. The Client first obtains the Lakecache Worker that needs data from the Lakecache Master, and then sends a read request to the Worker. After receiving the read request, the Worker pulls data from OSS, caches it, and then returns the data to the requested Executor. If a read request or multiple data requests to Lakecache fail, the Client also triggers the circuit breaking mechanism to directly connect to OSS. This prevents user job failures caused by occasional Lakecache service unavailability.

5. AnalyticDB Spark Performance

5.1 Test Information

Test Comparison Items

Test Comparison Items
Spark Community Edition 3.2.0
AnalyticDB Spark 3.2.0

Dataset

TPC-H 1T full query

AnalyticDB Spark Cluster Specifications

Driver: 2core8g

Executor: 40 * 2core8g

5.2 Test Results

In the TPC-H test, the total time consumed by queries is as follows:

	Spark Community	AnalyticDB Spark
Total Time Consumed (Seconds)	4351.506	623.273

The following figure compares the time consumed by each query:

The overall performance of AnalyticDB Spark is 6.98 times that of Spark Community in the TPC-H 1T test.

6. Summary and Future Planning

AnalyticDB Spark uses vectorization technology and integrates vectorization technology with intelligent caching to improve the performance by 6.98 times compared with Spark Community. Our plans for the future are as follows:

6.1 Vectorization Capability Available in all Alibaba Cloud Regions

Currently, our vectorization capability is in invitational preview. In the future, this feature will be available for more customers to experience.

6.2 More Supported Scenarios

Support more customer scenarios, including but not limited to:

• Gluten/Velox keeps up with the community. The current Gluten/Velox community is updated and iterated quickly. We will maintain testing and following up with the Gluten/Velox community on a monthly basis.

• Support more data sources, such as jindo fs and aws s3.

• Adapt to more custom scenarios that are common to customers, such as UDFs that are commonly used by customers.

6.3 Hardware and Software Integrated Optimization

For example, the AnalyticDB Spark vectorized engine is combined with Alibaba Cloud YiTian instances to further improve performance.

Community

Sevenfold Performance Improvement | Alibaba Cloud AnalyticDB Spark Vectorization Capability Analysis

1. AnalyticDB Spark Product Architecture

2. Reasons for Performing Spark Vectorized Computing

3. Spark Vectorization Solutions in the Industry

4. Vectorization Solution for AnalyticDB Spark

4.1 Process of Gluten + Velox

4.2 Features of AnalyticDB Spark Native Engine

Always-confidential Computing

Enhanced Usability

UDF Support

Integrated Intelligent Caching

5. AnalyticDB Spark Performance

5.1 Test Information

Test Comparison Items

Dataset

AnalyticDB Spark Cluster Specifications

5.2 Test Results

6. Summary and Future Planning

6.1 Vectorization Capability Available in all Alibaba Cloud Regions

6.2 More Supported Scenarios

6.3 Hardware and Software Integrated Optimization

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

AnalyticDB for MySQL

PolarDB for MySQL

Database for FinTech Solution

PolarDB for PostgreSQL