By Helin Jin (Pangbei)
To meet the complex data processing needs of batch processing and machine learning/AI-ready scenarios, AnalyticDB offers open-source Apache Spark™ compatibility (herein referred to as AnalyticDB Spark™).
• The upper layer of AnalyticDB Spark provides a variety of scheduling portals for users, including the console, DMS, and spark-submit scripts commonly used in Apache Spark™.
• The intermediate control service provides capabilities such as resource, metadata, multi-tenancy, and security management.
• The lower layer provides Apache Spark™ in the elastic serverless architecture. The Apache Spark™ cluster manages database and table information through a unified metadata service and applies for elastic resources through a unified management base.
• The underlying layer is data sources supported by AnalyticDB Spark. One is Object Storage Service (OSS) and MaxCompute which can be accessed by using AnyTunnel or STS Token. The other is AnalyticDB, RDS, and HBase in users' VPC which can be accessed only by using Elastic Network Interface (ENI) to connect different VPC networks.
The overall architecture diagram is as follows:

This article describes how to improve the performance of AnalyticDB Spark by using the vectorized engine in the current architecture. The performance of the optimized AnalyticDB Spark is 6.98 times that of Apache Spark™ Community.
Currently, Apache Spark™ is a relatively stable and mature project. Since Apache Spark™ 2.4, there has been limited optimization at the operator level. Most of the optimization work is in the execution plan, such as AQE and dynamic partition pruning. The speed of operator computing has reached a bottleneck. The following figure shows the performance of the HashAgg, HashJoin, and TableScan operators of various Apache Spark™ versions. The higher the value, the better the performance. Under the current Apache Spark™ architecture, it is difficult for Apache Spark™ to achieve significant performance breakthroughs at the operator level.

Modern SQL engines often have better operator performance. For instance, ClickHouse and Arrow are optimized in the implementation of the Native engine using columnar data structures and vectorization, which have better performance than the row processing of the Java stack of Apache Spark™. Vectorization is an essential means for Apache Spark™ to achieve significant performance improvements.
In OLAP engines in the industry, vectorized computing is already a relatively core and mature optimization capability. For example, open-source Apache Doris, ClickHouse, or Redshift self-developed by enterprises all have mature vectorization solutions.
Databricks officially released the vectorized engine Photon in 2022 and published a related paper Photon-A Fast Query Engine for Lakehouse Systems. The test results show that Photon demonstrates great performance, but is not open source, which also provides a new idea for Apache Spark™ acceleration. In the same year, Kyligence and Intel collaborated to open source a Apache Gluten. Apache Gluten relied on Native Engine as the backend to push operators down to Velox or ClickHouse to improve execution efficiency. After two years of community iteration, Apache Gluten currently supports most Apache Spark™ operators and UDFs. Kuaishou has also been trying to vectorize Apache Spark™ operators and has open-sourced the Apache Spark™ vectorization processing plug-in, Blaze. Blaze serves as the middle layer to execute operator vectorization based on DataFusion in the Rust language. In addition, Apple has also open-sourced datafusion-comet, which also executes Apache Spark™ operator vectorization based on DataFusion and Arrow.
In short, to improve Apache Spark™ performance, many companies are trying Apache Spark™ vectorization technology to further improve the operator performance of Apache Spark™.
To provide highly cost-effective Apache Spark™ capabilities, AnalyticDB Spark also began to investigate vectorization solutions at the end of 2022. Each solution is comprehensively evaluated in terms of operator compatibility, community popularity, and number of practical usage scenarios. In the initial performance survey, the solution of Apache Gluten + Velox achieved a 1.76 times performance improvement. Afterward, the AnalyticDB Spark team also conducted in-depth discussions and cooperation with the Intel Apache Gluten team. Finally, AnalyticDB Spark chose Apache Gluten + Velox as the vectorization solution.
In the Apache Gluten + Velox solution of Apache Spark™, Apache Gluten follows the original framework of Apache Spark™ and is integrated into Apache Spark™ as a plug-in. When an SQL statement comes in, the Driver will convert the SQL statement into a physical plan of Apache Spark™ through the Catalyst module, and then the physical plan will be passed to Apache Gluten. The rules in Apache Gluten then convert the operators of the execution plan into the execution operators of the Native Engine. If the execution plan includes unsupported operators, Apache Gluten will return them to the Java operators of Apache Spark™ and embed the row-column conversion operator for compatibility.
The Driver sends the generated execution plan to the Executor. The Executor then converts the execution plan into an actual operator for Velox execution, which is executed using the Velox library by calling the JNI.
On the whole, Apache Gluten still follows the SQL parsing and optimization logic of Apache Spark™ itself, replacing only the operators that are actually executed in the end, and executing the supported operators by using Velox.

While integrating the Native Engine capabilities of Apache Gluten + Velox, AnalyticDB Spark also conducts various improvements. It contains optimizations in security, usability, and performance.
To meet the security demands of customers, the AnalyticDB Spark team and the DAMO Academy team have jointly built an always-confidential cloud-native big-data computing engine for privacy computing. This enables trusted and secure one-stop data exchange to reach a new level of platformization and meet the needs of security scenarios. The completely self-developed TEE engine has also passed the highest level of security certification of the China Academy of Information and Communications Technology. You can enable always-confidential computing with simple configurations at a very low cost.
The AnalyticDB Spark Native Engine also adapts to always-confidential computing to provide secure, fast, and easy-to-use analysis capabilities.

Open-source Apache Spark™ uses Apache Gluten, which requires cumbersome configurations and high usage and understanding costs. However, AnalyticDB Spark can enable Native Engine with only one configuration. In addition, it provides optimal default configurations for users to improve usability.
Moreover, accessing Object Storage Service (OSS) data sources is a common scenario in which Apache Spark™ performs data analysis. To access OSS from open-source hadoop-oss, you need to configure a plaintext AccessKey pair. However, using a plaintext AccessKey pair may incur information security risks. AnalyticDB Spark develops a RAM and STS solution based on the Alibaba Cloud RAM system. The Apache Spark™ Driver/Executor periodically requests the metadata service center to refresh the STS token, enabling access to OSS data sources without using AccessKey pairs. This feature is available not only in non-Native Engine scenarios but also in Velox access to OSS in Native Engine.

In the process of promoting the use of AnalyticDB Spark Native Engine, we found that some customers use more Apache Spark™ UDFs in their queries, such as from_json and from_csv. UDFs of this type are not currently supported in the Apache Gluten community. Therefore, in customer testing scenarios, operators are rolled back to the Java engine because UDFs are not supported. During this process, a large number of row-column conversion overheads are incurred, resulting in poor performance.
AnalyticDB Spark has made major optimizations for these scenarios. UDFs such as from_json have been released internally, and no operator rollback occurs in SQL tests, which can result in additional row-column conversion overheads.
Lakecache, self-developed by the AnalyticDB team, provides efficient, stable, and reliable I/O acceleration capability. In this case, we will integrate intelligent caching into the vectorized engine to further improve performance.
The basic process of integrating Lakecache acceleration with Velox is as follows: The operator that reads OSS in Velox is proxied to the Lakecache CPP Client. The Client first obtains the Lakecache Worker that needs data from the Lakecache Master, and then sends a read request to the Worker. After receiving the read request, the Worker pulls data from OSS, caches it, and then returns the data to the requested Executor. If a read request or multiple data requests to Lakecache fail, the Client also triggers the circuit breaking mechanism to directly connect to OSS. This prevents user job failures caused by occasional Lakecache service unavailability.

| Test Comparison Items |
|---|
| Apache Spark™ Community Edition 3.2.0 |
| AnalyticDB Spark 3.2.0 |
TPC-H 1T full query
Driver: 2core8g
Executor: 40 * 2core8g
In the TPC-H test, the total time consumed by queries is as follows:
| Apache Spark™ Community | AnalyticDB Spark | |
|---|---|---|
| Total Time Consumed (Seconds) | 4351.506 | 623.273 |
The following figure compares the time consumed by each query:

The overall performance of AnalyticDB Spark is 6.98 times that of Apache Spark™ Community in the TPC-H 1T test.
AnalyticDB Spark uses vectorization technology and integrates vectorization technology with intelligent caching to improve the performance by 6.98 times compared with Apache Spark™ Community. Our plans for the future are as follows:
Currently, our vectorization capability is in invitational preview. In the future, this feature will be available for more customers to experience.
Support more customer scenarios, including but not limited to:
• Apache Gluten/Velox keeps up with the community. The current Apache Gluten/Velox community is updated and iterated quickly. We will maintain testing and following up with the Apache Gluten/Velox community on a monthly basis.
• Support more data sources, such as jindo fs and aws s3.
• Adapt to more custom scenarios that are common to customers, such as UDFs that are commonly used by customers.
For example, the AnalyticDB Spark vectorized engine is combined with Alibaba Cloud YiTian instances to further improve performance.
Implement Scheduled Elastic Scaling of Serverless ApsaraDB RDS for SQL Server By Calling API
ApsaraDB - October 21, 2020
Apache Flink Community - March 26, 2025
Alibaba EMR - August 5, 2024
ApsaraDB - October 29, 2025
Alibaba Clouder - July 1, 2020
ApsaraDB - October 24, 2025
AnalyticDB for MySQL
AnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn More
PolarDB for MySQL
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn More
Database for FinTech Solution
Leverage cloud-native database solutions dedicated for FinTech.
Learn More
PolarDB for PostgreSQL
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by ApsaraDB