Recently, PolarDB topped the TPC-C benchmark test list with a performance that exceeded the previous record by 2.5 times, setting a new TPC-C world record for performance and cost-effectiveness with a performance of 2.055 billion transactions per minute (tpmC) and a unit cost of CNY 0.8 (price/tpmC).
Each seemingly simple number contains countless technical personnel's ultimate pursuit of database performance, cost-effectiveness, and stability. The pace of innovation in PolarDB has never stopped. A series of articles on "PolarDB's Technical Secrets of Topping TPC-C" are hereby released to tell you the story behind the "Double First Place". Please stay tuned!
This is the third article in the series - Cost Optimization - Hardware and Software Collaboration.
Previous articles:
TPC-C is a benchmark model issued by the Transaction Processing Performance Council (TPC) specifically designed to evaluate OLTP (Online Transaction Processing) systems. It covers typical database processing paths such as addition, deletion, modification, and query to test the OLTP performance of the database. The final performance metric is measured by tpmC (transaction per minute). The TPC-C benchmark model intuitively evaluates the performance of a database.
In this TPC-C benchmark test, we use the cloud-native database PolarDB for MySQL 8.0.2. Through standalone optimization and I/O link optimization, we improve performance, efficiently combine software and hardware, and enhance cost-effectiveness. The unit cost is reduced by nearly 40% compared with the previous record.
This article discusses how PolarDB deeply combines software and hardware to improve I/O performance and reduce costs. During the test, we identified the following data characteristics:
• Massive data volume: The maximum amount of data can reach up to 64 PB, resulting in enormous data transmission.
• High I/O throughput and low latency: In high I/O throughput scenarios, extremely low latency is required to ensure high performance.
• Cost-effectiveness and high availability: It is required to lower storage costs while maintaining the high availability of the system and its fault tolerance.
This article elaborates on the I/O link technology of PolarDB featuring software and hardware collaboration. Based on the preceding data features, the software and hardware mechanisms are optimized to achieve high performance and high cost-effectiveness.
In relational database areas, customers place particular emphasis on three core dimensions: performance, scalability, and availability. In the traditional MySQL architecture, local disk deployment can provide high I/O performance by virtue of the direct connection to storage media. However, limited by the standalone deployment mode, the overall availability of the system remains relatively low. To improve availability and fault tolerance, MySQL usually employs a primary/secondary synchronization mechanism to synchronize data through network transmission, which increases the transaction processing latency and reduces the overall performance of the database. In addition, the time required for fault recovery and adding backup nodes is relatively long.
The scalability of deploying MySQL on local disks is limited by the capacity and throughput of physical disks. To improve scalability, you can deploy MySQL on cloud disks, which effectively alleviates the limitation of storage space expansion. However, since databases and cloud disks rely on networks for data transmission service, it is impossible to achieve computing-storage integrated deployment. Cross-level network latency is high, and the overall performance of MySQL is greatly reduced.
Figure 1: Traditional MySQL database
The computing-storage separation architecture of PolarDB perfectly addresses the preceding three dimensions. It features highly available distributed storage, supports dynamic adjustment of storage capacity on demand, and scales up to petabytes per instance. Leveraging software and hardware co-optimization technology with high-speed networks and premium storage devices, PolarDB achieves performance comparable to that of local disks. The compression technology with software and hardware collaboration significantly reduces data storage costs and substantially improves the cost-effectiveness of databases while maintaining high performance.
Figure 2: End-to-end PolarDB I/O architecture
This article will elaborate on three aspects:
100Gb RDMA high-speed network: PolarDB fully deploys a 100Gb RDMA network to synchronize data among proxies, compute node databases, and storage cluster servers. PolarDB provides high bandwidth and low latency for large-scale distributed databases, significantly enhances performance in high-concurrency scenarios, and ensures system response speed and stability.
Accelerated I/O performance in EMP: To optimize the performance of I/O links, a memory pool of petabytes in a single cluster is built based on DRAM and AliSCM, which is developed by Alibaba Cloud. Deeply integrated with database data features, PolarFS provides highly customized optimizations for different data, and fine-grained cache management through I/O labeling to ensure high efficiency of critical paths. The standalone performance is excellent in high concurrency and kernel stack optimization. The specific details are described in another article.
Hardware-software collaborative compression for cost efficiency: PolarDB achieves imperceptible compression of more than four times through the hardware-software collaborative compression technology. By combining software compression with SmarSSD hardware, the multi-tiered compression strategy greatly reduces storage costs and intelligently identifies data based on data characteristics to ensure efficient data access.
Through end-to-end software and hardware co-optimization, PolarDB achieves extreme performance, availability, and cost-effectiveness, delivering high-performance distributed storage with a high compression ratio.
PolarStore employs the 100Gbs Lossy RDMA network for communication. Combined with in-depth software-layer optimization, it implements highly reliable RDMA interconnection across thousands of hosts, ensuring consistency and reliability. This architecture provides aggregate bandwidth far exceeding that of local disks, significantly boosts read/write throughput, and offers high-performance and low-latency transmission support for massive concurrent requests in TPC-C benchmark scenarios.
PolarDB uses self-developed PolarFS and oriented optimization of the RDMA protocol to make adjustments for real-world scenarios, giving full play to the high-performance advantages of RDMA.
Combine RDMA and software implementation to achieve zero memory copy. PolarStore links register DPDK huge pages as RDMA memory buffers. End-to-end network communication does not need to copy data to the user space through the kernel as in traditional networks. Combined with the RDMA scatter-gather mechanism, distributed data forwarding directly redirects data for sending, thus realizing zero copy throughout the link and maximizing the advantages of network hardware.
Prevent RDMA connection explosion and reduce NIC (network interface controller) cache pressure. In a distributed cluster, compute nodes are interconnected with multiple storage nodes. To prevent unlimited growth of connections, an I/O forwarding module (PolarSwitch) is introduced for the compute nodes. The database is forwarded to the storage nodes via PolarSwitch through the shared memory. Each compute node maintains a set of network connections. This relieves NIC cache pressure and reduces the memory usage of the network message pool.
Small-scale networking optimization for storage nodes: A storage cluster uses a disk group architecture, where RDMA connections are established only within the disk group. Since the disk group typically contains no more than 20 disks, the all-mesh networking mode is not used, and the number of connections on a single machine is significantly reduced.
Figure 3: Schematic diagram of the number of compute node links before and after the optimization
Network stability is critical in large-scale clusters. PolarStore achieves linear scaling of nearly 10,000 nodes by reducing jitter and congestion frequency and enhancing fast recovery capabilities.
• Mitigate network congestion caused by Incast traffic. A PolarDB cluster consists of hundreds of compute and storage nodes. Its many-to-many communication mode is prone to cause Incast issues, which causes network congestion and RDMA throttling, affecting the overall performance. PolarStore adopts a hierarchical traffic management policy to establish multi-level protection. PolarSwitch restricts the I/O depth of RDMA links on a single storage node to prevent excessive instantaneous I/O from filling the switch buffer and triggering ECN throttling. The storage node introduces a backpressure mechanism (Throttle) to slow down the I/O sending rate of PolarSwitch. This reduces the risk of Incast issues and ensures the stability of the cluster network.
• Fast fault recovery: Cope with NIC flapping. NIC flapping causes timeout or packet loss. PolarStore uses the multi-path technology to establish multiple end-to-end links and record the RTT duration of each link. When flapping occurs, the system forwards I/O requests that have timed out through paths with normal RTT. This minimizes the impact of flapping on I/O performance.
In the TPC-C test, the PolarDB cluster faces the dual challenges of a large number of concurrent requests and 64 petabytes of data. The ratio of database memory to data volume is 1:40, resulting in frequent I/O read and write operations. PolarStore uses DRAM and AliSCM developed by Alibaba Cloud to build petabyte-level elastic memory pools. It uses intelligent I/O labeling to refine I/O performance and maximizes I/O throughput through ParallelRaft technology.
In the I/O link of PolarDB, the I/O performance of redo logs directly affects the transaction commit efficiency. The limited buffer pool causes transactions to frequently read pages. Therefore, the write performance of redo logs and the read performance of pages are critical.
Intelligent I/O labeling: As a high-performance file system in user mode, PolarFS improves I/O efficiency by reducing kernel switching and customizing optimization. It works with the database kernel and uses the I/O labeling mechanism to optimize the storage system for different data types.
Figure 4: End-to-end data acceleration in elastic memory pool (EMP)
Extreme EMP acceleration: EMP builds petabyte-level elastic memory pools based on DRAM and AliSCM, a persistent storage medium developed by Alibaba Cloud. AliSCM has a persistence latency of hundreds of nanoseconds and a storage density better than that of DRAM. AliSCM achieves four times the capacity and three times the throughput. AliSCM is used as a write cache acceleration to improve write I/O performance. Then, space is released immediately after data is asynchronously written to SmartSSD. To meet the high-throughput write requirements of redo logs, the system can return data after being written to AliSCM. This reduces the critical write latency to more than ten microseconds, which is equal to the performance of local disks. Page flushing operations in non-critical paths are stored in AliSCM through software compression to reduce storage costs. Raw data is cached in memory to accelerate read access.
PolarFS uses I/O labeling to cache hot pages in EMP. When the database reads a page, data can be returned if it hits the EMP. The read latency is reduced to more than ten microseconds, which is significantly better than the latency at the hundred-microsecond level of local SSDs. In addition, EMP adopts an intelligent prefetching mechanism, which pre-loads hot data to the cache based on dynamic analysis of user query patterns, effectively improving the cache hit rate. With fine-grained data management, you can reduce the read and write latency of critical paths and implement high-performance storage services.
Low overhead and high availability of ParallelRaft: Self-developed ParallelRaft uses Parallel Commit and Look Behind Buffer technologies to avoid possible blocking points in the traditional Raft replication process, maximize system concurrency, give full play to hardware capabilities, and greatly ensure the performance of commit and apply.
AliSCM is a self-developed persistence storage device that continuously improves storage density by using 3D stacking technology. This achieves lower costs, read and write latency close to that of DRAM, and performance superior to that of traditional NAND flash memory. Based on Intel's advanced data access DAX technology, it ensures data durability and uses the CXL protocol to natively support resource pooling. The device is flexible and easy to use, has a complete ecosystem and tool interface, and has extremely high stability and reliability.
Figure 5: AliSCM architecture
PolarDB uses SMART SSDs, which are developed by Alibaba Cloud. ASIC chips are embedded in the SSDs to accelerate compression and decompression, reducing TCO by 60%. The performance of PolarDB is not affected even under extreme loads such as TPC-C. Its core advantages are:
• The hardware compression operation is completed by embedded ASIC chips, which do not occupy the user's CPU resources and avoid performance loss.
• The interface is compatible with the standard disk, without the need for software stack adaptation.
• The write path provides SRAM fast landing cache, supports power-off flashback to persistent NAND, and performs compression in the background, which can achieve write performance comparable to that of standard disks.
• Although the read path has decompression overhead, the NAND page view is reduced after compression (for example, only half of the data needs to be read after compressing 16KB to 8KB), which makes up for part of the decompression overhead and does not affect the read performance.
SMART SSDs are embedded with high-performance dedicated ASIC chips and are used to offload computing tasks such as compression, decompression, encryption, and decryption. NAND embedded with a certain physical capacity is used to record compressed data. Before the disk is used, the logical capacity must be set according to a certain expansion factor to store the data before compression. Users transparently access the logical address to obtain the original data.
Figure 6: SMRT SSD architecture
SMART SSD uses data compression with a granularity of 4KB. Its core technology lies in the compact storage of compressed data on NAND through variable-length FTL. FTL records the mapping of compressed data. After compression, the physical address is no longer 4KB, including the offset and length of the compressed data. The mapping includes 32 bits occupied by PBA and 25 bits of compressed meta information: 1bit indicates whether to compress, 12bit indicates an address offset, and 12bit represents a compressed length.
Figure 7: Diagram of SMART SSD variable-length FTL
When data compression in SMART SSD is low, there is a risk that the physical capacity is exhausted. By controlling and PolarMaster continuously monitoring the physical water level of SMART SSD, migrate and balance data in the background based on usage data to prevent the physical capacity usage from rising.
Figure 8: Intelligent compression ratio scheduling based on PolarMaster
Based on the difference in the compression ratio of user data, when data is aggregated to SMART SSD, the overall compression ratio of each physical disk is significantly unbalanced. Disks with low compression ratio datasets are prone to premature failure due to exhausted physical capacity, while disks with high compression ratio may be idle due to the upper limit of logical capacity. PolarMaster continuously monitors user data compression characteristics and disk compression ratios. The system intelligently and dynamically migrates data blocks with different compression characteristics, equalizes the consumption rate of physical and logical capacity, and maximizes the storage resource utilization of SMART SSD.
PolarStore provides a dual-layer compression mechanism of software compression and SMART SSD, which improves the compression ratio and reduces user costs. Databases typically use 16KB data pages. PolarStore performs software compression on 16KB pages based on the intelligent I/O labeling mechanism.
The software compression manages block resources with a granularity of 4KB without introducing byte-level resource management, which greatly reduces the complexity of the software compression and the variable-length mapping table. After the SMART SSD is written, the variable-length FTL of SMART SSD is aligned to the byte granularity, avoiding the waste of storage space.
Figure 9: Dual-layer compression mechanism of software compression and hardware compression
PolarDB significantly improves its competitiveness by deeply integrating software optimization with cutting-edge hardware technologies. TPC-C benchmark, as a standard for comprehensively evaluating database performance, not only strictly examines the processing capabilities of the system, but also considers its cost-effectiveness in detail. In the test, PolarDB demonstrated excellent performance and attractive cost-effectiveness, providing users with a more efficient, reliable, and cost-effective database service experience.
Technical Secrets of PolarDB: High Availability - Smooth Switchover
ApsaraDB - April 9, 2025
ApsaraDB - April 9, 2025
ApsaraDB - July 9, 2025
ApsaraDB - May 29, 2025
OpenAnolis - September 6, 2022
Alibaba EMR - July 9, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB