This topic describes the benefits, architectures, and scenarios of PolarDB Archive Database.
Challenges and requirements for archiving historical data
In most cases, new data is read or updated more frequently. Historical data is not often queried, such as messages or orders that are generated one year ago. With the development of business, a large volume of data that is not often queried or never queried is stored in the database. The increase of data may cause the following issues:
- Historical data and new data are stored in the same database. This may result in insufficient disk space.
- A large volume of data shares the memory, cache space, and disk IOPS of the database. This may deteriorate the performance.
- It requires a long period of time to back up a large volume of data. In some cases, you cannot back up a large volume of data as expected. An efficient solution is required to store a large volume of backup data.
To fix these issues, historical data need to be archived. Historical data can be stored as files by using low-cost storage services, such as Alibaba Cloud Object Storage Service (OSS) or Database Backup (DBS) services. However, historical data is not completely static. Historical data generated months or even years ago may be occasionally queried or updated in real time. For example, historical data can be queried within Alibaba Group. This includes the historical orders in Taobao or Tmall, historical messages in DingTalk, and large amounts of historical logistics orders.
To read and update historical data, an archive database is used as a separate database to store the archived data. Each archive database must meet the following requirements:
- Provides large storage capacity and allows you to continuously save online data to the archive database. In this case, you do not need to be concerned about the storage capacity.
- Provides the same interface as the cloud databases. For example, the archive database must support MySQL protocols. This ensures that your applications can access both the online database and archive database. In this case, you do not need to modify code for your applications.
- The archive database is cost-efficient. For example, you can compress data to reduce storage consumption and use low-cost storage media to store large volumes of data.
- Provides read and write capabilities that meet the requirements of low-frequency reads and writes.
A solution that meets the requirements for large storage capacity, low cost, and read and write capabilities is required. However, as the most widely used open source database in the world, MySQL does not provide this solution. Engines with a high compression ratio, such as TokuDB and MyRocks, have been released. However, the data volume that can be stored is limited by the disk capacity of each physical machine.
Solution: PolarDB archive database
To handle the preceding challenges and requirements to store archived data, PolarDB provides the Archive Database product edition based on the following technological innovations and breakthroughs:
- X-Engine is developed by Alibaba Cloud based on log-structured merge-tree (LSM tree). X-Engine provides high data compression capabilities and allows you to use archive databases at a low cost. For more information about X-Engine, see X-Engine overview.
- PolarDB supports online scaling of the storage capacity based on the shared distributed storage service. It connects computing resources and storage resources by the high-speed network and performs data transmission by using the RDMA protocol. This eliminates the bottleneck of I/O performance. The X-Engine that is integrated in PolarDB also includes these benefits.
X-Engine is integrated with PolarDB by using the following technological innovations. This enables PolarDB to run in a dual-engine architecture.
- Combines the WAL log streams of X-Engine and REDO log streams of InnoDB, and uses the same log streams and transmission channels to support InnoDB and X-Engine engines. The management logic and the interaction logic with the shared storage remain unchanged. This architecture can be reused by other engines that are introduced later.
- The X-Engine IO module is adapted to the Polar file system (PFS) of PolarDB InnoDB. This ensures that InnoDB and X-Engine share a distributed storage. Backups are also accelerated based on the underlying distributed storage.
Single compute node architecture of Archive Database
A cluster of PolarDB Cluster Edition consists of one primary node and at least one read-only node. Read requests are forwarded to the primary node and write requests are forwarded to multiple read-only nodes. However, in Archive Database scenarios that require large storage capacity and fewer reads and writes, computing resources of the primary node cannot be used up. Therefore, the read capability that is provided by read-only nodes is unnecessary. When the specifications of the primary node and read-only nodes are the same, 50% of the computing resources are unused.
Archive Database can reduce storage costs based on the data compression capability of X-Engine. Archive Database uses only the primary node to provide services. This reduces the computing resources costs of read-only nodes. It requires a longer time for PolarDB clusters that do not have read-only nodes to recover in disaster recovery scenarios when the primary node crashes. However, Archive Database still ensures 99.95% of availability based on the high availability that is provided by the underlying distributed storage. Most users want to reduce the storage costs and do not require high availability in low-frequency reads and writes scenarios. In these cases, data is asynchronously imported to Archive Database in batches. Archive databases can meet the business requirements of these users.
Read-only nodes are not provided in the single-node architecture of the archive database. When you perform O&M operations on a node, such as restart the node after the minor version upgrade, the temporary read-only node deployed within the system is upgraded to the primary node to reduce adverse impacts on reads and writes to the Archive Database.
- Provides large storage capacity. Based on the 200 TB of storage and the compression capability of X-Engine, archive database provides more than 500 TB storage capacity to store raw data. The storage space is in a serverless architecture and automatically scales out as the data volume increases. You do not need to specify the storage capacity when you purchase clusters. You are charged for only the actual used storage.
- PolarDB Archive Database supports the official MySQL protocol. Compared to other solutions that back up historical data to NoSQL products such as HBase, archive database allows applications to access both the online database and archive database. In this case, you do not need to modify code for your applications.
- Based on the backup capability provided by PolarDB underlying distributed storage, archive database allows you to accelerate data backup. The backup data can be uploaded and permanently stored in low-cost storage, such as OSS.
PolarDB Archive Database provides a large storage capacity and can be used to store the historical data of multiple services. This ensures the centralized storage and management for all historical data. Archive database is applicable to the following scenarios:
- Use PolarDB Archive Database to store cold data of self-managed databases. These databases include MySQL, TiDB, PostgreSQL, SQL Server, and other relational databases.
- Use PolarDB Archive Database to store archived data of ApsaraDB RDS for MySQL or ApsaraDB PolarDB MySQL-compatible edition services. You can migrate the historical data that is not often queried to ApsaraDB PolarDB MySQL-compatible edition X-Engine. This releases the online database storage space to reduce costs and improve performance.
- Use PolarDB Archive Database as the relational database that provides large storage capacity. This is applicable to scenarios that require large amounts of writes and low-frequency reads, such as monitoring logs.
You can use Data Transmission Service (DTS) to continuously migrate data from the online database to PolarDB Archive Database in real time. You can also use Data Management Service (DMS) to periodically import online data to PolarDB Archive Database.
How does Archive Database ensure service availability and data reliability when only one primary node is used?
Archive Database is the database product that provides specific services based on a single compute node. However, Archive Database can ensure high service availability and high data reliability by using new technologies, such as computing scheduling within seconds and distributed multi-replica storage.