ADBPG&Greenplum cost-optimized disk water level management

1: Background description

At present, the core data of enterprises are generally stored in the database in the form of two-dimensional tables. In an environment where the core technology is independent and controllable, customers in the government and enterprise industries are trying to use domestic databases or open source databases, especially in the field of data warehouse OLAP. Greenplum is more and more widely used, and Alibaba Cloud ADB PG's There are also more and more market opportunities. In addition, as the value of the data center platform has been widely recognized in recent years, the demand for enterprises to build a data center platform is also very urgent. Greenplum or ADB PG will be used in many scenarios of the data center platform. Therefore, this year, Alibaba Cloud used ADB PG to help many customers upgrade their core data warehouses. We found that customers tend to pay more attention to the cost of using cloud-native data warehouses. How to help customers save costs is worth exploring and implementing.

ADB PG stands for AnalyticDB PostgreSQL version of cloud-native data warehouse. It is a database with massively parallel processing (MPP) architecture. Greenplum open source version. Figure 1 is a schematic diagram of the architecture of ADB PG (Greenplum is also the same). The Master is responsible for accepting connection requests, SQL parsing, optimization, transaction processing, etc., and dispatching tasks to Segment for execution; and coordinating the results returned by each Segment and presenting the final results to client program. Segment is an independent PostgreSQL database responsible for the storage and calculation of business data. Each Segment is located on a different independent physical machine to store different parts of business data. Multiple Segments form a computing cluster; the cluster supports horizontal expansion. It is clear from the architecture that the most important thing to save the cost of Greenplum is to save the number of segment servers as much as possible, but it must not only ensure the computing power of the overall MPP, but also meet the data storage space requirements. Usually, the data in the data warehouse is synchronized from the upstream production systems in various fields of the enterprise. These data have a life cycle in the analysis field, and many data need to reflect historical changes. Therefore, the data in the data warehouse is characterized by multiple sources and historical data. , The amount of data is relatively large. A large amount of data will inevitably consume storage space. Under the MPP architecture, it will consume server costs. To help customers optimize costs and save storage space is the first priority.

Next, we will use an actual disk space optimization case to illustrate how to help customers optimize costs.

Two: Introduction to Disk Management of ADB PG & Greenplum

1. Key technical points of ADB PG disk management

ADB PG is an MPP database modified based on the Greenplum ("GP" for short) kernel. For disk space management, there are several technical points that are common to Greenplum:

(1) Business data is mainly distributed in Segment nodes;

(2) Segment has Primary and Mirror nodes, so the available business space is 1/2 of the total server space;

(3) Greenplum's MVCC mechanism causes dead tuples of garbage data to be generated after DML occurs in the table data;

(4) The copy table (full distribution table) will store the same data copy on each segment; the distribution table will scatter and store data to each segment according to the distribution key.

(5) Greenplum has Append Only tables, which support compressed storage and can save space; of course, when users access, decompression takes time, so it is necessary to strike a balance between performance and space.

The characteristic of the cloud-native database is that it no longer provides the kernel for database storage and computing alone, but also supports an operation and maintenance management platform, referred to as "database management and control". After figuring out the principle of ADB PG disk management, we need to understand the design of database management and control in terms of disk water level management.

2. Disk reservation mechanism for database control

Let's take a look at the disk usage diagram of each Segment node in a data warehouse experimental environment.

The first percentage in the figure above is the disk usage percentage of the physical machine where the segment is located; the second percentage is the disk usage percentage controlled by the database. Why should the controlled data be inconsistent with the actual occupancy of the server? In fact, it is the first very important preventive measure in water level management: space reservation. That is, when ADB control creates a segment instance, it reserves a certain amount according to the server space, accounting for about 12%, that is, a 20T server. The control thinks that the maximum available business is 17.6T. This logic will notify the monitoring system. Therefore, when calculating the disk ratio, the denominator of the monitoring system is not 20T, but 17.6T. This is the first level of protection.

Another important reason for reserving space is that the database itself has WAL transaction logs, error logs, etc. that also take up space. Therefore, part of the disk space needs to be used for logs, and the customer's business data cannot use 100% of the server space, which is why two space percentages are displayed in Figure 2.

3. The "lock write" protection mechanism of database control

The second level of protection is "disk full lock write". On the basis of 17.6T, the control does not allow the business to be completely filled, which will easily cause data file damage, bring about database downtime and unrecoverable disasters. Therefore, there is a second threshold here, that is, when the disk is written to 90%, the automatic inspection task of the database control will start the "lock write" operation. At this time, all DMLs that request ADB will fail. This is an important protection mechanism. As shown in Figure 3 below, if the threshold is reached, it will prompt "need to lock". The threshold can be configured. If the disk space is tight, the threshold can be adjusted appropriately according to the actual situation.

The above two mechanisms of database management and control can effectively ensure that the disk runs under the safe water level. These designs are the basis for our cost optimization. Disk cost optimization means that the server's disks are used as much as possible. To save disk space, it is necessary to run at a relatively high disk water level (this refers to the case where the amount of data is really large). Therefore, effective disk management and timely problem monitoring and discovery mechanisms are very important.

3. Disk space optimization scheme

Let's take a customer's case to illustrate the method of disk space optimization. The data (including indexes) in the customer data warehouse is larger than 1.5PB, but the customer purchased 40 machines for the ADB data warehouse in the first phase, with a total capacity of about 800T. The customer clearly requested that Alibaba Cloud needs to cooperate with the business side in the design of the data warehouse to help them save costs. The customer has already set the KPI for cost optimization, which requires Alibaba Cloud to implement it through technology. We collaborated with the business side to make some plans in the design stage. Technically, we mainly considered from the perspective of table compression and separation of hot and cold data; in terms of business, we asked developers to reduce the stock data in ADB as much as possible from the design point of view. In the end, the developer estimated that about 360T of hot data was migrated from the old data warehouse to ADB. Before going online, developers need to migrate the necessary basic business data (such as source layer, middle layer) from DB2 to ADB PG. After the migration is completed and the business is in the trial run period, we found that the space is almost full (as shown in Figure 2). Space optimization is imminent, so we launched disk space optimization governance. Figure 4 is a framework for disk space management optimization.

Next, let's expand and explain.

1. Table storage format and compression

Compressed storage of tables can effectively guarantee customers to save storage space. Greenplum supports storage formats such as row storage, Append-only row storage, and Append-only column storage. If you want to save storage space, Append-only column storage table is a better choice. It supports data compression better, and you can specify the compression algorithm and compression level when creating the table. Appropriate compression algorithm and level can save storage space several times. The table creation example statement is as follows:

The column storage table must be of Append-only type. When creating a column storage table, the user can specify the compression type by specifying the COMPRESSTYPE field. If not specified, the data will not be compressed. Three compression types are currently supported:

zstd, zlib, and lz4, the zstd algorithm is relatively balanced in the three dimensions of compression speed, decompression degree, and compression rate. In practice, it is recommended to give priority to the zstd algorithm. The zlib algorithm is mainly for compatibility with some existing data. Generally, the new table does not use the zlib algorithm. The compression speed and compression rate of the lz4 algorithm are not as good as zstd, but the decompression speed is significantly better than that of the zstd algorithm. Therefore, lz4 algorithm is recommended for scenarios with strict query performance requirements.

Users can determine the compression level by specifying the COMPRESSLEVEL field. The larger the value, the higher the compression rate, and the value range is 1-19. The specific compression level is not that the larger the number, the better. As mentioned above, decompression also consumes time, and the compression rate Higher, the decompression will be relatively slower. Therefore, it needs to be selected according to the actual business test. Generally, 5-9 are the compression levels with actual production practice.

2. Hierarchical storage of hot and cold data

In the data warehouse design of large enterprises, the MPP database (ADB belongs to MPP) is only one of the data storage, and it is mostly used in partial batch processing, online query, and adHoc query; there are also many cold data and archived data. Generally, Hadoop, MaxCompute, and even OSS are planned for storage; in addition, the computing and storage of stream data that has emerged in recent years has a very strong demand, which can be solved by Kafka, Blink, and Storm. Therefore, when the space of the MPP database is in an emergency, we can also implement a hierarchical storage scheme for hot and cold data. There are roughly two types of ADB PG's hierarchical storage solutions: 1. The business side manages cold data and hot data by itself; 2. Use ADB PG's cold and hot data hierarchical storage and conversion functions.

The business side accesses HDFS cold data through the PXF appearance

The business side stores some cold data in HDFS or Hive as files, and can create a PXF external table in ADB for access; the external table does not occupy the disk space of ADB PG. PXF is a parallel channel framework for data interaction between Greenplum and Hadoop clusters. In Greenplum, PXF can be used to load and unload Hadoop platform data in parallel. The specific usage method is as follows:

(1) The console opens the PXF service

Log in to the ADB console, visit the external table page of the ADB PG instance, and click to activate a new service

After filling in the detailed Hadoop service information (involving kerberos authentication, which is not the focus of this article), the PXF service will start, as shown in the figure above after the start is successful.

(2) Create a PXF extension

-- admin execution

(3) Create a PXF appearance

Description: Location is hdfs source file information, /data/pxf_examples/pxf_hdfs_simple.txt, which is the external cold data file accessed by business; SERVER=23 indicates the address information of Hadoop exterior, and 23 is the storage directory of cluster address information, as shown in the figure 8 can be found according to the PXF service.

(4) Access to external tables

Accessing external tables is no different from accessing ordinary tables

ADB PG cold and hot data hierarchical storage scheme

The above pxf appearance access has a disadvantage. If the cold data (outer appearance) needs to be joined with the hot data, the efficiency is poor. The reason is that the data needs to be loaded from HDFS to ADB, and then joined with the ADB table, which will increase a lot of IO. Therefore, ADB PG provides the hot and cold data conversion function on the basis of Greenplum's PXF appearance. The business side can convert the external table into ADB's ordinary table data before doing business when it needs to join the appearance and ordinary table analysis. query, the overall solution is called hot and cold data hierarchical storage. Since they all use the PXF appearance service, the first and second steps in 3.4.1 can be reused. Additional configuration methods are as follows:

(1) Configure hierarchical storage to use the Foreign Server just now by default

Execute with super administrator

ALTER DATABASE postgres SET RDS_DEF_OPT_COLD_STORAGE TO 'server "23",resource "/cold_data", format "text",delimiter ","';
Note that here you need to replace postgres with the actual database name, and replace /cold_data with the path that actually needs to be used to store cold data on HDFS.

(2) Execute the check after restarting the database instance

Verify that the configuration is successful.

(3) Create a test table and insert a small amount of test data

At this time, the data in the t1 table is stored in the local storage of ADB and belongs to hot data.

(4) Migrate table data to cold storage HDFS

Note that this NOTICE is normal in the current version, because there is no so-called distribution information in cold storage, or the distribution information is determined by external storage (HDFS).

(5) Verify the use of cold data tables

First, verify that the table has been migrated to cold storage by looking at the table definition

Then query the table data normally;

(6) Move data back to hot storage

alter table t1 set (storagepolicy=hot);

Note: After migrating back to hot storage, the distributed information is lost, which is a limitation of the current version. If the table has an index, the index will be lost after migration, and the index needs to be rebuilt. The above two solutions can migrate cold data from ADB PG to external storage to a certain extent, saving the space of ADB PG.

Solution 1, Join is inefficient and does not support hot and cold data conversion, but it no longer occupies the space of ADB;

Solution 2, Join has high efficiency, supports hot and cold data conversion, and needs to occupy the space of ADB part of the time.

The two solutions have their own advantages and disadvantages. In fact, in the project, it depends on the business application. In this customer case, the cold and hot data layered storage solution saved hundreds of tons of space for the overall ADB. Most of these hundreds of tons of space were solved during the design stage, and a small part was further optimized during the trial run.

3. Garbage data vacuum

Due to the MVCC management mechanism of the GP core, the data tuples after the DML (time t2) submission of a table are not deleted immediately, but are always stored together with the normal tuples of the table, and are marked as dead tuples; This causes the table to bloat and take up extra space. There are two methods for garbage data collection: kernel automatic cleanup and SQL manual cleanup. The mechanism of automatic cleaning is: the dead tuples of the table accumulate to a certain percentage, and all the transactions that query the table (time t1 < time t2) have ended, and the kernel will automatically auto vacuum garbage data. There is no problem with this mechanism itself, but there are certain problems in the scenario of large databases and large tables. On a large table T, the data change is only 1% of 10G. If multiple large tables change together, it will bring problems to the overall space. Therefore, it must be supplemented by manual recovery.

Manual recovery method

(1) Count the top tables of the system;

(2) Query the proportion and space of dead tuples in large tables;

-- Query tables with an expansion rate greater than 20% based on statistical information

(3) Use pg_cron timing tasks to help businesses recycle garbage data

Here, it is necessary to communicate with the business to clarify the execution time. Although the specific vacuum does not affect reading and writing, there is still additional IO consumption. Vacuum full tablename should be used with caution, and the difference between the two should be explained: simple VACUUM (without FULL) just reclaims the space of the table and makes the original table available again. Commands of this form can operate concurrently with ordinary reads and writes of the table because no exclusive lock is requested. However, the extra space is not returned to the operating system; it just remains available in the same table. VACUUM FULL rewrites the entire contents of the table to a new file without any junk data (occupies new disk space, and then deletes the file of the old table to free up space), which is equivalent to returning unused space to the operating system. This form is much slower and requires an exclusive lock on the table while processing. Therefore, it affects the business use of the table.

(4) Vacuum is added to the appropriate link of the business code for recycling

If some tables are frequently updated and swell every day, you can add them to the business code for vacuum, and immediately recycle garbage data after each frequent DML change.

System tables also need recycling

This is an extremely easy to overlook point. Especially in scenarios where some data warehouses need to frequently create and change tables (temporary tables are also counted), many system tables that store metadata also experience expansion, and the expansion rate is positively correlated with the frequency of DDL. A customer experienced a situation where pg_attribute expanded to hundreds of gigabytes, and pg_class expanded to 20 times. The following table is the pg system table that is relatively easy to expand based on the actual summary.

Limitations of Manual Vacuum

Manual vacuum has certain limitations, but also pay attention.

(1) Do not perform vacuum during periods of high IO usage;

(2) vacuum full requires additional disk space to complete.

If the disk water level is high and the remaining space is small, it may not be enough to vacuum full the large table; you can delete some history tables first to free up disk space, and then vacuum full the target table.

(3) The large transaction on the target table must be ended first

During a routine maintenance of a large table, a table was vacuumed once, and the expanded space was not reclaimed. After checking pg_stat_activity carefully, it was found that there was a large transaction on this table (started earlier than the manual vacuum startup) and it was not over yet. At this time, the kernel thinks that the old data may still be used, so it cannot be recycled, nor can it be manually.

4. Redundant index cleaning

Indexes themselves also take up space, especially for large tables. Indexing is a common and basic way for databases to improve query efficiency. Using indexes well does not mean creating as many indexes as possible, especially on large tables in large databases. Space is tight, you can try to check whether there are redundant indexes that can be cleaned up.

Troubleshoot ideas

(1) Whether there is a composite index containing "abnormally many" fields;

(2) Whether there are multiple composite indexes with the same prefix field;

(3) Whether there are indexes that the optimizer never uses.

Troubleshooting methods and examples

First of all, we start from the first idea, the query index contains tables with fields greater than or equal to 4 columns.

A customer built many composite indexes with more than 10 fields, as shown in the following figure:

Generally, composite indexes with more than 6 fields are rarely seen in production, so our initial judgment is that when the table was built, the business side created redundant indexes; next, the redundant index list can be output after sorting according to the size of the index .

Here, we clearly found that the size of some indexes is more than 500G, and the size of more than 10 indexes exceeds 1TB. When we saw this information, we were shocked and happy. The happy thing is that we should be able to reclaim a lot of space. Next, you need to communicate with the business side, and after the business side confirms that you don't need to delete it.

In this customer case, we deleted more than 200 redundant indexes with a size of 24T, directly releasing 7% of the business space! Very impressive space optimization effect. This optimization was also very timely. I remember that the optimization was completed at the end of November; then the peak came at the beginning of December, and the business side wrote another 20TB of new data. Without this index optimization, it is no exaggeration: the customer’s ADB cluster in early December I can't stand it anymore!

For idea (2) (whether there are multiple composite indexes with the same prefix field), check the SQL as follows. It is best to export the index and the included field metadata to other GP libraries for analysis, because the analysis and comparison of index data (involving vector to character array, and the calculation of subset and superset) consumes more performance;

The following is the index of the second compound problem on the troubleshooting example user_t, as follows:

Explanation of the results of the above example: multi_index1 is a subset of multi_index2, and the index columns of the former have been indexed in the latter. Therefore, multi_index1 is a redundant index.

Idea (3): Is there an index that the optimizer never uses.

In addition, there is a very important knowledge point. The index scan on the Append-Only column storage table only supports the bitmap scan mode. If Greenplum disables the bitmap scan index scan mode, then all accesses to the AO column storage table will scan the entire table. That is to say, in theory, all non-unique indexes on the AO column storage table cannot be used, and can all be dropped. Of course, this operation is very risky, and the business that requires the use of AO column storage tables in the entire database is almost only batch processing, and there is no business of point check or range search. In summary, deleting redundant indexes can help customers save disk space.

5. Modify the copy table to a distribution table

As we all know, ADB PG's table distribution strategies include DISTRIBUTED BY (hash distribution), DISTRIBUTED RANDOMLY (random distribution), or DISTRIBUTED REPLICATED (full distribution or copy table). The first two tables will distribute the data to each segment according to the hash algorithm according to the specified distribution key; the copy table will store a complete copy of the data on each segment. The replicated table distribution strategy (DISTRIBUTED REPLICATED) should be used on small tables. Replicating large table data to each node is expensive both in terms of storage and maintenance. The SQL for querying the fully distributed table is as follows:

The query results are shown in the figure below, and a fully distributed table of about 10 TB is found. The first three tables are larger and can be changed to hash distributed tables, which can save about 7T of space.

6. Temporary table space is stored independently

We know that Greenplum has two default tablespaces

If you do not specify a table space when creating a table, it will be placed in the pg_default table space by default, including heap tables, AO tables, column storage tables, temporary tables, etc. Specific to the segment file directory, it is under the ~/data/Segment/${Segment_id}/base/${database_oid} directory on each segment server. At the same time, Greenplum will generate temporary tables in various scenarios, such as:

(1) Operations such as order by and group by in sql;

(2) The temporary table created by the GP engine due to the need of data reading or shuffle;

(3) The temporary table created by the business side in the ETL task.

There is a problem in this way, that is, the temporary table generated by business operation will also occupy space, but this part is not occupied by the data of the business table, and it is inconvenient to accurately manage the disk space of the large database; therefore, we separate the table space of the temporary table and store it on the server. The file level is also independent, which is convenient for separate and refined management of business data. There are also advantages: we can monitor the temporary table space, data table space, wal log, and error log separately to know the occupancy of each part. If the disk space alarms, we can take targeted measures. Greenplum's method of creating a temporary table space is relatively standard, as follows:

7. Other optimization schemes

In addition to the optimization scheme detailed above, generally speaking, Greenplum also has some general processing methods: expanding segment computing nodes, business data clipping, and backup file cleaning. Computing node expansion is the most effective. Generally speaking, regardless of whether it is Ali's own business or that of external customers, the disk usage of the database reaches 60%, and the expansion will be planned after considering the business increment. We need to tell customers about these "basic practices".

Business data clipping, in addition to cold data, there are some intermediate tables and history tables, we can also promote the business side to do a good job in data life cycle management, timely delete or transfer archives. In addition, for temporary operation and maintenance operations, the backup files left behind need to be cleaned up in time after the operation is completed. This simple habit is very easy to ignore and requires attention. In the disk management of large libraries, any small problem will be magnified.

Four: Optimize revenue

1. Save server costs for customers

In this case, the customer's original DB2 data volume was greater than 1PB, and we comprehensively optimized the above methods, and only saved more than 300 T of data in ADB, allowing the overall business to run completely. Saved about 100 servers and related software license fees for the client, amounting to about tens of millions.

2. Avoid secondary disasters caused by high disk water level

A high disk water level will cause many problems. Through the disk space optimization solution, these problems can be avoided. include:

1. A slight increase in business may cause the disk to be full, "write lock" to occur, and the database to temporarily go on strike;

2. When the disk space is insufficient, the operation and maintenance personnel cannot create a temporary table to locate the problem;

3. ADB's large table maintenance, such as vacuum full, no free disk space to use.

The above disk space optimization methods may not be very comprehensive, and I hope it will be helpful to readers. If there are any omissions in the article or if readers have supplements, welcome to communicate and discuss the cost optimization of migrating to the cloud.


Business side: refers to the user who uses Greenplum for business development or data analysis, usually the customer or the customer's developer.

OLAP: Refers to the online analytical processing system, which is the most important application of the data warehouse system. It is specially designed to support the analysis and query processing of complex and large amounts of data, and quickly returns intuitive and easy-to-understand results.

DML: Refers to SQL for adding, deleting, modifying, and merging table data. It is called DML-type SQL in the database field.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us