诊断数据倾斜的原因与消除方法 - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for PostgreSQL provides the intelligent data skew diagnostics feature. This feature automatically performs diagnostics on all database tables by hour and generates diagnostic information tables. This way, you can check all database tables for data skew.

Usage notes

The intelligent data skew diagnostics feature is supported only for AnalyticDB for PostgreSQL instances in elastic storage mode that meet the following minor version requirements:
- AnalyticDB for PostgreSQL V6.0: V6.3.10.0 or later.
- AnalyticDB for PostgreSQL V7.0: V7.0.4.0 or later.
For information about how to view and update the minor version, see View the minor engine version and Update the minor engine version.
The intelligent data skew diagnostics feature is implemented for every database except the five system databases: postgres, template0, template1, adbpgadmin, and aurora. To ensure data skew diagnostics, we recommend that you do not store your data in the five system databases.
The intelligent data skew diagnostics feature scans all database tables except for temporary tables and unlogged tables. However, to balance speed and efficacy, tables that are smaller than 1 GB in size are not scanned by default. For information about how to modify the data volume threshold, see the "Configure parameters for intelligent data skew diagnostics" section of this topic.
By default, the intelligent data skew diagnostics feature does not display the tables whose skew ratio is lower than 20%. For information about how to modify the skew ratio threshold, see the "Configure parameters for intelligent data skew diagnostics" section of this topic.
The intelligent data skew diagnostics feature queries only metadata tables. This does not affect query performance.

Causes of data skew

AnalyticDB for PostgreSQL uses the massively parallel processing (MPP) architecture to distribute data across different compute nodes. AnalyticDB for PostgreSQL supports the following data distribution methods:

Random distribution
Randomly distributed tables cannot utilize data characteristics. For example, collocated joins cannot be used when multiple tables are queried. In this case, the query performance of random distribution is lower than the query performance of hash distribution.
Replicated distribution
In replicated distribution mode, a copy of full data is stored on each compute node. In most cases, this mode is applicable to tables that contain small amounts of data. If this mode is used for large-size tables, a sharp data bloat occurs.
Hash distribution
In hash distribution mode, data is distributed across all compute nodes based on the hash value of the distribution key. To prevent data skew, you must select an appropriate distribution key. When you select a distribution key, comply with the following rules:
- If your table contains a primary key, we recommend that you select the primary key as the distribution key.
- If your table does not contain a primary key, we recommend that you select a column that has evenly distributed data or a large number of distinct values as the distribution key.

Check data skew

The diagnostic information of intelligent data skew diagnostics is stored in the adbpg_toolkit.diag_skew_tables table. You can use one of the following methods to check data skew:

View the diagnostic information directly in the console. For more information, see Data bloat, data skew, and index statistics.

Connect to a database and execute the following statement to query the diagnostic information:

SELECT * FROM adbpg_toolkit.diag_skew_tables;

The following table describes the fields in the diag_skew_tables table.

Field	Data type	Description
schema_name	name (63-byte type for storing system identifiers)	The name of the schema to which the table belongs.
table_name	name	The name of the table.
table_size	bigint	The size of the table. Unit: bytes.
table_skew	real	The skew ratio of the table. Valid values: 0 to 100. Unit: %.
table_owner	name	The name of the table owner.
table_distributed_policy	text	The distribution key of the table.
diagnose_time	timestamp with time zone	The time when the diagnostic information is generated.

You can specify filter conditions to check a schema or table for data skew.

Check all tables within a schema for data skew.

SELECT * FROM adbpg_toolkit.diag_skew_tables WHERE schema_name = '<Schema name>';

Check a table for data skew.

SELECT * FROM adbpg_toolkit.diag_skew_tables WHERE table_name = '<Table name>';

Manually initiate data skew diagnostics

By default, intelligent data skew diagnostics is initiated at the beginning of each hour. If you want to immediately check the effect of changing the distribution key of a table, you can manually initiate data skew diagnostics. Execute the following statement to manually initiate data skew diagnostics:

SELECT adbpg_toolkit.diagnose_skew_tables();

Configure parameters for intelligent data skew diagnostics

Specify the data volume threshold

By default, the intelligent data skew diagnostics feature filters out the tables that are smaller than 1 GB in size. If you want to modify the data volume threshold, you can execute the following statement:

ALTER DATABASE <Database name> SET adb_diagnose_table_threshold_size to <Table size in bytes>;

For example, to diagnose tables that are larger than 500 MB in size, execute the following statement:

ALTER DATABASE diagnose SET adb_diagnose_table_threshold_size to 536870912;

Specify the skew ratio threshold

By default, the intelligent data skew diagnostics feature does not display the tables whose skew ratio is lower than 20%. If you want to modify the skew ratio threshold, you can execute the following statement:

ALTER DATABASE <Database name> SET adb_diagnose_skew_percent to <Skew ratio from 0% to 100%>;

For example, to display all tables whose skew ratio is greater than 0% in the diagnose database, execute the following statement:

ALTER DATABASE diagnose SET adb_diagnose_skew_percent to 0;

Rules for data skew calculation

AnalyticDB for PostgreSQL defines the skew degree of data that is distributed across all compute nodes based on the skew ratios of tables. The value of the skew ratio ranges from 0% to 100%. A greater value indicates that a table is more severely skewed. The skew ratio can be calculated by using the following formula:

Avg=(S1+S2+....Sn)/n
Max=Max(S1,S2,....Sn)
Skew ratio = (1 - Avg/Max)/(1 - 1/n) × 100%

Assume that the data of a table is distributed across n compute nodes. Sn indicates the size of data that is stored on each node.

Eliminate data skew

In most cases, data skew is caused by inappropriate distribution keys. For example, a column that has a large number of identical values in a table is selected as the distribution key for hash distribution. In this case, the compute nodes where these identical values are stored have more data than the other compute nodes. To resolve this issue, we recommend that you change the distribution key. For more information, see Selection of distribution keys.

Execute the following statement to change the distribution key of a table:

ALTER TABLE <table_name> SET DISTRIBUTED BY (<column_name>);

Example: Change the distribution key of the t1 table to the c2 column.

ALTER TABLE t1 SET DISTRIBUTED BY (c2);