This topic describes how to optimize the query performance of an AnalyticDB for PostgreSQL instance in various scenarios.

Collect table statistics

The query optimizer of AnalyticDB for PostgreSQL optimizes the plan of a query and estimates its costs based on the statistics of the queried tables. If no statistics are collected from the queried tables or the collected table statistics are outdated, the query optimizer optimizes the query plan based on the default or outdated values. As a result, the query optimizer cannot generate an optimal query plan. We recommend that you collect statistics on a table after a large volume of data is loaded or more than 20% of the table data is updated.

The ANALYZE statement enables you to collect statistics on all tables, all columns of a table, or specific columns of a table. In most cases, we recommend that you collect statistics on all tables or all columns of a table. However, if you want to control the collection of table statistics at a finer level, you can choose to only collect statistics of the columns on which join keys, filter criteria, or indexes are created.

Examples:

  • After a large volume of data is imported, execute the following statement to collects statistics on all tables:
    ANALYZE;
  • After a large volume of data is inserted, updated, or deleted in the t table, execute the following statement to collect statistics on all columns of that table:
    ANALYZE t;
  • Execute the following statement to collect statistics on the a column of the t table:
    ANALYZE t(a);

Choose a query optimizer

AnalyticDB for PostgreSQL provides two query optimizers: Legacy and ORCA. Each query optimizer has its strengths and weaknesses in various scenarios.

  • Legacy query optimizer

    This is the default query optimizer. The Legacy query optimizer takes a short time to optimize an SQL statement. It is ideal for highly concurrent simple queries that require joins of no more than three tables and for highly concurrent data writes or updates that are executed by using INSERT, UPDATE, or DELETE statements.

  • ORCA query optimizer

    The ORCA query optimizer is designed to optimize complex queries. It traverses more execution paths, so it takes a longer time than the Legacy query optimizer to generate an optimal plan for each query. We recommend that you choose the ORCA query optimizer for complex queries that require joins of more than three tables to complete extract, transform, load (ETL) and report workloads. In addition, the ORCA query optimizer removes the need to join tables in subqueries and dynamically filters partitions. In this sense, we recommend that you choose the ORCA query optimizer to optimize SQL statements that have subqueries and those used to query data from partitioned tables on which parameter-specified filter criteria are created.

The following example shows how to configure both the Legacy and ORCA optimizers for a session:

-- Enable the Legacy query optimizer. The Legacy query optimizer is enabled by default.
set optimizer = off;
-- Enable the ORCA query optimizer.
set optimizer = on;
Note To configure the Legacy and ORCA optimizers for an instance, you must submit a ticket.

Use indexes to expedite queries

If a query contains a filter criterion used to identify identical values or values within a specific range and only a small volume of data is obtained, you can create an index on the column used as the filter criterion to expedite data scanning. AnalyticDB for PostgreSQL supports the following three types of indexes:

  • B-tree indexes: If a column has a large number of unique values and is used to filter, join, or sort data, create a B-tree index.
  • Bitmap indexes: If a column has a small number of unique values and more than one filter criterion is created on it, create a bitmap index.
  • GiST indexes: If you want to query geographic locations, ranges, image features, or Geometry values, create a GiST index.

Examples:

If you execute the following statement to query data from a table without an index, the system scans all data of the table and then filters the data based on the filter criteria specified in the query:

postgres=# EXPLAIN SELECT * FROM t WHERE b = 1;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
   ->  Table Scan on t  (cost=0.00..431.00 rows=1 width=16)
         Filter: b = 1
 Settings:  optimizer=on
 Optimizer status: PQO version 1.609
(5 rows)

Execute the following statement to create a B-tree index on the b column of the t table:

postgres=# CREATE INDEX i_t_b ON t USING btree (b);
CREATE INDEX

If you execute the following statement to query data from a table with an index, the system obtains data based on the index:

postgres=# EXPLAIN SELECT * FROM t WHERE b = 1;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..2.00 rows=1 width=16)
   ->  Index Scan using i_t_b on t  (cost=0.00..2.00 rows=1 width=16)
         Index Cond: b = 1
 Settings:  optimizer=on
 Optimizer status: PQO version 1.609
(5 rows)

View query plans

A query plan is a set of steps that AnalyticDB for PostgreSQL executes to complete the query. It is equivalent to an algorithm. You can analyze query execution processes based on query plans to find out why SQL statements are executed slowly. If you add the keyword EXPLAIN to a query, the system only displays the query plan but does not execute the specified SQL statement. If you add the keyword EXPLAIN ANALYZE to a query, the system executes the specified SQL statement, collects the query execution information, and then displays the information in the query plan.

  • The following example shows a query plan with the keyword EXPLAIN added to the query:

    postgres=# EXPLAIN SELECT a, b FROM t;
                                      QUERY PLAN
    ------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..4.00 rows=100 width=8)
       ->  Seq Scan on t  (cost=0.00..4.00 rows=34 width=8)
     Optimizer status: legacy query optimizer
    (3 rows)
  • The following example shows a query plan with the keyword EXPLAIN ANALYZE added to the query:

    postgres=# EXPLAIN ANALYZE SELECT a, b FROM t;
                                                                    QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..4.00 rows=100 width=8)
       Rows out:  100 rows at destination with 2.728 ms to first row, 2.838 ms to end, start offset by 0.418 ms.
       ->  Seq Scan on t  (cost=0.00..4.00 rows=34 width=8)
             Rows out:  Avg 33.3 rows x 3 workers.  Max 37 rows (seg2) with 0.088 ms to first row, 0.107 ms to end, start offset by 2.887 ms.
     Slice statistics:
       (slice0)    Executor memory: 131K bytes.
       (slice1)    Executor memory: 163K bytes avg x 3 workers, 163K bytes max (seg0).
     Statement statistics:
       Memory used: 128000K bytes
     Optimizer status: legacy query optimizer
     Total runtime: 3.739 ms
    (11 rows)

A query plan is composed of operations and organizes their information to process data in a logic flow.

AnalyticDB for PostgreSQL supports the following types of operations:

  • Data scanning operations: Seq Scan, Table Scan, Index Scan, and Bitmap Scan.
  • Join operations: Hash Join, Nested Loop, and Merge Join.
  • Aggregate operations: Hash Aggregate and Group Aggregate.
  • Distribute operations: Redistribute Motion, Broadcast Motion, and Gather Motion.
  • Other operations: Hash, Sort, Limit, and Append.
postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.b = t2.b;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice3; segments: 3)  (cost=0.00..862.00 rows=1 width=32)
   ->  Hash Join  (cost=0.00..862.00 rows=1 width=32)
         Hash Cond: t1.b = t2.b
         ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
               Hash Key: t1.b
               ->  Table Scan on t1  (cost=0.00..431.00 rows=1 width=16)
         ->  Hash  (cost=431.00..431.00 rows=1 width=16)
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
                     Hash Key: t2.b
                     ->  Table Scan on t2  (cost=0.00..431.00 rows=1 width=16)
 Settings:  optimizer=on
 Optimizer status: PQO version 1.609
(12 rows)
			

The preceding query plan is described as follows:

  1. The Table Scan operation scans the t1 and t2 tables.
  2. The Redistribute Motion operation redistributes the data of the t1 and t2 tables based on the hash values of their b columns to compute nodes.
  3. The Hash operation creates a hash key used for joins on the t2 table.
  4. The Hash Join operation joins the data of the t1 and t2 tables.
  5. The Gather Motion operation transmits the computing result to the coordinator node. The coordinator node then transmits the computing result to the client.

The actual query plan varies based on the SQL statement you specify.

Remove distribute operations to increase query performance

When you call a join or aggregate operation, AnalyticDB for PostgreSQL adds a distribute operation based on the data distribution to redistribute (Redistribute Motion) or broadcast (Broadcast Motion) data. Distribute operations occupy large amounts of network resources. To increase query performance, we recommend that you create tables and adjust the business logic to remove the needs of distribute operations.

How it works:

If the distribution keys of two tables you want to join do not match the business logic, you can change their distribution keys to remove the need of distribute operations.

Example:

SELECT * FROM t1, t2 WHERE t1.a=t2.a;

In this example, the distribution key of the t1 table is the a column.

  • If the distribution key of the t2 table is the b column, AnalyticDB for PostgreSQL redistributes the data of the t2 table:
    postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.a=t2.a;
                                                  QUERY PLAN
    -------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice2; segments: 3)  (cost=0.00..862.00 rows=1 width=32)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=32)
             Hash Cond: t1.a = t2.a
             ->  Table Scan on t1  (cost=0.00..431.00 rows=1 width=16)
             ->  Hash  (cost=431.00..431.00 rows=1 width=16)
                   ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
                         Hash Key: t2.a
                         ->  Table Scan on t2  (cost=0.00..431.00 rows=1 width=16)
     Settings:  optimizer=on
     Optimizer status: PQO version 1.609
    (10 rows)
  • If the distribution key of the t2 table is also the a column, AnalyticDB for PostgreSQL joins the t1 and t2 tables without redistributing the data of the t2 table:
    postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.a=t2.a;
                                      QUERY PLAN
    -------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..862.00 rows=1 width=32)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=32)
             Hash Cond: t1.a = t2.a
             ->  Table Scan on t1  (cost=0.00..431.00 rows=1 width=16)
             ->  Hash  (cost=431.00..431.00 rows=1 width=16)
                   ->  Table Scan on t2  (cost=0.00..431.00 rows=1 width=16)
     Settings:  optimizer=on
     Optimizer status: PQO version 1.609
    (8 rows)

Optimize the data types of the columns to be joined

Columns you want to join must have the same data type to prevent the explicit or implicit conversion of data types, because data type conversion causes data redistribution.

  • Explicit type conversion

    In SQL statements, the data types of the columns you want to join may be forcibly converted. This is called an explicit type conversion. For example, the a column of the t table uses the int data type, but it is converted to the numeric data type by a join.

    After the data type of a column is explicitly converted, the hash functions or values of the data in that column change. Therefore, we recommend that you avoid data type conversion on columns you want to join.

    As shown in the following examples, an explicit type conversion triggers data redistribution:

    -- Execute a join without a data type conversion.
    postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.a=t2.a;
                                      QUERY PLAN
    -------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..862.00 rows=1 width=32)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=32)
             Hash Cond: t1.a = t2.a
             ->  Table Scan on t1  (cost=0.00..431.00 rows=1 width=16)
             ->  Hash  (cost=431.00..431.00 rows=1 width=16)
                   ->  Table Scan on t2  (cost=0.00..431.00 rows=1 width=16)
     Settings:  optimizer=on
     Optimizer status: PQO version 1.609
    (8 rows)
    
    -- Execute a join with an explicit type conversion.
    postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.a=t2.a::numeric;
                                                  QUERY PLAN
    -------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice3; segments: 3)  (cost=0.00..862.00 rows=1 width=32)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=32)
             Hash Cond: t1.a::numeric = t2.a::numeric
             ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
                   Hash Key: t1.a::numeric
                   ->  Table Scan on t1  (cost=0.00..431.00 rows=1 width=16)
             ->  Hash  (cost=431.00..431.00 rows=1 width=16)
                   ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
                         Hash Key: t2.a::numeric
                         ->  Table Scan on t2  (cost=0.00..431.00 rows=1 width=16)
     Settings:  optimizer=on
     Optimizer status: PQO version 1.609
    (12 rows)
  • Implicit type conversion

    If you want to join more than one column of two tables but one of the columns uses a distinct data type in each table, the data type in that column needs to be converted. This is called an implicit type conversion.

    After the data type of a column is implicitly converted, the hash functions or values in the original data type may differ from those in the new data type. As a result, AnalyticDB for PostgreSQL redistributes the data of that column. Therefore, we recommend that you choose columns of the same data type as the join keys.

    In the following example, the a column of the t1 table uses the "timestamp without time zone" data type and the a column of the t2 table uses the "timestamp with time zone" data type. This means the two a columns use different hash functions. As a result, AnalyticDB for PostgreSQL redistributes their data before joining them.

    postgres=# CREATE TABLE t1 (a timestamp without time zone);
    CREATE TABLE
    postgres=# CREATE TABLE t2 (a timestamp with time zone);
    CREATE TABLE
    postgres=#
    postgres=# EXPLAIN SELECT * FROM t1, t2 WHERE t1.a=t2.a;
                                               QUERY PLAN
    -------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice2; segments: 3)  (cost=0.04..0.11 rows=4 width=16)
       ->  Nested Loop  (cost=0.04..0.11 rows=2 width=16)
             Join Filter: t1.a = t2.a
             ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=8)
             ->  Materialize  (cost=0.04..0.07 rows=1 width=8)
                   ->  Broadcast Motion 3:3  (slice1; segments: 3)  (cost=0.00..0.04 rows=1 width=8)
                         ->  Seq Scan on t2  (cost=0.00..0.00 rows=1 width=8)
    (7 rows)

Locate data skew

If your query is abnormally slow or resource usage is uneven, check whether data skew occurred.

Specifically, check the number of rows distributed to each compute node. If the rows of a table are unevenly distributed among compute nodes, the data of that table is skewed.

postgres=# SELECT gp_segment_id, count(1) FROM t1 GROUP BY 1 ORDER BY 2 DESC;
 gp_segment_id | count
---------------+-------
             0 | 16415
             2 |    37
             1 |    32
(3 rows)

If the data of a table is skewed, we recommend that you use one of the following methods to define a new distribution key for that table:

  • Re-create the table and specify a new distribution key.
  • Execute the ALTER TABLE t1 SET DISTRIBUTED BY (b); statement to change the distribution key.

View running SQL statements

If a large number of SQL statements are executed concurrently, the concurrent queries are slow and your AnalyticDB for PostgreSQL instance may report insufficient resources.

You can obtain the status of your AnalyticDB for PostgreSQL instance from the pg_stat_activity view. This view lists all concurrent SQL statements. You can determine whether a query took an abnormally long time based on the query_start field in this view.

Example:

postgres=# SELECT * FROM pg_stat_activity;
 datid | datname  | procpid | sess_id | usesysid |  usename   |          current_query          | waiting |          query_start          |         backend_start         | client_addr | client_port | application_name |          xact_start           | waiting_reason
-------+----------+---------+---------+----------+------------+---------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------+------------------+-------------------------------+----------------
 10902 | postgres |   53666 |       7 |       10 | yineng.cyn | select * from pg_stat_activity; | f       | 2019-05-13 20:27:12.058656+08 | 2019-05-13 20:16:14.179612+08 |             |          -1 | psql             | 2019-05-13 20:27:12.058656+08 |
 10902 | postgres |   54158 |       9 |       10 | yineng.cyn | select * from t t1, t t2;       | f       | 2019-05-13 20:26:28.138641+08 | 2019-05-13 20:17:40.368623+08 |             |          -1 | psql             | 2019-05-13 20:26:28.138641+08 |
(2 rows)

The crucial fields in the preceding example are as follows:

  • procpid: the ID of the master process that executed the query.
  • usename: the username of the user who executed the query.
  • current_query: the query text.
  • waiting: indicates whether the query was in the waiting state.
  • query_start: the start time of the query.
  • backend_start: the time when the process used to execute the query started.
  • xact_start: the time when the transaction to which the query belongs started.
  • waiting_reason: the reason why the query remained in the waiting state.

In addition, you can add the current_query ! = '<IDLE>' criterion to the current_query field to view the SQL statements that are being executed.

SELECT * FROM pg_stat_activity WHERE current_query ! = '<IDLE>';

Execute the following statement to view the top five SQL statements that took the longest time to execute:

SELECT current_timestamp - query_start as runtime
, datname
, usename
, current_query
FROM pg_stat_activity
WHERE current_query ! = '<IDLE>'
ORDER BY runtime DESC
LIMIT 5;

Check the status of locks

If an object in your AnalyticDB for PostgreSQL instance remains locked by a query for a long time, the other queries that involve that object may remain in the waiting state and cannot be executed properly. Execute the following statement to view the tables that are locked in your AnalyticDB for PostgreSQL instance:

SELECT pgl.locktype AS locktype
, pgl.database AS database
, pgc.relname AS relname
, pgl.relation AS relation
, pgl.transaction AS transaction
, pgl.pid AS pid
, pgl.mode AS mode
, pgl.granted AS granted
, pgsa.current_query AS query
FROM pg_locks pgl
JOIN pg_class pgc ON pgl.relation = pgc.oid
JOIN pg_stat_activity pgsa ON pgl.pid = pgsa.procpid
ORDER BY pgc.relname;

If a query does not respond because it is waiting for the lock on a table to be released, you can check the lock on that table. Use one of the following methods to resolve this issue if necessary:

  • Cancel this query. If the session indicated by the pid parameter is idle, this method is unsuitable. In addition, you must delete data and roll back the transaction to which the query belongs after you cancel the query.
    SELECT pg_cancel_backend(pid);
    Note The pg_cancel_backend function does not take effect on the session for which the value of the pg_stat_activity.current_query parameter is IDLE. In this situation, you can call the pg_terminate_backend function to delete data.
  • Terminate the session to which the query belongs. After the session is terminated, the uncommitted transactions in it are rolled back.
    SELECT pg_terminate_backend(pid);

Join tables by using nested loops to increase query performance

By default, nested loops joins are disabled in AnalyticDB for PostgreSQL. If only a small volume of data is returned by your query, the query performance may not be optimal.

Example:

SELECT *
FROM t1 join t2 on t1.c1 = t2.c1
WHERE t1.c2 >= '230769548' and t1.c2 < '230769549'
LIMIT 100;

In the preceding example, the t1 and t2 tables are large. However, the filter criterion (t1.c2 >= '230769548' and t1.c2 < '23432442') on the t1 table filters out most of the data records and the LIMIT clause further limits qualified data records. As a result, only a small volume of data is queried. In this situation, you can use nested loops to join the two tables.

To join tables by using nested loops, you must execute the SET statement. Example:

show enable_nestloop ;
 enable_nestloop
-----------------
 off
SET enable_nestloop = on ;
show enable_nestloop ;
 enable_nestloop
-----------------
 on
explain SELECT * FROM t1 join t2 on t1.c1 = t2.c1 WHERE t1.c2 >= '230769548' and t1.c2 < '23432442' LIMIT 100;
                                            QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=0.26..16.31 rows=1 width=18608)
   ->  Nested Loop  (cost=0.26..16.31 rows=1 width=18608)
         ->  Index Scan using t1 on c2  (cost=0.12..8.14 rows=1 width=12026)
               Filter: ((c2 >= '230769548'::bpchar) AND (c2 < '230769549'::bpchar))
         ->  Index Scan using t2 on c1  (cost=0.14..8.15 rows=1 width=6582)
               Index Cond: ((c1)::text = (T1.c1)::text)

In the preceding example, the t1 and t2 tables are joined by using nested loops to increase query performance.