How to troubleshoot high CPU utilization of ApsaraDB RDS for PostgreSQL - ApsaraDB RDS

Problem description

When you use the ApsaraDB RDS for PostgreSQL, you may encounter high or even 100% CPU utilization . This article describes the common causes and solutions by analyzing typical scenarios in which the CPU utilization is 100%.

Solution

Note

Take note of the following items:

Before you perform high-risk operations such as modifying the configurations or data of Alibaba Cloud instances, we recommend that you check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
You can modify the configurations and data of Alibaba Cloud instances, such as ECS and ApsaraDB RDS instances. We recommend that you create snapshots or enable RDS log backup before you modify instance configurations or data.
If you have granted permissions to users or submitted sensitive information such as logon accounts and passwords in Alibaba Cloud Management Console, we recommend that you modify the information in a timely manner.

When you encounter a 100% CPU utilization , check whether the active connection spikes during your business peak hours and the reserved resources of the database are insufficient. You must verify that the number of active connections is much greater than usual when the problem occurs. For an ApsaraDB RDS for PostgreSQL instance, you can view the connection change to the database in the monitoring and alerts section in the ApsaraDB RDS console. For the current active connection, you can connect to the database and run the following statement to query:

select count( * ) from pg_stat_activity where state not like '%idle';

Track slow SQL statements

If the number of active connections changes in a normal range, the cause may be a large number of slow SQL statements. In this case, to further analyze the cause, you can use slow query logs of ApsaraDB RDS to identify time-consuming statements. However, when this problem occurs, the entire system stands still and all SQL statements slow down. Therefore, it is difficult to identify the slow statements that cause the problem because lots of slow statements may be recorded in the logs. The following section describes methods for tracking slow SQL statements:

Method 1: Use the pg_stat_statements plug-in to identify slow SQL statements, which is applicable only to ApsaraDB RDS for PostgreSQL. Perform the following steps:
1. If your database does not have the plug-in, create one. This method uses the plug-in and counting information, such as the accumulation of SQL execution time. However, the counting information keeps accumulating and contains historical information of statements. Therefore, to troubleshoot high CPU utilization , you must run the following command to reset the counter:
```
create extension pg_stat_statements;
select pg_stat_reset();
select pg_stat_statements_reset();
```
2. Wait for the counter to accumulate sufficient information for a time period. For example, wait for one minute.
3. Refer to the following command to query time-consuming SQL statements. Then, you will obtain the direct cause of the problem.
```
select * from pg_stat_statements order by total_time desc limit 5;
```
4. Refer to the following statement to query SQL statements that reads the buffer for maximum times, which may be caused by the lack of index for the queried data. This consumes a large amount of CPU resources.
```
select * from pg_stat_statements order by shared_blks_hit+shared_blks_read desc limit 5;
```

Method 2: Use the pg_stat_activity view and refer to the following statements to query SQL statements that have been running for a long time period. These statements may cause a high CUP usage.

select datname,
       usename,
       client_addr,
       application_name,
       state,
       backend_start,
       xact_start,
       xact_stay,
       query_start,
       query_stay,
       replace(query, chr(10), ' ') as query
from
  (select pgsa.datname as datname,
          pgsa.usename as usename,
          pgsa.client_addr client_addr,
          pgsa.application_name as application_name,
          pgsa.state as state,
          pgsa.backend_start as backend_start,
          pgsa.xact_start as xact_start,
          extract(epoch
                  from (now() - pgsa.xact_start)) as xact_stay,
          pgsa.query_start as query_start,
          extract(epoch
                  from (now() - pgsa.query_start)) as query_stay,
          pgsa.query as query
   from pg_stat_activity as pgsa
   where pgsa.state != 'idle'
     and pgsa.state != 'idle in transaction'
     and pgsa.state != 'idle in transaction (aborted)') idleconnections
order by query_stay desc
limit 5;

Method 3: Query the tables that do not have the indexes, starting from the information recorded by the table scan feature. If a table has no indexes, and most of the hot data is stored in memory (for example, 8 GB of memory and 6 GB of hot data), the database can only use table scanning and process a large number of irrelevant records in memory, which consumes a large amount of CPU resources. Especially for tables with more than 100 records, a table scanning occupies almost all of the CPU resources and initiates multiple connections concurrently, for example, hundreds of connections.
1. Refer to the following SQL statement to query the tables that use table scan the most:
```
select * from pg_stat_user_tables where n_live_tup > 100000 and seq_scan > 0 order by seq_tup_read desc limit 10;
```
2. Refer to the following SQL statement to query slow queries that are running to access the tables:
```
select * from pg_stat_activity where query ilike '%<table name>%' and query_start - now() > interval '10 seconds';
```
  Note
  Note: You can also run the following statement to locate queries that are related to the tables by using the pg_stat_statements plug-in: select * from pg_stat_statements where query ilike '%<table>%'order by shared_blks_hit+shared_blks_read desc limit 3;

Handle slow SQL statements

For the slow SQL statements obtained by using the preceding methods as shown in the following examples, you must terminate them to restore your business.

select pg_cancel_backend(pid) from pg_stat_activity where  query like '%<query text>%' and pid != pg_backend_pid();
select pg_terminate_backend(pid) from pg_stat_activity where  query like '%<query text>%' and pid != pg_backend_pid();

If the slow SQL statements are necessary for your business, you must optimize them by using the following methods:

Run the ANALYZE [$Table] or VACUUM ANZLYZE [$Table] statement in the table that is related to the obtained slow SQL statements to update the statistics of the tables. This way, query execution plans can be more accurate. To minimize the impact on your business, we recommend that your run the statements during off-peak hours of the business.
Note
Note: [$Table] represents the table that is related to the slow SQL statements.
Select one of the following SQL statements to run and view the SQL execution plan. The first SQL statement does not actually run the SQL statement. The second SQL statement runs, and the detailed execution information is obtained. In this case, indexes are created for tables that are related to the table scan feature.
```
explain [$Query_Text]
explain (buffers true, analyze true, verbose true) [$Query_Text]
```
Note
Note: [$Query_Text] represents SQL files or statements.
Rewrite the slow SQL statements to remove unnecessary subqueries. To further optimize the statements, you can rewrite UNION ALL and use JOIN clauses for fixed JOIN sequences. Details are not described here.

Applicable scope

ApsaraDB RDS for PostgreSQL