Overview

This topic describes how to solve the problem that the CPU usage of apsaradb for RDS SQL Server is too high or even reaches 100%.

 

Description

Alibaba Cloud reminds you that:

  • When you perform operations that have risks, such as modifying instances or data, check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
  • Before you modify the configurations and data of instances including but not limited to ECS and RDS instances, we recommend that you create snapshots or enable RDS log backup.
  • If you have authorized or submitted security information such as the logon account and password in the Alibaba Cloud Management console, we recommend that you modify such information in a timely manner.

High CPU usage can be caused by many factors, the most common of which are described as follows and their solutions are as follows:

 

The parallelism setting of the instance is incorrect.

Troubleshooting

When a task is parallel processing threads, CXPACKET waits may occur because the amount of data processed by each thread is different. If CXPACKET waits frequently, the CPU usage will be high. You can monitor whether there is a large number of CXPACKET waits by using the SQL Server Management Studio activity monitor or by executing the following statement (multiple executions to get difference values):

Tips: CXPACKET refers to threads waiting for each other to complete parallel processing. When SQL Server finds a complex instruction, it will decide to use multiple threads for parallel execution. Because some parallel threads have already completed their work, they are waiting for other parallel threads to synchronize. This waiting is called CXPACKET.

WITH [Waits] AS (
SELECT [wait_type],
       [wait_time_ms] / 1000.0 AS [WaitS],
       ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
       [signal_wait_time_ms] / 1000.0 AS [SignalS],
       [waiting_tasks_count] AS [WaitCount],
       100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
                                    ROW_NUMBER() OVER (
                                                      ORDER BY [wait_time_ms] DESC) AS [RowNum]
FROM sys.dm_os_wait_stats
WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER',
                            N'BROKER_RECEIVE_WAITFOR ',
                             N'BROKER_TASK_STOP ',
                              N'BROKER_TO_FLUSH ',
                               N'BROKER_TRANSMITTER ',
                                N'CHECKPOINT_QUEUE ',
                                 N'CHKPT',
                                  N'CLR_AUTO_EVENT ',
                                   N'CLR_MANUAL_EVENT ',
                                    N'CLR_SEMAPHORE ', -- Maybe uncomment these four if you have mirroring issues N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE ', N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD ', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE ', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX ', -- Maybe uncomment these six if you have AG issues N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION ', n'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE ', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE ', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP ', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT ', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE ', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT ', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE ', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE ', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH ', n'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP ', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY ', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED ', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK ', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP ', N'SNI_HTTP_ACCEPT', n'SQLTRACE_INCREMENTAL_FLUSH_SLEEP ', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS ', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT') AND [waiting_tasks_count] > 0)

  SELECT MAX ([W1].[wait_type]) AS [WaitType],
             CAST (MAX ([W1].[ WaitS]) AS DECIMAL (16, 2)) AS [Wait_S],
                  CAST (MAX ([W1].[ ResourceS]) AS DECIMAL (16, 2)) AS [Resource_S],
                       CAST (MAX ([W1].[ SignalS]) AS DECIMAL (16, 2)) AS [Signal_S],
                            MAX ([W1].[ WaitCount]) AS [WaitCount],
                                CAST (MAX ([W1].[ Percentage]) AS DECIMAL (5, 2)) AS [Percentage],
                                     CAST ((MAX ([W1].[ WaitS]) / MAX ([W1].[ WaitCount])) AS DECIMAL (16, 4)) AS [AvgWait_S],
                                          CAST ((MAX ([W1].[ ResourceS]) / MAX ([W1].[ WaitCount])) AS DECIMAL (16, 4)) AS [AvgRes_S],
                                               CAST ((MAX ([W1].[ SignalS]) / MAX ([W1].[ WaitCount])) AS DECIMAL (16, 4)) AS [AvgSig_S]
  FROM [Waits] AS [W1]
  INNER JOIN [Waits] AS [W2] ON [W2].[ RowNum] <= [W1].[ RowNum]
GROUP BY [W1].[ RowNum]
HAVING SUM ([W2].[ Percentage]) - MAX([W1].[ Percentage]) < 95;

 

Troubleshooting procedure

  • Set from statement level.
    1. Run the following SQL statement to search for CPU-consuming statements.
      SELECT TOP 50 [Avg. MultiCore/CPU time(sec)] = qs.total_worker_time / 1000000 / qs.execution_count,
                    [Total MultiCore/CPU time(sec)] = qs.total_worker_time / 1000000,
                    [Avg. Elapsed Time(sec)] = qs.total_elapsed_time / 1000000 / qs.execution_count,
                    [Total Elapsed Time(sec)] = qs.total_elapsed_time / 1000000,
                    qs.execution_count,
                    [Avg. I/O] = (total_logical_reads + total_logical_writes) / qs.execution_count,
                    [Total I/O] = total_logical_reads + total_logical_writes,
                    Query = SUBSTRING(qt.[text], (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset
                                                                                            WHEN -1 THEN DATALENGTH(qt.[text])
                                                                                            ELSE qs.statement_end_offset
                                                                                        END - qs.statement_start_offset) / 2) +1 minute,
                    Batch = qt.[text],
                    [DB] = DB_NAME(qt.[dbid]),
                    qs.last_execution_time,
                    qp.query_plan
      FROM sys.dm_exec_query_stats AS qs CROSS APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp
      where qs.execution_count > 5 --more than 5 occurences ORDER BY [Total MultiCore/CPU time(sec)] DESC;
    2. For RDS SQL Server 2008 R2 instances, you can view slow log statistics in the console for CPU-consuming statements.
    3. After finding the statement, check its execution plan. For statements with a higher degree of parallelism, you can use hints at the statement level to limit the degree of parallelism. Example:
      SELECT column1,column2 
          FROM table1 o INNER JOIN table2 d ON (o.d_id = d.d_id) 
          OPTION (maxdop 1);
  • Set from the instance level.
    1. Run the following SQL statement to view the MAXDOP value of the current instance:
      select * from sys.configurations where name like '%max%';
      查询maxdop
    2. Run the following SQL statement to configure this parameter at the instance level for all queries.
      sp_rds_configure 'max degree of parallelism ',1;

High Application Load

Symptom

No slow queries are found (or slow queries are not the main cause of the problem), and the QPS and CPU usage curves are consistent. They are commonly used in online transaction and transaction systems (such as order systems) with optimized applications and popular Web applications with high read rates.

 

Feature

The QPS of the instance is high, the query is relatively simple, the execution efficiency is high, and there is little room for optimization.

 

Solution

It is recommended to solve this problem in terms of application architecture and instance specifications.

  • Upgrades the instance type to increase CPU resources.
  • Optimize queries as much as possible to reduce the execution costs of queries (logical I /O and the number of rows of table data to be accessed), and improve application scalability.

     

    The read /write of the query statement is too high

    Symptom

    There are slow queries, and the changes of QPS and CPU usage curves are not consistent. Check the statements that consume CPU, and the statements with large I/O.

     

    Feature

    Queries per second (QPS) is not high. You cannot improve the query efficiency. You must scan a large number of tables for queries. However, an optimization operation does not require much time.

     

    Solution

    • For large table queries, check whether an appropriate index exists. Check the actual execution plan to optimize the full table scan. The execution plan also provides suggestions on missing indexes.
      缺失索引检查
    • Use CloudDBA to check performance issues.

     

    Learn More

    The general principle for avoiding 100% CPU usage is as follows.

    • Set a CPU usage alert to ensure certain redundancy for the instance CPU usage.
    • During application design and development, query optimization must be considered and comply with the general optimization principles for SQL optimization. This reduces the logical I/O of queries and improves application scalability.
    • Use production environment data for stress testing before launching new features and modules.
    • CloudDBA is frequently used to check instance performance and detect problems in a timely manner.

     

    Application scope

    • Apsaradb for RDS SQL Server