Case Analysis | Discussion on Thread Pool Faults

By Qingyi

This article discusses the issues related to thread pool faults and analyzes them from both a fault perspective and a technical perspective. The fault perspective focuses on observing the phenomenon and drawing lessons from it, while the technical perspective delves into the essence of the faults and introduces methods to avoid them.

Background

I have received feedback from our new team members expressing their interest in learning about thread pool faults. Therefore, I have organized and shared our team's reviews on thread pool faults accumulated over the years.

This article can also be beneficial to developers outside of Alibaba Cloud, and I hope it will be helpful to all of you.

Fault Perspective

Having worked in the control team for many years, I have encountered various types of faults. Overall, there are often Service Unavailable errors caused by full thread pools. However, a full thread pool is just a result, and the underlying cause is usually a slow process within the system. For example, slow SQL queries in a database can lead to a full connection pool, which in turn causes a thread pool for external services, like a Dubbo thread pool, to become full. Once the thread pool reaches its maximum capacity, it may not be able to respond to new requests. Even if it does, the new requests may be queued for a long time and cannot be processed promptly, resulting in longer request durations. From the client's perspective, this manifests as a timeout error: Service Unavailable.

Here are some typical cases that developers can easily understand.

Database-related Faults

Hot Row Update

Updating the same row in a transaction can lead to lock waits and slow SQL queries, especially in scenarios involving count updates and quota updates.

Case 1: During a stress test where the database experienced over 600,000 queries per second, a large number of requests to update the same row (count update) in a transaction caused contention for row locks and resulted in slow SQL queries.
Case 2: In a situation where several large users were performing highly concurrent operations that involved hot row updates in a transaction, it was discovered that a single update took 5-6 seconds. This led to a buildup of threads in the Dubbo thread pool for external services, eventually causing the thread pool to become full and unavailable for external services.
In an offline simulation test using the specific database version and configurations, it was observed that with 1,200 concurrent requests to update a row, a single update took 1 minute when the transaction was enabled and 3 seconds when the transaction was disabled.

Adding Fields to a Large Table

There are multiple ways to perform DDL changes. If the most basic method is used, tables will be locked, resulting in a large number of lock waits for related SQL queries and causing slow SQL queries. It is recommended to use online DDL for DDL changes. In some past cases where tables were locked, either online DDL was not used or the database versions at that time did not support it.

Case: A user added fields to a large table without using online DDL. In the final phase, the table was locked with a Metadata lock, causing a large number of related SQL queries to wait for the lock, resulting in slow SQL and rapidly filling up the application thread pool.

Wrong Index (full table scan triggered by a primary key index)

This issue commonly occurs in scenarios involving ORDER BY ID LIMIT. Even if fields in the WHERE clause have indexes, the database may still perform a full table scan. To address this fault, you can use IGNORE INDEX (PRIMARY) and FORCE INDEX (idx_xxx) techniques.

Case: A user received a CPU usage alarm for the database at around 3:00 AM. After troubleshooting the issue, it was discovered that an SQL query had utilized a primary key index, triggering a full table scan. The SQL query sample was where a= and b= and c= and d= order by id desc limit 20, and at that time, there was only a composite index of idx_a_b_e. During the process, CPU usage was temporarily reduced by manually throttling SQL queries on the database's O&M platform. However, the usage quickly increased again. Deleting some invalid data was also attempted to reduce the data volume. Ultimately, the issue was resolved by temporarily adding an index of idx_a_b_c_d that covered all fields.

Deep Paging

In scenarios where a large amount of data needs to be queried, slow SQL queries caused by deep paging are common. You can address this issue by using NexToken or a cursor for querying. Currently, many Alibaba Cloud APIs support the use of NextToken.

Case: A user with a significant amount of data called a query interface to retrieve data by page, which resulted in slow SQL queries. As a result, the database connection pool became full, and the Dubbo thread pool became overloaded and unavailable for external services. By implementing emergency throttling on the account's calls to the interface, the issue was resolved.

Large-volume Calls

Case 1: Shortly after the issue was resolved, a user attempted to rerun a pending task on a single instance, leading to a full thread pool and impacting the service.

Solution: Throttling policies are essential for systems. When a bottleneck occurs in a single instance, it is recommended to switch the execution mode to grid jobs.

Case 2: During a stress test, data was not prefetched, and the query volume was instantly increased to the maximum. Consequently, the thread pool became full, and a large number of transactions had to wait in the database, resulting in slow SQL queries.

Solution: The query volume should be gradually increased during stress testing. The tester must monitor the system load and pause the test when necessary.

Others

Case: A user did not set query limits, which led to full garbage collections (Full GCs).

This case did not involve full thread pool faults, but I believe it is representative, so I am sharing it with you. Whether you are coding or editing SQL queries for querying, deleting, or updating data, it is recommended to set query limits to protect your database and minimize the impact of potential issues.

Technological Perspective

Thread pool failures are usually caused by slow operations or blockages. From a technical perspective, the most common scenarios are as follows.

Slow remote I/O calls result in increased time consumption.
A sudden surge in CPU usage in compute-intensive applications leads to increased time consumption.
A full custom service thread pool causes queuing and waiting, resulting in increased time consumption.

Among these scenarios, Scenario 2 is not very common, but I have encountered this situation in a compute-intensive application where a sudden increase in highly concurrent requests led to 100% CPU usage. Scenario 1 is more common, and popular connection pools for remote calls include Dubbo, HTTP, database, and Redis connection pools. In these situations, connection pools are used to interact with remote services. All connection pools have some common characteristics, and you should pay attention to the following two tips:

Set the timeout for remote calls as low as possible to fail fast. This is usually achieved by setting the ConnectionTimeout, which is the time required for a handshake, and the SocketTimeout, which is the timeout period for a task.
When the connection pool is full, set a lower timeout for connection requests to fail fast. This tip is often overlooked. For example, you can set the MaxWait in a Druid connection pool or the ConnectionRequestTimeout in an HTTP connection pool.

Here are some specific tips for different connection pools.

Dubbo Thread Pools

1. Isolate interfaces in the thread pool to avoid mutual interference.

For example, isolate interfaces for O&M from interfaces for external services,
or isolate core interfaces for external services from non-core interfaces for external services.

2. Set timeouts on Dubbo consumers as low as possible as per the fail-fast principle. Timeouts on providers are only used as a declaration for consumer reference, and cannot kill threads when time is out.

HTTP Connection Pools

1. Set ConnectTimeout, SocketTimeout, and ConnectionRequestTimeout.

Case: A code introduced an SDK that integrated HttpClient, but no ConnectionTimeout was set. When a network jitter occurred, the HTTP connection pool was quickly filled. As a result, the thread pool was full and the service was damaged.

2. Low DefaultMaxPerRoute leading to blocking

Case: In a stress test, SDK was set to the default value of 128. It was found that the time consumption on the client was high, while that on the server did not fluctuate. After troubleshooting the fault, it was suspected that the blocking was caused by low DefaultMaxPerRoute. The problem was solved after increasing the value.

Druid Connection Pools

1. Set ConnectTimeout and SocketTimeout.

Case: A user received an alarm that the API success rate decreased at around 1:00 a.m. After troubleshooting the fault, it was found that some SQL statements had timed out due to a failover. Further troubleshooting showed that no SocketTimeout was set for the database connection pool on the application side. As a result, the old connection that was established before the failover would not time out and be killed, so the related SQL statements timed out. When passing the default timeout period of the system (900 seconds), the old connection would be closed and a new one would be established.

2. Set TransactionTimeout. A transaction is a lock. The longer the timeout period, the longer the locking. This will lead to related SQL queries that are not in the transaction waiting for locks, thus resulting in poor performance.

Case: During a change, transactions were not committed due to a bug in the code, and no transaction timeout was set. As a result, a large number of related SQL queries had to wait for locks, and the service was damaged.

3. Set defaultStatementTimeout and queryTimeout of Ibatis.

4. Set MaxWait, the time a caller will wait before receiving a connection timeout.

Note: The default value of MaxWait in Druid used to be 60 seconds, which was too long and problematic. I gave feedback to the developer and now it has been reduced to 6 seconds [1]

Custom Thread Pools

Long queues set in the thread pool may lead to blocking and affect throughput.
By default, future.get has no timeout. You must specify this parameter explicitly.

Case: A user received an alarm that the Dubbo thread pool was full. After troubleshooting the fault, it was found that future.get was used in the business code but no timeout was set, and the deny policy of the thread pool was set to DiscardPolicy. Then, future.get was blocked when a new task was discarded after the thread pool was full. As a result, the Dubbo thread pool was full and the service was damaged.

Redis Connection Pools

Set MaxWait of JedisPool connection pools, which is similar to MaxWait of Druid connection pools and ConnectionRequestTimeout of HTTP connection pools.
Set ConnectionTimeout and SocketTimeout, which are similar to those of Druid and HTTP connection pools.

Summary

Fail-fast Principle

The fail-fast principle aims to prevent wastage of system resources. Setting excessively long timeouts can result in ineffective I/O waiting.
Based on personal experience, the following timeout settings are recommended: ConnectionTimeout should be set between 1-3 seconds, with a maximum limit of 5 seconds. SocketTimeout should not exceed 10 seconds, based on the request time. MaxWait/ConnectionTimeout is suggested to be set between 3-5 seconds, with a maximum limit of 6 seconds.

Protect Your Database: Flow Control/Backpressure

Enable automatic throttling on your database's operations and maintenance (O&M) platform. In case of emergencies, manually execute throttling as soon as you receive an alert.
Implement flow control at various levels: single machine, cluster (Region/AZ), user, and interface.
Apply the backpressure mechanism on the client of the message middleware for pulling messages.

Retry with Caution

Retrying too frequently can accelerate system avalanches. Refer to the blog post [2] for more insights. The key points are as follows:

Avoid automatic retries at the top layer and instead perform automatic retries on a single node.
Use the token bucket algorithm to control the rate of retries.
Disperse scheduled and periodic jobs to avoid peak loads. We have encountered similar failure cases:
- Case 1: A client experienced a similar failure where scheduled heartbeats were sent to the server within the same second, causing the server to fail. Such scheduled jobs should be dispersed.
- Case 2: In a system, a large number of scheduled tasks were executed at the same hour, resulting in excessive pressure on the system and causing online issues. Such periodic tasks should be dispersed.

This article does not cover all aspects comprehensively. Your comments and suggestions are welcome.

References

[1] DruidDataSource Configuration
[2] Timeouts, retries, and backoff with jitter

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

Case Analysis | Discussion on Thread Pool Faults

Background

Fault Perspective

Database-related Faults

Hot Row Update

Adding Fields to a Large Table

Wrong Index (full table scan triggered by a primary key index)

Deep Paging

Large-volume Calls

Others

Technological Perspective

Dubbo Thread Pools

HTTP Connection Pools

Druid Connection Pools

Custom Thread Pools

Redis Connection Pools

Summary

Fail-fast Principle

Protect Your Database: Flow Control/Backpressure

Retry with Caution

References

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Elastic High Performance Computing Solution

Elastic High Performance Computing

Remote Rendering Solution

Application Real-Time Monitoring Service