By Qingyi
This article discusses the issues related to thread pool faults and analyzes them from both a fault perspective and a technical perspective. The fault perspective focuses on observing the phenomenon and drawing lessons from it, while the technical perspective delves into the essence of the faults and introduces methods to avoid them.
I have received feedback from our new team members expressing their interest in learning about thread pool faults. Therefore, I have organized and shared our team's reviews on thread pool faults accumulated over the years.
This article can also be beneficial to developers outside of Alibaba Cloud, and I hope it will be helpful to all of you.
Having worked in the control team for many years, I have encountered various types of faults. Overall, there are often Service Unavailable errors caused by full thread pools. However, a full thread pool is just a result, and the underlying cause is usually a slow process within the system. For example, slow SQL queries in a database can lead to a full connection pool, which in turn causes a thread pool for external services, like a Dubbo thread pool, to become full. Once the thread pool reaches its maximum capacity, it may not be able to respond to new requests. Even if it does, the new requests may be queued for a long time and cannot be processed promptly, resulting in longer request durations. From the client's perspective, this manifests as a timeout error: Service Unavailable.
Here are some typical cases that developers can easily understand.
Updating the same row in a transaction can lead to lock waits and slow SQL queries, especially in scenarios involving count updates and quota updates.
There are multiple ways to perform DDL changes. If the most basic method is used, tables will be locked, resulting in a large number of lock waits for related SQL queries and causing slow SQL queries. It is recommended to use online DDL for DDL changes. In some past cases where tables were locked, either online DDL was not used or the database versions at that time did not support it.
This issue commonly occurs in scenarios involving ORDER BY ID LIMIT. Even if fields in the WHERE clause have indexes, the database may still perform a full table scan. To address this fault, you can use IGNORE INDEX (PRIMARY) and FORCE INDEX (idx_xxx) techniques.
where a= and b= and c= and d= order by id desc limit 20
, and at that time, there was only a composite index of idx_a_b_e. During the process, CPU usage was temporarily reduced by manually throttling SQL queries on the database's O&M platform. However, the usage quickly increased again. Deleting some invalid data was also attempted to reduce the data volume. Ultimately, the issue was resolved by temporarily adding an index of idx_a_b_c_d that covered all fields.In scenarios where a large amount of data needs to be queried, slow SQL queries caused by deep paging are common. You can address this issue by using NexToken or a cursor for querying. Currently, many Alibaba Cloud APIs support the use of NextToken.
Case 1: Shortly after the issue was resolved, a user attempted to rerun a pending task on a single instance, leading to a full thread pool and impacting the service.
Case 2: During a stress test, data was not prefetched, and the query volume was instantly increased to the maximum. Consequently, the thread pool became full, and a large number of transactions had to wait in the database, resulting in slow SQL queries.
Case: A user did not set query limits, which led to full garbage collections (Full GCs).
Thread pool failures are usually caused by slow operations or blockages. From a technical perspective, the most common scenarios are as follows.
Among these scenarios, Scenario 2 is not very common, but I have encountered this situation in a compute-intensive application where a sudden increase in highly concurrent requests led to 100% CPU usage. Scenario 1 is more common, and popular connection pools for remote calls include Dubbo, HTTP, database, and Redis connection pools. In these situations, connection pools are used to interact with remote services. All connection pools have some common characteristics, and you should pay attention to the following two tips:
Here are some specific tips for different connection pools.
1. Isolate interfaces in the thread pool to avoid mutual interference.
2. Set timeouts on Dubbo consumers as low as possible as per the fail-fast principle. Timeouts on providers are only used as a declaration for consumer reference, and cannot kill threads when time is out.
1. Set ConnectTimeout, SocketTimeout, and ConnectionRequestTimeout.
2. Low DefaultMaxPerRoute leading to blocking
1. Set ConnectTimeout and SocketTimeout.
2. Set TransactionTimeout. A transaction is a lock. The longer the timeout period, the longer the locking. This will lead to related SQL queries that are not in the transaction waiting for locks, thus resulting in poor performance.
3. Set defaultStatementTimeout and queryTimeout of Ibatis.
4. Set MaxWait, the time a caller will wait before receiving a connection timeout.
Retrying too frequently can accelerate system avalanches. Refer to the blog post [2] for more insights. The key points are as follows:
Disperse scheduled and periodic jobs to avoid peak loads. We have encountered similar failure cases:
This article does not cover all aspects comprehensively. Your comments and suggestions are welcome.
[1] DruidDataSource Configuration
[2] Timeouts, retries, and backoff with jitter
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Ushering Web3 Growth: Building Encryption and Data Security Across the Decentralized Web
Q&A: Alibaba Cloud's Li Fei Fei on the Role of Databases in the Gen-AI Era
875 posts | 198 followers
FollowAlibaba Clouder - March 12, 2020
Alibaba Clouder - August 18, 2020
Alibaba Clouder - June 20, 2017
Ye Tang - March 9, 2020
Alibaba Clouder - July 19, 2019
Alibaba Tech - September 24, 2019
875 posts | 198 followers
FollowHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreA HPCaaS cloud platform providing an all-in-one high-performance public computing service
Learn MoreConnect your on-premises render farm to the cloud with Alibaba Cloud Elastic High Performance Computing (E-HPC) power and continue business success in a post-pandemic world
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MoreMore Posts by Alibaba Cloud Community