Dave
Assistant Engineer
Assistant Engineer
  • UID627
  • Fans2
  • Follows0
  • Posts55
Reads:873Replies:0

[Others]Why the hour-long request congestion on the secondary node?

Created#
More Posted time:Nov 18, 2016 15:12 PM
I felt heartbroken when I saw the log on the secondary node - the authentication request took as long as 2,977,790 milliseconds (around 50 minutes). Detailed statistics showed that on this secondary node, all the user requests initiated after 16:54 were blocked and didn't return results until around 17:54, with the longest processing time for a request being about one hour.
2016-06-17T17:54:57.575+0800 I COMMAND  [conn2581] command admin.system.users command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:171 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 2977790ms
2016-06-17T17:54:57.575+0800 I COMMAND  [conn2740] command admin.system.users command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:171 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 2416390ms


Through investigation, the main culprit causing the secondary node congestion of nearly one hour is a pair of two very-huge action sets “background index building + index deleting”.
Background
1. After the secondary node pulls a batch of oplogs from the primary node, it applies a special lock during the oplog replaying process. This lock will block all the readers. The reason for doing so, in my personal opinion, is to prevent the readers from seeing the intermediate status, and only after a batch of oplogs are all applied successfully will the client be able to read them to avoid dirty reads.
2. Index building can adopt two modes, namely foreground index building and background index building. The former will apply mutex locks on the databases throughout the process, and reads and writes to the same database will be blocked until the index building is complete; the latter will only apply intention write locks to the database with no influences on reads or writes to the databases.
3. When the secondary node replays the index building commands, if it is the foreground index building, the whole process of replaying index building will block all readers; If it is the background index building, the index building actions during the replay will be conducted in the background thread and will block readers for just a very short period of time.
In order to avoid index building influencing the business, we usually
1. have the index ready when creating the set. At this time, the set is empty, so the index building overhead is very small. (The index will be updated simultaneously for every write later on.)
2. If the set already contains many documents at the index building, we recommend you use the background mode as much as possible to avoid “influencing all the reads and writes to a database on the primary node” and “influencing all the reads on the secondary node”.
Problem analysis
In the problem we encountered, the index building adopts the background mode, but it still caused nearly one hour of reader congestion on the secondary node. Next we can try to analyze the cause of the problem. The process of main events is as follows:


1. 15:07:00 The primary node initiates the background index building, and the index creation takes nearly 1.5 hours.
2. 16:32:25 Index creation is completed. The primary node records the oplog. The secondary node pulls the oplog and replays the action. Because it is a background index building operation, the secondary node will start a separate thread for the building and apply an intention lock on the database during the building process. So the replay will only block the readers for a short period of time (milliseconds).
3. 16:54:53 The user initiates an action to delete an index in the same database. The deletion action is completed quickly on the primary node, and the secondary node pulls the oplog and starts replay. The mutex lock is required on the database for the index deletion action while the background index building on the secondary node is still in progress and an intention lock has been applied to the database. As a result, the mutex lock cannot be applied until the background index building is complete. In the oplog replay, the special lock that the secondary node occupies will block all the readers. That is why from this time on, all the readers have been blocked and are in waiting status.
4. 17:54:53 The background index building on the secondary node is complete and the index deletion action gets the mutex lock and completes the deletion. The replay is completed and blocked readers successfully win the lock and are processed.
Summary
The above problem is mainly because of the designing mechanism of MongoDB. The secondary node applies an over-large granularity of lock for replaying the oplog and thus all the read requests will be blocked; but if we decrease the lock granularity, dirty reads may occur, which is unacceptable for some business scenarios. For the time being, as long as we do not place time-consuming operations that need to be synchronized to the secondary nodes in the business logic, the influence can be completely ignored. If it is truly not feasible, and prolonged request congestion occurs on the secondary node, we can directly use the default readPreference to read data from the primary node only.
Guest