Assistant Engineer
Assistant Engineer
  • UID627
  • Fans3
  • Follows0
  • Posts55

[Others]Analysis on high latency (synchronization lock) of MongoDB on the secondary node

More Posted time:Nov 21, 2016 17:14 PM
• In the MongoDB replica set, the secondary node keeps pulling oplog in batch from the primary node and replays it locally to ensure the data consistency with the primary node.
• When the secondary replays a batch of oplog after it pulls a batch of oplog, it will apply a special lock - Lock::ParallelBatchWriterMode which will block all the read requests until this batch of oplog is replayed completely. There are two reasons for doing this.
o Avoid dirty reads as much as possible. A batch of data cannot be read by users until the batch of oplog is replayed completely.
o Ensure synchronization performance as much as possible. Just imagine, if a general lock is applied when the oplog is replayed, the oplog replaying will compete for the lock resources with normal read/write requests. If there are a large number of reads on the secondary node, it will inevitably cause the secondary node to gradually fall behind in synchronization. Refer to SERVER-18190.
Case analysis
Based on the above issues, some users may encounter a long read latency for reading the secondary node because the secondary node occupies the special lock for a long time for replaying the oplog.
Issue 1: A prolonged single request
A one-hour-long “foreground index building” request being replayed on the secondary node keeps occupying the ParallelBatchWriterMode lock, leading to a one-hour-long congestion for all the requests on the secondary node. I have analyzed the case in Why the hour-long request congestion on the secondary node? and I will not repeat it here. The solution to this issue is to use background index building as much as possible.
The above scenario, apart from influencing the read requests on the secondary node, will also lead to congested writes on the primary node, if the write requests on the primary node specify writeConcern for writing to multiple nodes ({w: 2+}), the secondary node is always congested in index building and as a result all the following oplog replaying should wait for the index building to complete.
Issue 2: Long accumulative duration of multiple requests
When there is a high write concurrency on the primary node, the secondary node can pull many records of oplog every time and replay them concurrently. The duration of each log replaying may be short, but with accumulatively hundreds or thousands of oplogs replayed at one time, the time consumption will be much higher. During the replay, the read requests on the secondary node will be congested and wait. So in general, the average latency on the secondary node may be longer than that on the primary node. But as long as the latency is within the acceptable range, the issue is of no impact. In addition, according to the experience of cloud users, a vast majority of users won’t perceive this difference.
But you should pay attention to one circumstance:

From the above example, we can see that an update operation specifies the {multi: true} option and updates two matched documents. An oplog is generated for every document (mainly to ensure the oplog idempotence). If the matched documents amount to several thousands or several tens of thousands, the corresponding number of oplogs will be generated and the secondary node will pull these oplogs and replay them: In this scenario, the overhead of the update operation is multiplied on the secondary node. The read latency on the secondary node may be influenced greatly at this time.
How to evaluate the impact of the lock during oplog replaying?
From the case above, we can see that the secondary node may suffer a very high read latency in some scenarios. So how can we identify that the cause of the actual problem is the too-long lock occupation on the secondary node for replaying the oplog?
Our practice is to add the audit log and “place the record of time overhead of every batch of oplog on the secondary node into the audit log” so that we can conveniently see how big the influence is. Below is a “simplified version of patch for adding logs”. You can apply it to MongoDB 3.2 as needed.
diff --git a/src/mongo/db/repl/sync_tail.cpp b/src/mongo/db/repl/sync_tail.cpp
index 50517c2..e7d58bc 100644
--- a/src/mongo/db/repl/sync_tail.cpp
+++ b/src/mongo/db/repl/sync_tail.cpp
@@ -550,6 +550,8 @@ OpTime SyncTail::multiApply(OperationContext* txn,
     // stop all readers until we're done
     Lock::ParallelBatchWriterMode pbwm(txn->lockState());

+    unsigned long long startTime = curTimeMicros64();
     if (inShutdownStrict()) {
         log() << "Cannot apply operations due to shutdown in progress";
         return OpTime();
@@ -585,6 +587,8 @@ OpTime SyncTail::multiApply(OperationContext* txn,
         setMinValid(txn, boundaries->end, DurableRequirement::None);  // Mark batch as complete.

+    log() << "batch writer cost " << (curTimeMicros64() - startTime) << us;
     return lastOpTime;