[Others]Analysis on slow synchronization of MongoDB on the secondary node
Created#More Posted time:Nov 17, 2016 15:54 PM
Recently, there have been many cases of too-high QPS on the primary node in the production environment, making the secondary node unable to keep up the synchronization with the primary node (the latest oplog timestamp on the secondary node is smaller than the oldest oplog timestamp on the primary node), making the secondary node change to the RECOVERING status. This problem requiring human intervention, namely sending the resync command to the secondary node to enforce the secondary node to perform a full re-synchronization.
The figure below demonstrates the procedure of MongoDB data synchronization.
Writes on the primary node will be recorded in the oplog and saved to a capped collection of a fixed size. The secondary node will take the initiative to pull the oplog from the primary node and replay and apply the oplog to itself to maintain data consistency with the primary node.
When a new node is added (or a new node actively sends the resync command to the secondary node), the secondary node will perform an initial sync operation first, that is the full data synchronization to traverse all the sets of all the databases on the primary node and copy the data to its own node. Then it reads the oplog dated “within the period from full synchronization beginning to end” and replay the oplog. Full synchronization is not the focus of this article. So I will not talk about it much here.
After the full synchronization is complete, the secondary node starts to establish the tailable cursor from the ending time and constantly pulls oplog from the synchronization source, replays and oplog and applies it to itself. This process is not undertaken by a single thread. To enhance the efficiency, MongoDB places the oplog pulling and oplog replaying in two different threads.
• Producer thread: This thread keeps pulling oplog from the synchronization source and adds the oplog to a BlockQueue for preservation. The BlockQueue can store up to 240 MB of oplog data. When this threshold value is exceeded, it will not continue the pulling until the oplog is consumed by the replBatcher.
• ReplBatcher thread: This thread is responsible for retrieving oplog from producer thread queues one by one and placing the oplog into the queues it maintains. This queue allows a maximum of 5,000 elements and the total size of the elements shall not exceed 512 MB. When this queue is full, the replBatcher thread needs to wait for oplogApplication to consume the oplog.
• The oplogApplication will retrieve all the elements in the current queue of the replBatch thread, and distribute the elements into different replWriter threads according to the docId (if the storage engine does not support the doc lock, the distribution will be based on the set name). The replWriter thread applies all the oplog to itself; when all the oplog has been applied, the oplogApplication thread writes all the oplog to the local.oplog.rs set in order.
The statistics of the buffer and apply threads of the producer thread can be queried through db.serverStatus().metrics.repl. During the testing process, we simulated around 10,000 QPS of writes on the primary node and observed the synchronization status on the secondary node. The writing speed on the secondary node is far slower than that on the primary node, namely only around 3,000 QPS. We also observed that the producer buffer quickly reached saturation, from which we can determine that the oplog replaying thread failed to keep up.
Under default circumstances, the secondary node adopts 16 replWriter threads to replay the oplog. You can customize the number of threads through setting the replWriterThreadCount parameter at startup. When the number of threads is increased to 32, the synchronization performance is greatly improved and the write QPS levels on the primary and secondary nodes are basically flat. The data synchronization latency between the primary and secondary nodes is controlled within one second, further validating the above conclusions.
Thoughts on improvements
If the synchronization on the secondary node often fails to catch up with the primary node because of the high write QPS on the primary node, you can consider the following improvements:
• Configure a higher replWriterThreadCount to speed up the oplog replaying on the secondary node. But the cost is a higher memory overhead.
• Use a larger oplog. You can follow the official tutorial to change the size of the oplog. Alibaba Cloud MongoDB adds the patch feature and is capable of modifying the oplog size online.
• Distribute the writeOpsToOplog steps into multiple replWriter threads for concurrent execution. This is one of the official solutions under consideration. Refer to Secondaries unable to keep up with primary under WiredTiger.
In the appendix I provided the methods to modify the replWriterThreadCount parameter. The specific thread count is related with the writing load of the primary node, such as the write QPS, the average document size and so on. There is no uniform value for this parameter.
1. Specify the value using the mongod command line.
mongod --setParameter replWriterThreadCount=32
2. Specify the value in the configuration file.