How does MongoDB guarantee the oplog order? - Alibaba Cloud Developer Forums: Cloud Discussion Forums

Blanche
Engineer
Engineer
  • UID619
  • Fans3
  • Follows2
  • Posts59
Reads:2700Replies:0

[Others]How does MongoDB guarantee the oplog order?

Created#
More Posted time:Sep 8, 2016 10:09 AM
In a MongoDB replica set, data is synchronized between the Primary and the Secondary nodes through oplog. When writing data to the primary node, an oplog is recorded, and the Secondary node pulls oplog from the Primary node and replays it to ensure that the same data set is finally stored.
Oplog key features
• Idempotence. Whether replayed once or multiple times, each oplog produces the same results; to achieve idempotent, mongodb converts multiple operations, such as convert insert to upsert, convert $inc to $set.
• Capped collection. Oplog uses fixed space storage. Once the space is full, the oldest document will be deleted automatically.
• Oplog is ordered by time stamp, and its order in all nodes remains consistent.
How to lock during concurrent writes of oplog?
When writing a document to the primary node, first add an intent lock to the DB, then add an intent lock to the set, and call the underlying engine interface to write the document. Add an intent lock to the local database, add an intent lock to oplog.rs set, and thenwrite to oplog. For information about the mechanism of MongoDB multi-level intent lock, refer to official documents.
Write1
DBLock("db1", MODE_IX);
CollectionLock("collection1", MODE_IX);
storageEngine.writeDocument(...);    
DBLock("local", MODEX_IX);
CollectionLock("oplog.rs", MODEX_IX);
storageEngine.writeOplog(...);


Write2
DBLock("db2", MODE_IX);
CollectionLock("collection2", MODE_IX);
storageEngine.writeDocument(...);    
DBLock("local", MODEX_IX);
CollectionLock("oplog.rs", MODEX_IX);
storageEngine.writeOplog(...);


How to guarantee the oplog order on the Primary node?
Based on above mentioned concurrency strategy, how can we guarantee the oplog with multi-write currency?
Oplog is a special capped collection with no _id field but a ts (timestamp field) in the document. All oplog documents are stored according to the oplog order. The following are some examples of oplog.
{ "ts" : Timestamp(1472117563, 1), "h" : NumberLong("2379337421696916806"), "v" : 2, "op" : "c", "ns" : "test.$cmd", "o" : { "create" : "sbtest" } }
{ "ts" : Timestamp(1472117563, 2), "h" : NumberLong("-3720974615875977602"), "v" : 2, "op" : "i", "ns" : "test.sbtest", "o" : { "_id" : ObjectId("57bebb3b082625de06020505"), "x" : "xkfjakfjdksakjf" } }


Wiredtiger, for example, will write a KV record taking the oplog ts field as the key, and document content as value when writing to the oplog document. It will guarantee that the stored (both in btree and in lsm) documents are ordered by key, so that the problem of “order the documents by ts field” is solved. But there is still a problem of disorder of concurrency, for example:
When concurrently writing to multiple oplogs, and timestamps are ts1, ts2 and ts3 (ts1 < ts2 < ts3) respectively, ts1 and ts3 succeed first. At this moment, the Secondary node pulls these two oplogs, then ts2 succeeds, then the Secondary node pulls ts2, which means the order the Secondary node sees is ts1, ts3 and ts2, thereby the problem of disorder of oplogs occurs.
MongoDB's (wiredtiger engine) solution is to enforce a limit while reading to guarantee what the Secondary node sees is in order. The detailed implementation mechanism is shown below:

1. Prior to writing to oplog, a lock is added to the oplog allocation timestamp, and it is registered to the uncommitted list
lock();
ts = getNextOpTime(); // according to current timestamp + counter generation
_uncommittedRecordIds.insert(ts);
unlock();


2. Formally write to oplog. Corresponding oplog is removed from the uncommitted list after writing.
writeOplog(ts, oplogDocument);
lock();
_uncommittedRecordIds.erase(ts);
unlock();


3. When pulling oplogs
if (_uncommittedRecordIds.empty()) {
    // all oplogs are readable
} else {
    // only oplogs before the minimum of uncommitted list can be pulled
}


According to the rules described above, oplogs stored on the Primary node ordered by ts field is guaranteed, and the Secondary node is able to finally read all the oplogs.
How to guarantee that the order of oplogs on the Secondary node consists with that on the Primary node?
Oplogs will be multithreadedly replayed after being pulled to local by the Secondary node. In the last thread, the pulled oplogs are written into the local local.oplog.rs set as they are, so that the consistency between the order of oplogs on the Secondary node and that on the Primary node is guaranteed.
Guest