• UID623
  • Fans4
  • Follows1
  • Posts72

[Others]Mongodump archive mode principles

More Posted time:Aug 30, 2016 16:16 PM
Mongodump is an official logical backup tool of MongoDB. This article will introduce its archive mode principles.
Archive mode
For archive output, Mongodump blends the data in various sets and puts them in one output file. This features two advantages:
1. Convenience for network communications.
2. Convenience for usage in concert with other tools, such as using the output as an input for Mongorestore.
The archive mode of Mongodump supports concurrent dump operations for multiple sets, which is very hard in the general sense. Let’s see how it does it.
Mongodump is written in Go language, so a large number of goroutines are used. We need to pay attention to two types of goroutines: one is the backup goroutine of dump data (with the degree of parallelism, that is, the number of goroutines, assignable),the other is the multiplexer goroutine that blends various sets of data and is actually output into the archive file. See the figure below:

The multiplexer goroutine receives data from the backup goroutine using a select call through channels. It is designed with dynamic addition and deletion of select channels. In initialization, there will be a control channel for receiving the channel to be monitored. When a backup goroutine starts to prepare backup for a set, it will create three channels: one is the writeChan for sending set data to the multiplexer goroutine, one is the writeLenChan for receiving the write success data length from multiplexer goroutine, and the third is the writeCloseFinishedChan for receiving set dump completion signals from the multiplexer goroutine. Besides the three channels, it also creates a 16MB buffer. It encapsulates the data in a MuxIn data structure and sends it to the multiplexer goroutine through the control channel.

After the multiplexer goroutine is started, it executes select() methods repeatedly. When it receives the MuxIn data structure from the control channel, it will add the writeChan to the select channel and store the MuxIn in its own array. In the next round, it will start to listen to the control channel and the writeChan.
Dump process
For every working backup goroutine, it dumps data from Mongodb while sending data to the multiplexer goroutine through writeChan. This is a typical producer-consumer model. To avoid the disk I/O of the customers affecting the producer (reading data from Mongodb), it initiates another dump goroutine and transmits data through a buffChan channel. It also processes the exit signals in this goroutine (the primary goroutine initiates a signal processing goroutine for capturing SIGTERM, SIGINT and SIGHUP and communicates with other goroutines through a termChan). When the exit signal is received, the backup goroutine shuts down the buffChan and exits.

When the buffer of the backup goroutine is full, the backup goroutine sends the data to the multiplexer goroutine through writeChan and waits for data from the writeLenChan. If the returned write success data length is inconsistent with the buf data length, it returns failure.
Multiplexer goroutine receives data from the writeChan and writes the data into the archive file. Multiple backup goroutines may be working at the same time, so we need to blend the data. Here it is mainly to process sets into slices and each slice contains a header, a body and a terminator. Header is the set Namespace (db and set name); body is the set data and terminator is a 4-byte ending mark. In addition, the last slice of each set is followed by an EOF slice indicating the end of the set. Multiplexer goroutine will maintain a set name currently being processed. If the set data is crossed, it will terminate the current slice (by writing a terminator), and initiate a new slice (by writing a new header). Afterwards, the write success data will be sent to the backup goroutine through the writeLenChan channel. The physical format of the archive file is shown as follows:

When the backup goroutine finishes dumping a set, it will closes writeChan and writeLenChan in succession and waits for receiving data from the writeCloseFinishedChan. When the multiplexer goroutine perceives the closure of the writeChan, it sends data to the writeCloseFinishedChan and writes the ending mark and the EOF slice to the current slice. Here, the writeCloseFinishedChan serves to prevent the control channel of the multiplexer goroutine from being closed before the writeChan. If the control channel is closed, the entire dump process will be terminated. After the backup goroutine receives data returned from the writeCloseFinishedChan, it can start to process the next set until all sets are processed.
When to exit Goroutines
The primary goroutine will create a resultChan when it creates the backup goroutine. When the backup goroutine discovers that all sets have been dumped, or an error occurs during the dump, it sends data to the resultChan and exits. When the primary goroutine receives data sent by every backup goroutine from the resultChan channel, or the error sent by a backup goroutine, it returns. When the primary goroutine completes all the tasks and returns, it will shut down the control channel of the multiplexer goroutine, so that the multiplexer goroutine can exit.
Concluding remarks
Mongodump achieves output into one archive file while supporting concurrent requests. This can be used as a reference.