Assistant Engineer
Assistant Engineer
  • UID626
  • Fans1
  • Follows1
  • Posts52

[Others]What You Should Know about MongoDB Sharding

More Posted time:Oct 11, 2016 9:29 AM
Principles of MongoDB Sharded Cluster
If you know nothing about MongoDB Sharded Cluster, please read the document below first.

When to use Sharded Cluster?
You usually consider using Sharded Cluster to solve the following two problems.
1. Storage capacity is limited by a single computer, that is, there is limitation on disk resources.
2. Read/write capacities are limited by a single computer (the read capacity can be expanded through adding secondary nodes into replica sets). This is because there is a limitation on CPU, memory, network card or other resources, and read/write capacity can't be expanded.
If you haven't encountered these problems, using MongoDB is recommended, which is much easier than Sharded Cluster in terms of management and maintenance.
How to determine the number of shards and mongos?
When you decide to use Sharded Cluster, a new problem presents itself: how many shards and mongos should you deploy? The richest man in China has guided us in solving this problem, and he said: Set a small target first, for example first deploy 1,000 shards, and then gradually expand it according to requirements.
Back to the topic, the number of shards and mongos depends ultimately on application requirements. If sharding is only used for data warehousing instead of OLTP, when a single shard can store M, the required total storage is N.
numberOfShards = N / M / 0.75    (volume waterline is assumed to be 75%)
  numberOfMongos = 2+ (Due to the relatively low access requirements, at least two mongos should be deployed for high availability.)

If the sharding is used to solve the problem about high-concurrency data write/read, the total data size is actually small. The shards and mongos deployed must meet requirements for read/write performance, and the volume is not a significant point of consideration. If the max qps of a single shard and a single mongos are M and Ms respectively, the total qps required is N. (Note: The service capacity of mongos and mongod should be based on the measurement by the user according to access features.)
numberOfShards = Q / M  / 0.75    (Load waterline is assumed to be 75%)
numberOfMongos = Q / Ms / 0.75

If the sharding aims to solve both problems above, predication should be carried out as per the higher requirement. The above estimation is based on the ideal case of uniform distribution of data and requests in Sharded Cluster. But in the actual situation, they may not distribute evenly. Here a concept of imbalance coefficient D (This is my personal idea, not a general concept) is introduced, and it means the shard in which most data or requests is distributed is D times of the mean value. The actual number of required shards and mongos should be the above estimated value multiplied by the unbalance coefficient D.
To make the load distribution of the system as even as possible, a reasonable shard key should be selected.
How to select a shard key?
MongoDB Sharded Cluster supports two types of partitions, and they both have advantages and disadvantages.
range partition, it generally provides a lot of support for range query based on shard key.
Hash partition, it can distribute reads evenly into each shard.
Both partitions above can't solve the following problems:
1. Low cardinality of shard key, for example, if the data center is taken as a shard key, as generally there are not many data centers, the partition effect is certainly not good.
2. When there are many files in a shard key value, it results in a single chunk being very big (jumbo chunk), and this will affect chunk moving and load balancing.
3. Query and update operations based on shard key will become scatter-gather queries, and this will affect the efficiency.
A good shard key should possess the following features:
• sufficient cardinality of key distribution
• evenly distributed write
• Avoid scatter-gather query (targeted read)
For example, an Internet of Things application stores logs for a lot of equipment into MongoDB Sharded Cluster. Suppose there are millions of pieces of equipment, each of them delivers log data including deviceId and timestamp to MongoDB once every 10 seconds. The most common application query request is: related log information of a specific piece of equipment within a specific period. (Readers can estimate the magnitude independently. From either read or data size, sharding should be used to enable horizontal expansion).
• Solution 1: The timestamp is taken as a shard key, range partition.
o Bad
o New writes have continuous timestamps, and they may make requests from the same shard, which results in unevenly distributed writes.
o All queries based on deviceId may scatter to all shards during query, which results in low efficiency.
• Solution 2: The timestamp is taken as a shard key, hash partition.
o Bad
o Evenly distributed writes into various shards
o All queries based on deviceId may scatter to all shards during query, which results in low efficiency.
• Solution 3: DeviceId is taken as a shardKey, hash partition. (If id features no distinct rules, so does range partition.)
o Bad
o Evenly distributed writes into various shards
o Data corresponded to the same deviceId can't be divided further, and they can only be scattered into one chunk, which results in a jumbo chunk.
o When a query based on deviceId involves only single shard without adequate shards, requests may be routed to a single shard, and then the full table is scanned and sequenced according to the requirements for the range query of timestamp.
• Solution 4: (deviceId, timestamp) are combined into a shardKey, range partition (Better)
o Good
o Evenly distributed writes into various shards
o The data from the same deviceId can be scattered into various chunks according to timestamp.
o (deviceId, timestamp) composite index can be applied directly to complete the process based on data about the deviceId query time range.
About jumbo chunk and chunk size
Jumbo chunk refers to a chunk that is huge or contains too many files, and that can't split.
If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as jumbo.

For MongoDB, the default chunk size is 64MB. If a chunk exceeds 64MB and can't split (for instance, shard keys of all files are the same), they will be marked as jumbo chunk,and the balancer doesn't move these chunks,resulting in unbalanced load. This situation should be avoided.
Once jumbo chunk occurs, if the server load balancer requirements aren't high, it can be neglected and it doesn't affect the data read/write access. If they must be processed, the method below are available.
1. Split the jumbo chunk. Once it is split, mongos automatically removes the jumbo mark.
2. Chunks which can’t be split are no longer jumbo chunks. Try to remove the jumbo mark from the chunk manually (note: back up the config database first to avoid damaging the config database due to mis-operation).
3. The last method is an increased chunk size. When it doesn't exceed limits, jumbo mark will be removed. But this is a stopgap approach. As data is written, it will occur again. The fundamental solution is to arrange shard keys rationally.
As for how to set the chunk size, in most cases, use the default chunk size, and it needs adjustment in the following scenarios (generally between 1-1024).
• In moving, if IO load is too high, set a smaller chunk size.
• In testing, set a small chunk size to help verify the effect.
• If initial chunk size setting is unsuitable, resulting in the server load balancer being affected due to many jumbo chunks, try to increase chunk size.
• When un-partition set is switched into partition set, if the capacity of the set is too large, it's necessary (the situation occurs when the data size reaches T) to increase chunk size in order to switch. Please refer to Sharding Existing Collection Data Size.
Tag aware sharding
Tag aware sharding is a very useful feature of Sharded Cluster. It enables the user to customize the distribution principles by themselves. The principles of tag aware sharding:
1. Shard is set with tag A through sh.addShardTag().
2. A specific chunk range of a set is set with tag A through sh.addTagRange(). The final MongoDB can ensure chunk range (or the superset of the range) within which tag A is set is distributed in the shard set with tag A.
Tag aware sharding is applied in the following scenarios.
• Set equipment room tag on shards in different equipment rooms, and data within different chunk ranges are distributed into the specified equipment room.
• Set service level tags on shards with unavailable service capacity, and more chunks are scattered to shards with good service capacity.
• ...
When tag aware sharding is used, it is worth noting that chunk distribution into the shards of corresponding tags can't be completed at once. Instead, it is completed gradually including triggering split and moveChunk after constantly inserting and updating. Balancer should be enabled as well. Therefore, you may find after tag range is set for a period of time that write isn't distributed into the shard corresponding with the tag.
About server load balancer
The automatic server load balancing of MongoDB Sharded Cluster is currently realized by the background thread of mongos. Besides, only one moving task is permitted for each set at a time. Load balancing depends mainly on the number of chunks gathered on each shard, chunk moving is triggered when the difference exceeds a certain threshold value (related to the total number).
Server load balancer is enabled by default. To protect online businesses from chunk moving, you can set a moving implementation window, for instance, moving is only allowed between 2:00-6:00 am.
use config
   { _id: "balancer" },
   { $set: { activeWindow : { start : "02:00", stop : "06:00" } } },
   { upsert: true }

Besides, when sharding backup (via mongos or backup config server and all shards separately) is carried out, the server load balancer should be stopped to avoid inconsistent status of data after backup.

moveChunk archive setting
Usage of 3.0 and the former version of Sharded Cluster may encounter a problem, that after data write is stopped, disk space in the data catalog is continuously occupied.
This behavior depends on the sharding.archiveMovedChunks configuration item. The configuration item is true in 3.0 and former versions. That is, when a chunk is moved, the source shard archives the moved chunk data into a data catalog for use in case of any event. In other words, when a chunk moves, the source node space isn't released, and the target node occupies new space.
For version 3.2, the default of the configuration item is also set to false, that is moveChunk data will not be archived in the source shard.
Setting of recoverShardingState
When using MongoDB Sharded Cluster, you may encounter another problem, that is, shards don't operate normally after being started. When ismaster is called on primary node, the result is true, and other commands can't be executed. The status is as follows:
mongo-9003:PRIMARY> db.isMaster()
    "hosts" : [
    "setName" : "mongo-9003",
    "setVersion" : 9,
    "ismaster" : false,  // primary 的 ismaster 为 false???
    "secondary" : true,
    "primary" : "host1:9003",
    "me" : "host1:9003",
    "electionId" : ObjectId("57c7e62d218e9216c70aa3cf"),
    "maxBsonObjectSize" : 16777216,
    "maxMessageSizeBytes" : 48000000,
    "maxWriteBatchSize" : 1000,
    "localTime" : ISODate("2016-09-01T12:29:27.113Z"),
    "maxWireVersion" : 4,
    "minWireVersion" : 0,
    "ok" : 1

When viewing error logs, you may find that the shard can't connect with the config server. This depends on sharding.recoverShardingState, which is set true by default. In other words, once started, shards connect to the config server to execute initialization of sharding status, and if the config server connection fails, initialization isn't completed and shard status is abnormal.
Some users encounter the above problems when moving all nodes on Sharded Cluster to a new host, this is because information about the config server changes, while the shard still connects with the original config server once started. You can solve the problem by adding setParameter recoverShardingState=false to the startup command line when starting the shard.
The above default design is indeed irrational to some extent. The abnormality of the config server shouldn't affect the shard, and the presentation of the final problem is not explicit. MongoDB version 3.4 modifies this, and the parameter is removed, so by default, there is no recoverShardingState logic. Please refer to SERVER-24465 for details.
There are too many problems to consider. What should I do?
Alibaba Cloud has provided MongoDB cloud database service, which can solve all MongoDB operation and maintenance management related problems for developers and enable them to be focus on business development. MongoDB cloud database service provides supports for three-node high availability data replication sets. Also, Sharded Cluster functionality is under intensive research and development.
If you have encountered any problems when using MongoDB sharding, you're welcome to exchange and discuss it at Alibaba Cloud International Forum.