Troubleshoot high disk utilization due to large data volume log growth and uneven sharding - ApsaraDB for MongoDB

When disk utilization on an ApsaraDB for MongoDB instance reaches 100%, the instance becomes unavailable and writes are blocked. Act when utilization exceeds 80%: either reduce disk usage or expand storage before the instance is taken offline.

This topic explains how to identify what is consuming disk space and how to resolve the most common causes of high disk utilization.

Check disk usage

Replica set instances

Open the ApsaraDB for MongoDB console and use one of the following methods.

Overview

Go to Basic Information and find the Specification Information section. The Disk Space and Utilization fields show current usage at a glance.

Monitoring charts

In the left-side navigation pane, click Monitoring Data. Select a node to view the Disk Usage (Bytes) and Disk Usage (%) metrics.

A replica set instance includes a primary node (read/write), one or more high-availability secondary nodes, a hidden node, and optional read-only nodes. Disk space on each node follows this formula:

ins_size = data_size + log_size

Component	Contents
`data_size`	Physical data files (names start with `collection`), index files (names start with `index`), and metadata files such as `WiredTiger.wt`. Excludes data in the local database.
`log_size`	Physical size of the local database, MongoDB runtime logs, and some audit logs.

Detailed analysis

For a breakdown by collection, use MongoDB commands or CloudDBA:

Run db.stats() to see database-level storage stats. Run db.$collection_name.stats() for collection-level detail, including index size, data size, compression ratio, and average document size. See the MongoDB reference for db.stats(), db.collection.stats(), db.collection.storageSize(), db.collection.totalIndexSize(), and db.collection.totalSize().
Go to CloudDBA > Storage Analysis. The Storage Analysis page shows disk usage by database and collection, average daily growth, predicted days until storage is exhausted, and details of anomalous collections.

Sharded cluster instances

Open the ApsaraDB for MongoDB console and use one of the following methods.

Monitoring charts

On the Monitoring Data page, select a node to view the Disk Usage (Bytes) and Disk Usage (%) metrics for that node.

Commands

Run db.stats() and db.$collection_name.stats() on each node to analyze disk usage per shard.

Common causes and resolutions

Disk fragmentation after compact operations

Running db.runCommand({compact:"collectionName"}) reclaims fragmented space but temporarily inflates disk usage during execution. If a collection has accumulated significant fragmentation, compact is the right tool—run it on a secondary node first, then trigger a primary/secondary switchover to minimize impact on your application.

Resolve

Run compact on a secondary node first, then trigger a primary/secondary switchover to minimize impact on your application:

db.runCommand({compact: "<collectionName>"})

Replace <collectionName> with the actual collection name. For large collections, run compact during off-peak hours.

For step-by-step instructions, see Defragment the disks of an instance to increase disk utilization.

Verify

After compact completes, re-run db.$collection_name.stats() and confirm that storageSize has decreased. You can also check Disk Usage (Bytes) in Monitoring Data to confirm the reduction.

Excessive log space usage

Journals growing without bound (MongoDB earlier than 4.0)

In MongoDB versions earlier than 4.0, if the number of open files on the host reaches the system limit, the cleaner threads on the MongoDB log server exit silently. Journal files then grow without limit.

Look for entries like the following in the instance's runtime logs:

2019-08-25T09:45:16.867+0800 I NETWORK [thread1] Listener: accept() returns -1 Too many open files in system
2019-08-25T09:45:17.000+0800 I - [ftdc] Assertion: 13538:couldn't open [/proc/55692/stat] Too many open files in system src/mongo/util/processinfo_linux.cpp 74
2019-08-25T09:45:17.002+0800 W FTDC [ftdc] Uncaught exception in 'Location13538: couldn't open [/proc/55692/stat] Too many open files in system' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.

Resolve

Upgrade MongoDB to 4.0 or later. As a temporary measure, restart the mongod process. See the upstream bug report: WT-4083.

Verify

After upgrading or restarting, confirm that journal files are no longer growing by checking Disk Usage (Bytes) in Monitoring Data over a 10–15 minute window.

Oplog consuming growing space after replication lag or physical backup

Two scenarios cause oplog space to expand and not shrink automatically:

Replication lag: When secondary nodes fall behind, the available oplog space is no longer capped by the fixed collection size in the configuration file. It can reach up to 20% of the disk space provisioned for the instance. After the lag clears, the physical space is not automatically released.
Physical backup on a hidden node: A large number of checkpoints are generated during backup, producing substantial log data.

Resolve

Run compact on the oplog collection:

Note

All write operations are blocked during the compact operation.

db.grantRolesToUser("root", [{db: "local", role: "dbAdmin"}])
use local
db.runCommand({ compact: "oplog.rs", force: true })

Verify

After the operation, check Disk Usage (Bytes) in Monitoring Data to confirm that log space has decreased on the affected node.

Uneven disk usage across shards

Poor shard key choice: low cardinality

If most data ends up in a small number of chunks—while chunk count across shards stays balanced—the root cause is a low-cardinality shard key.

When a shard key has very few distinct values, the balancer can split and migrate chunks but cannot split a chunk whose documents all share the same key value. The result: chunk counts look balanced, but data sizes are heavily skewed.

Look for warnings in the output of sh.status():

2019-08-27T13:31:22.076+0800 W SHARDING [conn12681919] possible low cardinality key detected in superHotItemPool.haodanku_all - key is { batch: "201908260000" }
2019-08-27T13:31:22.076+0800 W SHARDING [conn12681919] possible low cardinality key detected in superHotItemPool.haodanku_all - key is { batch: "201908260200" }
2019-08-27T13:31:22.076+0800 W SHARDING [conn12681919] possible low cardinality key detected in superHotItemPool.haodanku_all - key is { batch: "201908260230" }

Resolve

Redesign the shard key using a field with high cardinality. Consider hashed sharding, which distributes data evenly by applying a hash function to shard key values. Ranged sharding distributes data by value range, which tends to concentrate writes on a single chunk. See shard key concepts, hashed sharding, and ranged sharding.

When a chunk reaches 64 MB, MongoDB creates a new empty chunk so migration can continue. If chunks are balanced but data sizes differ greatly across shards, a low-cardinality shard key is the likely cause.

Unsharded databases creating a jumbo shard

Data in an unsharded database is stored entirely on one shard. If that database is large, one shard ends up holding significantly more data than the others. The same situation can occur when data is imported into a sharded cluster instance that was not sharded before the import.

Resolve

Choose the appropriate action based on your situation:

Situation	Action
Import not yet started	Shard the destination instance before importing data
Multiple unsharded databases with similar sizes	Run the `movePrimary` command to distribute each database to a different shard
A single large unsharded database	Shard the database, or migrate it to a dedicated replica set instance
Disk space is sufficient	No action required

For more on how chunks are partitioned and split, see Data partitioning with chunks and Split chunks in a sharded cluster.

Disk fragmentation from moveChunk operations

When the balancer migrates a chunk, it removes source documents after writing them to the destination. By default, that removal does not release the physical disk space—the data files and index files for the WiredTiger engine retain the space until explicitly reclaimed. This is common when sharding is added to an instance that has been running for some time.

Resolve

Run compact on each affected shard to reclaim fragmented space:

db.runCommand({compact: "<collectionName>"})

See Migrate ranges in a sharded cluster and Manage sharded cluster balancer for context on moveChunk behavior.

Verify

After compact completes on each shard, compare the Disk Usage (Bytes) values across shards in Monitoring Data to confirm the distribution is more even.

What's next

If disk usage continues to grow after resolving the immediate cause, expand the storage space of the instance from the ApsaraDB for MongoDB console.
Review your shard key design to prevent data skew from recurring.
Schedule compact operations periodically during off-peak hours to keep fragmentation in check.