Fix Elasticsearch Cluster Restart Failures & Shard Errors - Elasticsearch

When you restart or update an Elasticsearch cluster, the operation may fail with the following error:

The operation cannot be performed because the cluster is unhealthy or contains indexes in the close state. We recommend that you perform the operation again after the cluster becomes healthy or the indexes are enabled.

This error occurs when the cluster meets one or more of these conditions:

The cluster contains indexes in the close state.
The cluster health status is red or yellow.
The cluster is healthy but heavily loaded.

The following sections explain how to diagnose and resolve each condition.

Closed indexes

Indexes in the close state block cluster restarts and updates. Run the following command to check index states:

GET /_cat/indices?v

Sample output:

health status index       uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   my-index-01 30h1EiMvS5uAFr2t5CEVoQ   5   1      820            0       14mb           7mb         7mb
       close  my-index-02 BJxfAErbTtu5HBjIXJV_7A   1   1
green  open   my-index-03 _8C6MIXOSxCqVYicH3jsEA   1   1        7            0     24.3kb        12.1kb       12.1kb

In this example, my-index-02 is in the close state. Open it with the following command:

POST /my-index-02/_open

Replace my-index-02 with the name of the closed index. If multiple indexes are closed, open each one individually before retrying the restart or update.

Red or yellow cluster status

A red status means one or more primary shards are unassigned. Searches and indexing on affected indexes may fail. A yellow status means all primary shards are assigned but one or more replica shards are unassigned, which increases the risk of data loss.

Diagnose the issue

Check the cluster health status:

GET /_cat/health?v

If the status is red or yellow, identify the unassigned shards:

GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state

To understand why a specific shard cannot be allocated, run:

GET _cluster/allocation/explain

This command returns an error if there are no unassigned shards in the cluster. This is expected behavior.

Sample output:

{
  "index": "my-index-02",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes"
}

Use the allocate_explanation field to identify the root cause. The following sections cover common causes and their solutions.

Shard allocation retries exhausted

Shards are automatically allocated with a maximum of 5 retries. If all retries are exhausted, reallocate the shards manually:

POST /_cluster/reroute?retry_failed=true

Primary and replica shards on the same node

If the allocation explanation contains the message "the shard cannot be allocated to the same node on which a copy of the shard already exists", the primary and replica shards of an index are assigned to the same node. To resolve this:

Set the number of replica shards to 0:

   PUT /my-index/_settings
   {
     "index": {
       "number_of_replicas": 0
     }
   }

After the cluster status returns to green, set the replica count back to 1:

   PUT /my-index/_settings
   {
     "index": {
       "number_of_replicas": 1
     }
   }

Maximum simultaneous shard allocation reached

If the cluster has reached its shard allocation limit, wait for the current allocation to complete. If shards remain unassigned after several minutes, check the allocation explanation:

GET _cluster/allocation/explain

Disconnected nodes

One or more nodes may have disconnected from the cluster. Check node status:

GET _cat/nodes?v

If any nodes are missing from the output, restart those nodes from the Elasticsearch console.

High disk usage

Elasticsearch does not allocate shards to nodes that use more than 85% of disk space. After the disk usage of the affected node drops below 85%, restart the node to restore normal shard allocation.

To check disk usage per node:

GET _cat/allocation?v

To reduce disk usage:

Delete historical indexes that are no longer needed.
Expand the disk capacity of the node.
Set the number of replica shards to 0 temporarily.

High heap memory usage

When heap memory usage is high, the cluster may suspend operations. To free memory:

Apply throttling to reduce incoming traffic.
Close historical indexes to reduce memory consumption.

Other causes

If none of the above causes apply, check CPU utilization and heap memory usage in the Elasticsearch console. For unassigned shards, run the following command for a detailed explanation:

GET _cluster/allocation/explain

Heavy cluster load

Even when the cluster is healthy (green status), a restart or update can fail if the cluster is heavily loaded. Check the following metrics to identify and resolve load issues.

Disk usage reaches 85%

Diagnose:

Check disk usage monitoring data in the Elasticsearch console.
Run GET _cat/allocation to view disk allocation per node.
Run GET _cluster/allocation/explain to check for allocation issues.
Check cluster logs for disk-related warnings.

Impact: When disk usage reaches 85%, Elasticsearch stops allocating new shards to the affected node.

Solutions:

Delete historical indexes that are no longer needed.
Expand disk capacity.
Set the number of replica shards to 0 temporarily.

After taking action, verify that disk usage has dropped below 85% in the monitoring data.

CPU utilization reaches 85%

Diagnose:

Check CPU utilization monitoring data in the Elasticsearch console.
Review hot threads information to identify CPU-intensive operations.

Impact: High CPU utilization degrades cluster stability.

Solutions:

Check read QPS and write QPS in the monitoring data and reduce traffic if possible.
Scale out the cluster by adding data nodes.
Upgrade the cluster configuration to use larger instance types.

Heap memory usage at or above 75%

Diagnose:

Check heap memory usage monitoring data in the Elasticsearch console.
Review cluster logs for garbage collection (GC) warnings.
Check the old gc collection count and old gc collecting.ms metrics for long GC pauses.

Impact: High heap memory usage degrades cluster stability and may cause operations to hang.

Solutions:

Reduce read and write traffic.
Upgrade the cluster configuration.
Close historical indexes to free heap memory.

Node load exceeds vCPU count

Diagnose:

Check the NodeLoad_1m(value) metric in the Elasticsearch console.
A value greater than the number of vCPUs for the node indicates heavy load.

Impact: An overloaded node may become unresponsive and affect cluster operations.

Solutions:

Check read QPS, write QPS, and disk throughput in the monitoring data.
Reduce read or write traffic.
Scale out the cluster by adding data nodes.
Upgrade the cluster configuration.

Elasticsearch:Cluster restart or update error

Closed indexes

Red or yellow cluster status

Diagnose the issue

Shard allocation retries exhausted

Primary and replica shards on the same node

Maximum simultaneous shard allocation reached

Disconnected nodes

High disk usage

High heap memory usage

Other causes

Heavy cluster load

Disk usage reaches 85%

CPU utilization reaches 85%

Heap memory usage at or above 75%

Node load exceeds vCPU count

References