Split indices with the _split API - Elasticsearch - Alibaba Cloud Documentation Center

If performance issues arise from an imbalanced or insufficient number of primary shards in your Elasticsearch indices (e.g., too much data per shard), the _split API allows you to increase the number of primary shards for an existing index. This operation is performed online, minimizing downtime, and is significantly faster than reindexing data, which can be critical for large datasets. This guide describes how to use the _split API to create a new index with more primary shards from an existing one.

Prerequisites

Cluster health: Your Elasticsearch cluster must be healthy and operating under normal load.

Disabling write: The source index is read-only. Example:

PUT /my_old_index/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

Usage notes

Before using the _split API, ensure the following:

Target shard count: The number of primary shards for the index must be a factor of the index.number_of_routing_shards parameter defined when the source index was created. It must also be a multiple of the source index's primary shard count.
- Example: If the source index has 2 primary shards and number_of_routing_shards is 24, the index can have 4, 6, 8, 12, or 24 primary shards.
- For more guidance on shard evaluation, see Assess shards.
Target index name: The Elasticsearch cluster must not already contain an index with the same name as your intended target index.
Disable write operations (source index): Data write operations must be disabled for the source index before splitting.
Sufficient disk space: The Elasticsearch cluster must have sufficient disk space to accommodate the target index.
Off-peak operation: We highly recommend performing index split operations during off-peak hours due to the resource consumption involved in segment merging.
Version compatibility: For Elasticsearch V7.0 and later, if index.number_of_routing_shards was not explicitly configured during source index creation, the index is split by a factor of 2 by default, with a maximum of 1,024 primary shards.

Procedure

Log on to the Kibana console of your Elasticsearch cluster and go to the Kibana homepage.
In the left navigation menu, click Dev tools.
Execute the _split API:
Split my_old_index into my_new_index with your desired number of primary shards. Simultaneously, enable write operations for the target index.
```
POST my_old_index/_split/my_new_index
{
  "settings": {
    "index.number_of_shards": 12,
    "index.blocks.write": null
  }
}
```
- Replace my_old_index with the name of your existing index.
- Replace my_new_index with the desired name for the target index.
- Set index.number_of_shards to your target primary shard count (e.g., 12). Ensure this value adheres to the target shard count constraint.
- index.blocks.write: null removes the write block from the target index.

Verify split progress and cluster health

After initiating the split, monitor its progress and ensure your cluster remains healthy.

Query split progress:
Use the _cat recovery API to check for any ongoing shard recoveries related to the split.
```
GET _cat/recovery?v&active_only
```
If no indices are displayed in the "index" column, and active_only yields no relevant recoveries, the split is likely complete.
Query cluster health:
Confirm your Elasticsearch cluster's health status.
```
GET _cluster/health
```
A response containing "status" : "green" indicates a healthy cluster.

Troubleshooting and FAQ

Q: Why are the CPU utilization and load_1m not reduced after the index split operation?

A: When you split an index, Alibaba Cloud Elasticsearch reroutes the documents, and the target index often contains a large number of docs.deleted documents. This necessitates a segment merge operation on the target index, which is performed in the background. Merge operations can consume significant computing resources, temporarily increasing the load on your data nodes.

We recommend splitting indexes during off-peak hours to mitigate this impact. You can use the GET _nodes/hot_threads command to monitor node activity and confirm if merge operations are underway.

Reference: Understanding the `_split` API

The _split API is designed to address scenarios where an index's initial primary shard count becomes insufficient for its growing data volume or query load. Unlike the time-consuming _reindex API, _split operates by internally re-routing data without a full copy, allowing for a much faster expansion of primary shards.

Key capability: Split an existing index into a new index with a greater number of primary shards.
Availability: Elasticsearch V6.X and later versions.
Official documentation: For detailed information, see the Split index API.

`_split` vs. `_reindex` performance comparison

Here's a comparison of the _split API's performance against the _reindex API in a test environment:

Test environment:
- Nodes: five data nodes (8 vCPUs, 16 GiB memory each)
- Data volume: 183 GiB in the source index
- Shards: 5 primary shards (source), 20 primary shards (target), 0 replica shards (both)

Test results

Method	Consumed time	Resource usage
`reindex`	2.5 hours	Excessively high write QPS, high data node resource utilization.
`_split`	3 minutes	CPU utilization ~78% per data node, minute-average load ~10 per data node.