If performance issues are caused by inappropriate shard configurations when you use
an Elasticsearch cluster, you can use the _split API to split indexes in the Elasticsearch cluster into new indexes with more primary
shards in online mode. For example, if the number of primary shards for an index is
small, large amounts of data may be stored in each primary shard. As a result, the
cluster performance may be affected. This topic describes how to use the _split API to split an existing index into a new index with more primary shards.
Background information
After an index is created, you cannot change the number of primary shards for the
index. In most cases, if you want to change the number of primary shards for an existing
index, you need to call the reindex API to reindex data, which is time-consuming.
To resolve this issue, Elasticsearch provides the _split API in Elasticsearch V6.X
and later versions. You can use this API to split an existing index into a new index
with more primary shards in online mode. For more information about the API, see Split index API.
The following descriptions provide information about performance tests performed on
the reindex API and
_split API:
- Test environment:
- Nodes: five data nodes, each of which offers 8 vCPUs and 16 GiB of memory
- Data volume: 183 GiB of data stored in an index
- Number of shards: five primary shards for the original index, 20 primary shards for
the new index, and no replica shards for both indexes
- Test results
| Method |
Consumed time |
Resource usage |
| reindex API |
2.5 hours |
The write QPS in the cluster is excessively high, and the resource usage of the data
nodes is high.
|
_split API
|
3 minutes |
The CPU utilization of each data node is approximately 78%, and the minute-average
load of each data node is approximately 10.
|
Prerequisites
- The Elasticsearch cluster is healthy, and the load of the cluster is normal.
- The number of primary shards that can be obtained after the index is split is evaluated
based on the number of data nodes in and the disk space of the Elasticsearch cluster.
For more information, see Shard evaluation.
- Data write operations are disabled for the index. The Elasticsearch cluster does not
contain an index that is named the same as the new index.
- The Elasticsearch cluster has sufficient disk space to store the new index.
Procedure
- Log on to the Kibana console of your Elasticsearch cluster and go to the homepage
of the Kibana console as prompted.
For more information about how to log on to the Kibana console, see Log on to the Kibana console.
Note In this example, an Elasticsearch V7.10.0 cluster is used. Operations on clusters
of other versions may differ. The actual operations in the console prevail.
- In the upper-right corner of the page that appears, click Dev tools.
- On the Console tab of the page that appears, run the following command to create an index. In the
command, configure the index.number_of_routing_shards parameter to specify the number of routing shards and the index.number_of_shards
parameter to specify the number of primary shards.
The number of primary shards that can be obtained after an index is split is a factor
of the value of the
index.number_of_routing_shards parameter and a multiple of the value of the
index.number_of_shards parameter. In this example, an index named
dest1 is created in an Elasticsearch V7.10 cluster. The
index.number_of_routing_shards parameter is set to 24, and the
index.number_of_shards parameter is set to 2. In this case, the number of primary shards that can be obtained
after the index is split is 4, 6, 8, 12, or 24.
Note You must replace dest1 in the following command based on your business requirements.
PUT /dest1
{
"settings": {
"index": {
"number_of_routing_shards": 24,
"number_of_shards":2
}
}
}
| Parameter |
Description |
| number_of_routing_shards |
The number of routing shards. This parameter defines the number of times the original
index can be split or the numbers of primary shards that can be obtained after the
split. When you create an index, you must make sure that the number of primary shards
configured for the index is a factor of the value of this parameter.
Note
- If you want to split an index in an Elasticsearch cluster of a version earlier than
V7.0, you must configure the index.number_of_routing_shards parameter in the command that is used to create the index, and the maximum value
of this parameter is 1024. By default, for an Elasticsearch cluster of V7.0 or later,
the value of the index.number_of_routing_shards parameter is related to the number of primary shards for the index that you want
to split in the cluster. If you want to split an index in an Elasticsearch cluster
of V7.0 or later and this parameter is not configured in the command used to create
the index, the index is split by a factor of 2 by default, and a maximum of 1,024
primary shards can be obtained after the split. For example, if the number of primary
shards for the original index is 1, the number of primary shards obtained after the
split can be a number from 1 to 1,024. If the number of primary shards for the original
index is 5, the number of primary shards that can be obtained after the split is 10,
20, 40, 80, 160, 320, or 640. The maximum number of primary shards that can be obtained
after the split is 640.
- When you split an index whose primary shards have been shrunk down by using the _shrink API, you must make sure that the number of primary shards obtained after the split
is a multiple of the number of primary shards before the split. For example, if the
number of primary shards before the split is 5, the number of primary shards obtained
after the split must be a multiple of 5, such as 10, 15, 20, 25, and 30. The number
of primary shards obtained after the split cannot exceed 1,024.
|
| number_of_shards |
The number of primary shards for the index. |
- Insert data.
Note The following data is used only for testing.
POST /dest1/_doc/_bulk
{"index":{}}
{"productName":"Product A","annual_rate":"3.2200%","describe":"A product that allows you to select whether to receive push messages for returns."}
{"index":{}}
{"productName":"Product B","annual_rate":"3.1100%","describe":"A product that daily pushes messages for returns credited to your account."}
{"index":{}}
{"productName":"Product C","annual_rate":"3.3500%","describe":"A product that daily pushes messages for returns immediately credited to your account."}
- Disable data write operations for the index.
PUT /dest1/_settings
{
"settings": {
"index.blocks.write": true
}
}
- Split the original index into a new index with more primary shards and enable data
write operations for the new index.
POST dest1/_split/dest3
{
"settings": {
"index.number_of_shards": 12,
"index.blocks.write": null
}
}
In this example, the original index
dest1 is split into the new index
dest3 by using the
_split API. The number of primary shards for the new index is 12, and data write operations
are enabled for the new index.
Notice
- The number of primary shards for the original index is 2, and the index.number_of_routing_shards parameter is set to 24. In this case, the number of primary shards for the new index
must be a multiple of 2 and cannot exceed 24. Otherwise, an error is reported in the
Kibana console.
- During the split, the system merges the segments on nodes. This operation consumes
the computing resources of the Elasticsearch cluster and increases the loads on the
cluster. Therefore, before you split an index, you must make sure that your Elasticsearch
cluster has sufficient disk space. We recommend that you split indexes during off-peak
hours.
- You must replace dest1 and dest3 in the preceding commands based on your business requirements.
- View the result.
Call the
_cat recovery API to query the index split progress. If no recoveries about shard split are returned
and the Elasticsearch cluster is healthy, the index split is complete.
- Query the index split progress
GET _cat/recovery?v&active_only
If no index that is waiting to be split is displayed in the index column in the returned result, no recoveries about index split exist.
- Query the health status of the Elasticsearch cluster
GET _cluster/health
If the returned result contains "status" : "green", the Elasticsearch cluster is healthy.
FAQ
Q: Why are the CPU utilization and minute-average load of each data node in my Elasticsearch
cluster not reduced after the index split operation is complete?
A: When you split an index, the system reroutes the documents in the index, and the
new index contains a large number of docs.deleted documents. If you run the GET _nodes/hot_threads command, you can view that a merge operation is being performed on the original index.
The merge operation consumes a large number of computing resources of the Elasticsearch
cluster. We recommend that you split indexes during off-peak hours.