Unbalanced loads on an Alibaba Cloud Elasticsearch cluster may be caused by several reasons. These reasons include inappropriate shard settings, uneven segment sizes, unseparated hot and cold data, and persistent connections that are used for Service Load Balancer (SLB) instances or multi-zone architecture. This topic describes the analysis and solutions for unbalanced loads on an Elasticsearch cluster.

Problem description

Cause

  • Shard allocation is inappropriate.
    Notice In most cases, unbalanced loads are caused by inappropriate shard allocation. We recommended that you first check shard allocation.
  • Segment sizes are uneven.
  • Hot data and cold data are not separated on nodes.
    Notice For example, you specify the routing parameter in queries or query hot data. In this case, unbalanced loads occur.
  • Persistent connections are used for SLB instances and multi-zone architecture. In this case, traffic may be unevenly distributed to nodes. However, this rarely occurs. For more information, see Multi-zone architecture.
Notice If unbalanced loads occur due to other reasons, contact Alibaba Cloud technical support engineers to troubleshoot the issue.

Inappropriate shard allocation

  • Scenario

    Company A has purchased an Alibaba Cloud Elasticsearch cluster. The cluster contains three dedicated master nodes and nine data nodes. Each dedicated master node offers 16 vCPUs and 32 GiB of memory. Each data node offers 32 vCPUs and 64 GiB of memory. Major data is stored in the test index. During peak hours (16:21 to 18:00), read performance is about 2,000 QPS, write performance is 1,000 QPS, and both cold and hot data are queried. In addition, the CPU utilization of two nodes reaches 100%, which affects the Elasticsearch service.

  • Analysis
    1. Check the network and Elastic Compute Service (ECS) instances. If ECS instances are normal, view network monitoring data.

      The network monitoring data shows that the number of network requests and the query QPS increases during peak hours. In addition, the CPU utilization of related nodes significantly increases. Based on the preceding information, you can conclude that the nodes with high loads are mainly used to process query requests.

    2. Run the GET _cat/shards?v command to query the shards of the index.

      The command output shows that shards are not evenly allocated to nodes. The shards of the text index are mainly allocated to nodes with high loads. In addition, the monitoring data for disk usage shows that the disk usage of nodes with high loads is greater than that of other nodes. You can conclude that the uneven allocation of shards results in uneven storage. When you query or write data, nodes that have large storage handle major query and write workloads.

    3. Run the GET _cat/indices?v command to query information of the index.

      The command output shows that the index has five primary shards and one replica shard for each primary shard. In addition, cluster configurations indicate that shards are not evenly allocated and specific documents are deleted. When Elasticsearch searches for data, it also searches for and filters documents marked with .del. This consumes additional resources and significantly decreases search efficiency. We recommend that you call the force merge operation during off-peak hours.

      Deleted index
    4. View cluster logs and slow search logs.

      The logs show that the queries are all normal term queries, and the cluster logs indicate that no errors occur. Therefore, the Elasticsearch cluster does not encounter errors or query statements that consume CPU resources.

  • Summary
    The preceding analysis indicates that the uneven CPU utilization is mainly caused by uneven shard allocation. You must re-allocate shards for the index. Make sure that the total number of primary and replica shards is a multiple of the number of data nodes in the cluster. After optimization, CPU utilization does not significantly differ among nodes. The following figure shows the CPU utilization.CPU utilization after optimization
  • Solution

    Plan shards properly before you create indexes. For more information, see Shard evaluation guidelines.

Shard evaluation guidelines

The number of shards and the size of each shard determine the stability and performance of an Elasticsearch cluster. You must properly plan shards for each index of an Elasticsearch cluster. This prevents numerous shards from affecting cluster performance when it is difficult to define business scenarios. This section provides the following guidelines:
Note In versions earlier than Elasticsearch V7.X, one index has five primary shards and one replica shard for each primary shard by default. In Elasticsearch V7.X and later, one index has one primary shard and one replica shard by default.
  • For nodes with low specifications, the size of each shard is no more than 30 GB. For nodes with high specifications, the size of each shard is no more than 50 GB.
  • For log analytics scenarios or extremely large indexes, the size of each shard is no more than 100 GB.
  • The total number of primary and replica shards is the same as or a multiple of the number of data nodes.
    Note The more primary shards you configure, the more performance overheads your Elasticsearch cluster incurs.
  • We recommend that you determine the number of shards on a single node based on the memory size multiplied by 30. If a large number of shards are planned, file handle exhaustion can easily occur and result in cluster failures.
  • We recommend that you configure a maximum of five primary shards for an index on a node.
  • If you enable the Auto Indexing feature for your cluster, you can use the scenario-based configuration feature to modify shard configurations. Make sure that shards are evenly allocated. For more information, see Use a scenario-based template to modify the configurations of a cluster.

Uneven segment sizes

  • Scenario
    A node in the Elasticsearch cluster of Company A experiences an abrupt increase in CPU utilization. This affects query performance. Queries are mainly performed on the test index. The index has three primary shards and one replica shard for each primary shard. The shards are evenly allocated to nodes. The index contains a large number of documents that are marked with docs.deleted. In addition, you have checked that ECS instances are normal.Unbalanced loads caused by large segment sizes
  • Analysis
    1. Add "profile": true to the query body.

      The query results show that Elasticsearch requires a longer time to query Shard 1 of the test index than other shards.

    2. Send a query request with preference set to _primary and "profile": true added to the query body, and view the time required to query the primary shard. Then, send another query request with preference set to _replica and "profile": true added to the query body, and view the time required to query the replica shard.

      The time required to query Shard 1 (primary shard) is longer than that required to query its replica shard. This indicates that unbalanced loads are caused by Shard 1.

    3. Run the GET _cat/segments/index?v&h=shard,segment,size,size.momery,ip and GET _cat/shards?v commands to query the information of Shard 1.
      The command outputs show that Shard 1 contains large segments and the number of documents in the shard is greater than that in its replica shard. Based on the preceding information, you can determine that unbalanced loads are caused by uneven segment sizes.
      Note The inconsistency in the number of documents can occur due to various reasons. Examples:
      • A latency exists in the data synchronization between primary and replica shards. If documents are continuously written to the primary shard, data inconsistency may occur. However, after you stop writing documents, the number of documents is the same between the primary shard and its replica shard.
      • After data is written to a primary shard, the system forwards data write requests to its replica shard. If you use automatically generated document IDs to write documents to a primary shard, you cannot perform delete operations on the primary shard during write operations. If you perform a delete operation (such as sending a Delete by Query request to delete a document that you have just written), the operation is also performed on the replica shard. Then, the system forwards the write request to the replica shard. The document is written to the replica shard without verification because the document ID is automatically generated by the system. As a result, the number of documents in the replica shard differs from that in the primary shard. In addition, the primary shard includes a large number of documents that are marked with docs.deleted.
  • Solution
    • Solution 1: During off-peak hours, call the force merge operation to merge small segments and remove documents that are marked with docs.deleted.
    • Solution 2: Restart the node where the primary shard resides to promote the replica shard to a primary shard. Use the new primary shard to generate a new replica shard. This ensures that the segments in the new primary and replica shards are the same.
    The following figure shows the loads after optimization.Loads after optimization

Multi-zone architecture

  • Scenario
    Company A deploys an Elasticsearch cluster across two zones: Zone B and Zone C. When the cluster provides services, the loads of the nodes in Zone C are higher than the loads of the nodes in Zone B. You have checked that the unbalanced loads are not caused by hardware or uneven data distribution.Unbalanced loads caused by multi-zone architecture
  • Analysis
    1. View the CPU utilization of nodes in the two zones over the last four days.
      The monitoring data shows that the CPU utilization of the nodes significantly changed.CPU utilization over the last four days
    2. View the TCP connections to the nodes.
      The monitoring data shows that the number of TCP connections in the two zones significantly differs. This indicates that the unbalanced loads are caused by network connections.Number of TCP connections to nodes
    3. Check client connections.

      The client uses persistent connections and establishes a small number of new connections. This scenario is at risk of independent scheduling for a multi-zone network. Network services are independently scheduled based on the number of connections. Each scheduling unit selects the optimal node to create a connection. Independent scheduling provides higher performance. However, if the number of new connections is small, multiple scheduling units may choose the same node to establish a connection. A client node of an Elasticsearch cluster first forwards requests to another node that resides in the same zone. This causes unbalanced loads among zones.

  • Solution
    • Solution 1: Use independent client nodes to forward complex traffic. In this case, data nodes are not affected even though the client nodes are heavily loaded. This reduces the risk of load imbalance.
    • Solution 2: Configure two independent domain names on the client to ensure balanced traffic on the client. This solution cannot ensure high availability. When you upgrade the configuration of an Elasticsearch cluster, access failures may occur because nodes are not removed from the SLB instance.
    The following figure shows the loads after optimization.Loads after optimization