gig is a plug-in developed by Alibaba Cloud Elasticsearch to implement throttling for client nodes in an Elasticsearch cluster. This plug-in integrates the core throttling capabilities possessed by the Taobao team to handle searches. The gig plug-in can perform a switchover within seconds if query jitters caused by accidental node exceptions occur. This minimizes the probability that query jitters occur and ensures the stability of queries. In addition, this plug-in detects traffic to handle query latency surges caused by enabled warm nodes and achieve query warm-up for online business. This topic describes how to use the gig plug-in.

Background information

This section describes how the gig plug-in works.
  • The gig plug-in runs on client nodes. For applications that require high query QPS, you can increase the number of replica shards for each primary shard to scale out the cluster. This helps achieve a linear increase in query throughput. The gig plug-in can help client nodes select the most appropriate replica shards to provide query services.
  • The plug-in determines the service capabilities of nodes based on query latency and coordinates the nodes that provide services by using the proportion integral differential (PID) algorithm. This ensures rapid and accurate coordination. If exceptions such as surging query latency or rising error rates occur on nodes, the gig plug-in can collect and analyze the metrics of the nodes in real time by using the PID algorithm. Then, the plug-in rapidly isolates anomalous nodes and performs a switchover within seconds.
  • When new nodes join the cluster, the plug-in samples online query traffic in real time, replicates some query traffic to the new nodes, and discards query results. The traffic that is replicated is detection traffic. This avoids direct transmission of traffic to nodes that cannot provide services and reduces query latency. If the detection results and metrics show that the latency of the new nodes is in a normal range, the plug-in transmits online query traffic to these nodes. Then, these nodes can provide online services.

Limits

The gig plug-in is available for Alibaba Cloud Elasticsearch clusters that meet the following requirements:
  • Cluster version: V6.7.0 or V7.10.0
  • Kernel version: V1.3.0 or later, but earlier than V1.6.0
    Notice If the kernel version of your cluster does not meet the requirements, you must upgrade the kernel of your cluster before you use the plug-in. For more information, see Upgrade the version of a cluster. You can upgrade only the kernels of Standard Edition V6.7.0 clusters whose kernel versions are V0.3.0, V1.0.2, or V1.2.0.

Precautions

  • After you upgrade the kernel of an Elasticsearch V6.7.0 cluster to V1.3.0, the gig plug-in is automatically installed. After the plug-in is installed, the throttling feature of the plug-in is disabled by default. If you want to use the feature, you must manually enable it.
  • If the version of your Elasticsearch cluster is V7.10.0, the gig plug-in is integrated into the aliyun-qos plug-in by default. You do not need to manually install the gig plug-in.
  • Before you use the gig plug-in, make sure that sufficient resources are reserved for the data nodes in the cluster. If exceptions occur on one of the data nodes, the query traffic is transmitted to other data nodes. This increases the load of these nodes. Therefore, you must reserve sufficient resources for data nodes to ensure business stability.
  • All commands provided in this topic can be run in the Kibana console. For more information about how to log on to the Kibana console, see Log on to the Kibana console.

Procedure

  1. Enable the throttling feature for the gig plug-in.
    PUT test/_settings
    {
     "index.flow_control.enabled": true
    }
    Note If you want to disable the feature, set index.flow_control.enabled to null or false.
  2. Configure thresholds for query latency in the gig plug-in. If one of the thresholds is met, the plug-in performs throttling.
    PUT test/_settings
    {
        "index.flow_control.search": {
                "latency_upper_limit_extra": "10s", 
                "latency_upper_limit_extra_percent": "1.0", 
                "probe_percent": "0.2",
                "full_degrade_error_percent": "0.5", 
                "full_degrade_latency": "10s" 
        }
    }
    Parameter Default value Description
    latency_upper_limit_extra 10s The threshold for the absolute value of the difference between the actual query latency and average query latency. This parameter is represented by using the following expression: |Actual query latency - Average query latency|. The default value is 10s. This indicates that if the average query latency of three data nodes in the cluster is 2 seconds, when the query latency of one of the three data nodes reaches 13 seconds, the gig plug-in performs throttling.
    latency_upper_limit_extra_percent 1.0 The threshold for the proportion of the absolute value of the difference between the actual query latency and average query latency to the average query latency. This parameter is represented by using the following expression: (|Actual query latency - Average query latency|)/Average query latency. The default value is 1.0. This indicates that if the average query latency of three data nodes in the cluster is 2 seconds, when the actual query latency of one of the three data nodes reaches 4 seconds, the gig plug-in performs throttling.
    probe_percent 0.2 The threshold for the proportion of detection traffic to the actual query traffic. The default value is 0.2. This indicates that if the proportion of detection traffic to the actual query traffic is greater than 0.2, the gig plug-in performs throttling.
    full_degrade_error_percent 0.5 The threshold for the proportion of query exceptions. The default value is 0.5. This indicates that if the error rate of query responses of a data node in the cluster reaches 50%, the gig plug-in performs throttling.
    full_degrade_latency 10s The threshold for the query latency. The default value is 10s. This indicates that if the query latency is greater than 10 seconds, the gig plug-in performs throttling.
    Notice You can adjust the values of these parameters based on your business requirements.