AliES release notes and optimized features - Elasticsearch

AliES is a highly tailored kernel for Alibaba Cloud Elasticsearch. AliES supports all features provided by the open source Elasticsearch kernel and provides additional features, such as metric optimization, thread pooling, circuit breaking optimization, and query and write performance optimization. The additional features are developed based on the abundant experience of the Alibaba Cloud Elasticsearch team in various scenarios. The features help improve cluster stability and performance, reduce costs, and extend the scope of monitoring and O&M. This topic describes the new features and optimized features in each version of AliES.

Elasticsearch V7.16.2

Kernel version 1.7.0

The aliyun-timestream plug-in is provided. The plug-in allows you to create, modify, query, and delete time series indexes and helps simplify the operations that are required to manage time series data in time series scenarios. For more information, see Overview of aliyun-timestream.
Prometheus Querying Language (PromQL) statements can be executed to query data stored. For more information, see Integrate Elasticsearch with Prometheus and Grafana based on aliyun-timestream to implement integrated monitoring.

Elasticsearch V7.10.0

Kernel version 1.12.0

New and optimized features
- Search
  - The analysis-dynamic-synonym plug-in is provided.
  - The primary shard balancing feature is supported.
  - The lengths of parameter values in wildcard and prefix queries are limited.
  - Complex queries such as terms and prefix queries of the keyword type are optimized based on doc_values. This improves the query performance by 80% in scenarios with low hit ratios.
  - Numeric term and terms queries are optimized based on doc_values. This improves the query performance by 80% in scenarios with low hit ratios.
  - The performance of BKD-tree term and terms queries is optimized by 30% based on the lazy loading strategy.
Bug fixes
- Task management at the storage layer is optimized to resolve the following issue: RPC-based communication is occasionally stuck.
- The data replication process is optimized to avoid the fail engine error on replica nodes.
- The promotion process of replica shards is optimized to avoid inconsistency between primary indexes and replica indexes.

Kernel version 1.10.0

New and optimized features
- Store/Snapshot
  LuceneVerifyIndexOutput is optimized to improve the speed of index restoration. For more information, see ES pull 96975.
- Cluster coordination
  ClusterState is no longer referenced by persistent tasks. In a large-scale cluster, the memory usage of dedicated master nodes is high. To avoid the timeout of leader election in a large-scale cluster, the default value of cluster.election.initial_timeout is changed from 100 milliseconds to 1 second. For more information, see ES pull 90724.
- Search
  - The end-to-end query timeout feature is added to effectively control the overall query duration. With this feature, some results can be returned in the event of a timeout.
  - Some fields are added to access logs.
Bug fixes
- Lucene
  The following issue is fixed: The DV update index file referenced by Lucene Merge is deleted by concurrent flush operations. For more information, see Lucene.

Kernel version 1.9.0

New and optimized features
- The framework for concurrent queries is reconstructed and optimized for Kernel-enhanced Edition clusters.
  - The query duration is reduced.
  - Memory can be reused, and high Java virtual machine (JVM) memory usage and garbage collection (GC) overhead are improved. This increases resource utilization.
  - The duration of the fetch phase in the concurrent fetch of raw text is reduced. For example, if the size parameter is set to 10000, the duration of the fetch phase can be reduced by 6 to 10 times, and the overall duration can be reduced by 50%.
  - The following types of aggregations are supported in queries: percentile aggregations, percentile rank aggregations, sampler aggregations, diversified sampler aggregations, significant text aggregations, geodistance aggregations, geohash grid aggregations, geotile grid aggregations, geobounds aggregations, geocentroid aggregations, and scripted metric aggregations.
- Fields such as traceId and a query duration-related field are added to end-to-end access logs. You can use traceId to concatenate query processes.
- The custom index structure and mapping parsing of raw text are optimized. This doubles write performance for raw text.
The following code can be used to enable caching. This can resolve the following issue: caching is not enabled for subqueries in some scenarios where few primary queries but a large number of subqueries are performed.
```
PUT _cluster / settings 
{
	"persistent": {
		"search.query_cache_get_wait_lock_enable": "true",
		"search.query_cache_skip_factor": "200000000"
	}
}
```
Data inconsistency between primary shards and replica shards is optimized in scenarios with k-nearest neighbors (k-NN) queries.
Bug fixes
- The following issue is fixed: After a shard on a node is migrated during a blue-green update, the GET _cat/node command fails to be run.

Kernel version 1.8.0

The aliyun-timestream plug-in is provided. The plug-in is used to enhance the storage and usage performance of time series data. The plug-in allows you to create, modify, query, and delete time series indexes, execute PromQL statements to query data stored in Elasticsearch, and write data to time series indexes by using the InfluxDB line protocol. The plug-in helps simplify the operations that are required to manage time series data in time series scenarios. For more information, see Overview of aliyun-timestream, Integrate Elasticsearch with Prometheus and Grafana based on aliyun-timestream to implement integrated monitoring, and Integrate aliyun-timestream with the InfluxDB line protocol.

Kernel version 1.7.0

New features
The analytic-search plug-in is provided, which significantly improves the query performance in log-related scenarios. The following descriptions provide the details:
- Index merging policies and date histogram aggregation policies are optimized. This improves the performance of unconditional or single-condition queries by more than six times in log query scenarios, such as queries performed on the Discover page of the Kibana console. In scenarios where more than 1 TB data is added every day, the time to complete a query is reduced from minutes to 5 seconds or even less.
- Concurrent queries are optimized. For concurrent queries, concurrent data recall is supported. This improves resource utilization and reduces the average time required for data recall in log-related scenarios by 50%.
- Read-only small segments are continuously merged before forced merging. This improves query performance by 20%.
Performance improvements
- The lightweight compression algorithm LZ4 is used to transmit write requests between client nodes and data nodes. This reduces the network bandwidth overheads of nodes by 30%.
- Forced merging can be performed in parallel for shards. This reduces the duration of forced merging.
- Large data blocks in raw text can be compressed, and parameters for the zstd compression algorithm are optimized. This reduces the size of raw text by 8%. In addition, the Patched Frame of Reference (PFOR) method is supported for Lucene postings. This reduces the size of an index by 3%.
Bug fixes
- The following issue is fixed: The source_reuse_doc_values feature of the aliyun-codec plug-in does not support fields whose names contain periods (.).

Kernel version 1.6.0

The source_reuse_doc_values feature is added to the aliyun-codec plug-in to further reduce index sizes and costs. For more information, see Use the aliyun-codec plug-in.
The aliyun-qos plug-in is updated to V2.0 to support finer-grained throttling types and parameters. For more information, see Use the aliyun-qos plug-in.

Kernel version 1.5.0

The aliyun-codec plug-in is provided to enhance the compression performance of the kernel for a cluster. For more information, see Use the aliyun-codec plug-in.
The bug related to the search_as_you_type field type is fixed. For more information, see search_as_you_type.

Kernel version 1.4.0

The aliyun-knn plug-in is updated to improve write performance. The plug-in supports script queries and is integrated with the optimized capabilities of the related hardware to improve the vector search feature.
The aliyun-qos plug-in is optimized to improve cluster-level throttling. When you use this plug-in, you do not need to focus on the topology and load of the nodes in your Elasticsearch cluster. Traffic is automatically distributed to the nodes. This improves cluster usability and stability.

Kernel version 1.3.0

The slow query isolation feature is provided to reduce the impact of anomalous queries on cluster stability.
The gig plug-in is provided to perform a switchover within seconds after an exception occurs on a cluster. This plug-in prevents query jitters caused by anomalous nodes.
Note
For Elasticsearch V7.10.0 clusters of the Standard Edition, the gig plug-in is integrated into the aliyun-qos plug-in. The aliyun-qos plug-in is installed by default.
The physical replication feature is provided to improve the write performance of indexes that have replica shards.
The pruning feature is provided for time series indexes to improve the query performance of the indexes.
The access logs of clusters can be viewed. These logs contain fields such as Time, Node IP, and Content. You can use these logs to troubleshoot issues and analyze requests.
The scheduling performance of dedicated master nodes is improved by 10 times. Each dedicated master node can schedule more shards.

Elasticsearch V6.7.0

Kernel version 1.3.0

The slow query isolation feature is provided to reduce the impact of anomalous queries on cluster stability.
The gig plug-in is provided to perform a switchover within seconds after an exception occurs on a cluster. This plug-in prevents query jitters caused by anomalous nodes.

Important

Before you use the preceding features, make sure that the kernel version of your Elasticsearch cluster is V1.3.0. Otherwise, upgrade the kernel. You can upgrade only the kernels of Standard Edition clusters whose kernel versions are V0.3.0, V1.0.2, or V1.3.0.

Kernel version 1.2.0

The physical replication feature is provided to improve the write performance of indexes that have replica shards.
The pruning feature is provided for time series indexes to improve the query performance of the indexes.
Primary key-based data deduplication is optimized during queries. This improves the write performance of documents that contain primary keys by 10%.
Finite state transducers (FSTs) that do not occupy heap memory are supported. A single node can store a maximum of 20 TiB of index data.

Kernel version 1.0.2

The access logs of clusters can be viewed. These logs contain fields such as Time, Node IP, and Content. You can use these logs to troubleshoot issues and analyze requests.

Kernel version 1.0.1

Circuit breaking policies can be configured for JVMs. When the usage of the JVM heap memory of your cluster reaches 95%, the system rejects requests to protect the cluster. The following parameters are used to configure the policies:

indices.breaker.total.use_real_memory: The default value is false.
indices.breaker.total.limit: The default value is 95%.

Kernel version 0.3.0

The scheduling performance of dedicated master nodes is improved by 10 times. Each dedicated master node can schedule more shards.
Write performance is improved by 10%, and the overheads of translog flush are reduced.