aliyun-codec is an index compression plug-in that is developed by the Alibaba Cloud Elasticsearch team. You can use the aliyun-codec plug-in to compress various types of documents in an index at the underlying layer of Elasticsearch. You can also use the source_reuse_doc_values feature provided by this plug-in to further reduce the size of an index. This topic describes how to use the aliyun-codec plug-in.
Background information
- Test environment
- Dataset: the cluster logs of an Alibaba Cloud Elasticsearch cluster.
- Data volume: a single index that stores 1.2 TiB of data and has 22 primary shards.
- Index configuration: Compression is enabled for row-oriented, column-oriented, and inverted documents. The zstd compression algorithm is used for the index.
- Test results
- Compared with a cluster for which the aliyun-codec plug-in is not used, the cluster
for which the plug-in is used and index compression is enabled has the following improvements:
- Write throughput: remains unchanged.
- Overall size of the index: reduces by 40%.
- Query latency: reduces by 50%.
- Index storage costs can be reduced if the aliyun-codec plug-in is used and the index
compression and source_reuse_doc_values features are enabled. Compared with a cluster
for which the aliyun-codec plug-in is not used, the cluster for which the plug-in
is used and the preceding features are enabled has the following improvements:
- Write throughput: remains unchanged.
- Overall size of the index: reduces by up to 40%. The percentage that the overall size of the index is reduced is related to the proportion of fields on which the source_reuse_doc_values feature takes effect in the index.
- Query latency: The query latency is related to factors such as the proportion of fields on which the source_reuse_doc_values feature takes effect in the index and the disk types of nodes. The actual test results prevail.
- Compared with a cluster for which the aliyun-codec plug-in is not used, the cluster
for which the plug-in is used and index compression is enabled has the following improvements:
Prerequisites
- An Alibaba Cloud Elasticsearch V7.10.0 cluster is created.
For more information, see Create an Alibaba Cloud Elasticsearch cluster.
- The kernel version of the Elasticsearch cluster is upgraded based on your business
requirements.
- The kernel version of the Elasticsearch cluster is upgraded to V1.5.0 or later to use the index compression feature.
- The kernel version of the Elasticsearch cluster is upgraded to V1.6.0 or later to use both the index compression and source_reuse_doc_values features.
For more information about how to upgrade the kernel version of an Elasticsearch cluster, see Upgrade the version of a cluster.
- The aliyun-codec plug-in is installed for the Elasticsearch cluster. By default, the
aliyun-codec plug-in is installed for an Elasticsearch V7.10.0 cluster.
You can check whether the plug-in is installed for the cluster on the Plug-ins page in the Elasticsearch console. If the plug-in is not installed, install it first. For more information, see Install and remove a built-in plug-in.
Limits
- Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.5.0 or later support the index compression feature of the aliyun-codec plug-in. If you use an Alibaba Cloud Elasticsearch V6.7.0 cluster, only the codec-compression plug-in can be used for compression. For more information, see Use the codec-compression plug-in of the beta version.
- Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.6.0
or later support the source_reuse_doc_values feature of the aliyun-codec plug-in.
By default, the index compression feature is enabled in the default index template
aliyun_default_index_template of such clusters. This indicates that
index.codec
in the default index template of such clusters is set to true.
Use the index compression feature
Use the source_reuse_doc_values feature
Enable the source_reuse_doc_values feature
PUT test
{
"settings": {
"index": {
"ali_codec_service": {
"source_reuse_doc_values": {
"enabled": true
}
}
}
}
}
Modify the configurations related to the source_reuse_doc_values feature
- Modify the threshold for the number of fields on which the source_reuse_doc_values
feature can take effect.
If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error or disables the source_reuse_doc_values feature. The default threshold is 50. You can run the following command to modify the threshold:
PUT _cluster/settings { "persistent": { "apack.ali_codec_service.source_reuse_doc_values.max_fields": 100 } }
- Specify whether the number of fields on which the source_reuse_doc_values feature
takes effect must be less than or equal to the threshold that you specify.
- true: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error.
- false: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch disables the source_reuse_doc_values feature.
PUT _cluster/settings { "persistent": { "apack.ali_codec_service.source_reuse_doc_values.strict_max_fields": true } }
- Modify the number of concurrent threads used to read the values of fields on which
the source_reuse_doc_values feature takes effect.
When you read data from a document, the system uses concurrent threads to read the values of fields on which the source_reuse_doc_values feature takes effect in the document and combines the values. To reduce time costs, you can modify the number of concurrent threads used to read the values of fields on which the source_reuse_doc_values feature takes effect. The default number of concurrent threads is 5. You can run the following command to modify the number:
PUT test/_settings { "index": { "ali_codec_service": { "source_reuse_doc_values": { "fetch_slice": 2 } } } }
- Modify the sizes of the thread pool and queue that are used to read the values of
fields on which the source_reuse_doc_values feature takes effect.
The default size of the thread pool is the same as the total number of vCPUs of data nodes in the cluster. The default size of the queue is 1,000. You can modify the two configurations only by modifying the YML configuration file of your cluster. For more information about how to modify a YML configuration file, see Configure the YML file. You can add the following configuration information to the YML configuration file of your cluster to modify the configurations:
apack.doc_values_fetch: size: 8 queue_size: 1000