Log and time-series workloads accumulate large indexes that drive up storage costs. The aliyun-codec plug-in addresses this by compressing indexes at the Lucene storage layer — row-oriented (_source), column-oriented (doc_values), and inverted (postings) data — without affecting write throughput. The plug-in also provides source_reuse_doc_values, which eliminates redundant _source storage by reconstructing field values from doc_values at read time.
Performance benchmarks
The following results are based on a test index containing 1.2 TiB of cluster log data across 22 primary shards.
Index compression (zstd algorithm, all three storage layers enabled; compared to a cluster where the aliyun-codec plug-in is used but index compression is not enabled):
| Metric | Result |
|---|---|
| Write throughput | No change |
| Index size | 40% smaller |
| I/O-intensive query latency | 50% lower |
source_reuse_doc_values (compared to the same cluster without the feature enabled):
| Metric | Result |
|---|---|
| Write throughput | No change |
| Index size | Up to 40% smaller (depends on the proportion of applicable fields) |
| I/O-intensive query latency | Varies by proportion of applicable fields and node disk type |
Prerequisites
Before you begin, make sure you have:
An Alibaba Cloud Elasticsearch V7.10.0 cluster. See Create an Alibaba Cloud Elasticsearch cluster.
The required kernel version for the features you plan to use: To upgrade the kernel version, see Upgrade the version of a cluster.
V1.5.0 or later: index compression only
V1.6.0 or later: both index compression and source_reuse_doc_values
The aliyun-codec plug-in installed on your cluster. The plug-in is installed by default on Elasticsearch V7.10.0 clusters. To verify or install it, go to the Plug-ins page in the Elasticsearch console. See Install and remove a built-in plug-in.
Limitations
Index compression requires Elasticsearch V7.10.0 with kernel V1.5.0 or later. For Elasticsearch V6.7.0 clusters, use the codec-compression plug-in instead. See Use the codec-compression plug-in of the beta version.
source_reuse_doc_values requires kernel V1.6.0 or later. On clusters that meet this requirement, index compression is enabled by default in aliyun_default_index_template (
index.codecis set totrue).source_reuse_doc_values can only be enabled at index creation time and cannot be disabled after it is enabled.
Enable index compression
The following steps use an Elasticsearch V7.10.0 cluster. Steps may differ for other versions.
Log on to the Kibana console of your cluster. See Log on to the Kibana console.
In the upper-right corner, click Dev tools.
On the Console tab, run a command to enable index compression on your index. The following command enables compression on an existing index named
test. By default, the plug-in applies the zstd algorithm to all three storage layers.Set a parameter to
""to disable compression for that storage layer.PUT test/_settings { "index.codec": "ali" }To use different algorithms for specific storage layers, specify them individually. The following example uses zstd for
_sourceanddoc_values, and leaves postings uncompressed.PUT test/_settings { "index.codec": "ali", "index.doc_value.compression.default": "zstd", "index.postings.compression": "", "index.source.compression": "zstd" }
Index compression parameters
| Parameter | Values | Description |
|---|---|---|
index.codec | "ali" | Enables the aliyun-codec plug-in for the index. |
index.doc_value.compression.default | lz4, zstd | Compression algorithm for doc_values (column-oriented data). Applies only to fields of the number, date, keyword, and ip types. |
index.postings.compression | zstd, "" | Compression algorithm for postings (inverted data). Set to "" to disable. |
index.source.compression | zstd, zstd_1024, zstd_dict, best_compression, default | Compression algorithm for _source (row-oriented data). See the table below. |
index.postings.pfor.enabled | true, false | Optimizes encoding for postings. Reduces storage by 14.4% for keyword, match_only_text, and text fields; reduces overall disk size by 3.5%. Backported from open source Elasticsearch 8.0. Alibaba Elasticsearch clusters of earlier versions also provide this feature. |
`index.source.compression` options:
| Value | Block size | Notes |
|---|---|---|
zstd | 128 KB | Standard zstd compression. |
zstd_1024 | 1,024 KB | zstd with a larger block size. |
zstd_dict | — | zstd with dictionary-based compression. Higher compression ratio, but lower read and write performance than zstd. |
best_compression | — | The best_compression codec from open source Elasticsearch. |
default | — | The default codec from open source Elasticsearch. |
Enable source_reuse_doc_values
Open source Elasticsearch stores multiple copies of field data: in _source, postings, and doc_values. source_reuse_doc_values reduces index size by pruning the JSON data stored in _source — for applicable fields, Elasticsearch reconstructs source data from doc_values at read time instead of maintaining a duplicate copy.
source_reuse_doc_values can only be enabled at index creation time and cannot be disabled after it is enabled.
Enable at index creation
Run the following command when creating your index:
PUT test
{
"settings": {
"index": {
"ali_codec_service": {
"source_reuse_doc_values": {
"enabled": true
}
}
}
}
}Configure source_reuse_doc_values
After enabling the feature, adjust the following settings based on your workload.
Set the maximum number of applicable fields
If the number of fields on which source_reuse_doc_values takes effect exceeds the threshold, Elasticsearch either reports an error or disables the feature (controlled by strict_max_fields). The default threshold is 50.
PUT _cluster/settings
{
"persistent": {
"apack.ali_codec_service.source_reuse_doc_values.max_fields": 100
}
}Control behavior when the threshold is exceeded
PUT _cluster/settings
{
"persistent": {
"apack.ali_codec_service.source_reuse_doc_values.strict_max_fields": true
}
}true: Elasticsearch reports an error if the number of applicable fields exceeds the threshold.false: Elasticsearch silently disables source_reuse_doc_values if the threshold is exceeded.
Adjust concurrent read threads per index
When reading a document, the system uses concurrent threads to fetch field values from doc_values and merge them. The default is 5 threads. Adjust this value to tune fetch latency.
PUT test/_settings
{
"index": {
"ali_codec_service": {
"source_reuse_doc_values": {
"fetch_slice": 2
}
}
}
}Adjust the thread pool and queue size (YAML configuration)
The thread pool size defaults to the total number of vCPUs on data nodes in the cluster. The queue size defaults to 1,000. These settings can only be changed in the cluster's YAML configuration file. See Configure the YML file.
Add the following to your YAML configuration file:
apack.doc_values_fetch:
size: 8
queue_size: 1000Parameter reference
The following table summarizes all parameters for both features.
| Parameter | Default | Scope | Description |
|---|---|---|---|
index.codec | — | Index | Set to "ali" to enable the plug-in. |
index.doc_value.compression.default | — | Index | Compression algorithm for doc_values. |
index.postings.compression | — | Index | Compression algorithm for postings. |
index.source.compression | — | Index | Compression algorithm for _source. |
index.postings.pfor.enabled | — | Index | Enables optimized encoding for postings. |
apack.ali_codec_service.source_reuse_doc_values.max_fields | 50 | Cluster | Maximum number of fields on which source_reuse_doc_values takes effect. |
apack.ali_codec_service.source_reuse_doc_values.strict_max_fields | — | Cluster | Behavior when the field limit is exceeded: true = error, false = silent disable. |
index.ali_codec_service.source_reuse_doc_values.fetch_slice | 5 | Index | Number of concurrent threads for reading field values. |
apack.doc_values_fetch.size | Total vCPUs of data nodes | Cluster (YAML only) | Thread pool size for doc_values reads. |
apack.doc_values_fetch.queue_size | 1,000 | Cluster (YAML only) | Queue size for doc_values reads. |