How to use the aliyun-codec plug-in - Elasticsearch - Alibaba Cloud Documentation Center

The aliyun-codec plug-in, developed by Alibaba Cloud, is an index compression plug-in for Elasticsearch. It compresses underlying row-oriented (source), column-oriented (doc values), and inverted (postings) files. The plug-in also supports the source_reuse_doc_values feature. This makes it ideal for scenarios with high write volumes and high index storage costs, such as logging and time series analysis.

Usage notes

Install the plug-in before using it. For more information, see Install or uninstall built-in plug-ins.
The aliyun-codec plug-in is supported only on Elasticsearch 7.10.0 instances.
The index compression feature requires kernel version 1.5.0 or later, and the source_reuse_doc_values feature requires kernel version 1.6.0 or later. If your current kernel version does not meet these requirements, you must first upgrade the kernel.

Performance reference

The following performance data for the aliyun-codec plug-in was recorded in a specific test environment and is for reference only.

Test environment: A single index with a size of 1.2 TB and 22 shards. row-oriented, column-oriented, and inverted compression are enabled, all using the zstd compression algorithm. The dataset consists of Alibaba Cloud Elasticsearch production logs.

Index compression: Compared to a cluster without index compression:

Write throughput: Unchanged.
Overall index size: Reduced by 40%.
Latency in I/O-intensive query scenarios: Reduced by 50%.

source_reuse_doc_values: Compared to a cluster without this feature enabled:

Write throughput: Unchanged.
Overall index size: Reduced by up to 40%. The reduction percentage depends on the proportion of fields with the source_reuse_doc_values feature enabled in the index.
Latency in I/O-intensive query scenarios: Depends on factors such as the proportion of fields with the feature enabled and the node disk type.

Use the index compression feature

The index.codec setting is static. You must close the index before modifying this setting and reopen it afterward.

Assume that you have created an index named test. Run the following commands to enable compression for the index:

POST test/_close

PUT test/_settings
{
  "index.codec" : "ali"
}

POST test/_open

After applying this configuration, Elasticsearch uses the zstd algorithm by default to compress the row-oriented (source), column-oriented (doc values), and inverted (postings) files for this index.

You can also specify a compression algorithm for a specific file type. The following example disables compression for inverted files and uses the zstd algorithm for row-oriented and column-oriented files. To disable compression for a file type, set the corresponding parameter to "".

POST test/_close

PUT test/_settings
{
  "index.codec":"ali",
  "index.doc_value.compression.default":"zstd",
  "index.postings.compression":"",
  "index.source.compression":"zstd"
}

POST test/_open

The following table describes the index configuration parameters.

Parameter	Description
index.doc_value.compression.default	* `lz4`: Uses the lz4 compression algorithm for column-oriented (doc values) files. * `zstd`: Uses the zstd compression algorithm for column-oriented (doc values) files. Currently, compression is applied only to doc values files for fields of the `number`, `date`, `keyword`, and `ip` types.
index.postings.compression	`zstd`: Uses the zstd compression algorithm for inverted (postings) files.
index.source.compression	Specifies the compression algorithm for row-oriented (source) files. Valid values: * `zstd`: Uses the zstd algorithm with a 128 KB block size. * `zstd_1024`: Uses the zstd algorithm with a 1024 KB block size. * `zstd_dict`: Uses the zstd algorithm with a dictionary. This provides a higher compression ratio than `zstd` but with lower read and write performance. * `best_compression`: Uses the native Elasticsearch source compression algorithm. * `default`: Uses the native Elasticsearch source compression algorithm.
index.postings.pfor.enabled	Specifies whether to enable encoding optimization for inverted index postings. Valid values are `true` (enabled) and `false` (disabled). This feature is available in native ES 8.0. It can save 14.4% of storage for `keyword`, `match_only_text`, and `text` fields, and 3.5% of total disk space.

Use the source_reuse_doc_values feature

The Elasticsearch storage layer stores multiple copies of data in _source, inverted indexes, and doc_values. The source_reuse_doc_values feature reduces the overall index size by removing data from _source that is already present in doc_values.

Enable the source_reuse_doc_values feature

The source_reuse_doc_values feature can only be enabled during index creation and cannot be disabled later.

PUT test
{
  "settings": {
    "index": {
      "ali_codec_service": {
        "source_reuse_doc_values": {
          "enabled": true
        }
      }
    }
  }
}

Adjust source_reuse_doc_values settings

After you enable the source_reuse_doc_values feature, you can adjust the following settings.

Adjust the maximum number of fields

If the number of fields with source_reuse_doc_values enabled exceeds the specified value, Elasticsearch either throws an exception or automatically disables the feature. The default value is 50.

PUT _cluster/settings
{
  "persistent": {
       "apack.ali_codec_service.source_reuse_doc_values.max_fields": 100
  }
}

Set whether to enforce the maximum field limit

true: Elasticsearch throws an exception if the limit is exceeded.
false: Elasticsearch automatically disables the source_reuse_doc_values feature if the limit is exceeded.

PUT _cluster/settings
{
  "persistent": {
       "apack.ali_codec_service.source_reuse_doc_values.strict_max_fields": true
  }
}

Adjust the concurrency for reading field values

When fetching an original document, the source_reuse_doc_values feature concurrently reads and assembles values from fields where this feature is enabled. The default concurrency is 5.

PUT test/_settings
{
  "index": {
    "ali_codec_service": {
      "source_reuse_doc_values": {
        "fetch_slice": 2
      }
    }
  }
}

Adjust the thread pool and queue size

By default, the thread pool size is the number of node cores, and the default queue size is 1000. You can change these settings only by modifying the YML file. For instructions, see Configure YML parameters.

apack.doc_values_fetch:
size: 8
queue_size: 1000