aliyun-codec is an index compression plug-in that is developed by the Alibaba Cloud Elasticsearch team. You can use the aliyun-codec plug-in to compress various types of documents in an index at the underlying layer of Elasticsearch. You can also use the source_reuse_doc_values feature provided by this plug-in to further reduce the size of an index. This topic describes how to use the aliyun-codec plug-in.

Background information

The aliyun-codec plug-in supports various types of compression algorithms and the source_reuse_doc_values feature. The plug-in is suitable for scenarios in which a large amount of data is written or high storage costs are required for indexes, such as logging and time series data analysis. The aliyun-codec plug-in can significantly reduce the storage costs of indexes in these scenarios. This topic provides instructions on how to use the following features provided by the aliyun-codec plug-in:
The following description provides information about a performance test performed on the plug-in:
  • Test environment
    • Dataset: the cluster logs of an Alibaba Cloud Elasticsearch cluster.
    • Data volume: a single index that stores 1.2 TiB of data and has 22 primary shards.
    • Index configuration: Compression is enabled for row-oriented, column-oriented, and inverted documents. The zstd compression algorithm is used for the index.
  • Test results
    • Compared with a cluster for which the aliyun-codec plug-in is not used, the cluster for which the plug-in is used and index compression is enabled has the following improvements:
      • Write throughput: remains unchanged.
      • Overall size of the index: reduces by 40%.
      • Query latency: reduces by 50%.
    • Index storage costs can be reduced if the aliyun-codec plug-in is used and the index compression and source_reuse_doc_values features are enabled. Compared with a cluster for which the aliyun-codec plug-in is not used, the cluster for which the plug-in is used and the preceding features are enabled has the following improvements:
      • Write throughput: remains unchanged.
      • Overall size of the index: reduces by up to 40%. The percentage that the overall size of the index is reduced is related to the proportion of fields on which the source_reuse_doc_values feature takes effect in the index.
      • Query latency: The query latency is related to factors such as the proportion of fields on which the source_reuse_doc_values feature takes effect in the index and the disk types of nodes. The actual test results prevail.

Prerequisites

  • An Alibaba Cloud Elasticsearch V7.10.0 cluster is created.

    For more information, see Create an Alibaba Cloud Elasticsearch cluster.

  • The kernel version of the Elasticsearch cluster is upgraded based on your business requirements.
    • The kernel version of the Elasticsearch cluster is upgraded to V1.5.0 or later to use the index compression feature.
    • The kernel version of the Elasticsearch cluster is upgraded to V1.6.0 or later to use both the index compression and source_reuse_doc_values features.

    For more information about how to upgrade the kernel version of an Elasticsearch cluster, see Upgrade the version of a cluster.

  • The aliyun-codec plug-in is installed for the Elasticsearch cluster. By default, the aliyun-codec plug-in is installed for an Elasticsearch V7.10.0 cluster.

    You can check whether the plug-in is installed for the cluster on the Plug-ins page in the Elasticsearch console. If the plug-in is not installed, install it first. For more information, see Install and remove a built-in plug-in.

Limits

  • Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.5.0 or later support the index compression feature of the aliyun-codec plug-in. If you use an Alibaba Cloud Elasticsearch V6.7.0 cluster, only the codec-compression plug-in can be used for compression. For more information, see Use the codec-compression plug-in of the beta version.
  • Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.6.0 or later support the source_reuse_doc_values feature of the aliyun-codec plug-in. By default, the index compression feature is enabled in the default index template aliyun_default_index_template of such clusters. This indicates that index.codec in the default index template of such clusters is set to true.

Use the index compression feature

  1. Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted.
    For more information about how to log on to the Kibana console, see Log on to the Kibana console.
    Note In this example, an Elasticsearch V7.10.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
  2. In the upper-right corner of the page that appears, click Dev tools.
  3. On the Console tab, run a command to enable index compression.
    For example, you can run the following command to enable index compression for an existing index named test:
    PUT test/_settings
    {
      "index.codec" : "ali"
    }

    By default, after index compression is enabled for the index, the system uses the zstd compression algorithm to compress the row-oriented, column-oriented, and inverted documents in the index.

    You can also use another compression algorithm to compress a specific type of document in the index. The following code provides an example on how to use the zstd algorithm to compress row-oriented documents and column-oriented documents, but do not enable index compression for inverted documents in the test index.
    Note If you want to disable index compression for a specific type of document, you can set the related parameter to "". For example, the index.postings.compression parameter is set to "" in the following code.
    PUT test/_settings
    {
      "index.codec":"ali",
      "index.doc_value.compression.default":"zstd",
      "index.postings.compression":"",
      "index.source.compression":"zstd"
    }
    The following table describes the parameters related to index compression.
    Parameter Value description
    index.doc_value.compression.default
    • lz4: indicates that the Iz4 compression algorithm is used to compress column-oriented documents.
    • zstd: indicates that the zstd compression algorithm is used to compress column-oriented documents.
    Notice The aliyun-codec plug-in can compress only the column-oriented documents that contain fields of the number, date, keyword, and ip types.
    index.postings.compression zstd: indicates that the zstd compression algorithm is used to compress inverted documents.
    index.source.compression
    • zstd: indicates that the zstd compression algorithm is used to compress row-oriented documents.
    • zstd-dict: indicates that the zstd compression algorithm is used to compress row-oriented documents and the dict feature is used to store data in the documents. zstd-dict provides a higher compression ratio but lower read and write performance than zstd.
    • best-compression: indicates that the best_compression compression algorithm provided by open source Elasticsearch is used to compress row-oriented documents.
    • default: indicates that the default compression algorithm provided by open source Elasticsearch is used to compress row-oriented documents.

Use the source_reuse_doc_values feature

Enable the source_reuse_doc_values feature

Run the following command to enable the source_reuse_doc_values feature when you create an index:
PUT test
{
  "settings": {
    "index": {
      "ali_codec_service": {
        "source_reuse_doc_values": {
          "enabled": true
        }
      }
    }
  }
}
Notice You can enable the source_reuse_doc_values feature only when you create an index. The feature cannot be disabled after it is enabled.

Modify the configurations related to the source_reuse_doc_values feature

After you enable the source_reuse_doc_values feature, you can modify the configurations related to this feature based on your business requirements.
  • Modify the threshold for the number of fields on which the source_reuse_doc_values feature can take effect.
    If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error or disables the source_reuse_doc_values feature. The default threshold is 50. You can run the following command to modify the threshold:
    PUT _cluster/settings
    {
      "persistent": {
           "apack.ali_codec_service.source_reuse_doc_values.max_fields": 100
      }
    }
  • Specify whether the number of fields on which the source_reuse_doc_values feature takes effect must be less than or equal to the threshold that you specify.
    • true: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error.
    • false: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch disables the source_reuse_doc_values feature.
    PUT _cluster/settings
    {
      "persistent": {
           "apack.ali_codec_service.source_reuse_doc_values.strict_max_fields": true
      }
    }
  • Modify the number of concurrent threads used to read the values of fields on which the source_reuse_doc_values feature takes effect.
    When you read data from a document, the system uses concurrent threads to read the values of fields on which the source_reuse_doc_values feature takes effect in the document and combines the values. To reduce time costs, you can modify the number of concurrent threads used to read the values of fields on which the source_reuse_doc_values feature takes effect. The default number of concurrent threads is 5. You can run the following command to modify the number:
    PUT test/_settings
    {
      "index": {
        "ali_codec_service": {
          "source_reuse_doc_values": {
            "fetch_slice": 2
          }
        }
      }
    }
  • Modify the sizes of the thread pool and queue that are used to read the values of fields on which the source_reuse_doc_values feature takes effect.
    The default size of the thread pool is the same as the total number of vCPUs of data nodes in the cluster. The default size of the queue is 1,000. You can modify the two configurations only by modifying the YML configuration file of your cluster. For more information about how to modify a YML configuration file, see Configure the YML file. You can add the following configuration information to the YML configuration file of your cluster to modify the configurations:
    apack.doc_values_fetch:
        size: 8
        queue_size: 1000