Elasticsearch uses the HNSW algorithm to optimize approximate knn search. Improve performance by setting appropriate parameters, using quantization techniques, and managing memory. - Elasticsearch

Elasticsearch uses the HNSW algorithm for approximate knn search. Because the HNSW algorithm is graph-based, it is most effective when most vector data is stored in memory. Therefore, you must ensure that data nodes have sufficient off-heap memory to store vector data and the index schema. This topic describes key performance optimization methods for the vector engine.

Set appropriate parameters

Set appropriate m and ef_construction parameters. These are advanced parameters for the dense_vector type that you configure when you create an index. For more information, see Dense vector field type.

Note

HNSW is an approximate k-NN search method that does not guarantee the retrieval of all the most contiguous data points. The main parameters affecting recall rate are m and ef_construction.

Parameter	Content
`m`	Specifies the number of neighbors for a node. The default value is 16. A larger number of neighbors increases the recall rate but significantly impacts performance and increases memory usage. If you have strict requirements for the recall rate, set this parameter to 64 or a larger value.
`ef_construction`	Specifies the number of nearest documents to find when building the `HNSW` graph to construct neighbors for a new node. The default value is 100. A larger `ef_construction` value increases the recall rate. This parameter has a significant impact on performance but does not affect memory usage. If you have strict requirements for the recall rate, set this parameter to 512 or a larger value.

Reduce memory consumption

Elasticsearch uses quantization to reduce memory usage. Quantization can reduce the memory required for vectors by a factor of 4, 8, or even 32. For example, with the default float type, a vector value uses 4 bytes. If you use int8 quantization, each value uses only 1 byte. With int4 quantization, each value uses half a byte. With Better Binary Quantization (BBQ), each value uses only 1 bit, and eight values use a total of 1 byte. This reduces the memory requirement to 1/32 of that for unquantized vectors.

To calculate the memory required for vector data:

You must consider the memory required for both the vector data and the HNSW graph index. When vectors are unquantized or use int8 quantization, the graph index uses a small proportion of the total memory. However, with bbq quantization, the proportion of memory used by the graph index increases significantly. Therefore, when you calculate the memory used by vector data, you must also account for the graph index.

Formulas for calculating vector data memory:

element_type: float: num_vectors * num_dimensions * 4
element_type: float with quantization: int8: num_vectors * (num_dimensions + 4)
element_type: float with quantization: int4: num_vectors * (num_dimensions/2 + 4)
element_type: float with quantization: bbq: num_vectors * (num_dimensions/8 + 12)
element_type: byte: num_vectors * num_dimensions
element_type: bit: num_vectors * (num_dimensions/8)

If you use the flat type and do not create an HNSW index, the memory usage for vector data is calculated using the preceding formulas. If you use the HNSW type, you must also calculate the size of the graph index. The following formula can be used to estimate the graph index size:

num_vectors * 4 * HNSW.m. The default value of HNSW.m is 16. Therefore, the default size is num_vectors * 4 * 16.

The total memory for vector data is the sum of these two parts.

In addition, consider the number_of_replicas (number of index replicas). The previous calculation is for a single data copy. The total memory required also accounts for all replica copies. For example, the default value of number_of_replicas is 1, so the total memory required is double the memory for a single data copy.

Note

When you enable quantization, the on-disk index size increases because Elasticsearch stores the quantized vector data in addition to the original vectors. For example, if you apply int8 quantization to 40 GB of floating-point vectors, an additional 10 GB of data is stored for the quantized vectors. The total disk usage becomes 50 GB, but the memory required for fast searches is reduced to 10 GB.

Is off-heap memory capacity sufficient?

When you calculate memory capacity and check whether a node has sufficient memory, you must focus on the node's off-heap memory.

To determine the off-heap memory, note that a node must reserve enough memory for the Java heap. For a node with 64 GB of memory or less, the off-heap memory is typically half of the total memory. For a node with more than 64 GB of memory, the off-heap memory is the total memory minus 31 GB by default. To calculate the exact amount, run the following command:

GET _nodes/stats?human&filter_path=**.os.mem.total,**.jvm.mem.heap_max

The specific off-heap memory capacity of a node is calculated as follows: os.mem.total - jvm.mem.heap_max.

Vector index memory calculation

Example:

Assume that you have 10 million data entries, each with 1,024 dimensions. You use the default vector settings, enable int8 quantization, set m=16, and use the default number_of_replicas value of 1. The total memory required for the vector data is calculated as follows:

2 × (10,000,000 × (1024 + 4) + 10,000,000 × 4 × 16) = 20.34 GB.

If you use two data nodes with 16 GB of memory each to store this index, the total off-heap memory of the nodes is (16 / 2) × 2 = 16 GB. This is not enough to store the vector data.

If you use two data nodes with 32 GB of memory each to store this index, the total off-heap memory of the nodes is (32 / 2) × 2 = 32 GB. This is enough to store the vector data.

When you calculate the actual off-heap memory required, you must also reserve some memory for other indexes, source documents, and network traffic from data read and write operations. In a production environment, insufficient off-heap memory often causes high disk I/O utilization and a large amount of random read traffic.

Prefetch the file system buffer

If the machine that runs Elasticsearch restarts, the file system buffer is cleared. The operating system then needs time to load frequently accessed parts of the index into memory to ensure fast searches. You can use the index.store.preload setting to explicitly instruct the operating system to immediately load specific files into memory based on their file name extensions.

Note

If the file system buffer is not large enough to hold all the data, eagerly loading data into it for too many indexes or files can slow down searches. Use this setting with caution.

Example:

PUT /my_vector_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.store.preload": ["vex", "veq"]
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

To determine which index files index.store.preload should load, refer to the following descriptions of file name extensions:

The following file name extensions apply to approximate knn search. Each extension is subdivided by quantization type.

vex is the file that stores the HNSW graph structure.
vec represents all non-quantized vector values. This includes all element types: floating-point numbers, bytes, and bits.
veq is for quantized vectors in an index that uses quantization: int4 or int8.
veb is for binary vectors in an index that uses quantization: bbq.
vem, vemf, vemq, and vemb are for metadata. They are usually small and do not need to be preloaded.

Typically, when you use a quantized index, preload only the relevant quantized values and the HNSW graph. Preloading the original vectors is unnecessary and can be counterproductive.

Configuration examples:

hnsw: "index.store.preload": ["vex", "vec"]

int8, int4: "index.store.preload": ["vex", "veq"]

bbq: "index.store.preload": ["vex", "veb"]

To configure this setting for an existing index, note that index.store.preload is a static parameter and cannot be modified directly after an index is created. If you can tolerate temporary index unavailability, you can follow these steps: First, close the index. Then, configure the parameter. Finally, reopen the index. The following example shows how to perform these steps:

POST my_vector_index/_close

PUT my_vector_index/_settings
{
  "index.store.preload": ["vex", "veq"]
}

POST my_vector_index/_open

Reduce the number of index `segments`

Elasticsearch shards consist of segments (segment), which are internal storage elements within an index. For approximate knn search, Elasticsearch stores the vector values of each segment as a separate HNSW graph. As a result, knn search must examine every segment. Recent knn search parallelization improves performance across multiple segments, but even with fewer segments, knn search performance can improve by several times. By default, Elasticsearch periodically merges smaller segments into larger ones through a background merge process. If this is insufficient, you can take the following explicit steps to reduce the number of index segments.

1. Increase the maximum segment size

Elasticsearch provides many settings to control the merge process. One important setting is index.merge.policy.max_merged_segment. This setting controls the maximum size of a segment created during a merge. By increasing this value, you can reduce the number of segments in an index. The default value is 5 GB, which might be too small for vectors with large dimensions. You can consider increasing this value to 10 GB or 20 GB to help reduce the number of segments. Example:

PUT my_vector_index/_settings
{
  "index.merge.policy.max_merged_segment": "10gb"
}

2. Create large segments during batch indexing

A common workflow is to perform an initial batch upload and then make the index available for search. You can adjust index settings to encourage Elasticsearch to create larger initial segments instead of forcing a merge. During batch uploads, you can disable index.refresh_interval by setting it to -1. This prevents refresh operations and avoids creating extra segments. You can also configure a larger index buffer for Elasticsearch so that it can accept more documents before a refresh occurs. By default, indices.memory.index_buffer_size is set to 10% of the heap size. For a large heap size, such as 32 GB, this is usually sufficient. To allow the full index buffer to be used, you should also increase the index.translog.flush_threshold_size limit.

Exclude vector fields from `_source`

Elasticsearch stores the original JSON document provided at index time in the _source field. By default, each hit in the search results includes the full _source document. When a document contains a high-dimensional dense vector field, the _source can be very large and expensive to load. This can significantly slow down knn searches.

Note

The reindex, update, and update by query operations often require the _source field. Excluding fields from _source can cause these operations to behave unexpectedly. For example, when you reindex, the dense_vector field might not be included in the new index.

You can use the excludes mapping parameter to exclude dense vector fields from being stored in _source. This prevents large amounts of raw vector data from being loaded and returned during a search and also reduces the index size. A vector excluded from _source can still be used in a knn search because the search process relies on a separate data structure. However, you should review the potential drawbacks of excluding _source fields before you use the excludes parameter. For more information about the drawbacks, see the preceding note.

PUT /my_vector_index
{
  "mappings": {
    "_source": {
      "excludes": [
        "my_vector"
      ]
    },
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3
      },
      "my_text": {
        "type": "keyword"
      }
    }
  }
}

To view the vector content in a doc, if your Elasticsearch version is 8.17 or later, you can use the following command:

GET my_vector_index/_search
{
  "docvalue_fields": ["my_vector"]
}

You can also use other versions:

GET my_vector_index/_search
{
  "script_fields": {
    "vector_field": {
      "script": {
        "source" : "doc['my_vector'].vectorValue"
      }
    }
  }
}

As an alternative to excluding vector fields from the _source field, see synthetic _source.

Upgrade instance type configuration

Because vector similarity calculation is a compute-intensive task, it requires high CPU performance. Therefore, you can choose Turbo instance types to more than double the performance. You can use a blue-green deployment to upgrade to a Turbo instance type of the same specification.

Set appropriate parameters

Reduce memory consumption

Is off-heap memory capacity sufficient?

Vector index memory calculation

Prefetch the file system buffer

Reduce the number of index segments

1. Increase the maximum segment size

2. Create large segments during batch indexing

Exclude vector fields from _source

Upgrade instance type configuration

Reduce the number of index `segments`

Exclude vector fields from `_source`