OpenSearch protocol: Performance on the hundred-million-scale MSMARCO dataset - PolarDB

This benchmark demonstrates the write and query performance of the PolarDB vector index for a hundred-million-scale dataset of 1,000-dimension vectors. The test uses the OpenSearch protocol with specific hardware and software configurations. This topic covers the test environment, dataset, key configuration parameters, reproduction steps, and an analysis of the performance results. The results provide data to help with technology selection, capacity planning, and performance tuning.

Scope

The performance data is based on a specific cluster environment and dataset. Before you use this data for decision-making, confirm that your environment is similar to the one described below.

Cluster specifications and versions

Master node: 2-core 8 GB.
Read-only node: 2-core 8 GB.
Search nodes: 32-core 256 GB × 3.
Client latency: 0.097 ms.
PolarDB vector index version: 2.19.3.

Dataset

Category	Item	Details
Software version	PolarDB-Vector	2.19.3
Dataset	MSMARCO V2.1	Cohere/msmarco-v2.1-embed-english-v3
Data scale	Total documents	113,520,750
	Vector dimensions	1024
	Query set size	1677
Algorithm parameters	Distance measure	L2 (Euclidean distance)
Algorithm parameters	Index type	HNSW

Test steps

The following steps describe how to reproduce index creation, data writing, and performance stress testing.

Note

To obtain the test script and reproduce this test flow, submit a ticket.

Create an HNSW index and write data

Create an index: Use the following configuration to create an index for a hundred-million-scale dataset. This configuration balances build speed, memory usage, and query performance.

Define the index schema and key parameters.
- number_of_shards: Set to 18 to distribute data and computing workloads evenly across three search nodes (96 physical cores in total).
- ef_construction and m: These are key parameters for building an HNSW index. In this test, 128 and 8 are used to balance build speed and index quality.
- refresh_interval and durability: These are specific optimizations to maximize test performance and are not recommended for direct use in a production environment. For more information, see Going live.

Run the following command to create the index.

curl -X PUT "http://<endpoint>:<port>/msmarco" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "docid": { "type": "keyword" },
      "domain": { "type": "keyword" },
      "emb": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "engine": "faiss",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {
            "ef_construction": 128,
            "m": 8
          }
        }
      },
      "url": { "type": "text" }
    }
  },
  "settings": {
    "index": {
      "replication": { "type": "DOCUMENT" },
      "refresh_interval": "0s",
      "number_of_shards": "18",
      "translog": {
        "flush_threshold_size": "1gb",
        "sync_interval": "30s",
        "durability": "async"
      },
      "knn.algo_param": { "ef_search": "64" },
      "provided_name": "msmarco",
      "knn": "true",
      "number_of_replicas": "0"
    }
  }
}
'

Write data: Write the MSMARCO V2.1 dataset to the HNSW index.

Write performance

Total time: 13,523.85 seconds (about 3.75 hours). This time includes data network transfer, writing to the translog, and background HNSW index construction.
Average write throughput: 8,394.11 docs/sec.

Run a query performance test

The query throughput (QPS), latency, and recall rate were tested using different combinations of concurrency and the ef_search parameter.

concurrency: Simulates from 1 to 128 concurrent queries.
ef_search: The breadth of neighbor nodes searched in the HNSW graph during a query. A larger value theoretically results in a higher recall rate but also increases computational overhead, which decreases QPS and increases latency.

Stress testing command

Run the following command to perform a 60-second stress test for different combinations of concurrency and ef_search.

# Example command. Replace it with your actual script.
python benchmark.py --concurrency 1/2/4/8/16/32/64/128 --ef-search 32/64/128/256 --max-duration 60

Performance test results

ef_search	concurrency	QPS	Avg (ms)	P99 (ms)	Recall
32	1	132.4	7.53	8.83	0.9585
32	16	878.19	18.13	30.14	0.9586
32	64	994.43	63.48	135.83	0.9621
32	128	1043.22	118.53	256.14	0.9693
64	1	132.82	7.5	8.82	0.9585
64	16	878.44	18.11	30.35	0.9586
64	64	989.47	63.77	136.55	0.9622
64	128	1062	116.74	238.94	0.9696
128	1	132.74	7.51	8.82	0.9585
128	16	884.77	17.99	29.91	0.9588
128	64	998.4	63.28	133.64	0.962
128	128	1063.91	116.85	244.43	0.9695
256	1	132.45	7.52	8.82	0.9585
256	16	881.95	18.05	30.16	0.9587
256	64	993.25	63.4	135.17	0.962
256	128	1067.68	116.09	227.54	0.9697

Analysis of performance results

Concurrency scalability: The QPS curve shows that as concurrency increases from 1 to 64, system throughput (QPS) grows almost linearly. This indicates that the PolarDB vector engine has good horizontal scalability. Beyond a concurrency of 64, QPS growth slows and peaks at a concurrency of 128. At this point, system resources, most likely the CPU, are nearly saturated and have become the performance bottleneck.
Relationship between latency and concurrency: The average (Avg) and P99 latencies increase significantly as concurrency grows. This behavior is expected as the system load increases. In scenarios that require high QPS, ensure that the P99 latency meets your business requirements.
Recall rate performance: Under all test conditions, the recall rate remains stable above 95.8%. This indicates that the HNSW index has high search accuracy with the current parameters.

Going live

Using the test environment configuration directly in a production environment is risky. The following sections provide configuration recommendations for key parameters and guidance for resource planning in a production environment.

Production recommendations for key parameters

The following parameters were set to achieve maximum test performance. Evaluate them carefully before using them in a production environment.

"refresh_interval": "0s"
- Test purpose: To disable auto-refresh. This ensures that during the write test, data is written only to memory and the translog. A manual refresh is run before the query test to obtain query performance data without interference from background tasks.
- Production recommendation: Do not set this to 0s in a production environment. Set a reasonable value based on your data visibility requirements. For example, a value of 1s means that new data is searchable approximately 1 second after it is written.
"durability": "async"
- Test purpose: To use asynchronous translog flushing. Data is written to memory and a success response is returned immediately. A background thread then asynchronously persists the data to disk. This improves write throughput.
- Production recommendation: Use this with caution in scenarios that require high data reliability. In extreme situations, such as a server breakdown, the async mode can lead to the loss of the last few seconds of data that has not been persisted to disk. If you have high data reliability requirements, use the default request mode in your production environment. This mode ensures that a success response is returned only after data is written to the translog and persisted to the disk, but it reduces write performance.

Resource utilization assessment

Understanding the system's resource consumption under peak load is crucial for accurate capacity planning.

Write-intensive scenarios: During peak writes at 8,394 docs/sec, the primary system bottlenecks are CPU (used for index construction) and disk I/O (used for translog writes).
Query-intensive scenarios: During peak queries at a concurrency of 128 and 1,067 QPS, the system bottleneck is primarily CPU usage.

FAQ

How does the PolarDB vector index perform with a ten-million-scale data volume?

This test is based on a dataset of 113 million records. For a ten-million-scale scenario (for example, 10 million records), the expected performance is as follows:

Query performance: With the same hardware configuration, single-query latency decreases and the maximum system QPS increases. This is because the data volume and index size are smaller.
Write performance: The total time for data writing and index construction is significantly reduced.

To obtain performance data specific to your business scenario, you can run small-scale tests using your actual business data.