All Products
Search
Document Center

Vector Retrieval Service for Milvus:Query node specifications and performance comparison

Last Updated:Sep 22, 2025

This topic describes the Compute Unit (CU) types and the number of Query node for Vector Retrieval Service for Milvus. Use this information to select the optimal instance specification for your workload.

CU type introduction

Query nodes of Vector Retrieval Service for Milvus support the following CU types for different business scenarios and requirements:

  • Performance-optimized: Suitable for scenarios that require high queries per second (QPS) and low query latency. Performance-optimized instances are ideal for scenarios with high concurrency, high traffic, and low latency, such as search, recommendation systems, generative AI, and chatbot applications.

  • Capacity-optimized: Suitable for scenarios that involve large data volumes but have lower requirements for search efficiency. Capacity-optimized instances offer four times the storage capacity of performance-optimized instances and can store and manage more vector data. These instances also provide excellent technical specifications to meet the needs of most scenarios, such as large-scale unstructured data retrieval, copyright detection, and model data preparation.

    Important

    Capacity-optimized CUs currently have the following limitations:

    • Only horizontal scaling (scale-out and scale-in) is supported. Vertical scaling (upgrading or downgrading node specifications) is not. Therefore, you must carefully confirm the CU specifications before making a purchase.

    • The DiskANN index is recommended. This index type supports only vector data of the Float type. For vector distance measurement, only Euclidean distance (L2), inner product (IP), or cosine similarity (COSINE) are supported.

Storage capacity comparison

CU type

Index type

CU specifications

Capacity (SIFT, 128-dim)

Capacity (GIST, 960-dim)

Performance-optimized

HNSW

M:30

efConstruction:360

4 vCPU 16 GiB (4 CUs)

16 million

3 million

8 vCPU 32 GiB (8 CUs)

32 million

6 million

16 vCPU 64 GiB (16 CUs)

64 million

12 million

32 vCPU 128 GiB (32 CUs)

128 million

24 million

Capacity-optimized

DiskANN

8 vCPU 32 GiB (8 CUs)

120 million

23 million

16 vCPU 64 GiB (16 CUs)

240 million

46 million

32 vCPU 128 GiB (32 CUs)

480 million

92 million

Note
  • The data in the preceding table is based on actual tests and can be used as a reference for capacity assessment.

  • The data used for the capacity test includes only primary keys and vector data, with no scalar data. The primary keys are strings converted from auto-incrementing positive integers that start from 0. In most production scenarios, scalar fields are essential and also consume storage space. This will cause the actual number of stored vectors to be lower than the values shown in the table.

Retrieval performance comparison

CU type

CU specifications

Index type

topk=50

topk=100

topk=250

topk=1000

QPS

RT_p99

QPS

RT_p99

QPS

RT_p99

QPS

RT_p99

Performance-optimized

16 vCPU 64 GiB

(16 CUs)

HNSW

M:30 efConstruction:360

2000

< 10 ms

1200

< 10 ms

550

< 15 ms

150

< 30 ms

Capacity-optimized

16 vCPU 64 GiB

(16 CUs)

DiskANN

700

< 15 ms

550

< 20 ms

200

< 30 ms

60

< 50 ms

Note
  • The data in the preceding table is based on test results from the Cohere dataset (10 million 768-dimension vectors). This data is for reference only because performance is affected by the data distribution of different datasets.

  • The RT_p99 metric is the 99th percentile response time, measured by running 1,000 queries sequentially.

  • The data used for this performance test includes only primary keys and vector data, with no scalar data. The primary keys are auto-incrementing positive integers that start from 0. The HNSW index type is used for performance-optimized instances, and the DiskANN index type is used for capacity-optimized instances.

  • Vector Retrieval Service for Milvus periodically optimizes vector indexes in the background. This process is usually completed within 3 hours after data is written, at which point the system performance reaches its optimal state.

Number of query nodes

Vector Retrieval Service for Milvus supports adjusting the number of query nodes, ranging from 1 to 50. A higher number of nodes supports higher QPS in a linear relationship. More nodes also increase service availability. Therefore, for production environments that require high availability, select at least 2 nodes.

Scenario analysis

If you are building an image retrieval system with 20 million images, where each image is represented by a 768-dimension vector, and your goal is to process 2,000 search requests per second and return the top 100 related images within 10 milliseconds, you can evaluate your options as follows:

  1. Latency assessment: Select the appropriate CU type based on your latency requirements. For example, if you require a latency of less than 10 milliseconds, the performance-optimized CU is the only type that can meet this requirement.

  2. Capacity consideration: Calculate the required number of CUs based on the data volume and dimensions. One performance-optimized CU with 16 vCPU and 64 GiB (16 CUs) can handle 12 million 960-dimension vectors. To accommodate 20 million 768-dimension vectors, you must configure at least two of these CUs for a total of 32 CUs.

  3. Throughput check: Verify the throughput of each CU for a given top-k setting. For example, with a top-k setting of 100, a performance-optimized CU provides a QPS of 1,200. To achieve a sustained performance of 2,000 QPS, you must double the number of nodes.

In summary, for this application scenario, select performance-optimized CUs and configure 4 nodes, each with specifications of 16 vCPU and 64 GiB (16 CUs), to meet the performance requirements.