Alibaba Cloud AISearch for Milvus (Milvus) offers different Compute Unit (CU) types and flexible compute node (Query Node) counts. Use this reference to select an instance type that fits your business scenario.
CU types
Milvus compute nodes support the following CU types:
-
Compute-optimized: Best for high-QPS, low-latency workloads such as search, recommendation systems, generative AI, and chatbots.
-
Storage-optimized: Best for large data volumes with moderate search-performance requirements. Storage-optimized instances offer four times the storage capacity of compute-optimized instances, and also deliver excellent performance that meets the needs of most scenarios. They are ideal for large-scale unstructured data retrieval, copyright detection, and model data preparation.
ImportantCurrently, storage-optimized CUs have the following limitations:
-
Only scale-out and scale-in are supported. Vertical upgrades or downgrades are not supported. Carefully confirm the CU specifications before purchase.
-
Only the DiskANN index is recommended. This index type supports only float vector data. For measuring the distance between vectors, only Euclidean distance (L2), inner product (IP), or cosine similarity (COSINE) is supported.
-
Storage capacity comparison
|
CU type |
Index type |
CU specifications |
Vector data capacity reference (based on 128-dimension SIFT vector data) |
Vector data capacity reference (based on 960-dimension GIST vector data) |
|
Compute-optimized |
HNSW M:30 efConstruction:360 |
4 vCPU 16 GiB (4 CUs) |
16 million |
3 million |
|
8 vCPU 32 GiB (8 CUs) |
32 million |
6 million |
||
|
16 vCPU 64 GiB (16 CUs) |
64 million |
12 million |
||
|
32 vCPU 128 GiB (32 CUs) |
128 million |
24 million |
||
|
Storage-optimized |
DiskANN |
8 vCPU 32 GiB (8 CUs) |
120 million |
23 million |
|
16 vCPU 64 GiB (16 CUs) |
240 million |
46 million |
||
|
32 vCPU 128 GiB (32 CUs) |
480 million |
92 million |
-
The data in the table is based on performance tests and serves as a reference for capacity assessment.
-
The test dataset contains only primary keys and vector data, with no scalar fields. The primary keys are auto-incrementing positive integers starting from zero, converted to strings. Because scalar fields are essential in most production environments and also consume storage, the actual vector count you can store will be lower than the values in the table.
Retrieval performance comparison
|
CU type |
CU specifications |
Index type |
topk=50 |
topk=100 |
topk=250 |
topk=1000 |
||||
|
QPS |
RT_p99 |
QPS |
RT_p99 |
QPS |
RT_p99 |
QPS |
RT_p99 |
|||
|
Compute-optimized |
16 vCPU 64 GiB (16 CUs) |
HNSW M:30 efConstruction:360 |
2000 |
< 10 ms |
1200 |
< 10 ms |
550 |
< 15 ms |
150 |
< 30 ms |
|
Storage-optimized |
16 vCPU 64 GiB (16 CUs) |
DiskANN |
700 |
< 15 ms |
550 |
< 20 ms |
200 |
< 30 ms |
60 |
< 50 ms |
-
The data is based on test results from the Cohere dataset (10 million vectors, 768 dimensions). Actual performance varies with the data distribution of different datasets.
-
RT_p99 is measured by running 1,000 queries sequentially and taking the 99th percentile response time.
-
The test data contains only primary keys and vector data, with no scalar fields. The primary keys are auto-incrementing positive integers starting from zero. HNSW is used for compute-optimized instances and DiskANN for storage-optimized instances.
-
Milvus periodically optimizes vector indexes in the background. This process typically completes within 3 hours after data is written, after which the system reaches optimal performance.
Number of compute nodes
You can scale the number of compute nodes (Query Nodes) from 1 to 50. QPS increases linearly with the number of nodes. More nodes also improve service availability, so for production environments that require high availability, use at least 2 nodes.
Scenario analysis
Suppose you are building an image retrieval system that contains 20 million images. Each image is represented by a 768-dimension vector. Your goal is to process 2,000 search requests per second and return the top 100 results within 10 milliseconds. Evaluate your options as follows:
-
Latency assessment: Choose a CU type based on your latency requirements. For example, if you require a latency of less than 10 milliseconds, the compute-optimized CU is the only type that meets this requirement.
-
Capacity consideration: Calculate the required number of CUs based on the data volume and dimensions. A single 16 vCPU 64 GiB (16 CUs) compute-optimized CU can handle 12 million 960-dimension vectors. To accommodate 20 million 768-dimension vectors, you must configure at least two such CUs for a total of 32 CUs.
-
Throughput validation: Validate the throughput of each node for your specified top-k setting. For example, with top-k set to 100, a compute-optimized node provides a QPS of 1,200. To achieve a sustained performance of 2,000 QPS, you need to double the number of nodes.
Based on this analysis, choose compute-optimized CUs and configure four nodes, each with 16 vCPU 64 GiB (16 CUs) specifications.