×
Community Blog Practices for HNSW Distributed Construction

Practices for HNSW Distributed Construction

The article introduces practices for improving the distributed construction of Hierarchical Navigable Small World (HNSW) graphs to enhance scalability and efficiency in vector retrieval systems.

By Beishi

1. Background

With the advent of large models, vector retrieval is facing unprecedented challenges. The dimensions and quantities of embeddings have seen an unprecedented increase, which poses significant engineering challenges. The Intelligent Engine Division is responsible for designing and constructing Alibaba's search, promotion, and AI-related engineering systems. During actual business iteration and development, we encountered many problems caused by the expansion of embedding dimensions and quantities, with the index creation time issue particularly prominent.

1
Figure 1 HNSW

Approximate graph algorithms, represented by the HNSW [1] algorithm, have become mainstream technology choices in vector recall due to their high cost-effectiveness and recall rate. Approximate graph algorithms play a crucial role and have a broad application scope, especially in the search recall scenarios of platforms such as Taobao Tmall, Pailitao, and Xianyu. However, a major criticism of approximate graph algorithms is the long index creation time, especially in high-dimensional scenarios with massive data, where this drawback is further magnified.

In a distributed scenario, based on the divide-and-conquer approach, the original data is divided into multiple orthogonal subsets. Each node is only responsible for one of these subsets. In this way, multiple independent compute (storage) nodes can solve the massive data problem that a single machine cannot address. The same applies in the vector recall scenario. The original embedding dataset is evenly divided into multiple columns based on pk hash. In this way, a single shard can greatly reduce the number of vectors that need to be processed within a single instance. During querying, recall is first performed in the indexes of each column, and then the results from each column are aggregated.

Then, is it true that disregarding cost, any amount of data can resolve all issues caused by scale expansion simply by continuously sharding?

1.1 Scalability Dilemma of Distributed Scenarios

In our search recommendation scenarios, the business typically requires only eventual consistency, eliminating the high cost of supporting complex transaction processing or maintaining strict consistency. Loose consistency makes horizontal scaling easier. However, factors such as the instability of the distributed environment and network overhead still limit this scalability, not to mention the cost.

Expanding columns for slow offline index creation is not cost-effective in the vector recall scenario. Take HNSW as an example. Disregarding other indexes and retrieval times, building a single graph with all data without sharding is inevitably optimal for retrieving computation in the top K vector recall scenario. Therefore, a question arises: How can we speed up the construction of approximate graphs? Taking advantage of the multi-thread concurrent processing of modern processors is an obvious direction.

1.2 The Bottleneck of Multi-thread Concurrent Construction

The original HNSW paper [1] mentioned that parallel graph construction can be achieved by simply adding locks at a few key points, and the graph quality is hardly affected. This is because the execution streams are mostly isolated from each other during the graph construction process, with very few synchronization points. Most mainstream vector libraries also support HNSW multi-thread construction. Within ten concurrent threads, it is almost possible to achieve a near-linear acceleration ratio. However, in our practice, we quickly found that when the number of concurrent threads increased to 30, the acceleration ratio almost reached the limit. Even with more concurrency, the construction time could hardly be further reduced. In addition, in the online environment, due to the limitation of resource stability in the mixed environment, it is difficult to ensure that all subgraphs can be built at a high speed.

2. Divide-and-conquer Graph Construction

This introduces the final solution: The distributed graph construction, which is different from horizontal segmentation. The distributed graph construction uses multiple instances to build a single graph, i.e., divide-and-conquer graph construction. Through some research, we found that the industry and academia have already had some relatively mature solutions 2[4]. Except for the Pyramid [4] solution (detailed in the appendix), the other solutions are largely similar and can be abstracted into the following three divide-and-conquer steps:

  1. Split the original dataset X into multiple X's according to a certain method, where X's may have intersections.
  2. Construct a graph G' for each X'.
  3. Merge all G's into a large graph G. This step is usually a simple edge merging, followed by possible further fine-grained optimization of the large graph G.

The available solutions for the first step of splitting include k-means, random splitting, and multiple random splitting after principal component analysis (PCA). If the original dataset is too large, there may be an initial random sampling process.

The second step is usually the normal subgraph construction.

In the third step, edge merging is always performed. The merged graph may require further processing, such as simple neighbor propagation or nn-descent [5]. If the sets do not intersect in the first step of splitting, some glue nodes need to be added in the third step of graph merging to ensure the graph's connectivity.

2
Figure 2: Two Methods of Graph Merging

2.1 Distributed HNSW

Due to the insufficient theoretical foundation of the approximate graph algorithms, predicting the solution's effectiveness in advance is often difficult. After extensive offline experiments and considering engineering implementation, we finally adopted the distributed graph construction from DISKANN [2]. The first step is to sample some documents from the original dataset to perform k-means clustering to obtain k centroids. Then, each document is assigned to the α nearest centroids, which results in each centroid having a corresponding set. HNSW is constructed for each of these sets. Finally, edge merging is directly performed on the same point in the reduce process. For points with a degree exceeding the limit after merging, pruning is performed using the common points between the k sets to merge the original graph.

3
Figure 3 Divide-and-conquer Graph Construction

4
Figure 4 Divide-and-conquer Graph Construction

The advantages of this solution are its minimal changes and abandonment of the fine-grained optimization process after graph merging, which simplifies engineering practice. The disadvantage is that, despite most computations being distributed across child nodes, the total computation volume is quite large, and the number of vectors involved in the final computation will expand by α - 1 times. However, the ef parameter for subgraph construction can be appropriately reduced to offset the increased computation volume. After experiments on multiple public datasets and some online production datasets, we demonstrate that this divide-and-conquer construction method has little impact on the graph quality and, in most cases, even has positive benefits.

The remaining issue is how to support the above three-step graph construction process in BuildService [6]. BuildService is a distributed index construction system introduced by the Intelligent Engine team and is extensively applied in Alibaba's search and promotion scenarios. Fortunately, with BuildService's powerful distributed DAG graph execution, HNSW's distributed construction can be expressed simply using BuildService's index customization framework.

The distributed graph construction has been launched in multiple scenarios, such as Pailitao and Xianyu. The offline full-quantity time and stability have been greatly improved. The average full-quantity time has been reduced from the original 7-10 hours to about 3 hours on average.

3. Other Related Work

3.1 Filtered HNSW

Filtered or category (label) recall is also an important application scenario for vector recall. Since embeddings themselves can hardly fully represent complete category information, combining vector recall with ordinary category filtering for some specific scenarios is necessary.

The trivial solution to this problem (filter by filter during query) only handles the filtering conditions at the query stage and does not pay attention to the index creation process. Therefore, it does not make good use of the problem's characteristics. In the worst case, it may need to traverse the whole graph to gather enough vectors to be recalled.

Another solution is to build a separate vector index for each discrete category. However, the storage cost of this solution becomes unacceptable when the number of categories is too large, or each vector is associated with multiple categories.

Another article from DISKANN [7] proposed two vector index creation solutions for discrete category (label) scenarios with category metadata: FilteredVamana and StitchedVamana. One is a streaming-based algorithm, and the other is a batch construction algorithm. The core of these two algorithms is to consider the vector distance measurement and the related category labels during index creation.

We implemented the Stitched HNSW index by referring to the batch construction method of StitchedVamana and utilizing the distributed HNSW mentioned above. A subgraph is constructed for each category. Since each point can have multiple labels, a point may belong to multiple subgraphs. Finally, graph merging is performed using the overlap of points between subgraphs.

During the merging process, triangular pruning is performed if the degree of a node exceeds the limit. As described in the paper, the triangular pruning used here is the category-based triangular pruning, FilteredRobustPrune. The final merged graph retains a navigation point for each category. This way, during querying, each category has its navigation point, and points with the same category are within the same connected component. The query process only accesses points that include these categories, thus solving the problem of increased computation in post-filtering. In addition, since the final result is still merged into a single graph, it addresses the issue of index expansion caused by building separate indexes for each category. After we launched this feature in the Pailitao business, we significantly improved the vector recall performance for queries with categories.

4. Future Work

Although various approximate graph algorithms represented by the HNSW algorithm have been widely applied across multiple business scenarios in recent years, their problems have gradually been exposed, such as insufficient theoretical foundation [8], lack of error bounds 8, low construction efficiency, and unfavorable parallel querying [10]. We will continue to work to provide more efficient and accurate solutions for vector retrieval problems in complex data application scenarios in the future.

5. Appendix

The Pyramid [4] solution is unique. Strictly speaking, it is a multi-level index architecture, and the final output is not a graph. After conducting a simple experiment, we found that the comprehensive query performance was not ideal, so we abandoned this solution first.

5
Figure 5 The Architecture of Pyramid

References

[1] Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
https://arxiv.org/pdf/1603.09320
[2] DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
https://suhasjs.github.io/files/diskann_neurips19.pdf
[3] Scalable k-NN graph construction for visual descriptors
https://pages.ucsd.edu/~ztu/publication/cvpr12_knnG.pdf
[4] Pyramid: A General Framework for Distributed Similarity Search
https://arxiv.org/pdf/1906.10602
[5] Efficient k-nearest neighbor graph construction for generic similarity measures
https://dl.acm.org/doi/abs/10.1145/1963405.1963487
[6] The Index Creation Service of Havenask, a Large-scale Distributed Retrieval System Widely Used in Alibaba
https://mp.weixin.qq.com/s/uRdk5voz2mmSge1babLC3A
[7] Filtered - DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters
https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf
[8] Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
https://arxiv.org/pdf/2310.19126
[9] Graph based Nearest Neighbor Search: Promises and Failures
https://export.arxiv.org/pdf/1904.02077
[10] Speed-ANN: Low-Latency and High-Accuracy Nearest Neighbor Search via Intra-Query Parallelism
https://arxiv.org/pdf/2201.13007


Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

Alibaba Cloud Community

1,133 posts | 351 followers

You may also like

Comments