向量計算核心各模組使用與配置-雲原生MaxCompute MaxCompute-阿里雲

本文為您介紹各個模組的使用流程及參數配置，包括IndexBuilder、IndexConverter、IndexMeasure及IndexSearcher模組。

IndexBuilder

IndexBuilder為索引構建模組，基本調用流程如下： Index Builder

初始化Builder。
資料訓練。
構建索引。
dump索引。
清理資源。

Proxima內建了多種Builder外掛程式，如：ClusteringBuilder、LinearBuilder、HnswBuilder和SsgBuilder等。

IndexConverter

IndexConverter是對特徵向量進行轉換的模組，例如對特徵進行降維，Half FLOAT轉換，INT8量化等。可獨立使用，也可作為檢索流程中的一部分。

IndexConverter在檢索流程中使用時，一般與IndexReformer結合使用，類似IndexBuilder與IndexSearcher的關係。Index Converter作為Index Builder流程前置環節，特徵向量經過IndexConverter後，再進行索引的構建。線上檢索時，所有Query向量先經過IndexReformer轉換，再交由IndexSearcher進行檢索。

IndexMeasure

IndexMeasure為相似性（距離）計算模組，用於計算適配，遵從距離越小，相似性越近的原則。相關的IndexMeasure外掛程式名及參數詳情請參見IndexMeasure參數配置。

距離計算公式

數值距離

距離參數	計算公式
Squared Euclidean	`$$\sum_{i=0}^n (u_i - v_i)^2$$`
Euclidean	`$$\sqrt{\sum_{i=0}^n (u_i - v_i)^2}$$`
Normalized Euclidean	`$$\sqrt{\frac{1}{2}\frac{\sum_{i=0}^n [(u_i-\bar{u}) - (v_i-\bar{v})]^2}{\sum_{i=0}^n [(u_i-\bar{u})^2 + (v_i-\bar{v})^2]}}$$`
Normalized Squared Euclidean	`$$\frac{1}{2}\frac{\sum_{i=0}^n [(u_i-\bar{u}) - (v_i-\bar{v})]^2}{\sum_{i=0}^n [(u_i-\bar{u})^2 + (v_i-\bar{v})^2]}$$`
Manhattan	`$$\sum_{i=0}^n \|u_i - v_i\|$$`
Chebyshev (Chessboard)	`$$\max_{i=0} \|u_i - v_i\|$$`
Cosine	`$$1.0 - \frac{\sum_{i=0}^n u_iv_i}{\sqrt{\sum_{i=0}^n u_i^2}\sqrt{\sum_{i=0}^n v_i^2}}$$`
Minus Inner Product	`$$-\sum_{i=0}^n u_iv_i$$`
Canberra	`$$\sum_{i=0}^n\frac{\|u_i-v_i\|}{\|u_i\|+\|v_i\|}$$`
Bray Curtis	`$$\frac{\sum_{i=0}^n\|u_i-v_i\|}{\sum_{i=0}^n\|u_i+v_i\|}$$`
Correlation	`$$1.0 - \frac{\sum_{i=0}^n(u_i-\bar{u})(v_i-\bar{v})}{\sqrt{\sum_{i=0}^n(u_i-\bar{u})^2} \sqrt{\sum_{i=0}^n(v_i-\bar{v})^2}}$$`
Binary	`$$[!u == v]$$`

二值距離

距離參數	計算公式
Hamming	`$$M_{10}+M_{01}$$`
Jaccard	`$$\frac{M_{10}+M_{01}}{M_{11}+M_{10}+M_{01}}$$`
Matching	`$$\frac{M_{10}+M_{01}}{M_{11}+M_{10}+M_{01}+M_{00}}=\frac{M_{10}+M_{01}}{N}$$`
Dice	`$$\frac{M_{10}+M_{01}}{2M_{11}+M_{10}+M_{01}}$$`
Rogers Tanimoto	`$$\frac{2(M_{10}+M_{01})}{M_{11}+2(M_{10}+M_{01})+M_{00}}$$`
Russell Rao	`$$\frac{M_{10}+M_{01}+M_{00}}{N}$$`
Sokal Michener	`$$\frac{M_{10}+M_{01}}{M_{11}+M_{10}+M_{01}+M_{00}}=\frac{M_{10}+M_{01}}{N}$$`
Sokal Sneath I	`$$1.0 - \frac{M_{11}}{M_{11} + 2(M_{10}+M_{01})}=\frac{2(M_{10}+M_{01})}{M_{11}+2(M_{10}+M_{01})}$$`
Sokal Sneath II	`$$1.0 - \frac{2(M_{11} + M_{00})}{2(M_{11} + M_{00}) + M_{10} + M_{01}} = \frac{M_{10} + M_{01}}{2N - (M_{10} + M_{01})}$$`
Sokal Sneath III	`$$1.0 - \frac{M_{11} + M_{00}}{M_{10} + M_{01}} = \frac{2(M_{10} + M_{01}) - N}{M_{10} + M_{01}}$$`
Sokal Sneath IV	`$$1.0 - \frac{1}{4}(\frac{M_{11}}{M_{11} + M_{10}} + \frac{M_{11}}{M_{11} + M_{01}} + \frac{M_{00}}{M_{10} + M_{00}} + \frac{M_{00}}{M_{01} + M_{00}})$$`
Sokal Sneath V	`$$1.0 - \frac{M_{11}M_{00}}{\sqrt{(M_{11} + M_{10}) (M_{11} + M_{01}) (M_{10} + M_{00}) (M_{01} + M_{00})}}$$`
Kulczynski I	`$$1.0-\frac{S_{AB}}{S_A+S_B-2S_{AB}} = 1.0-\frac{M_{11}}{M_{10}+M_{01}} = \frac{M_{10}+M_{01}-M_{11}}{M_{10}+M_{01}}$$`
Kulczynski II	`$$1.0-\frac{1}{2}\left(\frac{S_{AB}}{S_{A}}+\frac{S_{AB}}{S_{B}}\right)$$`
Yule	`$$\frac{2M_{10}M_{01}}{M_{11}M_{00}+M_{10}M_{01}}$$`

IndexSearcher

IndexSearcher是進行KNN檢索的主要模組，以唯讀方式載入離線構建好的索引並進行線上檢索。

IndexSearcher的調用流程如下：

初始化Searcher。
載入索引資料。
建立檢索上下文 - 檢索。
執行查詢。
卸載索引資料。
清理資源。

IndexSearcher支援並發檢索，但因為使用方的情境和環境差異較大，所以需要將並發的控制權交給引擎使用者。為此，Proxima CE引進了檢索內容相關的概念，即Searcher Context，其儲存了檢索結果以及檢索過程中的中間資料。每一個上下文（Context）對象可複用，但僅允許串列進入，不允許多個線程同時進入。實現並發檢索時，需要建立多個上下文（Context）對象。目前Proxima內建了多種Searcher外掛程式，如：ClusteringSearcher、LinearSearcher、HnswSearcher和SsgSearcher等。

IndexBuilder參數配置

ClusteringBuilder

重要

proxima.hc.builder.max_document_count與proxima.hc.builder.centroid_count至少設定其中一項。

參數名	類型	預設值	說明
proxima.hc.builder.max_document_count	UNIT32	無	當proxima.hc.builder.centroid_count未設定時，使用proxima.hc.builder.max_document_count計算中心點數量。
proxima.hc.builder.centroid_count	STRING	無	聚類中心點參數，支援層次聚類。層之間用``分隔。該參數不指定會根據proxima.hc.builder.max_document_count自動推導。一層聚類樣本：1000 兩層樣本：100100 如果使用兩層中心點，一般第一層中心點數量比第二層多，效果更好。第一層經驗值是第二層10倍。
proxima.hc.builder.thread_count	UNIT32	0	構建時開啟線程數量，設定為0時表示CPU核心數。

HnswBuilder

參數名	類型	預設值	說明
proxima.hnsw.builder.thread_count	UNIT32	0	構建時開啟線程數量，設定為0時表示CPU核心數。
proxima.hnsw.builder.efconstruction	UNIT32	500	用於控製圖的構建精度。該值越大，構建的圖越精確，但構建更耗時。
proxima.hnsw.builder.max_neighbor_count	UNIT32	100	圖的鄰居數。值越大，圖越精確，但計算和儲存開銷越大，一般不超過特徵維度。最大為65535個。

SsgBuilder

參數名	類型	預設值	說明
proxima.ssg.builder.thread_count	UNIT32	0	控制構建線程數。
proxima.ssg.builder.efconstruction	UNIT32	500	用於控製圖的構建精度。該值越大，構建的圖越精確，但構建更耗時。
proxima.ssg.builder.max_neighbor_count	UNIT32	100	圖的鄰居數。值越大，圖越精確，但計算和儲存開銷越大，一般不超過特徵維度。最大為65535。
proxima.ssg.builder.centroid_count	UNIT32	0	訓練樣本產出的聚類中心點數量。中心點越多，構建成本越高，圖精度也會越高。一般建議：文檔數<200W，中心點=2000 文檔數200W~1KW，中心點=5000 文檔數>1KW，中心點=8000
proxima.ssg.builder.scan_ratio	FLOAT	0.01	聚類掃描數，預設是1%。該值控製圖的精度，調整越高，圖精度越高，但構圖成本線性增加。一般建議根據文檔數具體配置：文檔數<200W，ratio = `1W/doc_count` 文檔數`200W~1KW`， ratio = `2W/doc_count` 文檔數>1KW，ratio = `5W/doc_count`

GcBuilder

重要

proxima.gc.builder.centroid_count參數必須指定。

參數名

類型

預設值

說明

proxima.gc.builder.thread_count

UNITt32

構建時開啟線程數量，設定為0時為cpu核心數。

proxima.gc.builder.centroid_count

STRING

無

聚類中心點參數，支援層次聚類。層之間用*分隔。

一層聚類樣本：1000
兩層樣本：100*100

如果使用兩層中心點，一般第一層中心點數量比第二層多，效果更好。第一層經驗值是第二層10倍。

LinearBuilder
參數名
類型
預設值
說明
proxima.linear.builder.column_major_order
STRING
false
構建的時候特徵用行排（false）/列排（true）。

QcBuilder

說明

proxima.qc.builder.centroid_count 參數必須指定。

參數名	類型	預設值	說明
proxima.qc.builder.thread_count	UNIT32	0	構建時開啟線程數量，設定為0時為cpu核心數。
proxima.qc.builder.centroid_count	STRING	無	聚類中心點參數，支援層次聚類。層之間用``分隔。一層聚類樣本：1000 兩層樣本：100100 如果使用兩層中心點，一般第一次中心點數量比第二層多，效果更好。經驗值是第一層是第二層10倍。
proxima.qc.builder.quantizer_class	STRING	無	配置量化器，預設不使用量化器。可選值有 Int8QuantizerConverter、HalfFloatConverter、DoubleBitConverter。一般配置量化器可提升效能，減少索引大小，召回視情況有所損失。
proxima.qc.builder.quantizer_params	IndexParams	無	配置上面量化器相關參數。

IndexSearcher參數配置

ClusteringSearcher

參數名	類型	預設值	說明
proxima.hc.searcher.max_scan_count	UNIT32	無	線上尋找文檔截斷數，用於控制考察範圍。值越大一般召回率越多，但最多不會超過proxima.hc.searcher.scan_count_in level中指定的中心點下doc數量。
proxima.hc.searcher.scan_ratio	FLOAT	0.01	用於計算max_scan_count數量，`總doc數量 * scan_ratio`。

HnswSearcher

參數名	類型	預設值	說明
proxima.hnsw.searcher.ef	UNIT32	500	線上尋找文檔截斷數，用於控制考察範圍。值越大一般召回率越多，但最多不會超過proxima.hc.searcher.scan_count_in level中指定的中心點下doc數量。
proxima.hnsw.searcher.max_scan_ratio	FLOAT	0.1f	用於計算max_scan_count數量，`總doc數量 * scan_ratio`。
proxima.hnsw.searcher.brute_force_threshold	INT	1000	當總doc數量小於此值時，走線性檢索。

SsgSearcher

參數名	類型	預設值	說明
proxima.ssg.searcher.ef	UNIT32	500	用於檢索時，考察精度。該值越大，掃描doc數越多，召回率越高。
proxima.ssg.searcher.max_scan_ratio	UNIT32	0	最大掃描文檔比例，兜底截斷策略，預設0代表不開啟。

GcSearcher
參數名
類型
預設值
說明
proxima.gc.searcher.scan_ratio
FLOAT
0.01
用於計算max_scan_count數量，總doc數量 * scan_ratio。

LinearSearcher

參數名	類型	預設值	說明
proxima.linear.searcher.read_block_size	UNIT32	1024*1024	search階段，一次性讀取到記憶體的大小（1 MB左右）。值越小對QPS影響較大，越大會招用較多記憶體，推薦值（1024*1024）。

QcSearcher

參數名	類型	預設值	說明
proxima.qc.searcher.scan_ratio	FLOAT	0.01	用於計算max_scan_count數量，`總doc數量 * scan_ratio`。
proxima.qc.searcher.brute_force_threshold	INT	1000	如果總doc數小於此值，則走線性檢索。

IndexConverter參數配置

MipsConverter

參數名	類型	預設值	說明
proxima.mips.converter.m_value	UINT32	無	M值，即增加的維數，一般不超過4。
proxima.mips.converter.u_value	FLOAT	0.38196601	U值，取值範圍：0~1.0。
proxima.mips.converter.forced_half_float	BOOLEAN	false	強制將FP32的資料轉換為FP16。
proxima.mips.converter.spherical_injection	BOOLEAN	false	是否採用spherical injection的變換方式，只增加1維。

HalfFloatConverter
無需進行參數配置。
DoubleBitConverter
參數名
類型
預設值
說明
proxima.double_bit.converter.train_sample_count
INT
0
進行訓練的資料量，如果為0，則使用holder全量資料。
Int8QuantizerConverter
無需進行參數配置。
Int4QuantizerConverter
無需進行參數配置。
NormalizeConverter
參數名
類型
預設值
說明
proxima.normalize.reformer.forced_half_float
BOOLEAN
false
強制將FP32的資料轉換為FP16。
proxima.normalize.reformer.p_value
UNIT32
2
使用P-norm中的P值。

IndexReformer參數配置

MipsReformer

參數名	類型	預設值	說明
proxima.mips.reformer.m_value	UINT32	4	M值，即增加的維數，一般不超過 4。
proxima.mips.reformer.u_value	FLOAT	0.38196601	U值，取值範圍：大於 0，小於 1.0。
proxima.mips.reformer.l2_norm	FLOAT	0.0	訓練得到的L2 NORM值。
proxima.mips.reformer.normalize	BOOLEAN	false	是否歸一化。
proxima.mips.reformer.forced_half_float	BOOLEAN	false	強制將FP32的資料轉換為FP16。

HalfFloatReformer
無需進行參數配置。

IndexMeasure參數配置

Canberra
無需進行參數配置。
Chebyshev
無需進行參數配置。
SquaredEuclidean
無需進行參數配置。
Euclidean
無需進行參數配置。
GeographicalDistance
無需進行參數配置。
Hamming
無需進行參數配置。
InnerProduct
無需進行參數配置。
Manhattan
無需進行參數配置。
Matching
無需進行參數配置。
MipsSquaredEuclidean
參數名
類型
預設值
說明
proxima.mips_euclidean.measure.injection_type
INT
0
對內積特徵變換注入方式：
0 LocalizedSpherical
1 Spherical
2 RepeatedQuadratic
3 Identity
RogersTanimoto
無需進行參數配置。
RussellRao
無需進行參數配置。