Proxima 2.X supports converters for data analysis, such as data quantization and data normalization. This topic describes how to use a converter to quantize data, using INT8 data quantization as an example.
Prerequisites
You have installed the Proxima CE package and prepared the input table. For more information, see Install the Proxima CE package.
You have imported data to the input table. For more information, see Import data to the doc and query tables.
Converter-related parameters
-converter: specifies the name of the converter for index building.
-converter_params: specifies the parameters of the converter. Each parameter is a single-line JSON string. The double quotation marks (") of the parameter do not need to be escaped. Spaces are not allowed in the configuration of each parameter. For example, you can specify the parameter
{"proxima.normalize.reformer.forced_half_float":false}for NormalizeConverter. For more information, see IndexConverter parameter configuration.NoteFor a complete list of converters, see Index Converter.
Command examples
For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.
--@resource_reference{"proxima-ce-aliyun-1.0.2.jar"} -- Reference the uploaded proxima-ce JAR package. In the left navigation pane, choose Business Flow-MaxCompute-Resources. Right-click the uploaded JAR package and select Reference Resource to generate this comment line
jar -resources proxima-ce-aliyun-1.0.2.jar -- The uploaded proxima-ce JAR package
-classpath proxima-ce-aliyun-1.0.2.jar com.alibaba.proxima2.ce.ProximaCERunner -- The classpath specifies the main function entry class
-doc_table doc_table_xx -- The input doc table
-doc_table_partition 20221111 -- The name of the partition in the doc table
-query_table query_table_xx -- The input query table
-query_table_partition 20221111 -- The name of the partition in the query table
-output_table output_table_xx -- The output table
-output_table_partition 20221111 -- The name of the partition in the output table
-data_type float -- The vector data type
-dimension 8 -- The vector dimension
-external_volume_name xxx_volume_name -- The created volume that is stored in an Object Storage Service (OSS) bucket. The OSS directory at the underlying layer must be created. Otherwise, the search task fails to run
-owner_id 123456 -- The ID of the user
-converter Int8QuantizerConverter -- The converter
-converter_params "" -- Specifies the parameters of the converter. This parameter is optional. Each parameter is a single-line JSON string. The double quotation marks (") of the parameter do not need to be escaped. Spaces are not allowed in the configuration of each parameter.
; -- Do not forget the semicolon, which indicates the end of the ODPS SQL statementPerformance description
In most cases, vector quantization causes data loss. The recall rate after quantization decreases by 1% to 2%. For example, when you perform a test on the doc table and query table that both contain 20 million data records of the FLOAT data type with 512 dimensions, the recall rate after quantization decreases from 99.0% to 98.2% compared with the recall rate before quantization. However, vector quantization improves search performance. In the preceding test, the search performance of the test data increases by approximately 10%. The data in the test is for reference only.
The recall rate is a common metric used in vector search to measure the accuracy of query results. For a vector search algorithm, the recall rate refers to the degree of similarity between the doc data that is retrieved by using the vector search algorithm and the doc data that is retrieved by using a brute-force attack for a query. A higher recall rate indicates a more accurate vector search algorithm.