All Products
Search
Document Center

MaxCompute:Converters

Last Updated:Apr 15, 2025

Proxima 2.X supports converters for data analysis, such as data quantization and data normalization. This topic describes how to use a converter to quantize data, using INT8 data quantization as an example.

Prerequisites

Converter-related parameters

  • -converter: specifies the name of the converter for index building.

  • -converter_params: specifies the parameters of the converter. Each parameter is a single-line JSON string. The double quotation marks (") of the parameter do not need to be escaped. Spaces are not allowed in the configuration of each parameter. For example, you can specify the parameter {"proxima.normalize.reformer.forced_half_float":false} for NormalizeConverter. For more information, see IndexConverter parameter configuration.

    Note

    For a complete list of converters, see Index Converter.

Command examples

Note

For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.

--@resource_reference{"proxima-ce-aliyun-1.0.2.jar"}  -- Reference the uploaded proxima-ce JAR package. In the left navigation pane, choose Business Flow-MaxCompute-Resources. Right-click the uploaded JAR package and select Reference Resource to generate this comment line
jar -resources proxima-ce-aliyun-1.0.2.jar  -- The uploaded proxima-ce JAR package
-classpath proxima-ce-aliyun-1.0.2.jar com.alibaba.proxima2.ce.ProximaCERunner  -- The classpath specifies the main function entry class
-doc_table doc_table_xx  -- The input doc table
-doc_table_partition 20221111  -- The name of the partition in the doc table
-query_table query_table_xx  -- The input query table
-query_table_partition 20221111  -- The name of the partition in the query table
-output_table output_table_xx  -- The output table
-output_table_partition 20221111  -- The name of the partition in the output table
-data_type float  -- The vector data type
-dimension 8  -- The vector dimension
-external_volume_name xxx_volume_name -- The created volume that is stored in an Object Storage Service (OSS) bucket. The OSS directory at the underlying layer must be created. Otherwise, the search task fails to run
-owner_id 123456  -- The ID of the user
-converter Int8QuantizerConverter -- The converter
-converter_params "" -- Specifies the parameters of the converter. This parameter is optional. Each parameter is a single-line JSON string. The double quotation marks (") of the parameter do not need to be escaped. Spaces are not allowed in the configuration of each parameter.
; -- Do not forget the semicolon, which indicates the end of the ODPS SQL statement

Performance description

In most cases, vector quantization causes data loss. The recall rate after quantization decreases by 1% to 2%. For example, when you perform a test on the doc table and query table that both contain 20 million data records of the FLOAT data type with 512 dimensions, the recall rate after quantization decreases from 99.0% to 98.2% compared with the recall rate before quantization. However, vector quantization improves search performance. In the preceding test, the search performance of the test data increases by approximately 10%. The data in the test is for reference only.

Note

The recall rate is a common metric used in vector search to measure the accuracy of query results. For a vector search algorithm, the recall rate refers to the degree of similarity between the doc data that is retrieved by using the vector search algorithm and the doc data that is retrieved by using a brute-force attack for a query. A higher recall rate indicates a more accurate vector search algorithm.