All Products
Search
Document Center

OpenSearch:Vector indexes

Last Updated:Feb 28, 2024

Overview

The vector-based retrieval mechanism expresses commodity data and content data in the form of vectors and builds a vector index library. You can specify one or more user vectors or commodity vectors in a vector index library to retrieve a top-k list of commodities or content based on vector distance.

Sample code for configuring a vector index

Configure a vector index without categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    }
  ]
}

Configure a vector index with categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field",
      "category_id"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "field_name": "category_id",
          "boost": 1
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field",
    "category_id"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    },
    {
      "field_name": "category_id",
      "field_type": "INTEGER"
    }
  ]
}
Important
  • Categories are introduced to allow you to search for vectors based on categories. For example, an image belongs to different categories. If you do not build a vector index with categories and only filter the retrieved vectors, no results may be returned.

  • If you configure a vector index as an administrator, the escape characters in the values of the build_index_params and search_index_params parameters must be removed.

Parameter description

  • field_name: the fields that are used to build the vector index. The fields must be of the RAW data type. You must specify at least two fields for this parameter. One field must be a primary key of the INTEGER data type or the hash value of the primary key. The other field must be a field that includes vectors. If you want to build a vector index based on categories, you can add a category field. The field type is RAW and the field value is of the INTEGER data type. The order of the fields in the index parameter must be configured in the same way as that in the fields parameter. If the category field exists, the order must be the primary key field, the category field, and the vector field.

  • index_name: the name of the vector index.

  • index_type: the type of the vector index. Set the value of this parameter to CUSTOMIZED.

  • indexer: the plug-in that you want to use to build the vector index. Set the value of this parameter to aitheta2_indexe.

  • parameters: the parameters that are used to configure a builder and a searcher for the vector index.

    • dimension: the number of dimensions.

    • embedding_delimiter: the vector delimiter. By default, the vector delimiter is a comma (,).

    • distance_type: the type of the distance. Valid values:

      • InnerProduct: the inner product.

      • SquaredEuclidean: the squared Euclidean distance. Specify SquaredEuclidean for data that is normalized.

    • major_order: the method that you want to use to store data. Valid values:

      • col: uses column store for the data. If you set the major_order parameter to col, you must set the dimension parameter to 2 to the power of n. n must be a positive integer. If you use column store, the system performance is better than the system performance when you use row store.

      • row: uses row store for the data. This is the default value.

    • builder_name: the type of builder that you want to use for the vector index. We recommend that you set the parameter to one of the following values. For information about other parameter values, contact technical support.

      • QcBuilder

      • LinearBuilder: builds indexes in order. We recommend that you set the builder_name parameter to LinearBuilder if the number of documents is less than 10,000.

    • searcher_name: the type of searcher that you want to use for the vector index. The value of the searcher_name parameter must match the value of the builder_name parameter. If you want to use GPU resources, contact technical support.

      • QcSearcher: performs searches by using a CPU. Set the searcher_name parameter to QcSearcher if the builder_name parameter is set to QcBuilder.

      • LinearSearcher: performs full-text searches by using a CPU. Set the searcher_name parameter to LinearSearcher if the builder_name parameter is set to LinearBuilder.

    • build_index_params: the parameters that you want to configure for the builder type that you specified for the builder_name parameter. For more information, see Quantized clustering configurations.

    • search_index_params: the parameters that you want to configure for the searcher type that you specified for the searcher_name parameter. For more information, see HNSW configurations.

    • linear_build_threshold: the threshold value for operations that do not use LinearBuilder. If the number of documents is less than the specified threshold value, the system uses LinearBuilder and LinearSearcher. LinearBuilder can help you reduce memory usage and ensures lossless retrieval results. The performance of LinearBuilder is compromised if an excessive number of documents exist. Default value: 10000.

    • min_scan_doc_cnt: the minimum number of candidate sets that you want to retrieve. The min_scan_doc_cnt and proxima.qc.searcher.scan_ratio parameters have similar concepts. Default value: 10000. If you specify a value for the min_scan_doc_cnt parameter and specify a value for the proxima.qc.searcher.scan_ratio parameter, the larger value is used as the minimum number of candidate sets.

      • Do not specify an excessively great value for the min_scan_doc_cnt or proxima.qc.searcher.scan_ratio parameter. If you specify an excessively great value, the system performance is compromised and latency occurs.

      • In most cases, if you want to retrieve top-k vectors, we recommend that you use max(10000, 100*topk) as the value of the min_scan_doc_cnt parameter and use max(10000, 100*topk)/total_doc_cnt as the value of the proxima.qc.searcher.scan_ratio parameter. In addition, you must configure the parameters based on the performance, retrieval ratio, and number of your documents.

      • These two similar parameters are used to meet the requirements in real-time and multi-category scenarios. If you are a regular user, you can configure only the proxima.qc.searcher.scan_ratio parameter.

    • enable_recall_report: specifies whether to report a retrieval ratio. Default value: false.

    • is_embedding_saved: specifies whether to save original vectors. Default value: false. If you enable INT8 quantization or FP16 quantization and enable real-time retrieval, make sure that you set the is_embedding_saved parameter to true. Otherwise, incremental vectors fail to be built in batches.

    • enable_rt_build: specifies whether to support real-time indexing. Default value: true.

    • ignore_invalid_doc: specifies whether to ignore abnormal vector data. Default value: true.

    • rt_index_params: the parameters for real-time indexing. You can specify this parameter if the enable_rt_build parameter is set to true.

      {
        "proxima.oswg.streamer.segment_size": 2048
      }