All Products
Search
Document Center

:biz configuration

Last Updated:Mar 08, 2023

biz confighash_mode level-1 hash level-2 hash query_configreturn hit optimization configuration mempool configuration join_config

In Havenask versions earlier than V3.0, the cluster configuration includes the configurations that are required when you use indexlib to build indexes offline or in real time, to load indexes, and to obtain data online. This configuration method is complex. In Havenask V3.0, the configuration logic is simplified.

  • The configuration that is used when you use indexlib to build full indexes and incremental indexes offline, and to build indexes in real time is moved into the offline cluster configuration file. In most cases, the file is named app_cluster.json.

  • The configuration of the method that is used to load indexlib indexes is included in the table configuration file. In most cases, the file is named cluster_name +"_cluster.json". This file name and the configuration file of indexes that you build offline are distinguished by folder path.

  • The configuration that is used when you obtain data online is included in the biz configuration. The following sample code shows the structure of the biz configuration.

biz configuration structure

{

"cluster_config": {...},
"aggregate_sampler_config" : {...},
"rankprofile_config" : {...},
"summary_profile_config" : {...},
"function_config", : {...},
"searcher_cache_config" : {...},
"service_degradation_config" : {...}

}

cluster_config

cluster_config {

"hash_mode" : "",
"query_config" : "",
"join_config" : "",
"return_hit_rewrite_threshold" : "",
"return_hit_rewrite_ratio" : "",
"table_name" : "",
"pool_trunk_size: : 10,
"pool_recycle_size_limit" : 20,

}

The hash method that is used for a document determines the partition to which the document belongs. hash_field: the field that is used to calculate hash values. In most cases, the primary key field is used. In special cases, the inshop field is used. hash_function: HASH, GALAXY_HASH, and KINGSO_HASH hash methods are supported. If you want to change the hash range of kingSo in the KINGSO_HASH method, you can configure the method in the KINGSO_HASH#range syntax. This range must be configured.

Example:

"hash_mode": 
{
    "hash_field" : "nid",
    "hash_function" : "HASH"
}

The values in the preceding sample code indicate that the HASH method and the nid field are used to calculate hash values for partitioning.

Indexes of the Havenask engine consist of full or incremental indexes and real-time indexes. Full indexes are built in the build service system. Real-time indexes are built in the searcher builder. The documents of the topic that corresponds to the swift system are required to build full indexes and real-time indexes. The processor of the build service system distributes the documents that are indexed. The processor reads the fields specified in the hash_mode configuration item from the documents, uses the hash function to obtain a value that ranges from 0 to 65535, and sends the value to the partition whose range is [0, 65535] of the topic that corresponds to the swift system. Indexes are logically divided into 2^16 logical partitions that range from 0 to 65535. You can specify the number of physical partitions that you want to generate. The engine automatically maps consecutive logical partitions to physical partitions. For example, if you use an index for two physical partitions, the logical partitions of the physical partitions are in the following ranges: [0, 32767] and [32768, 65535]. You can subscribe to the partition of the topic that corresponds to the swift systemto obtain the required documents. The system reads the required hash_mode configuration when you perform an online query. Then, the system queries the specified columns. If you understand the information that is provided in the preceding section, you can use level-1 hashes. The following section describes an example of a search in Taobao stores. In hash_mode, hash_field is used in a similar manner as user_id that specifies the seller ID. The quantity of product items that are owned by a large seller is greater than the quantity of product items that are owned by small sellers. When the processor of the build service system creates an index, the value of user_id is calculated as a hash value. Hot data is generated in the following scenarios: 1. After the processor calculates the hash value, the time period required to run the builder and manager of the large seller is more than the time period required to run other builders and managers. As a result, a longer period of time is required to complete the building process. 2. When the indexing result is loaded into different columns of the engine, the index that is set on the column corresponds to the large seller is larger than the indexes of other columns. In this case, the engine increases the number of resources that are required to handle hot data of all columns and online resources are wasted. 3. The column that corresponds to the large seller is queried more frequently. This causes the system to generate hot data for queries. In this case, problems occur. For example, the period of time that is required to complete the building process is prolonged, online resources are wasted, and hot data for queries is generated.

The level-2 hash is based on the level-1 hash. After the level-1 hash is used to hash documents to the same column, you can use the level-2 hash to hash the documents to different columns based on a level-2 hash field. Set hash_fields in the level-2 hash to user_id and nid. user_id specifies the seller ID and nid specifies the item id. The level-1 hash is used to sort columns by seller. The items of the large seller are distributed to a column in the engine. Then, the level-2 hash is used to hash items based on the item IDs. This way, the items are distributed to multiple columns of the engine. Hashing is performed by the processor of the build service system. The following results occur: The issue in which hot data is generated during the building and merging phases is resolved; hot data is generated in different columns of the engine. Different columns contain only a small volume of hot data; query requests are sent to different columns to reduce the volume of hot data. This way, hashing accelerates building and improves overall resource utilization. This feature is supported in Havenask V3.7.2 or later. The level-2 list function ROUTING_HASH is provided in autil. The following sample code describes the new hash_mode configuration for the function:

"hash_mode" :
{
    "hash_field" : "user_id",
     "hash_fields" : ["user_id", "nid"],  
     "hash_function" : "ROUTING_HASH",
     "hash_params" : {
            "hash_func": "HASH",
            "routing_ratio":"0.25",
            "hot_ranges":"512-1536;8000-10000",
            "hot_ranges_ratio":"0.25;0.5", 
      "hot_values":"t1;t2",
      "hot_values_ratio":"0.3;0.4"
        }
}

The following section describes the parameters.

  1. hash_field and hash_fields: The hash_field parameter is still supported. If you do not configure the hash_fields parameter, the value of the hash_field parameter is automatically used as the default value. If the hash_fields parameter is configured, the value of the hash_field parameter does not take effect, but you still need to configure the hash_field parameter.

  2. hash_function supports HASH, GALAXY_HASH, and KINGSO_HASH. ROUTING_HASH is added as the level-2 hash.

  3. The hash_params parameter is used to configure the hash function. This parameter is of the map<string, string> type. The parameter is used when you create the hash function. For example, if you use HASH, GALAXY_HASH, or KINGSO_HASH, you do not need to configure the hash_params parameter. If you use ROUTING_HASH and do not configure the hash_params parameter, ROUTING_HASH is used in the same manner in which HASH is used by default.

  4. The hash_params parameter of ROUTING_HASH supports the following configuration items:

  • hash_func specifies the hash function that is used in ROUTING_HASH. The default value is HASH. You can use NUMBER_HASH to test the function.

  • routing_ratio specifies the coverage rate of the level-2 hash. In the preceding configuration, the internal level-2 hash value is calculated based on the following formula:

(HASH(user_id) %65536 + floor(HAHS(nid)%65536 * routing_ratio) )%65536
  • hot_ranges specifies the hash value of hot data. hot_ranges_ratio specifies the coverage rate of the hash value. You can specify different coverage rates for the hash of specific hot data. In most cases, these parameters are used when physical columns that contain hot data are known. You need to obtain the hash values from 0 to 65535 in the columns to use these parameters.

  • hot_values specifies the seller of hot data. hot_values_ratio specifies the coverage rate of the seller. You can specify different coverage rates for different large sellers. In most cases, these parameters are used when the original values of the seller of hot data are known.

Note: If you modify the hash function or hash_params in the hash_mode configuration, the distribution method for index data changes. You can use one of the following methods to update the query result:

  1. If you want to query all columns, recreate the full index and update the query result searcher (QRS) configuration to perform the query.

  2. If you query a single column or perform the level-2 hash, recreate the cluster and full index. You need to recreate the cluster and index for full data because the hash configuration of the QRS and the hash configuration that is used by the index may be inconsistent. The columns that are queried before the updated hash_mode configuration takes effect are invalid.

The following list describes the query configuration. Specify the default index name of your query, query the default relationship between multiple keywords, and whether to enable multi-term optimization.

  • default_index

The default index name of the query. If you do not specify the index name of the query, the index name is queried from the index. For example, query=nid:1 specifies that the index name of the query is queried from the nid index. query='mp3' does not specify the index name of the query. In this case, the default index is used.

  • default_operator

The default relationship among multiple words that you want to query. If you set this parameter to "AND", the intersection of the query result for multiple words is retrieved. For example, "query=a b" is processed as "query=a AND b". If the tokenizer splits a single phrase into multiple terms during the query, the terms are also queried based on the relationship that is specified by the default_operator parameter.

  • multi_term_optimize

multi_term_optimize specifies whether to enable multi-term query optimization. For example, the system processes a|b and a&b in a similar manner in which the system processes a OR b and a AND b by default. These queries are binary queries that are run twice. The QRS incurs high overhead costs to process queries especially when | or & are used to concatenate a large number of terms. For example, after this parameter is set to true, the customized recall query for the main search is changed to a multi-term query that is run once. op is used to identify the OR or AND logic.

Example: "query_config" : {

"default_index" : "phrase",
"default_operator" : "AND",
"multi_term_optimize" : true

},

In the preceding example, the index of the default query is phrase, the relationship between multiple keywords is AND, and multi-term optimization is performed to resolve the query.

The set of configurations that the searcher uses to serialize documents to the QRS is optimized. By default, the QRS queries required data from the searcher. The searcher must return the documents that are specified by the start and hit parameters. The searcher sorts documents to return the required documents. When a large number of documents specified by the start and hit parameter is required, the serialization in the searcher and deserialization in the QRS incur high costs. When data is queried from multiple columns, if data is evenly distributed and the number of records that are returned by the searcher is reduced, the query result is not affected. Make sure the number of records that are returned by the searcher is greater than the result that is calculated by using the following formula: Number of records that are specified by the start and hit parameters × Coefficient. In the crawler service architecture or rank service architecture, a large number of records are specified by the start and hit parameters. In this case, you can set the multi_term_optimize parameter to true.

  • return_hit_rewrite_threshold : If the number of records that are specified by the start and hit parameters is greater than this threshold value, set the multi_term_optimize parameter to true.

  • return_hit_rewrite_ratio: In the searcher, the number of sorted records, the number of serialized records, and the value of (start + hit)/partition are multiplied by a coefficient that is greater than 1 based on your business requirements. Valid values of the return_hit_rewrite_ratio parameter range between 1 and partition_count. The optimization operation is not performed for queries such as the query in which inshop is routed to a single column.

After the records are optimized, the number of serialized records on the search is calculated based on the following formula: (start+hit)/partition_count × return_hit_rewrite_ratio. A larger number of columns indicates better optimization.

  • pool_trunk_size

The size of the trunk that is allocated by the pool. Unit: MB. Default value: 10 MB.

  • pool_recycle_size_limit

The pool size that triggers the system to reclaim occupied space. Unit: MB. Default value: 20 MB.

Configuration of secondary tables in typical scenarios: In addition to item information, member information also needs to be stored. If you do not use secondary tables, the required member information needs to be added to item data. If items belong to the same member, the same member information about each item needs to be stored. The member information is processed offline and occupies storage space. When member information is updated, all the items of the members are updated. You can use secondary tables to resolve this issue. The item table and member table are separately indexed. When the information about items and members is stored online, the primary key of the member table is used to query required member information. The forward index of member_id is stored in the item table and the mechanism of the forward index is similar to the mechanism of databases. In this method, some optimization operations are performed and the configuration of the operations is further elaborated in following sections. The member table is not stored as additional information and the table can be independently updated. You cannot use data in the member table as query conditions. You can only filter, sort, or display the data in the member table. You can also store the queried fields in the member table as additional data in the item table and store other information in the secondary table.

  • join_infos Describes the information about the secondary table that is joined in the current cluster.

  • join_field The field that is used to join the secondary table. join_field in the primary table has the following limits: a. The value of the join_field parameter must be the same as the value of the join field in the secondary table. b. join_field can contain only a single value. c. join_field supports only the INTEGER and STRING types. d. The primary key index field in the secondary table must be join_field.

  • join_cluster The cluster name of the secondary table that you want to join.

  • strong_join Specifies whether to enable the strong join feature. The attribute of the secondary table is required for queries. A document fails to be joined or the primary key index of the secondary table fails to be queried by using the join attribute. In this case, if the strong join feature is enabled, the document is discarded; if the strong join feature is disabled, the document is retained by default even if the join fails.

  • use_join_cache Specifies whether to enable the join docid cache feature. join docid cache specifies the mapping between the primary table docid and the secondary table docid. The mapping can be considered as a special attribute and is available for users. The mapping is used to optimize the performance of the secondary table. If this feature is enabled, the primary key of the secondary table does not need to be queried. You can obtain docid of the secondary table from the cache. In V2.8 or later, the use_join_cache parameter can be configured for each cluster specified by the join_cluster parameter.

  • check_cluster_config Specifies whether to read the configuration file in a path of a cluster specified by the cluster_name parameter. If this feature is enabled, the configuration file is read in a path of a cluster specified by the cluster_name parameter. If the feature is disabled, the configuration file is not read and the value of the table_name parameter is automatically filled with the name of the cluster_name parameter. Default value: true.

Example:

"join_config" : {                                             
         "join_infos" : [                                     
                {
                    "join_field" : "company_id",              
                    "join_cluster" : "company",               
                    "strong_join" : true,                      
                    "use_join_cache" : true,
     "check_cluster_config": true
                }
            ]
        },

aggregate_sampler_config

aggregate_sampler_config is used to configure the parameters that are related to statistics. The following statistical modes are supported: general statistics and batch statistics. aggBatchMode is used to identify optional configuration. The default mode is general statistics.

If you use general statistics, each queried record is counted as one query result. The following elements are included in the parameter configuration.

"aggregate_sampler_config" :
 {
        "aggThreshold" : 0,
        "sampleStep" : 1
 }

  • aggThreshold specifies the maximum number of records that are counted. If the number of records that are counted is less than the threshold value, all records are sampled. If the number of records that are counted is greater than or equal to the threshold value, the sampleStep parameter determines the number of records that are sampled.

  • sampleStep specifies the number of records that you want to sample.

If the aggBatchMode parameter is set to true, all queried records are counted at a time. The following sample code provides an example on the configuration.

"aggregate_sampler_config" : {
  "aggBatchMode" : true,
  "aggThreshold" : 1000,
  "batchSampleMaxCount" : 1000
}

Batch statistics

  • aggThreshold specifies the maximum number of records. If the value of the totalhits parameter is less than or equal to the value of the aggThreshold parameter, all the queried records are counted.

  • If the value of the totalhits parameter is greater than the value of the aggThreshold parameter, the value of the batchSampleMaxCount parameter is used as the maximum number of sampled records. In this case, the step size is calculated based on the following formula: (totalhits + batchSampleMaxCount - 1)/batchSampleMaxCount.

densemap feature

"aggregate_sampler_config" : {
    "enableDenseMode" : false
}

  • Valid values of the enableDenseMode parameter are true and false. You can use this parameter to improve your business performance. You can set this parameter to true or false based on your business requirements. If you want to measure the groupkey parameter that has various values in your business scenario, you can set the enableDenseMode parameter to true. This operation helps improve the performance. The default value is false.

    Note:Before you set this parameter to true, perform a stress test.

table_name

The name of the table. The table name is used to obtain the table schema from a searcher.

rankprofile_config

rankprofile_config is used to configure scoring plug-ins. The parameters of the plug-in are optional. The main configuration items are related to the name of the so plug-in and the name of the score plug-in.

  • modules.module_name The name of the module that is used for the plug-in.

  • modules.module_path The name of the dynamic-link library of the plug-in.

  • modeles.parameters The user-defined parameters that are passed to the so plug-in.

  • rankprofiles.rank_profile_name The name of the rank_profile file. You can specify the name of the rank_profile file that you want to use in the query statement.

  • rankprofiles.score_name The score plug-in name in the so plug-in.

  • rankprofiles.total_rank_size The maximum number of documents that can be scored in the score plug-in. The value that is divided by partition_count is used as the number of documents that can be scored in a single partition.

Example:

"rankprofile_config" : {
        "modules" : [
            {
                "module_name" : "fake_scorer",               
                "module_path" : "libfakescorer.so", 
                "parameters" : {
                    "key" : "value"                 
                }
            }
        ],
        "rankprofiles" : [
            {
                "rank_profile_name": "DefaultProfile", 
                "scorers" : [
                    {
                        "scorer_name" : "FakeScorer",
                        "module_name" : "fake_scorer",
                        "parameters" : {               
                            "key" : "value"
                        },
                        "rank_size" : "300"            
                    }
                ],
                "field_boost" : {                      
                    "phrase.title" : 1000,
                    "phrase.body" : 100
                }
            }
        ]
    }

summary_profile_config

  • required_attribute_fields The attribute field that is not included in summary_schema in the summary result. In most cases, this field is used in the secondary table. This indicates that the fields in the secondary table are returned in the query result of the primary table.

  • modules.module_name The name of the module that is used for the plug-in.

  • modules.module_path The name of the dynamic-link library of the plug-in.

  • modules.parameters The user-defined parameters that are passed to the so plug-in.

  • summary_profiles.summary_profile_name The name of the summary_profile file. You can specify the name of the summary_profile file that you want to use in the query statement.

  • summary_profiles.extractors The configuration of the summary extractor processing chain. When the summary is extracted, all extractors are processed in sequence.

  • summary_profiles.parameters The user-defined parameters that are passed to the summary_profile file.

  • summary_profiles.field_configs The field definition of the summary that you want to create.

  • summary_profiles.max_summary_length highlight_prefix The items such as the summary length and whether to highlight keywords.

Example:

"summary_profile_config" : {
        "required_attribute_fields" : ["aux_field1", "aux_field2", "attr_field"],1
        "modules" : [
            {
                "module_name" : "ha3_summary_eg",       
                "module_path" : "libha3_summary_eg.so", 
                "parameters" : {
                    "key" : "value"                     
                }
            }
        ],
        "summary_profiles" : [
            {
                "summary_profile_name": "DefaultProfile", 
                "extractors" :[                           
                   {
                      "extractor_name" : "SmartSummaryExtractor",
                      "module_name" : "ha3_summary_eg",
                      "parameters" : {                    
                          "key": "value"
                      }
                   },
                   {
                      "extractor_name" : "DefaultSumamryExtractor",
                      "module_name" : "",
                      "parameters" : {}
                   }
                ],
                "field_configs" : {                       
                    "TITLE" : {                           
                        "max_snippet" : 1,
                        "max_summary_length" : 40,
                        "highlight_prefix": "<font color=red>",
                        "highlight_suffix": "</font>",
                        "snippet_delimiter": "..."
                    },
                    "BODY" : {
                        "max_snippet" : 2,
                        "max_summary_length" : 100,
                        "highlight_prefix": "<font color=red>",
                        "highlight_suffix": "</font>",
                        "snippet_delimiter": "..."
                    }
                }
            }
        ]
    }

function_config

function_config is used to specify the information about the function_expression plug-in that is used in your cluster. The parameters in function_config are optional. The following sample code describes the parameters.

"function_config" : 
     {
        "modules" : [
            {
                "module_name" : "func_module_name",                          
                "module_path" : "libfunction_expression_plugin.so",          
                "parameters" : 
                {
                             "config_path" : "pluginsConf/one_function.json"
                }
            }
        ]
    }

searcher_cache_config

The searcher cache is used to cache the result of the phase-1 query.

"searcher_cache_config" :
   {
       "max_size" : 1024, /* M byte */
    "max_item_num" : 200000,
    "inc_doc_limit" : 1000,
    "inc_deletion_percent" : 1,
    "latency_limit" : 1, /* ms */
    "min_allowed_cache_doc_num" : 0,
    "max_allowed_cache_doc_num" : 50000
 }

  • max_size max_size specifies the maximum memory capacity in the searcher cache. Unit: megabytes (MB). Default value: 1024.

  • max_item_num max_item_num specifies the number of items in the cache when the searcher cache is initialized. The bottom layer of the cache is a hash table. The max_item_num parameter is used to initialize the hash table. The actual number of items in the cache is determined based on the actual item size and the limits on the memory capacity when the engine is running. The value of the max_item_num parameter is used to only initialize the hash table in the cache. In most cases, this parameter can be left empty. The default value is 200000.

  • inc_doc_limit If incremental data is required and data in the searcher cache is required in a query, the queried result in the cache and the queried result in incremental data are combined and returned to you. The queried result in incremental data is the documents that are added after the last time when the cache is accessed by the query. If the number of documents queried in incremental data is greater than the value of the inc_doc_limit parameter, the result of the current query in the cache becomes invalid. In this case, when the query is performed next time, the queried result is filled in the cache again. Default value: 1000.

  • inc_deletion_percent If incremental data is deleted, some queried documents in the cache may be deleted or updated. When the percentage of invalid documents in the total number of queried documents in the cache exceeds the value that is specified by the inc_deletion_percent% parameter, the queried documents in the cache become invalid. The query is processed as a cache miss. The query is run again and the query result is filled in the cache. Default value: 1.

  • latency_limit The searcher cache caches only the result of the query for which the following condition is met: (rank_latency + rerank_latency) > latency_limit. If the latency of the query that is cached is high, you can use the cache to improve the performance in an effective manner. Unit: ms. Default value: 1.

  • min_allowed_cache_doc_num and max_allowed_cache_doc_num The min_allowed_cache_doc_num parameter specifies the minimum number of queried documents that can be cached. Default value: 0. The max_allowed_cache_doc_num parameter specifies the maximum number of queried documents that you can cache. Default value: the value that is returned for std::numeric_limits<uint32_t>::max().The parameters are used when a large number of documents need to be cached in special scenarios.

service_degradation_config

service_degradation_config is used to configure service degradation. When an error occurs in the service or traffic spikes occur, you can use this parameter to ensure the system stability. The following sample code provides an example on how to configure service_degradation_config.

"service_degradation_config" : 
 {
     " condition" : {
                      "worker_queue_size_degrade_threshold" : 100,
                      "worker_queue_size_recover_threshold" : 10,
                      "duration" : 1000
                    },
     "request" : {
                      "rank_size" : 5000,
                      "rerank_size" : 200
                 }
 }

To implement service degradation, configure the parameters that are related to the check standards for service degradation and the operations that you want to perform during service degradation.

  • Check standards: The work queue length of the current worker is used to determine whether to perform service degradation. If the work queue length continuously equals to or exceeds the value of the worker_queue_size_degrade_threshold parameter for a period of time that is specified by the duration parameter, the service enters the degradation state. In the degradation state, if the length is less than the value specified by the worker_queue_size_recover_threshold parameter for a period of time specified by the duration parameter, the service exits the degradation state. The value of the duration parameter is in milliseconds. If the queue length is less than the value of the worker_queue_size_recover_threshold parameter when the service is in the degradation state, the query is not degraded. This way, the queue length fluctuates based on the value of the worker_queue_size_recover_threshold parameter.

  • Measures of service degradation: You can change only the values of the rank_size and rerank_size parameters for the request. rank_size specifies the number of roughly sorted documents. rerank_size specifies the number of finely sorted documents. In the rank_service architecture, only the rank_size parameter can take effect.

multi_call_config

The following sample code provides an example on the flow control configuration for Havenask versions earlier than V3.2.0. In Havenask versions later than V3.2.0, the flow control logic changes significantly and the anomaly_process_config field is changed to the multi_call_config field.

"multi_call_config" : {
    "probe_percent" : 0.3,
    "latency_upper_limit_percent" : 0.4,
    "begin_degrade_latency" : 100,
    "full_degrade_latency" : 150,
    "et_trigger_percent" : 0.8,
    "et_wait_time_factor" : 5,
    "et_min_wait_time" : 30,
    "retry_trigger_percent" : 0.6,
    "retry_wait_time_factor" : 3
}

For more information about configuration items, see gig documentation. The following section describes the parameters of multi_call_config in Havenask of versions earlier than V3.2.0. You can configure the settings to ensure the service stability for each cluster. The following code provides an example on the parameter configuration that you can use to ensure service stability:

"anomaly_process_config" : 
   {
       "flowControlEnabled" : true,
       "earlyTerminationEnabled" : true,
       "retryQueryEnabled" : true,
       "flowControl" : {
           "sample_interval" : 100,
           "min_sample_count" : 10,
           "heavy_load_arpc_error_ratio" : 0.5,
           "light_load_arpc_error_ratio" : 0,
           "max_flow_redirect" : 1,
           "queue_size_threshold" : 5,
           "flow_control_categories" : [1,2,3,4,5,6,7,8,9,10,11]
       },
       "detection" : {
           "early_termination_trigger_result_percent" : 0.8,
           "early_termination_wait_time_factor" : 3,
           "retry_query_trigger_result_percent" : 0.8,
           "retry_query_wait_time_factor" : 1
       }
   }

  • flowControlEnabled specifies whether to enable flow control.

  • earlyTerminationEnabled specifies whether to end the query process in advance.

  • retryQueryEnabled specifies whether to retry the query.

  • flowControl specifies the configuration of flow control. flowControl contains the parameters that are used to monitor the health status of the searcher.

  • The interval at which the health status is updated. Unit: ms.

  • The minimum number of queries that you must sample. This parameter is used to determine the percentage of ARPC errors that occur. If the queries are not enough for evaluating the latest health status after the specified interval, wait until the number of queries becomes sufficient.

  • If the number of ARPC query errors is greater than the value of the heavy_load_arpc_error_ratio parameter, the health status of the searcher needs to be degraded.

  • If the number of ARPC query errors is less than the value of the light_load_arpc_error_ratio parameter, the health status of the searcher needs to be upgraded.

  • max_flow_redirect specifies the maximum number of backup searchers that the system can find. If one searcher in a column fails to process data, some traffic in the searcher is distributed among other searchers in the same column. If no healthy searcher node can be found from multiple searcher nodes in the same column, query requests are lost.

  • queue_size specifies the length of the searcher queue that each query obtains from the searcher node. The QRS or proxy updates the health status of the searcher according to the length of the searcher queue. Before the health status is changed, the condition must be met. The health status has 11 levels. The update conditions for each level of health status are different. In the example, the value of the queue_size_threshold parameter is 5, and the array that corresponds to the length of the queue in which the health status is updated is q=[5,10,15,20,25,30,35,40,45,50,55]. The health status level of the searcher is lowered when the following condition is met: queue_size >= q[10-H_c +1]. The health status level of the searcher is raised when the following condition is met: queue_size < q[10-H_c]. In queue_size < q[10-H_c], H_c specifies that the health status level of the searcher is in the range of 0 <= H_c <= 10.

  • The length of the queue in which the health status is updated corresponds to the advanced control configuration items of the array. <10> is a simple configuration method that is used for linear expansion. When the array is configured, <10> is overwritten.

  • early_termination_trigger_result_percent specifies the number of documents required when you want to terminate the query in advance. For example, if the required data in column 8 is returned before data in column 10 is queried, the query result is checked in advance.

  • early_termination_wait_time_factor specifies the time that is required before the query is terminated in advance. The waiting period after which the system terminates the query in advance is t. Even if you specify a waiting period that is 3t and the query is not completed after 3t, the system terminates the query in advance and returns the result. If the query is completed during the period of time that is specified by this parameter, the result is returned as expected. t is the time period that is required to return the first expected document.

  • retry_query_trigger_result_percent specifies the number of documents required when the system checks whether query retry is required. For example, if the required data in column 8 is returned before data in column 10 is queried, the system checks whether query retry is required.

  • Configure the period of time after which the query is retried. For example, the waiting period after which the query retry is triggered is T. If you wait for T and the query result is incomplete, the query request is sent to another searcher in the same column. If the query is completed during the period of time that is specified by this parameter, the result is returned as expected. t is the time period that is required to return the first expected document.

cava_config

cava_config is used to configure the advanced language script of Havenask. Havenask provides the cava script to allow you to write plug-ins. The following sample code provides an example on the parameter configuration of cava_config.

"cava_config" : {
        "enable" : true, (1)
        "ha3_cava_conf" : "../binary/usr/local/etc/cava/config/cava_config.json", (2)
        "lib_path" : "cava/lib",(3)
        "source_path" : "cava/src",(4)
        "code_cache_path" : "cava/cache",(5)
  "pool_trunk_size" : 10, //MB (6)
  "pool_recycle_size_limit" : 20, //MB (7)
  "alloc_size_limit" : 40, //MB  (8)
  "init_size_limit" : 1024, //MB (9)
  "module_cache_size" : 100000   (10)
    }

  • (1) Specifies whether to enable the cava feature.

  • (2) The cava language.

  • (3) The public library provided by the platform. The value takes effect when the UPC language is used to compile code.

  • (4) The custom plug-in of the service provider. The public library provided by the platform can be reused for the plug-in. The value takes effect when the UPC language is used to compile code.

  • (5) The plug-in code that is compiled into code cache in advance. The plug-in code is always valid and other users cannot quote the code.

  • (6) The value of the trunk_size parameter in the cava memory pool. Default value: 10 MB. You cannot change the default value.

  • (7) The value of the recycle_size parameter in the cava memory pool. Default value: 20 MB. You cannot change the default value.

  • (8) The maximum memory capacity that can be allocated to the cava query level.

  • (9) The maximum memory capacity for cava. The memory capacity excludes the memory resources that are allocated to the cava query level.

  • (10) The maximum number of source code blocks that are passed to the cache query. If the threshold value is exceeded, the Least Recently Used (LRU) algorithm is used.

The information about cava will be described in another topic.

turing_options_config

You can use turing_options_config to overwrite the graph-related configuration. The following section provides an example on the configuration of turing_options_config.

"turing_options_config":{ "graph_config_path": “ssss”, // The path of the graph. "dependency_table":["a","b"] // The dependency table. }