Rough sort functions - OpenSearch - Alibaba Cloud Documentation Center

Rough sort is the process of selecting the top N high-quality documents from all documents that are retrieved. Then, the top N high-quality documents are scored and sorted in the fine sort process. This way, users can obtain the documents that best match their requirements. Rough sort affects the search performance, whereas fine sort affects the ultimate sort results. Therefore, you can use key factors of fine sort to roughly sort documents in an efficient and simple manner. You can use sort expressions to roughly and finely sort documents. This topic describes the feature functions used for rough sort.

Feature functions

static_bm25(): returns the static text relevance that indicates the matching degree between the query and document.

Syntax: static_bm25()
Parameters: none
Return value: The return value is of the FLOAT type. The valid values range from 0 to 1.
Scenario 1: You can use the static_bm25() function in a rough sort expression to calculate text scores.
Usage notes
- By default, static_bm25() takes effect if you use the default rough sort expression.

Note

If you have configured query analysis for queries, the score calculated by using the static_bm25() function exceeds 1. Example:

Synonyms are configured for the default query clause query=index: '苹果'. Then, the query clause is changed to query=index: '苹果' OR index:'apple'. In this case, if a document contains both 苹果 and apple, the scores calculated by using the static_bm25() function for 苹果 and apple are accumulated. That is, the final rough sort score is greater than 1.

exact_match_boost(): calculates the maximum weight of a specified term in a query.

Syntax: exact_match_boost()
Parameters: none
Return value: The return value is of the INT type. The valid values range from 0 to 99.
Scenario: The query clause is query=default:'开放搜索'^60 OR default:'opensearch'^50. You want to sort results based on the boost weight that is specified in the query clause for a matching term. For example, Document A contains the term 开放搜索, and Document B contains the term opensearch. In this case, Document A ranks higher than Document B. The rough sort expression is exact_match_boost().
Usage notes
- The field that you reference in the parameter of this function must be configured as an index field.
- The default boost weight of the term for which no boost weight is specified in the query clause is 99.
- If the exact_match_boost function is used in the rough sort expression for an exclusive application, the function parameter can be set to 'sum' and 'max'.

timeliness: returns the timeliness score that indicates how new the document is.

Syntax: timeliness(pubtime)
pubtime: the field whose timeliness is to be evaluated. The field value must be of the INT type in the unit of seconds.
Return value: The return value is of the FLOAT type. The valid values range from 0 to 1. A greater value indicates better timeliness. If the value of the field is later than the current time, 0 is returned.
Scenario: You can use the timeliness(create_timestamp) function in a rough sort expression to calculate the timeliness score of the document specified by the create_timestamp field.
Usage notes
- The pubtime field must be configured as an attribute field.

timeliness_ms: returns the timeliness score that indicates how new the document is.

Syntax: timeliness_ms(pubtime)
pubtime: the field whose timeliness is to be evaluated. The field value must be of the INT type in the unit of milliseconds.
Return value: The return value is of the FLOAT type. The valid values range from 0 to 1. A greater value indicates better timeliness. If the value of the field is later than the current time, 0 is returned.
Scenario: You can use the timeliness_ms(create_timestamp) function in a rough sort expression to calculate the timeliness score of the document specified by the create_timestamp field.
Usage notes
- The pubtime field must be configured as an attribute field.

normalize: normalizes scores in different value ranges to numeric values in the range from 0 to 1.

Scenario overview: The relevance of a document is calculated from different dimensions. The scores that are calculated from different dimensions may be in different value ranges. For example, a web page can have millions of clicks while the text relevance score of the web page is a value from 0 to 1. You cannot compare such values in different value ranges. The normalize function can normalize the scores in different value ranges to scores in the same value range. This way, you can use the normalized scores for further calculation. The normalize function supports three normalization methods: linear normalization, log normalization, and arctangent normalization. The function automatically selects a normalization method based on the input parameters. If only the value parameter is set, the normalize function uses the arctangent function for normalization. If both the value and max parameters are set, the normalize function uses the logarithmic function for normalization. If the value, max, and min parameters are all set, the normalize function uses the linear function for normalization.
Syntax: normalize(value, max, min)
value: the field in a document or the expression for which you want to normalize the field value or the return value. The field value or return value must be of the DOUBLE type. max: the maximum value of the value range after normalization. This parameter is optional. The maximum value must be of the DOUBLE type. min: the minimum value of the value range after normalization. This parameter is optional. The minimum value must be of the DOUBLE type.
Return value: The return value is of the DOUBLE type. The valid values range from 0 to 1.
Scenario 1: You want to normalize the value of the price field but do not know the value range of the price field. In this case, you can use the normalize function in the following syntax: normalize(price).
Scenario 2: You want to normalize the value of the price field and only know that the maximum value of the price field is 100. In this case, you can use the normalize function in the following syntax: normalize(price, 100).
Scenario 3: You want to normalize the value of the price field and know that the maximum value of the price field is 100 and the minimum value is 1. In this case, you can use the normalize function in the following syntax: normalize(price, 100, 1).
Scenario 4: You want to normalize the return value of the distance function to a value from 0 to 1. In this case, you can use the normalize function in the following syntax: normalize(distance(longitude_in_doc, latitude_in_doc, longitude_in_query, latitude_in_query)).
Usage notes
- The field referenced in the function must be set as an attribute field.
- If the arctangent function is used for normalization and the field value or the return value of the specified expression is smaller than 0, the return value of the normalize function is 0.
- If the logarithmic function is used for normalization, the value of the max parameter must be greater than 1.
- If the linear function is used for normalization, the value of the max parameter must be greater than that of the min parameter.

category_score: the category prediction function that returns the matching score between the specified category field in the parameters and the category obtained by the category prediction query

Syntax
category_score(cate_id)
Parameter
cate_id: the field that is used as a category ID for model training. This field must be of the INT type.
The return value is of the INT type. The valid values range from 0 to 2.
Scenario: You can set the category_score(cate_id) function in a sort expression. For more information, see Use the category prediction feature.
Usage notes
- This function must be used together with the category prediction feature.