Document sorting practices in OpenSearch - OpenSearch - Alibaba Cloud Documentation Center

Retrieval and sorting are two features that users are most concerned about for a search engine. The retrieval feature allows you to retrieve all documents that meet the conditions and the sorting feature allows you to retrieve the document whose relevance is highest before the other documents that meet the conditions. When you use the sorting feature, you need to adjust related configurations based on your business requirements. Therefore, you need to have a good understanding of the sorting capabilities provided by OpenSearch. This topic describes the details of the sorting capabilities provided by OpenSearch and lists common scenarios to which the sorting capabilities can be applied.

Relationships between the sort clause and sort policies

The sort clause is used for global sorting in OpenSearch, and a sort policy can be understood as a hierarchical sorting method in the sort clause. The sort policies use built-in functions and expressions to form a complex document scoring logic to apply to complex business scenarios. However, sorting is implemented based on the final score that is calculated by using the expressions in a sort policy. For example, a business wants to sort documents based on how new a document is. The documents at the same newness level can be sorted again based on the similarity of the documents. To meet the preceding requirements, you need to use the sort clause together with sort policies. If the business table contains the create_time field and you want to retrieve documents based on the name field, add the following content to the sort clause:

sort=-create_time;-RANK

Configure the static_bm25() function in a rough sort policy and the text_relevance(name) function in a fine sort policy. For more information about how to configure a sort policy, see Configure sort policies.

RANK indicates the score that is obtained based on the sort policy. A minus sign (-) indicates the descending order and a plus sign (+) indicates the ascending order.

By default, if the sort clause is not configured, -RANK is used by the system as the sorting condition. If the sort clause is configured, -RANK must be explicitly written to the sort clause when sorting is implemented based on the score that is obtained from the sort policy. Otherwise, the system does not automatically introduce the score as the sorting condition.

The following example describes the relationships between the sort clause and sort policies:

In this example, an application exists in OpenSearch. The following table describes the application schema.

Field	Type	Index
id	int	Keyword
name	text	General index for Chinese
age	int	Keyword

A rough sort policy is configured for the name field. The expression is shown in the following figure.

A fine sort policy is configured for the name field. The expression is shown in the following figure.

When you retrieve data, configure the sort clause in the following format:

sort=age;-RANK

Turn on Show Sort Details to view the details of score calculation.

First, you can see that the sort score is 13,10000.2259030193. The sort clause is set to age;-RANK. Therefore, 13 indicates the value of the age field and 10000.2259030193 indicates the final score of the sort policy. OpenSearch sorts documents based on the value of the age field in ascending order and then sorts documents with the same age value in descending order based on the final score of the sort policy.

Then, you can see the sorting formula:

FirstRank:
expression[static_bm25()], result[0.496452].
SecondRank:
expression[text_relevance(name)], result[0.225903].

FirstRank indicates the score of the rough sort policy, whereas SecondRank indicates the score of the fine sort policy. The final score is 10000.2259030193. The reason why the final score is 10000.2259030193 instead of [0.496452+0.225903] will be explained in detail in the next section.

The preceding results show that the sort clause in OpenSearch is similar to the ORDER BY clause in SQL. You can specify attribute fields in the sort clause to sort documents. You can also use complex sort policies for score calculation. The sort policies have unique functions and score calculation rules. Finally, the sort clause determines the sorting method of documents and scores based on the minus sign (-) or plus sign (+) in the sort clause and the sort field.

Sort policies

Scoring principle of sort policies

The score calculation of sort policies is divided into two stages: rough sort and fine sort. After documents are retrieved by using a query clause and filtered by using a filter clause, rough sort is implemented. At this stage, scores are calculated for the documents based on the rough sort expression. Then, the documents with top N scores are scored and sorted based on the fine sort expression. Finally, the final score of the sort policy is returned. Take note of the following score calculation rules:

If only a rough sort policy is configured, the document score equals 10,000 plus the result calculated by using the rough sort expression. The maximum document score is 20,000. If the actual document score exceeds 20,000, the displayed score is still 20,000.
If only a fine sort policy is configured, the document score equals 10,000 plus the result calculated by using the fine sort expression. No upper limit exists for the document score.
If both a rough sort policy and a fine sort policy are configured, the final score of a document that enters the fine sort stage equals 10,000 plus the result calculated by using the fine sort expression, and the final score of other documents that are only roughly sorted equals 10,000 plus the results calculated by using the rough sort expression. The maximum final score is 20,000. If the actual document score exceeds 20,000, the displayed score is still 20,000.

Scores that are calculated in the preceding section:

FirstRank:
expression[static_bm25()], result[0.496452].
SecondRank:
expression[text_relevance(name)], result[0.225903].

The final score is 10000.2259030193.

Based on the preceding principle, the hit document is one of the one million documents which are retrieved. The rough sort score of the document is 0.496452, which is calculated by using the static_bm25 function. The document ranks top 200 among all hit documents based on the rough sort score. Therefore, the document enters the fine sort stage. When the document enters the fine sort stage from the rough sort stage, 10,000 points are added to the score by default and the rough sort score of the document is discarded. The final sort policy score is the fine sort score plus 10,000. The document score is 0.225903, which is calculated by using the text_relevance function in the fine sort policy. Therefore, the final sort policy score of the document is 10000.2259030193.

Usage of fine sort functions

Note: All fields in the application schema referenced by the following built-in functions must be configured as attribute fields. Otherwise, the Invalid formula error is reported.

Function	Description	Example
i in (value1, value2, …, valuen)	If the value of i is included in the set [value1, value2, …, valuen], 1 is returned. Otherwise, 0 is returned.	age=5 age in (1,2,3,4,5) # 1 is returned. age in (6,7,8,9) # 0 is returned.
if(cond, then_value, else_value)	If the value of the cond parameter is not 0, the value of the then_value parameter is returned. Otherwise, the value of the else_value parameter is returned. For example, the value of if(2,3,5) is 3 and the value of if(0,3,5) is 5.	a=1 if(a==1,5,10) # 5 is returned. if(1,5,10) # 5 is returned. if(a==2,5,10) # 10 is returned. if(0,5,10) # 10 is returned.
random()	Return a random value in the [0,1] range.	-
now()	Return the number of seconds that have elapsed since 00:00:00 January 1, 1970 in UTC.	-

For more information about the text relevance, geographical location relevance, timeliness, algorithm relevance, and functionality functions, see Fine sort functions.
For more information about common numeric functions, expressions, and operators, see Configure sort policies.

Note

If the expressions of the sort policy still cannot meet the requirements for score calculation in complex scenarios, you can use the Cava plug-in to write a script. For more information about the usage and principle of the Cava plug-in, see Cava for the development of sort plug-ins.

Configurations of sort policies in common scenarios

1. Configure that 10 points are added to the score if the value of the age field is greater than 10, 20 points are added if the value of the age field is greater than 40, and 30 points are added if the value of the weight field is greater than 60. Then, implement sorting based on the final score.

Implementation 1:

# Set the following fine sort expression. By default, only 1 point is added to the score when the condition is met.
(age>10)*10+(age>40)*20+(weight>60)*30

Implementation 2:

# Set the following fine sort expression:
if(age>10,10,0) + if(age>40,20,0) +if(weight>60,30,0)

2. Rank "xxx Company" before "xxx Hangzhou Branch".

Implementation:

# Configure the field_match_ratio function in the fine sort expression:
field_match_ratio(title)

3. Rank "dim_itm_tb" before "dim_itm_tb_dst_itm_relation_dd" for the results returned by the all:'dim_itm_tb' query.

Implementation:

# Configure the field_match_ratio function in the fine sort expression:
field_match_ratio(detail)

4. Implement a query similar to query = item:"iphone 8" OR item: 'iphone 8'.

Implementation:

# Configure the query_min_slide_window function in the fine sort expression:
query_min_slide_window(title)

5. Rank the documents that contain "Republic of China" before the documents that contain "Interesting news of the Republic of China-Republic of China" or "Chinese national history-Republic of China" when the text_relevance function is configured in the fine sort expression and the search keyword is set to "Republic of China".

Implementation:

# Configure the query_min_slide_window function in the fine sort expression:
query_min_slide_window(title)

6. Prevent the static_bm25() function from repeatedly calculating the score when a search keyword is repeatedly hit in a specific field.

Implementation:

# Configure the query_match_ratio function in the fine sort expression:
query_match_ratio(title)

7. Rank the documents that contain stacked search keywords at the back.

Implementation:

# Configure the query_term_match_count function to check whether a document is stacked with keywords based on the number of times a search keyword is hit in the document. 
if(field_term_match_count(title)>3,1,10)

8. Configure that specific points are added to the sort score if a string is not empty.

Implementation:

Add the mark field to the source database. If the string is empty, the value of the mark field is 0. Otherwise, the value is 1. Then, configure the if function in the fine sort expression to check whether the string is empty. 
Set the following fine sort expression. 500 points are added to the sort score when the value of the mark field is 1.
if(mark==1,500,0)