Configure document sorting using sort clauses and policies in OpenSearch Industry Algorithm Edition - OpenSearch

This topic describes document sorting practices in OpenSearch Industry Algorithm Edition. Search engines focus primarily on recall and sorting. Recall refers to retrieving documents that meet specific conditions, while sorting prioritizes displaying documents with the highest relevance. For sorting, you need to adjust configurations based on specific business requirements, so you should understand OpenSearch's characteristics in this feature. This topic explains in detail the sorting mechanism of OpenSearch Industry Algorithm Edition and demonstrates how to use its sorting capabilities to meet specific business requirements through common scenarios.

Relationships between the sort clause and sort policies

The sort clause is used for global sorting in OpenSearch Industry Algorithm Edition, and a sort policy can be viewed as a hierarchical sorting method in the sort clause. Sort policies combine built-in functions and expressions to form complex document scoring logic that meets complex business scenarios. However, sorting is ultimately implemented based on the final score calculated by the expressions in a sort policy. For example, a business wants to sort documents based on how new a document is. Documents at the same newness level can be sorted again based on the similarity of the documents. To meet these requirements, you need to use the sort clause together with sort policies. If the business table contains the create_time field and you want to retrieve documents based on the name field, add the following content to the sort clause:

sort=-create_time;-RANK

Configure the static_bm25() function in a rough sort policy and the text_relevance(name) function in a fine sort policy. For more information, see Sort policy configuration.

RANK indicates the score that is obtained based on the sort policy. A minus sign (-) indicates descending order and a plus sign (+) indicates ascending order.

By default, if the sort clause is not configured, -RANK is used by the system as the sorting condition. If the sort clause is configured, -RANK must be explicitly written to the sort clause when sorting is implemented based on the score obtained from the sort policy. Otherwise, the system does not automatically introduce the score as the sorting condition.

The following example describes the relationships between the sort clause and sort policies:

In this example, an application exists in OpenSearch. The following table describes the application schema.

Field	Type	Index
id	int	Keyword
name	text	General index for Chinese
age	int	Keyword

A rough sort policy is configured for the name field. The expression is shown in the following figure.

A fine sort policy is configured for the name field. The expression is shown in the following figure.

When you retrieve data, configure the sort clause in the following format:

sort=age;-RANK

Turn on Show Sort Details to view the details of score calculation.

First, you can see that the sort score is 13,10000.2259030193. The sort clause is set to age;-RANK. Therefore, 13 indicates the value of the age field and 10000.2259030193 indicates the final score of the sort policy. OpenSearch sorts documents based on the value of the age field in ascending order and then sorts documents with the same age value in descending order based on the final score of the sort policy.

Then, you can see the sorting formula:

FirstRank:
expression[static_bm25()], result[0.496452].
SecondRank:
expression[text_relevance(name)], result[0.225903].

FirstRank indicates the score of the rough sort policy, whereas SecondRank indicates the score of the fine sort policy. The final score is 10000.2259030193. The reason why the final score is 10000.2259030193 instead of [0.496452+0.225903] will be explained in detail in the next section.

The preceding results show that the sort clause in OpenSearch is similar to the ORDER BY clause in SQL. You can specify attribute fields in the sort clause to sort documents. You can also use complex sort policies for score calculation. The sort policies have built-in functions and score calculation rules. Finally, the sort clause determines the sorting method of documents and scores based on the minus sign (-) or plus sign (+) in the sort clause and the sort field.

Sort policies

Scoring principle of sort policies

The score calculation of sort policies is divided into two stages: rough sort and fine sort. After documents are retrieved by using a query clause and filtered by using a filter clause, rough sort is implemented. At this stage, scores are calculated for the documents based on the rough sort expression. Then, the documents with top N scores are scored and sorted based on the fine sort expression. Finally, the final score of the sort policy is returned. Take note of the following score calculation rules:

If only a rough sort policy is configured, the document score equals 10,000 plus the result calculated by using the rough sort expression. The maximum document score is 20,000. If the actual document score exceeds 20,000, the displayed score is still 20,000.
If only a fine sort policy is configured, the document score equals 10,000 plus the result calculated by using the fine sort expression. No upper limit exists for the document score.
If both a rough sort policy and a fine sort policy are configured, the final score of a document that enters the fine sort stage equals 10,000 plus the result calculated by using the fine sort expression, and the final score of other documents that are only roughly sorted equals 10,000 plus the results calculated by using the rough sort expression. The maximum final score is 20,000. If the actual document score exceeds 20,000, the displayed score is still 20,000.

Scores that are calculated in the preceding section:

FirstRank:
expression[static_bm25()], result[0.496452].
SecondRank:
expression[text_relevance(name)], result[0.225903].

The final score is 10000.2259030193.

Based on the preceding principle, the hit document is one of the one million documents which are retrieved. The rough sort score of the document is 0.496452, which is calculated by using the static_bm25 function. The document ranks top 200 among all hit documents based on the rough sort score. Therefore, the document enters the fine sort stage. When the document enters the fine sort stage from the rough sort stage, 10,000 points are added to the score by default and the rough sort score of the document is discarded. The final sort policy score is the fine sort score plus 10,000. The document score is 0.225903, which is calculated by using the text_relevance function in the fine sort policy. Therefore, the final sort policy score of the document is 10000.2259030193.

Usage of fine sort functions

Important

All fields in the application schema referenced by the following built-in functions must be configured as attribute fields. Otherwise, the Invalid formula error is reported.

Function	Description	Example
i in (value1, value2, …, valuen)	If the value of i is included in the set [value1, value2, …, valuen], 1 is returned. Otherwise, 0 is returned.	age=5 age in (1,2,3,4,5) # 1 is returned age in (6,7,8,9) # 0 is returned
if(cond, then_value, else_value)	If the value of the cond parameter is not 0, the value of the then_value parameter is returned. Otherwise, the value of the else_value parameter is returned. For example, the value of if(2,3,5) is 3 and the value of if(0,3,5) is 5.	a=1 if(a==1,5,10) # 5 is returned if(1,5,10) # 5 is returned if(a==2,5,10) # 10 is returned if(0,5,10) # 10 is returned
random()	Return a random value in the [0,1] range.	-
now()	Return the number of seconds that have elapsed since 00:00:00 January 1, 1970 in UTC.	-

For more information about the text relevance, geographical location relevance, timeliness, algorithm relevance, and functionality functions, see Fine sort functions.
For more information about common mathematical functions, see Sort policy configuration-Mathematical functions.
For more information about common expressions and operators, see Sort policy configuration-Basic operations.

Note

If the expressions of the sort policy still cannot meet the requirements for score calculation in complex scenarios, you can use the Cava plug-in to write a script. For more information about the usage and principle of the Cava plug-in, see Sort plugin development-Cava language.

Configurations of sort policies in common scenarios

Configure that 10 points are added to the score if the value of the age field is greater than 10, 20 points are added if the value of the age field is greater than 40, and 30 points are added if the value of the weight field is greater than 60. Then, implement sorting based on the final score.
Implementation 1:
```
# Set the following fine sort expression. By default, only 1 point is added to the score when the condition is met.
(age>10)*10+(age>40)*20+(weight>60)*30
```
Implementation 2:
```
# Set the following fine sort expression:
if(age>10,10,0) + if(age>40,20,0) +if(weight>60,30,0)
```

Rank "xxx Company" before "xxx Hangzhou Branch".

Implementation:

# Configure the field_match_ratio function in the fine sort expression
field_match_ratio(title)

Rank "dim_itm_tb" before "dim_itm_tb_dst_itm_relation_dd" for the results returned by the all:'dim_itm_tb' query.
Implementation:
```
# Configure the field_match_ratio function in the fine sort expression:
field_match_ratio(detail) 
```

Implement a query similar to query = item:"iphone 8" OR item: 'iphone 8'.

Implementation:

# Configure the query_min_slide_window function in the fine sort expression:
query_min_slide_window(title)

Rank the documents that contain "Republic of China" before the documents that contain "Interesting news of the Republic of China-Republic of China" or "Chinese national history-Republic of China" when the text_relevance function is configured in the fine sort expression and the search keyword is set to "Republic of China".
Implementation:
```
# Configure the query_min_slide_window function in the fine sort expression:
query_min_slide_window(title)
```
Prevent the static_bm25() function from repeatedly calculating the score when a search keyword is repeatedly hit in a specific field.
Implementation:
```
# Configure the query_match_ratio function in the fine sort expression:
query_match_ratio(title) 
```

Rank the documents that contain stacked search keywords at the back.

Implementation:

# Configure the query_term_match_count function to check whether a document is stacked with keywords based on the number of times a search keyword is hit in the document.
if(field_term_match_count(title)>3,1,10)

Configure that specific points are added to the sort score if a string is not empty.

Implementation:

Add the mark field to the source database. If the string is empty, the value of the mark field is 0. Otherwise, the value is 1. Then, configure the if function in the fine sort expression to check whether the string is empty.
Set the following fine sort expression. 500 points are added to the sort score when the value of the mark field is 1.
if(mark==1,500,0)

Cases

Functions

When you sort results, you can obtain the relevance score of the text by configuring the static_bm25 function in the rough sort stage and the text_relevance function in the fine sort stage. Functions:

static_bm25: the static text relevance. You can use the function to measure the matching degree between the query and the text. Valid values: [0,1].
text_relevance: the text matching degree based on the keyword in fields. Valid values: [0,1].

Preparations

To show the effect of the text relevance score on sort, the following rows of data are provided. id indicates the primary key and name indicates the text content.

id    name
1    Black humor, also known as "black comedy", is a modernist literary genre that emerged in the United States in the 1960s.
2    "Black Humor" is a song sung by Jay Chou.
3		 Jay Chou, a Mandopop male singer, musician, music arranger, record producer, and magician from Taiwan (China).
4    Night is falling, and everything around is dark. To ease the oppressive atmosphere, Jay Chou humorously told a joke.
5    Black Humor female version (Original singer: Jay Chou)
6    Jay Chou"Black Humor"-Official Music Video

Create a rough sort policy named test_first_rank_name and a fine sort policy named test_second_rank_name
Configure the static_bm25 function in the test_first_rank_name policy and the text_relevance function in the test_second_rank_name policy.
Create a query analysis task named test_qp and specify OR as Search Query Rewriting:
Upload the test data that is provided in Step 1 to the OpenSearch application.

Case analysis

Case 1

Requirement: Search for Black Humor Jay Chou.

Query analysis: The user wants to search for the song "Black Humor" sang by Jay Chou. The value of the name field whose ID is 4 in the test data is irrelevant. However, the document is considered relevant after the keyword is split, and the document is retrieved. Therefore, the irrelevant document needs to be sorted after the required documents in the sorting stage.

Procedure: Perform the operations described in Step 2 in the "Preparations" section.

Results:

Case 2

Requirement: Search for Black Humor Jay Chou. If the documents do not contain exact matches of the keyword, retrieve the documents that contain the word "Black Humor", and then retrieve the documents that contain the word "Jay Chou".

Procedure: Perform the operations described in Step 2 and Step 3 in the "Preparations" section.

Results: