CTR prediction models - OpenSearch - Alibaba Cloud Documentation Center

Overview

Click-through rate (CTR) prediction is a core task of a search platform. This task predicts the possibility that a user clicks the documents that match a specific query of the user after the documents are exposed. The predicted value can be used in sorting scripts to improve search performance and improve business metrics such as CTR.

Benefits

OpenSearch supports CTR prediction models to better meet the needs of search result sorting in various scenarios. You can create and train CTR prediction models to implement personalized sorting of search results.

Create and train a model

Create an Industry Algorithm Edition configurations. Then, log on to the OpenSearch console and choose Search Algorithm Center > Sort Configuration in the left-side navigation pane. On the Policy Management page, click CTR Prediction Models in the left-side pane. On the CTR Prediction Models page, click Create to create a CTR prediction model.

Enter a model name and specify the training fields.

Map Training Fields: The Commodity ID and Commodity Title fields are required. The more fields you specify, the better the performance of the model is.

After the model is created, find the model on the CTR Prediction Models page and click Train in the Actions column.

After the training starts, view the training progress on the model details page.

After the training is complete and the model status changes to Available, use the model. If the model status changes to Unavailable, adjust the model based on the condition for integrity level upgrade in the Data Verification section. After the model is adjusted to meet the condition, train the model the next day. If you have any questions, submit a ticket to contact technical support.

Note

We recommend that you enable scheduled training to train the model daily.

Perform a search test

In the left-side navigation pane, choose Search Algorithm Center >Sort Configuration >Policy Management. On the Sort Configuration page, click Create to create a Cava-based fine sort policy.

Configure the Policy Name parameter, select Fine Sort from the Scope drop-down list, select Cava Script from the Type drop-down list, and then click Next.

Click Add Script File and copy the sample Cava script to the script editor. Click Compile. If the compilation is successful, click Save and then Publish.
Then, you can perform a search test.

Sample Cava script:

package users.scorer;
import com.aliyun.opensearch.cava.framework.OpsScoreParams;
import com.aliyun.opensearch.cava.framework.OpsScorerInitParams;
import com.aliyun.opensearch.cava.framework.OpsRequest;
import com.aliyun.opensearch.cava.framework.OpsDoc;
import com.aliyun.opensearch.cava.features.algo.AlgoModel;

class BasicSimilarityScorer {
    boolean init(OpsScorerInitParams params) {
        return true;
    }

    double score(OpsScoreParams params) {
        double score = 0;
        return score;
    }
};

class IntelligenceAlgorithmScorer {
    AlgoModel _algoModel;

    boolean init(OpsScorerInitParams params) {
      // The tf_checkpoint parameter is a fixed parameter.
        _algoModel = AlgoModel.create(params, "tf_checkpoint","ctr", "Name of your CTR prediction model"); 
        return true;
    }

    double score(OpsScoreParams params) {
        OpsDoc doc = params.getDoc();
        double modelScore = _algoModel.evaluate(params);
        doc.trace("ctrModelScore: ", modelScore);
        
        double score = modelScore + 700;
        return score;
    }
};

Perform a search test.

Note:

The second_rank_type, second_rank_name, and raw_query parameters are required in search requests.
If the user_id parameter is contained in both behavioral data and queries, the performance of the model is better.

Model details page

Basic Information

You can view the following basic information about the model: Created At, Status, Last Training Time, and Latest Version Status.

Configuration Information

Training Fields: After you click Map Training Fields, you can modify or delete training fields in the Map Training Fields panel. After you modify training fields, you must retrain the model.

Scheduled Training: By default, scheduled training is enabled to train the model daily. You can also modify the scheduled training task to customize the training cycle.

Data Verification

Valid values of Data Integrity: Available Data and Abnormal Data.

The integrity report displays the integrity level of the current application. The following table describes the integrity levels.

Integrity level	Description	Upgrade condition
l0	The data is completely unavailable. Required core fields are missing, and the size of data is too small. Therefore, subsequent data processing cannot be performed.	l0 --> l1: The application table contains more than 1,000 data entries. The number of page views (PVs) in the last 24 hours is greater than 10,000. The number of independent queries in the last 24 hours is greater than 1,000. The number is counted only based on the raw_query field.
l1	The core fields of the data are configured and meet the most basic requirements. However, the size of behavioral data is small, and some fields are missing. Optimization that does not rely on behavioral data can be performed. Issues of the behavioral data must be resolved to perform comprehensive optimization.	l1 --> l2: The number of item page views (IPVs) in the last 24 hours is greater than 1,000. The number of reported exposures is greater than that of IPVs, and the bhv_type field of the behavioral data is not empty. The number of unique visitors (UVs) in the last 24 hours is greater than 1,000. The number is counted based on the user_id field of search requests. The rn field in more than 90% of the behavioral data in the last 24 hours is not empty. The item_id field in more than 90% of the behavioral data in the last 24 hours is not empty. The bhv_time field in more than 90% of the behavioral data in the last 24 hours is not empty. The values of the item_id field in more than 90% of the behavioral data can be matched with the items in the application table in the last 24 hours. The values of the rn field in more than 60% of the behavioral data can be matched with the values of the request_id field in search logs in the last 24 hours. The values of the bhv_time field in more than 60% of the behavioral data are valid timestamps in the last 24 hours. The values of the bhv_time field in more than 60% of the behavioral data indicate points in time on the current day in the last 24 hours. This indicates that the behavioral data is reported with no latency.
l2	The data quality meets the requirements and subsequent optimization can be performed. However, the data size is small. This has a certain impact on the final optimization result.	l2 --> l3: The number of PVs in the last 24 hours is greater than 1,000,000. The number of UVs in the last 24 hours is greater than 100,000. The number of independent queries in the last 24 hours is greater than 100,000. The number of IPVs in the last 24 hours is greater than 100,000. The number of reported exposures is greater than the number of IPVs. The values of the rn field in more than 90% of the behavioral data can be matched with the values of the request_id field in search logs in the last 24 hours. The values of the bhv_time field in more than 90% of the behavioral data are valid timestamps in the last 24 hours. The values of the bhv_time field in more than 90% of the behavioral data indicate points in time on the current day in the last 24 hours. This indicates that the behavioral data is reported with no latency.
l3	Both the data quality and data size meet the requirements and optimization can be performed.	l3 --> l4: The number of PVs in the last 24 hours is greater than 10,000,000. The number of UVs in the last 24 hours is greater than 1,000,000. The number of independent queries in the last 24 hours is greater than 1,000,000. The number of IPVs in the last 24 hours is greater than 1,000,000.
l4	The data size is large and contains tens of millions of data entries. The data integrity is great.	l4 --> l5: The number of PVs in the last 24 hours is greater than 100,000,000. The number of UVs in the last 24 hours is greater than 10,000,000. The number of independent queries in the last 24 hours is greater than 10,000,000. The number of IPVs in the last 24 hours is greater than 10,000,000.
l5	The data size is very large and contains more than hundreds of millions of data entries. Deep optimization can be performed.

Note

The number of IPVs indicates the CTR of each search. In this case, the value of the bhv_type field is click.
If exposures are reported and the number of exposures is greater than the number of IPVs, the number of behavioral data entries that contain bhv_type=expose is greater than the number of behavioral data entries that contain bhv_type=click. If a user clicks a product, the product is exposed. Therefore, the behavior data needs to be uploaded twice. One data entry contains bhv_type=expose and the other data entry contains bhv_type=click.

Usage notes

You can use CTR prediction models only in Cava-based plug-ins.
This feature is available only for Industry Algorithm Edition - Dedicated Cluster instances.
Each application supports up to three CTR prediction models.
The more training fields, the better the model training result.
The raw_query field in upgrade conditions is a required field in search requests. The value of the field must be a unique and independent search query that has search results. For more information, see SDK for Java demo code for implementing the search feature.
Related API operations and SDKs: Algorithms.