This topic describes how to use the data mining components that Machine Learning Platform for AI (PAI) provides to perform offline scheduling in the scenario of ad click-through rate (CTR) prediction.

Background information

The experiment described in this topic is conducted in the following procedure:
  1. Train a model in PAI based on historical data.
  2. Schedule the model in DataWorks.
  3. In the early morning of each day, the CTRs of ads are predicted and ads are delivered based on the predicted CTRs.

The dataset used in this experiment is generated by using a random number generator. Therefore, the experiment result is not evaluated. This topic describes only how to create an experiment and perform offline scheduling in DataWorks.

Dataset

The training dataset used in this experiment includes the historical data about the ads delivered on September 19, 2016 and September 20, 2016. This experiment predicts the CTRs of the ads delivered on September 21, 2016. The dataset is stored in a MaxCompute partitioned table. The following table describes the fields in the dataset.
Field Data type Description
id STRING The unique ID of the ad.
age DOUBLE The age of the person to which the ad is delivered.
sex DOUBLE The gender of the person to which the ad is delivered. Valid values: 1 (male) and 0 (female).
duration DOUBLE The duration that the ad is displayed. Unit: seconds.
place DOUBLE The position where the ad is displayed. Valid values: 0 to 4.
ctr DOUBLE The CTR of the ad. If the number of clicks divided by the number of views is greater than 0.03 for the ad, the value of this field is 1. Otherwise, the value of this field is 0.
dt STRING The date when the ad is delivered. Format: YYYYMMDD.
The following figure shows the sample table ad that this experiment uses.Example of the data table

Step 1: Create an experiment

  1. Create and configure an experiment.
    1. Create an experiment. Then, click Components in the left-side navigation pane.
    2. In the left-side Component Descriptions pane, click Data Source/Target and drag the Read MaxCompute Table component to the canvas twice. Rename the two Read MaxCompute Table components ad-1 and ad-2.
    3. Click Data Preprocessing and drag the Normalization component to the canvas twice.
    4. Choose Machine Learning > Binary Classification and drag the Logistic Regression for Binary Classification component to the canvas.
    5. Drag the Prediction component under Machine Learning to the canvas.
    6. Drag the Write MaxCompute Table component under Data Source/Target to the canvas. Rename the Write MaxCompute Table component ad_result-1.
    7. Drag directed lines to connect the preceding components, as shown in the following figure.
      Offline model
      Area No. Description
      1 The components in this area import data from the source dataset.
      2 The components in this area preprocess the source data.
      3 The component in this area trains a model.
      4 The components in this area perform prediction.
  2. Set the component parameters.
    1. Click the ad-2 and ad-1 components on the canvas and set the parameters in the right-side pane. The ad-2 component serves as the training data source, whereas the ad-1 component serves as the prediction data source.
      Tab Parameter Description
      Select Table Table Name The name of the MaxCompute table that you want to import. Enter ad.
      Partition Specifies whether the MaxCompute table is a partitioned table. If the MaxCompute table is a partitioned table, the system automatically selects Partition.
      Parameter The data that you want to import. Set this parameter to a value in the format of dt=@@{yyyyMMdd}. This ensures that daily incremental data is imported as the prediction data.
      Fields Information Source Table Columns The columns in the MaxCompute table to import. After you set the parameters on the Select Table tab, the system automatically displays the columns of the MaxCompute table.
    2. Click the Logistic Regression for Binary Classification-1 component on the canvas and set the parameters in the right-side pane. Set only the parameters described in the following table and use the default values of other parameters.
      Tab Parameter Description
      Fields Setting Training Feature Columns The columns that you want to use for training. Select age, sex, duration, and place.
      Target Columns The column that stores the CTR data. Select ctr.
    3. Click the Prediction-1 component on the canvas and set the parameters in the right-side pane. Set only the parameters described in the following table and use the default values of other parameters.
      Tab Parameter Description
      Fields Setting Feature Columns The columns that you want to use for prediction. Select age, sex, duration, and place.
      Output Result Column The column that stores the CTR data. Select ctr.
  3. In the top toolbar of the canvas, click Run.
  4. After the experiment is run, right-click ad_result-1 on the canvas and select View Data to view the table that is generated based on the prediction result. The following figure shows the table.Prediction resultIn the table, the prediction_result field indicates whether the ad is clicked. Valid values: 1 and 0. 1 indicates that the ad is clicked, whereas 0 indicates that the ad is not clicked. The prediction_score field indicates the probability that the ad is clicked.

Step 2: Perform offline scheduling in DataWorks

  1. Create a PAI node in DataWorks. For more information, see Create a Machine Learning (PAI) node.
    Configure DataWorks to schedule the PAI node at 00:00 every day. For more information, see Time properties.
  2. Commit the node and go to Operation Center to view the logs of the node. For more information, see View auto triggered nodes.