use the data mining components of PAI to perform offline scheduling - Platform For AI

This topic describes how to use the data mining components of Platform for AI (PAI) to perform offline scheduling when predicting ad click-through rate (CTR).

Background information

The pipeline used in this topic contains the following steps:

Train a model in PAI based on historical data.
Schedule the model in DataWorks.
In the early morning of every day, the CTRs of ads are predicted, and ads are delivered based on the predicted CTRs.

The dataset used in this pipeline is generated by using a random number generator. Therefore, the pipeline results are not evaluated. This topic describes only how to create a pipeline and perform offline scheduling in DataWorks.

Step 1: Prepare a dataset

The training dataset used in this pipeline includes the historical data of ads that are delivered on September 19, 2016 and September 20, 2016. The pipeline predicts the CTRs of the ads delivered on September 21, 2016. The dataset is stored in a MaxCompute partitioned table. The following table describes the fields in the dataset.

Field	Type	Description
id	STRING	The unique ID of the ad.
age	DOUBLE	The age of the person to which the ad is delivered.
sex	DOUBLE	The gender of the person to which the ad is delivered. Valid values: 1 (male) and 0 (female).
duration	DOUBLE	The display duration of the ad. Unit: seconds.
place	DOUBLE	The position where the ad is displayed. Valid values: 0 to 4. A higher value indicates a lower position.
ctr	DOUBLE	The CTR of the ad. If the number of clicks divided by the number of views is greater than 0.03 for the ad, the value of this field is 1. Otherwise, the value of this field is 0.
dt	STRING	The date when the ad is delivered. Format: YYYYMMDD.

You can use the MaxCompute client to run the following command to create a partitioned table named ad: For more information, see Create tables.

create table if not exists ad (id STRING,age DOUBLE,sex DOUBLE,duration DOUBLE,place DOUBLE,ctr DOUBLE ) partitioned by (dt STRING) ;
alter table ad add if not exists partition (dt='20160919') partition (dt='20160920');

The following table shows the ad table that is used in the pipeline. You can run Tunnel commands to import partitioned table data. For more information, see Import data to tables.

id	age	sex	duration	place	ctr	dt
0	49	1	9	0	0	20160919
1	17	1	3	1	1	20160919
2	44	0	4	0	0	20160919
3	14	1	9	1	0	20160919
4	44	1	5	4	0	20160919
5	10	1	9	3	1	20160919
6	42	1	7	3	0	20160919
7	51	1	3	1	1	20160919
8	18	0	3	3	0	20160919
9	39	0	8	4	1	20160919
10	45	1	3	2	0	20160919
11	57	0	8	2	0	20160919
12	14	0	7	2	1	20160919

Step 2: Create a pipeline

Create a custom pipeline and open the pipeline. For more information, see Prepare data.

Build a pipeline.

In the left-side component list, drag the Read Table component in the Data Source/Target folder to the canvas twice and rename the two Read Table components as ad-1 and ad-2.
In the left-side component list, drag the Normalization component in the Data Preprocessing folder to the canvas twice.
In the left-side component list, choose Machine Learning > Binary Classification and drag the Logistic Regression for Binary Classification component to the canvas.
In the left-side component list, drag the Prediction component in the Machine Learning folder to the canvas.
In the left-side component list, drag the Write Table component in the Data Source/Target folder to the canvas. Rename the Write Table component as ad_result-1.

Connect the preceding components as shown in the following figure.

离线模型

Section	Description
①	The components in this section import data from the source dataset.
②	The components in this section preprocess the source data.
③	The component in this section trains a model.
④	The components in this section use the model to make predictions.

Configure the component

Click the ad-2 and ad-1 components on the canvas and configure the parameters in the right-side panel. The ad-2 component serves as the training data source, and the ad-1 component serves as the prediction data source.

Tab	Parameter	Description
Select Table	Table Name	The name of the MaxCompute table that you want to import. Enter ad.
	Partition	Specifies whether the MaxCompute table is a partitioned table. If the MaxCompute table is a partitioned table, the system automatically selects Partition.
	Parameter	The data that you want to import. Set this parameter to a value in the dt=@@{yyyyMMdd} format. This parameter ensures that daily incremental data is imported as prediction data.
Fields Information	Source Table Columns	The columns in the MaxCompute table that you want to import. After you configure the parameters on the Select Table tab, the system automatically displays the columns of the MaxCompute table.

Click the Normalization -1 component on the canvas. On the Fields Setting tab in the right-side panel, click Select Field and select a field of the DOUBLE or INT type. Perform the same operations for the Normalization -2 component.

Click the Logistic Regression for Binary Classification component on the canvas and configure the parameters in the right-side panel. Configure only the parameters described in the following table and use the default values of other parameters.

Tab	Parameter	Description
Fields Setting	Training Feature Columns	The columns that you want to use for training. Select age, sex, duration, and place.
Fields Setting	Target Columns	The column that stores the CTR data. Select ctr.

Click the Prediction component on the canvas and configure the parameters in the right-side panel. Configure only the parameters described in the following table and use the default values of other parameters.

Tab	Parameter	Description
Fields Setting	Feature Columns	The columns that you want to use for training. Select age, sex, duration, and place.
Fields Setting	Reserved Columns	The column that stores the CTR data. Select ctr.

Click the ad_result-1 component on the canvas. On the Select Table tab, set the New Table Name field to ad_result.

Click the icon in the top toolbar of the canvas to run the pipeline.
After you run the pipeline, right-click ad_result-1 on the canvas and choose View Data > View Output to view the table that is generated based on the prediction results.
In the table:
- prediction_result: indicates whether the ad is clicked. Valid values: 1 and 0. 1 indicates that the ad is clicked, and 0 indicates that the ad is not clicked.
- prediction_score: indicates the probability that the ad is clicked.

Step 3: Perform offline scheduling in DataWorks

Create a PAI node in DataWorks. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.
Configure DataWorks to schedule the PAI node at 00:00 every day. For more information, see Configure time properties.
Commit the node and go to Operation Center to view the logs of the node. For more information, see View and manage auto triggered nodes.

References

You can use DataWorks tasks to periodically schedule pipelines in Machine Learning Designer. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.
For information about the normalization component, see Normalization.
For information about logistic regression for binary classification, see Logistic Regression for Binary Classification.