All Products
Search
Document Center

Platform For AI:Offline scheduling

Last Updated:Feb 23, 2024

This topic describes how to use the data mining components of Platform for AI (PAI) to perform offline scheduling when predicting ad click-through rate (CTR).

Background information

The pipeline used in this topic contains the following steps:

  1. Train a model in PAI based on historical data.

  2. Schedule the model in DataWorks.

  3. In the early morning of every day, the CTRs of ads are predicted, and ads are delivered based on the predicted CTRs.

The dataset used in this pipeline is generated by using a random number generator. Therefore, the pipeline results are not evaluated. This topic describes only how to create a pipeline and perform offline scheduling in DataWorks.

Step 1: Prepare a dataset

The training dataset used in this pipeline includes the historical data of ads that are delivered on September 19, 2016 and September 20, 2016. The pipeline predicts the CTRs of the ads delivered on September 21, 2016. The dataset is stored in a MaxCompute partitioned table. The following table describes the fields in the dataset.

Field

Type

Description

id

STRING

The unique ID of the ad.

age

DOUBLE

The age of the person to which the ad is delivered.

sex

DOUBLE

The gender of the person to which the ad is delivered. Valid values: 1 (male) and 0 (female).

duration

DOUBLE

The display duration of the ad. Unit: seconds.

place

DOUBLE

The position where the ad is displayed. Valid values: 0 to 4. A higher value indicates a lower position.

ctr

DOUBLE

The CTR of the ad. If the number of clicks divided by the number of views is greater than 0.03 for the ad, the value of this field is 1. Otherwise, the value of this field is 0.

dt

STRING

The date when the ad is delivered. Format: YYYYMMDD.

You can use the MaxCompute client to run the following command to create a partitioned table named ad: For more information, see Create tables.

create table if not exists ad (id STRING,age DOUBLE,sex DOUBLE,duration DOUBLE,place DOUBLE,ctr DOUBLE ) partitioned by (dt STRING) ;
alter table ad add if not exists partition (dt='20160919') partition (dt='20160920');

The following table shows the ad table that is used in the pipeline. You can run Tunnel commands to import partitioned table data. For more information, see Import data to tables.

id

age

sex

duration

place

ctr

dt

0

49

1

9

0

0

20160919

1

17

1

3

1

1

20160919

2

44

0

4

0

0

20160919

3

14

1

9

1

0

20160919

4

44

1

5

4

0

20160919

5

10

1

9

3

1

20160919

6

42

1

7

3

0

20160919

7

51

1

3

1

1

20160919

8

18

0

3

3

0

20160919

9

39

0

8

4

1

20160919

10

45

1

3

2

0

20160919

11

57

0

8

2

0

20160919

12

14

0

7

2

1

20160919

Step 2: Create a pipeline

  1. Create a custom pipeline and open the pipeline. For more information, see Prepare data.

  2. Build a pipeline.

    1. In the left-side component list, drag the Read Table component in the Data Source/Target folder to the canvas twice and rename the two Read Table components as ad-1 and ad-2.

    2. In the left-side component list, drag the Normalization component in the Data Preprocessing folder to the canvas twice.

    3. In the left-side component list, choose Machine Learning > Binary Classification and drag the Logistic Regression for Binary Classification component to the canvas.

    4. In the left-side component list, drag the Prediction component in the Machine Learning folder to the canvas.

    5. In the left-side component list, drag the Write Table component in the Data Source/Target folder to the canvas. Rename the Write Table component as ad_result-1.

    6. Connect the preceding components as shown in the following figure.

      离线模型

      Section

      Description

      The components in this section import data from the source dataset.

      The components in this section preprocess the source data.

      The component in this section trains a model.

      The components in this section use the model to make predictions.

  3. Configure the component

    1. Click the ad-2 and ad-1 components on the canvas and configure the parameters in the right-side panel. The ad-2 component serves as the training data source, and the ad-1 component serves as the prediction data source.

      Tab

      Parameter

      Description

      Select Table

      Table Name

      The name of the MaxCompute table that you want to import. Enter ad.

      Partition

      Specifies whether the MaxCompute table is a partitioned table. If the MaxCompute table is a partitioned table, the system automatically selects Partition.

      Parameter

      The data that you want to import. Set this parameter to a value in the dt=@@{yyyyMMdd} format. This parameter ensures that daily incremental data is imported as prediction data.

      Fields Information

      Source Table Columns

      The columns in the MaxCompute table that you want to import. After you configure the parameters on the Select Table tab, the system automatically displays the columns of the MaxCompute table.

    2. Click the Normalization -1 component on the canvas. On the Fields Setting tab in the right-side panel, click Select Field and select a field of the DOUBLE or INT type. Perform the same operations for the Normalization -2 component.

    3. Click the Logistic Regression for Binary Classification component on the canvas and configure the parameters in the right-side panel. Configure only the parameters described in the following table and use the default values of other parameters.

      Tab

      Parameter

      Description

      Fields Setting

      Training Feature Columns

      The columns that you want to use for training. Select age, sex, duration, and place.

      Target Columns

      The column that stores the CTR data. Select ctr.

    4. Click the Prediction component on the canvas and configure the parameters in the right-side panel. Configure only the parameters described in the following table and use the default values of other parameters.

      Tab

      Parameter

      Description

      Fields Setting

      Feature Columns

      The columns that you want to use for training. Select age, sex, duration, and place.

      Reserved Columns

      The column that stores the CTR data. Select ctr.

    5. Click the ad_result-1 component on the canvas. On the Select Table tab, set the New Table Name field to ad_result.

  4. Click the image icon in the top toolbar of the canvas to run the pipeline.

  5. After you run the pipeline, right-click ad_result-1 on the canvas and choose View Data > View Output to view the table that is generated based on the prediction results.

    In the table:

    • prediction_result: indicates whether the ad is clicked. Valid values: 1 and 0. 1 indicates that the ad is clicked, and 0 indicates that the ad is not clicked.

    • prediction_score: indicates the probability that the ad is clicked.

Step 3: Perform offline scheduling in DataWorks

  1. Create a PAI node in DataWorks. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.

    Configure DataWorks to schedule the PAI node at 00:00 every day. For more information, see Configure time properties.

  2. Commit the node and go to Operation Center to view the logs of the node. For more information, see View and manage auto triggered nodes.

References