All Products
Search
Document Center

Platform For AI:Feature Encoding

Last Updated:Apr 07, 2024

The Feature Encoding component uses the gradient boosting decision tree (GBDT) algorithm to convert nonlinear features to linear features.

Background information

The Feature Encoding component uses decision trees and ensemble methods to discover new features. The features are generated by applying one-hot encoding to the leaf nodes of decision trees. Each leaf node represents one or more features.

The following figure shows an example of feature encoding. The example uses three decision trees that have a total of 12 leaf nodes. Each leaf node is assigned a unique feature code based on the order of the trees. The first tree occupies features 0–3, the second tree occupies features 4–7, and the third tree occupies features 8–11. Linear features are then generated by using the GBDT algorithm and one-hot encoding.

image

Configure the component

You can use one of the following methods to configure the Feature Encoding component.

Method 1: Use the Platform for AI (PAI) console

To configure the Feature Encoding component in the PAI console, perform the following steps: Log on to the PAI console, go to the Visualized Modeling (Designer) page, and then open a pipeline. On the pipeline page, drag the Feature Encoding component to the canvas and configure the parameters in the right-side pane. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Feature Columns

Optional. The feature columns that are selected from the input table for training.

Label Column

Required. The label column in the input table.

Appended Output Columns

Optional. The feature columns that you want to preserve in the output table.

Parameters Setting

Cores

Optional. The number of computing cores. The value must be a positive integer.

Memory size per core

Optional. The memory size of each computing core. The value must be a positive integer.

Method 2: Use PAI commands

To configure the Feature Encoding component by using PAI commands, run the commands in the SQL Script component. For more information, see SQL Script.

PAI -name fe_encode_runner -project algo_public
    -DinputTable="tdl_pai_bank_test1"
    -DencodeModel="xlab_m_GBDT_LR_1_19064"
    -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
    -DlabelCol="y"
    -DoutputTable="pai_temp_2159_19061_1";
    -DcoreNum=10
    -DmemSizePerCore=1024

Parameter

Required

Description

Default value

inputTable

Yes

The name of the input table.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training. Specify a value for this parameter in the partition_name=value format.

To specify a multi-level partition, Specify a value for this parameter in the name1=value1/name2=value2 format.

Separate multiple partitions with commas (,).

All partitions of the input table

encodeModel

Yes

The encoded model that is output by the GBDT Binary Classification component.

N/A

outputTable

Yes

The output table after scaling.

N/A

selectedCols

Yes

The features that are encoded by using the GBDT algorithm. Set this parameter to the feature columns that you specify for the GBDT Binary Classification component.

N/A

labelCol

Yes

The label column.

No default value

lifecycle

No

The lifecycle of the output table.

7

coreNum

No

The number of computing cores. The value of this parameter must be of the BIGINT type.

-1 (The system configures this parameter based on the input data volume)

memSizePerCore

No

The memory size of each core.

-1 (The system configures this parameter based on the input data volume)

Examples

  1. Execute the following SQL statements to generate training data:

    CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1
    (
        age            BIGINT COMMENT '',
        campaign       BIGINT COMMENT '',
        pdays          BIGINT COMMENT '',
        previous       BIGINT COMMENT '',
        emp_var_rate   DOUBLE COMMENT '',
        cons_price_idx DOUBLE COMMENT '',
        cons_conf_idx  DOUBLE COMMENT '',
        euribor3m      DOUBLE COMMENT '',
        nr_employed    DOUBLE COMMENT '',
        y              BIGINT COMMENT ''
    )
    LIFECYCLE 7;
    insert overwrite table tdl_pai_bank_test1
    select * from
    (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate,
           93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y
    union all
    select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate,
           94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y
    union all
    select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y
    union all
    select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y
    union all
    select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate,
           93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y
    union all
    select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y
    union all
    select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y
    union all
    select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y
    union all
    select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y
    ) a
  2. Create a pipeline as shown in the following figure. In most cases, the Feature Encoding component is used together with the GBDT Binary Classification component. For more information, see Algorithm modeling.

    Configure the parameters of the GBDT Binary Classification component. Set the number of trees to 5, the maximum number of leaf nodes to 3, the label column to the y column, and the feature columns to the other columns. 建模

  3. Run the pipeline and view the prediction results.

    kv

    y

    2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1

    0.0

    2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1

    0.0

    2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1

    1.0

    2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1

    0.0

    0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1

    1.0

    You can import the generated results to the Logistic Regression for Binary Classification or Logistic Regression for Multiclass Classification component. The preceding components provide better performance than the Linear Regression or GBDT Regression component and prevent overfitting.