The Feature Encoding component can encode nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms.

Overview

Feature Encoding is a technique that uses decision trees and ensemble methods to explore new features. The features are one-hot encoding results of the leaf nodes of decision trees. These nodes are composed of one or more features.

The following figure shows the feature encoding process. The three trees in the figure have 12 leaf nodes. The leaf nodes are encoded as features 0 to 11 successively. The leaf nodes of the first tree is encoded as features 0 to 3. The leaf nodes of the second tree are encoded as features 4 to 7. The leaf nodes of the third tree are encoded as features 8 to 11. The Feature Encoding component can convert nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms. Feature encoding process

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Feature Columns The feature columns that are selected from the input table for training.
    Label Column Required. The label column.

    Click the Directory icon. In the Select Column dialog box, enter the keywords of the column that you want to search for. Select the column and click OK.

    Append Output Columns Optional. The original features that are reserved in the output table.
    Parameters Setting Number of cores The number of cores that are used in computing. The value of this parameter must be a positive integer.
    Memory size per core The memory size of each core. The value of this parameter must be a positive integer.
  • Use commands
    PAI -name fe_encode_runner -project algo_public
        -DinputTable="pai_temp_2159_19087_1"
        -DencodeModel="xlab_m_GBDT_LR_1_19064"
        -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
        -DlabelCol="y"
        -DoutputTable="pai_temp_2159_19061_1";
       -DcoreNum=10
       -DmemSizePerCore=1024
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the partition_name=value format.

    To specify multi-level partitions, set this parameter in the name1=value1/name2=value2 format.

    If you specify multiple partitions, separate them with commas (,).

    All partitions of the input table
    encodeModel Yes The encoded GBDT binary classification model that is imported. N/A
    outputTable Yes The output table after scaling. N/A
    selectedCols Yes The features that are encoded by using GBDT algorithms. These features are the training features of GBDT components. N/A
    labelCol Yes The label column. N/A
    lifecycle No The lifecycle of the output table. 7
    coreNum No The number of cores. The value of this parameter must be of the BIGINT type. -1 (If you retain the default value, the system determines the number of cores based on the input data volume.)
    memSizePerCore No The memory size of each core. -1 (If you retain the default value, the system determines the memory size based on the input data volume.)

Example

  1. Execute the following SQL statements to generate training data:
    CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1
    (
        age            BIGINT COMMENT '',
        campaign       BIGINT COMMENT '',
        pdays          BIGINT COMMENT '',
        previous       BIGINT COMMENT '',
        emp_var_rate   DOUBLE COMMENT '',
        cons_price_idx DOUBLE COMMENT '',
        cons_conf_idx  DOUBLE COMMENT '',
        euribor3m      DOUBLE COMMENT '',
        nr_employed    DOUBLE COMMENT '',
        y              BIGINT COMMENT ''
    )
    LIFECYCLE 7;
    insert overwrite table tdl_pai_bank_test1
    select * from
    (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate,
           93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y
    from dual
    union all
    select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate,
           94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y
    from dual
    union all
    select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y
    from dual
    union all
    select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y
    from dual
    union all
    select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate,
           93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y
    from dual
    union all
    select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    union all
    select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    union all
    select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y
    from dual
    union all
    select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    ) a
  2. Create the experiment shown in the following figure. The GBDT Binary Classification component is used in this experiment. For more information, see Generate a model by using an algorithm.
    Set the parameters for the GBDT Binary Classification component. Set the Decision Tree Quantity parameter to 5, the Maximum Decision Tree Depth parameter to 3, the Label Column parameter to y, and the Feature Columns parameter to other fields. Algorithm modeling
  3. Run the experiment and view the prediction results.
    kv y
    2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1 0.0
    2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1 0.0
    2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1 1.0
    2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1 0.0
    0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1 1.0

    The generated results can be imported to the Logistic Regression for Binary Classification or Logistic Regression for Multiclass Classification Evaluation component. This provides better performance than Linear Regression and GBDT Regression and can prevent overfitting of the generated results.