The Feature Encoding component uses the gradient boosting decision tree (GBDT) algorithm to convert nonlinear features to linear features.
Background information
The Feature Encoding component uses decision trees and ensemble methods to discover new features. The features are generated by applying one-hot encoding to the leaf nodes of decision trees. Each leaf node represents one or more features.
The following figure shows an example of feature encoding. The example uses three decision trees that have a total of 12 leaf nodes. Each leaf node is assigned a unique feature code based on the order of the trees. The first tree occupies features 0–3, the second tree occupies features 4–7, and the third tree occupies features 8–11. Linear features are then generated by using the GBDT algorithm and one-hot encoding.
Configure the component
You can use one of the following methods to configure the Feature Encoding component.
Method 1: Use the Platform for AI (PAI) console
To configure the Feature Encoding component in the PAI console, perform the following steps: Log on to the PAI console, go to the Visualized Modeling (Designer) page, and then open a pipeline. On the pipeline page, drag the Feature Encoding component to the canvas and configure the parameters in the right-side pane. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Feature Columns | Optional. The feature columns that are selected from the input table for training. |
Label Column | Required. The label column in the input table. | |
Appended Output Columns | Optional. The feature columns that you want to preserve in the output table. | |
Parameters Setting | Cores | Optional. The number of computing cores. The value must be a positive integer. |
Memory size per core | Optional. The memory size of each computing core. The value must be a positive integer. |
Method 2: Use PAI commands
To configure the Feature Encoding component by using PAI commands, run the commands in the SQL Script component. For more information, see SQL Script.
PAI -name fe_encode_runner -project algo_public
-DinputTable="tdl_pai_bank_test1"
-DencodeModel="xlab_m_GBDT_LR_1_19064"
-DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
-DlabelCol="y"
-DoutputTable="pai_temp_2159_19061_1";
-DcoreNum=10
-DmemSizePerCore=1024
Parameter | Required | Description | Default value |
inputTable | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. Specify a value for this parameter in the partition_name=value format. To specify a multi-level partition, Specify a value for this parameter in the name1=value1/name2=value2 format. Separate multiple partitions with commas (,). | All partitions of the input table |
encodeModel | Yes | The encoded model that is output by the GBDT Binary Classification component. | N/A |
outputTable | Yes | The output table after scaling. | N/A |
selectedCols | Yes | The features that are encoded by using the GBDT algorithm. Set this parameter to the feature columns that you specify for the GBDT Binary Classification component. | N/A |
labelCol | Yes | The label column. | No default value |
lifecycle | No | The lifecycle of the output table. | 7 |
coreNum | No | The number of computing cores. The value of this parameter must be of the BIGINT type. | -1 (The system configures this parameter based on the input data volume) |
memSizePerCore | No | The memory size of each core. | -1 (The system configures this parameter based on the input data volume) |
Examples
Execute the following SQL statements to generate training data:
CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1 ( age BIGINT COMMENT '', campaign BIGINT COMMENT '', pdays BIGINT COMMENT '', previous BIGINT COMMENT '', emp_var_rate DOUBLE COMMENT '', cons_price_idx DOUBLE COMMENT '', cons_conf_idx DOUBLE COMMENT '', euribor3m DOUBLE COMMENT '', nr_employed DOUBLE COMMENT '', y BIGINT COMMENT '' ) LIFECYCLE 7; insert overwrite table tdl_pai_bank_test1 select * from (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate, 93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y union all select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate, 94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y union all select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y union all select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y union all select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate, 93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y union all select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y union all select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y union all select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y union all select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y ) a
Create a pipeline as shown in the following figure. In most cases, the Feature Encoding component is used together with the GBDT Binary Classification component. For more information, see Algorithm modeling.
Configure the parameters of the GBDT Binary Classification component. Set the number of trees to 5, the maximum number of leaf nodes to 3, the label column to the y column, and the feature columns to the other columns.
Run the pipeline and view the prediction results.
kv
y
2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1
0.0
2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1
0.0
2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1
1.0
2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1
0.0
0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1
1.0
You can import the generated results to the Logistic Regression for Binary Classification or Logistic Regression for Multiclass Classification component. The preceding components provide better performance than the Linear Regression or GBDT Regression component and prevent overfitting.