The Feature Encoding component can encode nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms.
Feature Encoding is a technique that uses decision trees and ensemble methods to explore new features. The features are one-hot encoding results of the leaf nodes of decision trees. These nodes are composed of one or more features.
Configure the component
- Use the Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Feature Columns The feature columns that are selected from the input table for training. Label Column Required. The label column.
Click the icon. In the Select Column dialog box, enter the keywords of the column that you want to search for. Select the column and click OK.
Append Output Columns Optional. The original features that are reserved in the output table. Parameters Setting Number of cores The number of cores that are used in computing. The value of this parameter must be a positive integer. Memory size per core The memory size of each core. The value of this parameter must be a positive integer.
- Use commands
PAI -name fe_encode_runner -project algo_public -DinputTable="pai_temp_2159_19087_1" -DencodeModel="xlab_m_GBDT_LR_1_19064" -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign" -DlabelCol="y" -DoutputTable="pai_temp_2159_19061_1"; -DcoreNum=10 -DmemSizePerCore=1024
Parameter Required Description Default value inputTable Yes The name of the input table. N/A inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the partition_name=value format.
To specify multi-level partitions, set this parameter in the name1=value1/name2=value2 format.
If you specify multiple partitions, separate them with commas (,).
All partitions of the input table encodeModel Yes The encoded GBDT binary classification model that is imported. N/A outputTable Yes The output table after scaling. N/A selectedCols Yes The features that are encoded by using GBDT algorithms. These features are the training features of GBDT components. N/A labelCol Yes The label column. N/A lifecycle No The lifecycle of the output table. 7 coreNum No The number of cores. The value of this parameter must be of the BIGINT type. -1 (If you retain the default value, the system determines the number of cores based on the input data volume.) memSizePerCore No The memory size of each core. -1 (If you retain the default value, the system determines the memory size based on the input data volume.)
- Execute the following SQL statements to generate training data:
CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1 ( age BIGINT COMMENT '', campaign BIGINT COMMENT '', pdays BIGINT COMMENT '', previous BIGINT COMMENT '', emp_var_rate DOUBLE COMMENT '', cons_price_idx DOUBLE COMMENT '', cons_conf_idx DOUBLE COMMENT '', euribor3m DOUBLE COMMENT '', nr_employed DOUBLE COMMENT '', y BIGINT COMMENT '' ) LIFECYCLE 7; insert overwrite table tdl_pai_bank_test1 select * from (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate, 93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y from dual union all select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate, 94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y from dual union all select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y from dual union all select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate, 93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y from dual union all select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y from dual ) a
- Create the experiment shown in the following figure. The GBDT Binary Classification
component is used in this experiment. For more information, see Generate a model by using an algorithm.
Set the parameters for the GBDT Binary Classification component. Set the Decision Tree Quantity parameter to 5, the Maximum Decision Tree Depth parameter to 3, the Label Column parameter to y, and the Feature Columns parameter to other fields.
- Run the experiment and view the prediction results.
kv y 2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1 0.0 2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1 0.0 2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1 1.0 2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1 0.0 0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1 1.0
The generated results can be imported to the Logistic Regression for Binary Classification or Logistic Regression for Multiclass Classification Evaluation component. This provides better performance than Linear Regression and GBDT Regression and can prevent overfitting of the generated results.