edit-icon download-icon

Feature engineering

Last Updated: Aug 15, 2018

Contents

PCA

  • The principal component analysis (PCA) method is used to reduce dimensionality. For more information about the PCA algorithm, see wiki.

  • The algorithm supports the dense data format.

PAI command

  1. PAI -name PrinCompAnalysis
  2. -project algo_public
  3. -DinputTableName=bank_data
  4. -DeigOutputTableName=pai_temp_2032_17900_2
  5. -DprincompOutputTableName=pai_temp_2032_17900_1
  6. -DselectedColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
  7. -DtransType=Simple
  8. -DcalcuType=CORR
  9. -DcontriRate=0.9;

Algorithm parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the PCA input table NA NA
eigOutputTableName (Required) Name of the output table of the feature vectors and feature values NA NA
princompOutputTableName (Required) Name of the output table after dimensionality reduction of the principal component NA NA
selectedColNames (Required) Feature columns involved in PCA operation NA NA
transType (Optional) Mode of transforming the original table to the principal component table. Simple、Sub-Mean、Normalization Simple
calcuType (Optional) Mode of feature breakdown for the original table. CORR、COVAR_SAMP、COVAR_POP CORR
contriRate (Optional) Information retaining ratio after dimensionality reduction (0,1) 0.9
remainColumns (Optional) Fields retained from the original table after dimensionality reduction NA NA

PCA output example

  • Data table after dimensionality reduction.

    image

  • Feature value and feature vector table.

    image

Feature scaling

Supports common scale functions such as log2, log10, ln, abs, and sqrt. Supports dense and sparse data formats.

PAI command

  1. PAI -name fe_scale_runner -project algo_public
  2. -Dlifecycle=28
  3. -DscaleMethod=log2
  4. -DscaleCols=nr_employed
  5. -DinputTable=pai_dense_10_1
  6. -DoutputTable=pai_temp_2262_20380_1;

Algorithm parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTable (Required) Name of the scaling result table NA NA
scaleCols (Required) Select the features that require scaling. Sparse features are automatically filtered. Only numeric features can be selected. NA NA
labelCol (Optional) Label field. If this column is set, the x-y distribution histogram from the feature to the target variable will be visualized. NA NA
categoryCols (Optional) Process the selected fields as enumeration features. Scaling is not supported. NA “”
scaleMethod (Optional) Scaling method. Default value: log2. The following methods are supported: log2, log10, ln, abs, and sqrt. NA SameDistance
scaleTopN Top N scaling features automatically selected when scaleCols is not selected. Default value: 10 NA 10
isSparse (Optional) Indicates whether it is a sparse feature in the key:value format. Default: dense data NA 100
itemSpliter (Optional) Delimiter between sparse feature items. Default value: “,” NA “,”
kvSpliter (Optional) Delimiter between sparse feature items. Default value: “:” NA “:”
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_1 as
  2. select
  3. nr_employed
  4. from bank_data limit 10;

Parameter settings

Select nr_employed for feature scaling. Only numeric features are supported.Select log2 as the scale function.

image

Running result

nr_employed
12.352071021075528
12.34313018339218
12.285286613666395
12.316026916036957
12.309533196497519
12.352071021075528
12.316026916036957
12.316026916036957
12.309533196497519
12.316026916036957

Feature discretization

  • Supports dense and sparse numeric feature discretization.
  • Supports same-frequency discretization and same-distance discretization (default).

PAI command

  1. PAI -name fe_discrete_runner -project algo_public
  2. -DdiscreteMethod=SameFrequecy
  3. -Dlifecycle=28
  4. -DmaxBins=5
  5. -DinputTable=pai_dense_10_1
  6. -DdiscreteCols=nr_employed
  7. -DoutputTable=pai_temp_2262_20382_1;

Algorithm parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTable (Required) Name of the discretization output table NA NA
discreteCols (Required) Selected features that require discretization. Sparse features are automatically filtered. NA “”
labelCol (Optional) Label field. If this column is set, the x-y distribution histogram from the feature to the target variable will be visualized. NA NA
categoryCols (Optional) Process the selected features as enumeration features. Discretization is not supported. NA “”
discreteMethod (Optional) Discretization method. Default value: SameDistance. The options are SameDistance and SameFrequency. NA SameDistance
discreteTopN Top N features that require discretization are automatically selected when discreteCols is not selected. Default value: 10 NA 10
maxBins Discrete interval. Default value: 100 NA 100
isSparse (Optional) Indicates whether it is a sparse feature in the key:value format. Default: dense data NA 100
itemSpliter (Optional) Delimiter between sparse feature items. Default value: “,” NA “,”
kvSpliter (Optional) Delimiter between sparse feature items. Default value: “:” NA “:”
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Modeling example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_1 as
  2. select
  3. nr_employed
  4. from bank_data limit 10;

Parameter settings

The input data is pai_dense_10_1. If nr_employed is selected for the discrete feature, use the same-distance method to discretize it into five intervals. The result is as follows:

image

Running result

nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0

Feature exception smoothing

Function: Smooths exceptional data in the input feature to a specific interval. Sparse and dense features are supported.

Note: The feature smoothing component only rectifies abnormal values but does not filter or delete any records. Therefore, the dimensions and number of input data records remain unchanged.

Smoothing methods

  • Zscore: If a feature follows normal distribution, Zscore smooths data in the rage to [-3xalpha,3xalpha] because the noise is normally out of the range of [-3xalpha,3xalpha].

Example: If a feature follows normal distribution, the mean is 0, and the standard deviation is 3, feature value -10 is determined to be abnormal and changed to -3x3+0=-9 and feature value 10 is changed to 3x3+0=9.

image

  • Percentile: smooths data distributed out of [minPer, maxPer] to the minPer and maxPer quantiles.
    Example: The age feature value is in the range of 0-200. Set minPer to 0 and maxPer to 50%. Feature values out of 0-100 are changed to 0 or 100.

  • Threshold: smooths data distributed out of [minThresh, maxThresh] to the minThresh and maxThresh data points.
    Example: The age feature value is in the range of 0-200. Set minThresh to 10 and maxThresh to 80. Feature values out of 0-80 are changed to 0 or 80.

PAI command

  1. PAI -name fe_soften_runner -project algo_public
  2. -DminThresh=5000 -Dlifecycle=28
  3. -DsoftenMethod=min-max-thresh
  4. -DsoftenCols=nr_employed
  5. -DmaxThresh=6000
  6. -DinputTable=pai_dense_10_1
  7. -DoutputTable=pai_temp_2262_20381_1;

Algorithm parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value. The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTable (Required) Name of the smoothing output table NA NA
labelCol (Optional) Label field. If this column is set, the x-y distribution histogram from the feature to the target variable will be visualized. NA NA
categoryCols (Optional) Process the selected fields as enumeration features NA “”
softenCols (Required) Selected features that require smoothing. Sparse features are automatically filtered. NA NA
softenMethod (Optional) Smoothing method. Default value: zscore. The options are min-max-thresh and min-max-per. NA zscore
softenTopN Top N features that require smoothing are automatically selected when softenCols is not selected. Default value: 10 NA 10
cl Confidence level. This parameter is valid when the smoothing method is zscore. NA 10
minPer Minimum percentile. This parameter is valid when the smoothing method is min-max-per. NA 0.0
maxPer Maximum percentile. This parameter is valid when the smoothing method is min-max-per. NA 1.0
minThresh Minimum threshold value. Default value: -9999, which indicates the minimum threshold is not set. This parameter is valid when the smoothing method is min-max-thresh. NA -9999
maxThresh Maximum threshold value. Default value: -9999, which indicates the maximum threshold is not set. This parameter is valid when the smoothing method is min-max-thresh. NA -9999
isSparse (Optional) Indicates whether it is a sparse feature in the key:value format. Default: dense data NA 100
itemSpliter (Optional) Delimiter between sparse feature items. Default value: “,” NA “,”
kvSpliter (Optional) Delimiter between sparse feature items. Default value: “:” NA “:”
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_1 as
  2. select
  3. nr_employed
  4. from bank_data limit 10;
nr_employed
5228.1
5195.8
4991.6
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1

Parameter settings

Select nr_employed for the smoothing feature column, select min-max-thresh in parameter settings, and set the upper limit to 5000 and lower limit to 6000.

image

Running result

nr_employed
5228.1
5195.8
5000.0
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1

Random forest feature importance

The function of the component is to use the original data and random forest model to calculate feature importance.

PAI command

  1. pai -name feature_importance -project algo_public
  2. -DinputTableName=pai_dense_10_10
  3. -DmodelName=xlab_m_random_forests_1_20318_v0
  4. -DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1
  5. -DlabelColName=y
  6. - DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
  7. -Dlifecycle=28 ;

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
outputTableName (Required) Name of the output table NA NA
labelColName (Required) Name of the label column NA NA
modelName (Required) Name of the input model NA NA
featureColNames (Optional) Selected feature columns in the input table NA All columns are selected by default except the label column.
inputTablePartitions (Optional) Names of the selected partitions in the input table NA All partitions are selected by default.
lifecycle (Optional) Life cycle of the output table NA Unspecified by default
coreNum (Optional) Number of cores NA Automatically calculated by default
memSizePerCore (Optional) Size of memory NA Automatically calculated by default

Example

Input data

SQL statement for data generation:

  1. drop table if exists pai_dense_10_10;
  2. creat table if not exists pai_dense_10_10 as
  3. select
  4. age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  5. from bank_data limit 10;

Parameter settings

The example flowchart is as follows. The data source is pai_dense_10_10. Column y is the label column of the random forest, and other columns are feature columns. Select age and campaign for the required conversion, which indicates that the two features are processed as enumeration features. Use the default values for other parameters. The result is as follows after successful running:

image

Running result

colname gini entropy
age 0.06625000000000003 0.13978726292803723
campaign 0.0017500000000000003 0.004348515545596772
cons_conf_idx 0.013999999999999999 0.02908409497018851
cons_price_idx 0.002 0.0049804499913461255
emp_var_rate 0.014700000000000003 0.026786360680260933
euribor3m 0.06300000000000003 0.1321936348846039
nr_employed 0.10499999999999998 0.2203227248076733
pdays 0.0845 0.17750329234397513
poutcome 0.03360000000000001 0.07050327193845542
previous 0.017700000000000004 0.03810381005801592

Right-click the random forest feature importance component and choose View Visualized Analysis. The result is as follows:

image

GBDT feature importance

The function of the component is to calculate the GBDT feature importance.

PAI command

  1. PAI -name gbdt_importance -project algo_public
  2. -DmodelName=xlab_m_GBDT_LR_1_20307_v0
  3. -Dlifecycle=28 -DoutputTableName=pai_temp_2252_20308_1 -DlabelColName=y
  4. -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
  5. -DinputTableName=pai_dense_10_9;

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
outputTableName (Required) Name of the output table NA NA
labelColName (Required) Name of the label column NA NA
modelName (Required) Name of the input model NA NA
featureColNames (Optional) Selected feature columns in the input table NA All columns are selected by default except the label column.
inputTablePartitions (Optional) Names of the selected partitions in the input table NA All partitions are selected by default.
lifecycle (Optional) Life cycle of the output table NA Unspecified by default
coreNum (Optional) Number of cores NA Automatically calculated by default
memSizePerCore (Optional) Size of memory NA Automatically calculated by default

Example

Input data

SQL statement for data generation:

  1. drop table if exists pai_dense_10_9;
  2. create table if not exists pai_dense_10_9 as
  3. select
  4. age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  5. from bank_data limit 10;

Parameter settings

The example flowchart is as follows. The input data is pai_dense_10_9. Select label column y for the GBDT binary classification component, and use other columns as the feature columns. Set the minimum number of leaf node samples to 1 and then run the command.

image

Running result

colname feature_importance
age 0.004667214954427797
campaign 0.001962038566773853
cons_conf_idx 0.04857761873887033
cons_price_idx 0.01925292649801252
emp_var_rate 0.044881269590771274
euribor3m 0.025034606434306696
nr_employed 0.036085457464908766
pdays 0.639121250405536
previous 0.18041761734639272

Right-click the GBDT feature importance component and choose View Visualized Analysis. The result is as follows:

image

Linear model feature importance

The function of this component is to calculate the importance of the linear model features, including linear regression models and binary class logical regression models. Sparse and dense features are supported.

PAI command

  1. PAI -name regression_feature_importance -project algo_public
  2. -DmodelName=xlab_m_logisticregressi_20317_v0
  3. -DoutputTableName=pai_temp_2252_20321_1
  4. -DlabelColName=y
  5. -DfeatureColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
  6. -DenableSparse=false -DinputTableName=pai_dense_10_9;

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
outputTableName (Required) Name of the output table NA NA
modelName (Required) Name of the input model NA NA
labelColName (Required) Name of the label column NA NA
featureColNames (Optional) Selected features in the input table NA All columns are selected by default except the label column.
inputTablePartitions (Optional) Selected partitions in the input table NA All partitions are selected by default.
enableSparse (Optional) Indicates whether the input table data is in sparse format true, false false
itemDelimiter (Optional) Indicates the delimiter between key-value pairs when the input table data is in sparse format. NA Space
kvDelimiter (Optional) Indicates the delimiter between keys and values when the input table data is in sparse format. NA Colon
lifecycle (Optional) Life cycle of the output table NA Unspecified by default
coreNum (Optional) Number of cores NA Automatically calculated by default
memSizePerCore (Optional) Size of memory NA Automatically calculated by default

Example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_9 as
  2. select
  3. age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  4. from bank_data limit 10;

Parameter settings

The modeling process is shown in the following figure. Select the y label column for the logical regression multiclass classification, use other columns as the feature columns, use the default values for other parameters, and then run the command.

image

Running result

colname weight importance
pdays 0.033942600256583334 16.31387797440866
previous 0.00004248130342485344 0.000030038817725357177
emp_var_rate 0.00006720242617694611 0.00010554561260753949
cons_price_idx 0.00012311047229142307 0.00006581255124425219
cons_conf_idx 0.00017227965471819213 0.0008918770542818432
euribor3m 0.00006113758212679113 0.00010427128177450988
nr_employed 0.0034541377310490697 0.26048098230126043
age 0.00009618162708080744 0.0009267659744232966
campaign 0.000019142551785274455 0.000041793353660529855

Metric calculation formula

Column Formula
weight abs(w_)
importance abs(w_j) * STD(f_i)

Right-click the linear model feature importance component and choose View Visualized Analysis. The result is as follows:

image

Preference calculation

The function of this component is:

  • Specify the detailed behavior feature data of users and automatically calculate users’ preference scores for feature values.

  • The input table includes the user ID and detailed behavior features of the user. For example, if user 2088xxx1 eats Sichuan food twice and western fast food once within three months, the input table is as follows.

    user_idcate
    2088xxx1Sichuan food
    2088xxx1Sichuan food
    2088xxx1Western fast food
  • The output table shows the user’s preference scores for Sichuan food and western fast food.
    user_idcate
    2088xxx1Sichuan food:0.0544694,western fast food:0.0272347

PAI command

  1. PAI -name=preference
  2. -project=algo_public
  3. -DInputTableName=preference_input_table
  4. -DIdColumnName=user_id
  5. -DFeatureColNames=cate
  6. -DOutputTableName=preference_output_table
  7. -DmapInstanceNum=2
  8. -DreduceInstanceNum=1;

Algorithm parameters

Parameter key Description Required/Optional Default value
InputTableName Name of the input table Required NA
IdColumnName User ID column Required NA
FeatureColNames User feature column Required NA
OutputTableName Name of the output table Required NA
OutputTablePartitions Partitions in the output table Optional NA
mapInstanceNum Number of mappers Optional 2
reduceInstanceNum Number of reducers Optional 1

Example

Input data

SQL statement for data generation:

  1. drop table if exists preference_input_table;
  2. create table preference_input_table as
  3. select
  4. *
  5. from
  6. (
  7. select '2088xxx1' as user_id, 'Sichuan food' as cate from alipaydw.dual
  8. union all
  9. select '2088xxx1' as user_id, 'Sichuan food' as cate from alipaydw.dual
  10. union all
  11. select '2088xxx1' as user_id, 'western fast food' cate from alipaydw.dual
  12. union all
  13. select '2088xxx3' as user_id, 'Sichuan food' as cate from alipaydw.dual
  14. union all
  15. select '2088xxx3' as user_id, 'Sichuan food' as cate from alipaydw.dual
  16. union all
  17. select '2088xxx3' as user_id, 'western fast food' as cate from alipaydw.dual
  18. ) tmp;

Running result

  1. +------------+------------+
  2. | user_id | cate |
  3. +------------+------------+
  4. | 2088xxx1 | Sichuan food:0.0544694,western fast food:0.0272347 |
  5. | 2088xxx3 | Sichuan food:0.0544694,western fast food:0.0272347 |
  6. +------------+------------+

Filter-based feature selection

Component functions

Select and filter Top N feature data based on different feature selection modes of users and save all feature importance tables (right output). Sparse and dense data is supported.

PAI command

  1. PAI -name fe_select_runner -project algo_public
  2. -DfeatImportanceTable=pai_temp_2260_22603_2
  3. -DselectMethod=iv
  4. -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
  5. -DtopN=5
  6. -DlabelCol=y
  7. -DmaxBins=100
  8. -DinputTable=pai_dense_10_9
  9. -DoutputTable=pai_temp_2260_22603_1;

Algorithm parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTable (Required) Name of the output table after filtering NA NA
featImportanceTable (Required) Importance weight values of all input features NA NA
selectedCols (Required) Selected feature columns NA NA
labelCol (Required) Label column/target column NA NA
categoryCols (Optional) Enumeration features that contain Int or Double characters NA “”
maxBins (Optional) Maximum number of intervals for division of consecutive features NA 100
selectMethod (Optional) Feature selection method. Default value: iv. The options are iv (Information Value), GiniGain, InfoGain, and Lasso. iv,GiniGain,InfoGain,Lasso iv
topN (Optional) Top N features selected. If N is greater than the number of input features, the output includes all the features. Default value: 10 NA 10
isSparse (Optional) Indicates whether it is a sparse feature in the key:value format. Default: dense data NA 100
itemSpliter (Optional) Delimiter between sparse feature items. Default value: “,” NA “,”
kvSpliter (Optional) Delimiter between sparse feature items. Default value: “:” NA “:”
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_9 as
  2. select
  3. age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  4. from bank_data limit 10;

Parameter settings

Select the y field as the label column, select other fields as the feature columns, set the feature selection method to IV, and set topN to 5, which indicates top 5 features are filtered.

image

image

Running result

The left output is the filtered data.

pdays nr_employed emp_var_rate cons_conf_idx cons_price_idx y
999.0 5228.1 1.4 -36.1 93.444 0.0
999.0 5195.8 -0.1 -42.0 93.2 0.0
6.0 4991.6 -1.7 -39.8 94.055 1.0
999.0 5099.1 -1.8 -47.1 93.075 0.0
3.0 5076.2 -2.9 -31.4 92.201 1.0
999.0 5228.1 1.4 -42.7 93.918 0.0
999.0 5099.1 -1.8 -46.2 92.893 0.0
999.0 5099.1 -1.8 -46.2 92.893 0.0
3.0 5076.2 -2.9 -40.8 92.963 1.0
999.0 5099.1 -1.8 -47.1 93.075 0.0

The right output is the feature importance table.

The field structure of the feature importance table is as follows.

The featureName field indicates feature names, and weight indicates the weight calculated based on the feature selection method.

featname weight
pdays 30.675544191232486
nr_employed 29.08332850085075
emp_var_rate 29.08332850085075
cons_conf_idx 28.02710269740324
cons_price_idx 28.02710269740324
euribor3m 27.829058450563718
age 27.829058450563714
previous 14.319325030742775
campaign 10.658129656314467

Weight calculation formula

Selection method weight contains
IV Information value
GiniGain Gini gain
InfoGain Information entropy gain
Lasso Absolute value of the linear model weight

RFM

RFM calculates the purchase frequencies and total monetary value for a user in a specific time window. For example, if the time windows are set to 1,7,30,90,180, RFM calculates the purchase frequencies and total monetary value in 1 day, 7 days, 30 days, 90 days, and 180 days.

PAI command

  1. PAI -name=rfm
  2. -project=algo_public
  3. -DinputTableName=window_input_table
  4. -DuserName=user_id
  5. -DdtName=dt
  6. -DcntName=cnt
  7. -DamtName=amt
  8. -Dwindow=1,7,30,90
  9. -DoutputTableName=window_output_table
  10. -DmapInstanceNum=2
  11. -DreduceInstanceNum=2;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputTableName Name of the input table Required NA
userName User ID column Required NA
dtName Time (format: 20160101) Required NA
cntName Frequency Required NA
amtName Monetary value Required NA
window Time window (format: 1,7,30,90…) Required NA
outputTableName Name of the output table Required NA
outputTablePartitions Partitions in the output table Optional NA
mapInstanceNum Number of mappers Optional 2
reduceInstanceNum Number of reducers Optional 2

Example

Input data

SQL statement for data generation

  1. drop table if exists window_input_table;
  2. create table window_input_table as
  3. select
  4. *
  5. from
  6. (
  7. select 'a' as user_id, '20151201' as dt, 2 as cnt, 32.0 as amt from dual
  8. union all
  9. select 'a' as user_id, '20160201' as dt, 3 as cnt, 37.0 as amt from dual
  10. union all
  11. select 'a' as user_id, '20160223' as dt, 1 as cnt, 22.0 as amt from dual
  12. union all
  13. select 'b' as user_id, '20151212' as dt, 1 as cnt, 12.0 as amt from dual
  14. union all
  15. select 'b' as user_id, '20160110' as dt, 2 as cnt, 30.0 as amt from dual
  16. union all
  17. select 'c' as user_id, '20151001' as dt, 3 as cnt, 60.0 as amt from dual
  18. union all
  19. select 'c' as user_id, '20151201' as dt, 2 as cnt, 39.0 as amt from dual
  20. ) tmp;

Running result

  1. +------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
  2. | user_id | cnt_1d_sum | amt_1d_sum | cnt_7d_sum | amt_7d_sum | cnt_30d_sum | amt_30d_sum | cnt_90d_sum | amt_90d_sum |
  3. +------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
  4. | a | 1 | 22.0 | 1 | 22.0 | 4 | 59.0 | 6 | 91.0 |
  5. | c | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 2 | 39.0 |
  6. | b | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 3 | 42.0 |
  7. +------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+

Feature encoding

Feature encoding is a method of using the decision tree and Ensemble algorithm for new feature mining. Features are derived from the decision tree branches formed by one or multiple features. For example, in the tree on the left of the following figure, node 1 -> node 2 -> node 4 is a feature. Apparently, this coding policy effectively transforms non-linear features into linear features.

image

PAI command

  1. PAI -name fe_encode_runner -project algo_public
  2. -DinputTable="pai_temp_2159_19087_1"
  3. -DencodeModel="xlab_m_GBDT_LR_1_19064"
  4. -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
  5. -DlabelCol="y"
  6. -DoutputTable="pai_temp_2159_19061_1";

Parameter description

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
encodeModel (Required) GBDT binary classification model for encoding NA NA
outputTable (Required) Name of the scaling result table NA NA
selectedCols (Required) Selected GBDT features involved in encoding (mainly training features of the GBDT component) NA NA
labelCol (Required) Label field NA NA
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Example

Input data

SQL statement for data generation:

  1. create table if not exists pai_dense_10_9 as
  2. select
  3. age,campaign,pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  4. from bank_data limit 10;

Parameter settings

The modeling process is as follows. It is generally used with the GBDT binary classification component.

image

To simplify the demonstration, set the number of GBDT binary classification trees to 5, set the depth to 3, set field y as the label column, and set other fields as the feature columns. The running result is as follows.

Running result

kv y
2:1,5:1,8:1,8:1,12:1,15:1,18:1,18:1,28:1,34:1,34:1,41:1,50:1,53:1,53:1,63:1,72:1,72:1 0.0
2:1,5:1,6:1,6:1,12:1,15:1,16:1,16:1,28:1,34:1,34:1,41:1,50:1,51:1,51:1,63:1,72:1,72:1 0.0
2:1,3:1,3:1,12:1,13:1,13:1,28:1,34:1,34:1,36:1,39:1,39:1,55:1,61:1,61:1 1.0
2:1,3:1,3:1,12:1,13:1,13:1,20:1,21:1,22:1,22:1,41:1,42:1,43:1,46:1,46:1,63:1,64:1,67:1,68:1,68:1 0.0
0:1,0:1,10:1,10:1,28:1,29:1,32:1,32:1,36:1,37:1,37:1,55:1,56:1,59:1,59:1 1.0
2:1,5:1,8:1,8:1,12:1,15:1,18:1,18:1,20:1,26:1,26:1,41:1,42:1,48:1,48:1,63:1,64:1,67:1,70:1,70:1 0.0
2:1,3:1,3:1,12:1,13:1,13:1,20:1,21:1,24:1,24:1,41:1,42:1,43:1,44:1,44:1,63:1,64:1,65:1,65:1 0.0
2:1,3:1,3:1,12:1,13:1,13:1,20:1,21:1,24:1,24:1,41:1,42:1,43:1,44:1,44:1,63:1,64:1,65:1,65:1 0.0
0:1,0:1,10:1,10:1,28:1,29:1,30:1,30:1,36:1,37:1,37:1,55:1,56:1,57:1,57:1 1.0
2:1,3:1,3:1,12:1,13:1,13:1,20:1,21:1,22:1,22:1,41:1,42:1,43:1,46:1,46:1,63:1,64:1,67:1,68:1,68:1 0.0

One-hot encoding

One-hot encoding converts a feature with m possible values into m binary features. In addition, these features are mutually exclusive, that is, only one feature can be activated at a time. As a result, the data is sparse, and the output is also in the sparse key:value structure.

PAI command

  1. PAI -name fe_binary_runner -project algo_public
  2. -DinputTable=one_hot
  3. -DbinaryCols=edu_num
  4. -DlabelCol=income
  5. -DbinaryReserve=false
  6. -DbinStrategy=noDealStrategy
  7. -DbinaryIndexTable=pai_temp_2458_23436_3
  8. -DmodelTable=pai_temp_2458_23436_2
  9. -DoutputTable=pai_temp_2458_23436_1
  10. -Dlifecycle=28;

Parameter description

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
binaryCols (Required) One-hot encoding field. The features must be enumeration features, and the field type is not required. NA NA
binStrategy (Required) Encoding policy. This component provides two encoding policies: noDealStrategy and autoStrategy. The classification label must be set. NA NA
labelCol Classification label type. Required when the purity-based binarization policy is used, and optional when other policies are used NA NA
impurityMergeThresh (Optional) Purity increase threshold value when the binarization features are merged when the purity-based binarization policy is used NA 0.1
densityMergeThresh (Optional) Binarization feature density (percentage) merging threshold value when the density-based binarization policy is used NA 0.1
binaryReserve (Optional) Indicates whether the output table contains the original one-hot encoding features. NA NA
outputTable (Required) Output table of one-hot encoding. The kv field contains the encoding result. NA NA
binaryIndexTable (Required) KV serialization table, which contains the mapping between encoded feature names and keys in key-value pairs. NA NA
modelTable (Required) Mapping logic of one-hot encoding, stored in the JSON format NA NA
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Encoding policy

  • noDealStrategy: Simple binarization, for example, the feature values of sex are Female|Male|Unknown. The features are sex_female, sex_male, and sex_unknown after binarization.
  • autoStrategy: Automatic binarization Based on the simple binarization policy, this policy uses the logical regression algorithm to calculate each one-hot feature and merges the features, whose one-hot feature weight is 0, into another item (named other).

Example

Input data

edu_num income
13 <=50K
13 <=50K
9 <=50K
7 <=50K
13 <=50K
14 <=50K
5 <=50K
9 >50K
14 >50K
13 >50K

Parameter settings

Select edu_num as the feature to be binarized and use the default parameters for others, that is, the simple binarization policy is used.

Rnuning result

Encoding result (outputTable)

income kv
<=50K 0:1
<=50K 0:1
<=50K 4:1
<=50K 3:1
<=50K 0:1
<=50K 1:1
<=50K 2:1
>50K 4:1
>50K 1:1
>50K 0:1

KV sequence table (binaryIndexTable)

featname featindex
edu_num 0
13 0
14 1
5 2
7 3
9 4

Exception detection

The component supports the following functions:

  • For continuous-value features, perform abnormal feature detection based on the maximum and minimum values of the box plot.
    image

  • For enumeration-value features, perform abnormal feature detection based on the enumeration frequency threshold value of the features.

PAI command

  1. PAI -name fe_detect_runner -project algo_public
  2. -DselectedCols="emp_var_rate,cons_price_rate,cons_conf_idx,euribor3m,nr_employed" \
  3. -Dlifecycle="28"
  4. -DdetectStrategy="boxPlot"
  5. -DmodelTable="pai_temp_2458_23565_2"
  6. -DinputTable="pai_bank_data"
  7. -DoutputTable="pai_temp_2458_23565_1";

Parameter settings

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
selectedCols (Required) Input features. The field type is not required. NA NA
detectStrategy (Required) The options are boxPlot (for continuous-value features) and avf (for enumeration-value features). NA boxPlot
outputTable (Required) Data set after detected abnormal features are filtered NA NA
modelTable (Required) Exception detection model NA NA
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Feature importance filtering

The component provides the filter function for importance evaluation components such as linear feature importance, GBDT feature importance, and random forest feature importance and supports top N feature filtering.

PAI command

  1. PAI -name fe_filter_runner -project algo_public -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome
  2. -DinputTable=pai_dense_10_10
  3. -DweightTable=pai_temp_2252_20319_1
  4. -DtopN=5
  5. -DmodelTable=pai_temp_2252_20320_2
  6. -DoutputTable=pai_temp_2252_20320_1;

Component parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
weightTable (Required) Feature importance weight table (output table of linear feature importance/GBDT feature importance/random forest feature importance) NA NA
outputTable (Required) Output table after top N features are filtered NA NA
modelTable (Required) Model file generated by feature filtering NA NA
selectedCols (Optional) All the fields in the input table are selected by default. NA NA
topN Top N features selected. Default value: 10 NA 10
lifecycle (Optional) Life cycle of the output table. Default value: 7 NA 7

Modeling example

Input data

The input on the left of the feature importance filtering component is the original data table, and the output on the right is the feature importance table.The output is the top N feature data after filtering.

Note: The feature importance table uses a specified format. The first field indicates the feature name, and the second field indicates the weight value of the feature, whose data type is generally double. For example, the output table of random forest feature importance is as follows:

Field Type Label Comment
colname string NA NA
gini double NA NA
entropy double NA NA

SQL statement for data generation

  1. creat table if not exists pai_dense_10_10 as
  2. select
  3. age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
  4. from bank_data limit 10;

The feature importance table is the output table of the random forest feature importance component.

Parameter settings

Modeling flowchart (This component is generally used with a feature importance component, for example, linear feature importance, GBDT feature importance, and random forest feature importance.)

feature engineering

The input data is pai_dense_10_10.
For random forest feature importance, select field y as the label column and select other fields as the feature columns.
For feature importance filtering components, set topN to 5, which indicates only the top 5 features are filtered.

feature engineering 2

Running result

The output on the left is the result.

nr_employed pdays age euribor3m poutcome
5228.1 999.0 44 4.963 nonexistent
5195.8 999.0 53 4.021 nonexistent
4991.6 6.0 28 0.729 success
5099.1 999.0 39 1.405 nonexistent
5076.2 3.0 55 0.869 success
5228.1 999.0 30 4.961 nonexistent
5099.1 999.0 37 1.327 nonexistent
5099.1 999.0 39 1.313 nonexistent
5076.2 3.0 36 1.266 success
5099.1 999.0 27 1.41 failure

The output on the right is the filtering model, which is empty currently.

Thank you! We've received your feedback.