The Feature Selection (Filter Method) component selects the top N features from all feature data in sparse or dense formats by using a filter based on the feature selection method that you specify. This component saves the selected features in a feature importance table. This reduces the difficulty of model training and improves the accuracy of the trained model. This topic describes how to configure parameters for the Feature Selection (Filter Method) component provided by Machine Learning Designer (formerly known as Machine Learning Studio). This topic also provides an example on how to use the Feature Selection (Filter Method) component.

Limits

The Feature Selection (Filter Method) component does not support filtering of data in the LIBSVM or key-value pair format.

Configure the component

You can use one of the following methods to configure the Feature Selection (Filter Method) component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Feature Selection (Filter Method) component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingFeature ColumnsThe names of the feature columns that are selected from the input table for training.
Target ColumnThe name of the label column that is selected from the input table.
Enumeration FeaturesThe columns of features to be processed as enumeration features. Only columns of the INT and DOUBLE data types are supported.
Sparse Features (K:V,K:V)Specifies whether the features are sparse features in the key-value pair format.
Parameters SettingFeature Selection MethodThe method that is used to select features. Valid values:
  • IV: measures the prediction capability of features.
  • Gini Gain: evaluates the importance of a single feature.
  • Information Gain: measures the differences among multiple features.
  • Lasso: filters ultra-large-scale features by using dimensionality reduction.
Top N FeaturesThe top N features to be selected. If the specified number is greater than the number of input features, all features are selected.
Continuous Feature Partitioning ModeThe partitioning mode of continuous features. Valid values:
  • Automated Partitioning
  • Equal Width Partitioning
Continuous Feature Discretization IntervalsYou need to set this parameter only if you set the Continuous Feature Partitioning Mode parameter to Equal Width Partitioning.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name fe_select_runner -project algo_public 
     -DfeatImportanceTable=pai_temp_2260_22603_2 
     -DselectMethod=iv 
     -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign 
     -DtopN=5 
     -DlabelCol=y 
     -DmaxBins=100 
     -DinputTable=pai_dense_10_9 
     -DoutputTable=pai_temp_2260_22603_1;
ParameterRequiredDescriptionDefault value
inputTableYesThe name of the input table. N/A
inputTablePartitionsNoThe partitions that are selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate them with commas (,).
All partitions
outputTableYesThe feature result table that is generated after filtering. N/A
featImportanceTableYesThe table that stores the importance weight values of all input features. N/A
selectedColsYesThe feature columns that are used for training. N/A
labelColYesThe label column that is selected from the input table. N/A
categoryColsNoThe columns of enumeration features. Only columns of the INT or DOUBLE data types are supported. N/A
maxBinsNoThe maximum number of intervals for continuous feature partitioning. 100
selectMethodNoThe method that is used to select features. Valid values: iv, GiniGain, InfoGain, and Lasso. iv
topNNoThe top N features to be selected. If the specified number is greater than the number of input features, all features are selected. 10
isSparseNoSpecifies whether the features are sparse features in the key-value pair format. A value of false indicates dense features. false

Example

  • Input data

    Execute the following SQL statement to generate test data:

    create table if not exists pai_dense_10_9 as 
    select  
        age,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
    from bank_data limit 10;
  • Parameter settings
    The input table is pai_dense_10_9. Select the y column for Target Column and other columns for Feature Columns. The following figures show the detailed parameter settings.
    Figure 1. Fields Setting tab
    Fields Setting
    Figure 2. Parameters Setting tab
    Parameters Setting
  • Output
    The left output is the filtered data that is stored in the following table.
    pdaysnr_employedemp_var_ratecons_conf_idxcons_price_idxy
    999.05228.11.4-36.193.4440.0
    999.05195.8-0.1-42.093.20.0
    6.04991.6-1.7-39.894.0551.0
    999.05099.1-1.8-47.193.0750.0
    3.05076.2-2.9-31.492.2011.0
    999.05228.11.4-42.793.9180.0
    999.05099.1-1.8-46.292.8930.0
    999.05099.1-1.8-46.292.8930.0
    3.05076.2-2.9-40.892.9631.0
    999.05099.1-1.8-47.193.0750.0
    The right output is the feature importance table shown below. The featname column stores the feature names. The weight column stores the weight values that are calculated by the feature selection method.
    featnameweight
    pdays30.675544191232486
    nr_employed29.08332850085075
    emp_var_rate29.08332850085075
    cons_conf_idx28.02710269740324
    cons_price_idx28.02710269740324
    euribor3m27.829058450563718
    age27.829058450563714
    previous14.319325030742775
    campaign10.658129656314467