All Products
Search
Document Center

Platform For AI:Random Forest Feature Importance Evaluation

Last Updated:Jan 03, 2025

Feature importance evaluation in Random Forest is a method used to analyze the contribution of each feature to the prediction outcomes in a Random Forest model. The method determines the importance of the model by calculating how much each feature decreases the average impurity across all decision trees or how much permutation decreases the accuracy of the model. This way, the method can help identify the features that matter most on the model performance.

Configure the component

Method 1: Configure the component on the pipeline page

On the pipeline details page in Machine Learning Designer, add the Random Forest Feature Importance Evaluation component to the pipeline and configure the parameters described in the following table.

Tab

Parameter

Description

Fields Setting

Feature Columns

Optional. The feature columns that are selected from the input table for training. By default, all columns other than the label column are selected.

Target Column

Required. The label column.

Click the Directory icon. In the Select Column dialog box, enter the keywords of the column that you want to search for. Select the column and click OK.

Parameters Setting

Parallel Computing Cores

Optional. The number of cores used in parallel computing.

Memory Size per Core

Optional. The memory size of each core. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

pai -name feature_importance -project algo_public
    -DinputTableName=pai_dense_10_10
    -DmodelName=xlab_m_random_forests_1_20318_v0
    -DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1
    -DlabelColName=y
    -DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
    -Dlifecycle=28 ;

Parameter

Required

Default value

Description

inputTableName

Yes

No default value

The name of the input table.

outputTableName

Yes

No default value

The name of the output table.

labelColName

Yes

No default value

The name of the label column in the input table.

modelName

Yes

No default value

The name of the input model.

featureColNames

No

All columns other than the label column

The feature columns that are selected from the input table for training.

inputTablePartitions

No

All partitions

The partitions that are selected from the input table for training.

lifecycle

No

Not specified

The lifecycle of the output table.

coreNum

No

Determined by the system

The number of cores.

memSizePerCore

No

Determined by the system

The memory size of each core. Unit: MB.

Example

  1. Execute the following SQL statements to generate training data:

    In this example, the top 10 data records in the bank_data table are selected to create a table named pai_dense_10_10. You can create a table based on your business requirements:

    drop table if exists pai_dense_10_10;
    create table pai_dense_10_10 as
    select
        age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
    from bank_data limit 10;
  2. Create the experiment shown in the following figure. For more information, see Custom pipelines.

    The data source is pai_dense_10_10. y is the label column of the random forest model, and other columns are feature columns. Select age and campaign for the Columns Forced to Convert parameter. This indicates that the two columns are processed as enumerated features, and default settings are retained for other columns. Generate a model

  3. Run the experiment and view the prediction results. Result

  4. After the experiment is run, right-click the Random Forest Feature Importance Evaluation component and select View Analytics Report to view the result. Analysis report