All Products
Search
Document Center

Platform For AI:Discrete Feature Analysis

Last Updated:Jan 29, 2024

This topic describes the Discrete Feature Analysis component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

The Discrete Feature Analysis component collects statistics on the distribution of discrete features by using the following metrics: gini, entropy, gini gain, information gain, and information gain ratio. The gini and entropy are calculated for each discrete value. The gini gain, information gain, and information gain ratio are calculated for each column.
  • giniindex
  • entropyentropy

Configure the component

You can use one of the following methods to configure the Discrete Feature Analysis component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Discrete Feature Analysis component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
ParameterDescription
Feature ColumnsThe columns to represent the features of data in training samples.
Label ColumnThe label column.
Sparse MatrixIf data in an input table is in the sparse format, features must be in the key-value pair format.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI
-name enum_feature_selection
-project algo_public
-DinputTableName=enumfeautreselection_input
-DlabelColName=label
-DfeatureColNames=col0,col1
-DenableSparse=false
-DoutputCntTableName=enumfeautreselection_output_cntTable
-DoutputValueTableName=enumfeautreselection_output_valuetable
-DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table. No default value
inputTablePartitionsNoThe partitions that are selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate them with commas (,).
Full table
featureColNamesNoThe feature columns that are selected from the input table for training. No default value
labelColNameNoThe name of the label column in the input table. No default value
enableSparseNoSpecifies whether the input data is in the sparse format. Valid values: true and false. false
kvFeatureColNamesNoThe names of the feature columns that are in the key-value pair format. Full table
kvDelimiterNoThe delimiter that is used to separate keys and values if data in an input table is in the sparse format. :
itemDelimiterNoThe delimiter that is used to separate key-value pairs if data in an input table is in the sparse format. ,
outputCntTableNameNoThe output distribution table that contains the enumerated values of discrete features. N/A
outputValueTableNameNoThe output table that contains gini and entropy values of discrete features. N/A
outputEnumValueTableNameNoThe output table that contains enumerated gini and entropy values of discrete features. N/A
lifecycleNoThe lifecycle of the table. No default value
coreNumNoThe number of cores that are used in computing. The value must be a positive integer. Determined by the system
memSizePerCoreNoThe memory size of each core. Valid values: 1 to 65536. Unit: MB. Determined by the system

Example

Execute the following SQL statements to generate input data:
drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
    *
from
(
    select
        '00' as col_string,
        1 as col_bigint,
        0.0 as col_double
    from dual
    union all
        select
            cast(null as string) as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            0 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            cast(null as double) as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '00' as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
) tmp;
Input data:
+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01         | 1          | 1.0        |
| 01         | 0          | 1.0        |
| 01         | 1          | NULL       |
| NULL       | 0          | 0.0        |
| 00         | 1          | 0.0        |
| 00         | 0          | 0.0        |
+------------+------------+------------+
  • PAI command
    • Command
      drop table if exists enum_feature_selection_test_input_enum_value_output;
      drop table if exists enum_feature_selection_test_input_cnt_output;
      drop table if exists enum_feature_selection_test_input_value_output;
      PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";
    • Command output
      • enum_feature_selection_test_input_cnt_output
        +------------+------------+------------+------------+
        | colname    | colvalue   | labelvalue | cnt        |
        +------------+------------+------------+------------+
        | col_double | NULL       | 1          | 1          |
        | col_double | 0          | 0          | 2          |
        | col_double | 0          | 1          | 1          |
        | col_double | 1          | 0          | 1          |
        | col_double | 1          | 1          | 1          |
        | col_string | NULL       | 0          | 1          |
        | col_string | 00         | 0          | 1          |
        | col_string | 00         | 1          | 1          |
        | col_string | 01         | 0          | 1          |
        | col_string | 01         | 1          | 2          |
        +------------+------------+------------+------------+
      • enum_feature_selection_test_input_value_output
        +------------+------------+------------+------------+------------+---------------+
        | colname    | gini       | entropy    | infogain   | ginigain   | infogainratio |
        +------------+------------+------------+------------+------------+---------------+
        | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
        | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
        +------------+------------+------------+------------+------------+---------------+
      • enum_feature_selection_test_input_enum_value_output
        +------------+------------+------------+------------+
        | colname    | colvalue   | gini       | entropy    |
        +------------+------------+------------+------------+
        | col_double | NULL       | 0.0        | 0.0        |
        | col_double | 0          | 0.22222222222222224 | 0.4591479170272448 |
        | col_double | 1          | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | NULL       | 0.0        | 0.0        |
        | col_string | 00         | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
        +------------+------------+------------+------------+
  • PAI console
    • Component interfaceComponent
    • Parameter settingsParameter settings
    • ResultsResults