This topic describes the Discrete Feature Analysis component provided by Machine Learning Studio.

The Discrete Feature Analysis component collects statistics on the distribution of discrete features by using the following metrics: gini, entropy, gini gain, information gain, and information gain ratio. The gini and entropy are calculated for each discrete value. The gini gain, information gain, and information gain ratio are calculated for each column.
  • gini:index
  • entropy:entropy

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Parameter Description
    Feature Columns The columns to represent the features of data in training samples.
    Label Column The label column.
    Sparse Matrix If data in an input table is in the sparse format, features must be in the key-value pair format.
  • PAI command
    PAI
    -name enum_feature_selection
    -project algo_public
    -DinputTableName=enumfeautreselection_input
    -DlabelColName=label
    -DfeatureColNames=col0,col1
    -DenableSparse=false
    -DoutputCntTableName=enumfeautreselection_output_cntTable
    -DoutputValueTableName=enumfeautreselection_output_valuetable
    -DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note Separate multiple partitions with commas (,).
    Full table
    featureColNames No The names of the feature columns selected from the input table for training. No default value
    labelColName No The name of the label column in the input table. No default value
    enableSparse No Specifies whether to configure input data in the sparse format. Valid values: {true,false}. false
    kvFeatureColNames No The names of the feature columns that are in the key-value pair format. Full table
    kvDelimiter No The delimiter used to separate keys and values if data in an input table is in the sparse format. Colon (:)
    itemDelimiter No The delimiter used to separate key-value pairs if data in an input table is in the sparse format. Comma (,)
    outputCntTableName No The output distribution table that contains the enumerated values of discrete features. N/A
    outputValueTableName No The output table that contains gini and entropy values of discrete features. N/A
    outputEnumValueTableName No The output table that contains enumerated gini and entropy values of discrete features. N/A
    lifecycle No The lifecycle of the table. No default value
    coreNum No The number of cores. The parameter value must be a positive integer. Automatically allocated
    memSizePerCore No The memory size of each core, in MB. Valid values: 1 to 65536. Automatically allocated

Example

Execute the following SQL statements to generate input data:
drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
    *
from
(
    select
        '00' as col_string,
        1 as col_bigint,
        0.0 as col_double
    from dual
    union all
        select
            cast(null as string) as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            0 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            cast(null as double) as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '00' as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
) tmp;
Input data:
+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01         | 1          | 1.0        |
| 01         | 0          | 1.0        |
| 01         | 1          | NULL       |
| NULL       | 0          | 0.0        |
| 00         | 1          | 0.0        |
| 00         | 0          | 0.0        |
+------------+------------+------------+
  • PAI command
    • Command
      drop table if exists enum_feature_selection_test_input_enum_value_output;
      drop table if exists enum_feature_selection_test_input_cnt_output;
      drop table if exists enum_feature_selection_test_input_value_output;
      PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";
    • Command output
      • enum_feature_selection_test_input_cnt_output
        +------------+------------+------------+------------+
        | colname    | colvalue   | labelvalue | cnt        |
        +------------+------------+------------+------------+
        | col_double | NULL       | 1          | 1          |
        | col_double | 0          | 0          | 2          |
        | col_double | 0          | 1          | 1          |
        | col_double | 1          | 0          | 1          |
        | col_double | 1          | 1          | 1          |
        | col_string | NULL       | 0          | 1          |
        | col_string | 00         | 0          | 1          |
        | col_string | 00         | 1          | 1          |
        | col_string | 01         | 0          | 1          |
        | col_string | 01         | 1          | 2          |
        +------------+------------+------------+------------+
      • enum_feature_selection_test_input_value_output
        +------------+------------+------------+------------+------------+---------------+
        | colname    | gini       | entropy    | infogain   | ginigain   | infogainratio |
        +------------+------------+------------+------------+------------+---------------+
        | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
        | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
        +------------+------------+------------+------------+------------+---------------+
      • enum_feature_selection_test_input_enum_value_output
        +------------+------------+------------+------------+
        | colname    | colvalue   | gini       | entropy    |
        +------------+------------+------------+------------+
        | col_double | NULL       | 0.0        | 0.0        |
        | col_double | 0          | 0.22222222222222224 | 0.4591479170272448 |
        | col_double | 1          | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | NULL       | 0.0        | 0.0        |
        | col_string | 00         | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
        +------------+------------+------------+------------+
  • Machine Learning Platform for AI console
    • Component interfaceComponent
    • Parameter settingsParameter settings
    • ResultsResults