The Feature Discretization component discretizes continuous features based on a specific rule.

Overview

The Feature Discretization component supports the following types of discretization:
  • Discretization of features that are of numeric data types and in the dense format
  • Unsupervised discretization such as equal frequency discretization and equal width discretization
    Note The default unsupervised discretization is equal width discretization.
  • Supervised discretization such as Gini gain-based discretization and entropy gain-based discretization
    Note The data type for label feature discretization must be ENUM, STRING, or BIGINT.
The Feature model prediction component can be used to predict feature discretization. The following figure shows the detailed directed acyclic graph (DAG) for modeling. Feature model prediction
Note
  • You must use the same discrete model to predict feature discretization. This ensures that the same measurement is used during the prediction.
  • Supervised discretization is used to search for segmentation points based on entropy gains by performing constant transversal. This type of discretization may take a long time to run. The number of bins that are obtained after segmentation is not limited by the value specified by the maxBins parameter.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Discrete Features The features that require discretization. Sparse features are automatically filtered by the system.
    Label Column The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized.
    Parameters Setting Discretization Method The method that is used for discretization. Valid values:
    • Isometric Discretization
    • Isofrequecy Discretization
    • Gini-gain-based Discretization
    • Entropy-gain-based Discretization
    Discrete Interval The interval for discretization. The value of this parameter must be a positive integer greater than 1.
    Tuning Number of Computing Cores The number of cores used in computing. Set the parameter to a positive integer.
    Memory Size per Core The memory size of each core.
  • Use commands
    PAI -name fe_discrete_runner_1 -project algo_public
       -DdiscreteMethod=SameFrequecy
       -Dlifecycle=28
       -DmaxBins=5
       -DinputTable=pai_dense_10_1
       -DdiscreteCols=nr_employed
       -DoutputTable=pai_temp_2262_20382_1
       -DmodelTable=pai_temp_2262_20382_2;
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the Partition_name=value format.

    To specify multi-level partitions, set this parameter in the name1=value1/name2=value2; format.

    If you specify multiple partitions, separate them with commas (,).

    All partitions in the input table
    outputTable Yes The output table after discretization. N/A
    discreteCols Yes The features that require discretization. Sparse features are automatically filtered by the system. ""
    labelCol No The label column. If this field is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. N/A
    categoryCols No The selected fields that are processed as enumerated features. These fields do not support discretization. Empty string
    discreteMethod No The method that is used for discretization. Valid values:
    • Isometric Discretization
    • Isofrequecy Discretization
    • Gini-gain-based Discretization
    • Entropy-gain-based Discretization
    Isometric Discretization
    discreteTopN No If you do not set the discreteCols parameter, the system automatically selects the top N features that require discretization. The value must be a positive integer. 10
    maxBins No The interval for discretization. The value of this parameter must be a positive integer greater than 1. 100
    isSparse No Specifies whether features are sparse features in the key-value format. Valid values:
    • true
    • false

    The default value is false, indicating that features are in the dense format.

    false
    itemSpliter No The delimiter that is used to separate key-value pairs in the sparse format. ,
    kvSpliter No The delimiter that is used to separate keys and values in the sparse format. :
    lifecycle No The lifecycle of the output table. The value must be a positive integer. 7
    coreNum No The number of cores. This parameter must be used with the memSizePerCore parameter. The value of this parameter must be a positive integer. By default, the system determines the value.
    memSizePerCore No The memory size of each core. Unit: MB. The value must be a positive integer. By default, the system determines the value.

Example

  • Input data

    Execute the following SQL statements to generate input data:

    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;
  • Parameter settings

    The input table is pai_dense_10_1. On the Fields Setting tab, set the Discrete Features parameters to nr_employed. On the Parameters Setting tab, set the Discretization Method parameter to Isometric Discretization and the Discrete Interval parameter to 5.

  • Result
    nr_employed
    4.0
    3.0
    1.0
    3.0
    2.0
    4.0
    3.0
    3.0
    2.0
    3.0