All Products
Search
Document Center

Platform For AI:Feature Discretization

Last Updated:Feb 16, 2023

The Feature Discretization component discretizes continuous features based on a specific rule.

Overview

The Feature Discretization component supports the following types of discretization:

  • Discretization of dense features that are of numeric data types

  • Unsupervised discretization such as equal frequency discretization and equal width discretization

    Note

    The default unsupervised discretization is equal width discretization.

  • Supervised discretization such as Gini gain-based discretization and entropy gain-based discretization

    Note

    The data type for label feature discretization must be ENUM, STRING, or BIGINT.

  • Supervised discretization is used to search for segmentation points based on entropy gains by performing constant transversal. This type of discretization may take a long time to run. The number of bins that are obtained after segmentation is not limited by the value specified by the maxBins parameter.

Configure the component

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Feature Discretization component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Discrete Features

The features that require discretization.

Label Column

The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed.

Parameters Setting

Discretization Method

The method that is used for discretization. Valid values:

  • Equal Width Discretization

  • Equal Frequency Discretization

  • Gini Gain-based Discretization

  • Entropy Gain-based Discretization

Discretization Interval

The number of discrete intervals. The value must be a positive integer that is greater than 1.

Tuning

Cores

The number of cores used in computing. The value must be a positive integer.

Memory Size per Core

The memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name fe_discrete_runner_1 -project algo_public
   -DdiscreteMethod=SameFrequecy
   -Dlifecycle=28
   -DmaxBins=5
   -DinputTable=pai_dense_10_1
   -DdiscreteCols=nr_employed
   -DoutputTable=pai_temp_2262_20382_1
   -DmodelTable=pai_temp_2262_20382_2;

Parameter

Required

Description

Default value

inputTable

Yes

The name of the input table.

None

inputTablePartitions

No

The partitions that are selected from the input table for training. Specify this parameter in the Partition_name=value format.

To specify multi-level partitions, specify this parameter in the name1=value1/name2=value2; format.

If you specify multiple partitions, separate them with commas (,).

All partitions in the input table

outputTable

Yes

The output table after discretization.

None

discreteCols

Yes

The features that require discretization. Sparse features are automatically filtered by the system.

""

labelCol

No

The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed.

None

discreteMethod

No

The method that is used for discretization. Valid values:

  • Isometric Discretization

  • Isofrequecy Discretization

  • Gini-gain-based Discretization

  • Entropy-gain-based Discretization

Isometric Discretization

maxBins

No

The number of discrete intervals. The value must be a positive integer that is greater than 1.

100

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

7

coreNum

No

The number of cores. This parameter is used together with the memSizePerCore parameter. The value must be a positive integer.

Determined by the system

memSizePerCore

No

The memory size of each core. Unit: MB. The value must be a positive integer.

Determined by the system

Examples

  • Input data

    Execute the following SQL statements to generate input data:

    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;
  • Configure the component

    The input table is pai_dense_10_1. On the Fields Setting tab, set the Discrete Features parameter to nr_employed. On the Parameters Setting tab, set the Discretization Method parameter to Equal Width Discretization and the Discrete Interval parameter to 5.

  • Execution results

    nr_employed

    4.0

    3.0

    1.0

    3.0

    2.0

    4.0

    3.0

    3.0

    2.0

    3.0