Feature Discretization - Platform For AI - Alibaba Cloud Documentation Center

Feature Discretization is a data preprocessing technique in machine learning used to convert continuous features into discrete features. By applying specific rules or methods (such as equal frequency or equal width), feature discretization divides continuous numerical data into a limited number of discrete intervals or categories, facilitating model handling and analysis. This transformation helps in enhancing the performance of certain algorithms, particularly when dealing with classification problems.

Overview

The Feature Discretization component supports the following types of discretization:

Discretization of dense features that are of numeric data types
Unsupervised discretization such as equal frequency discretization and equal width discretization
Note
The default unsupervised discretization is equal width discretization.
Supervised discretization such as Gini gain-based discretization and entropy gain-based discretization
Note
The data type for label feature discretization must be ENUM, STRING, or BIGINT.
Supervised discretization is used to search for segmentation points based on entropy gains by performing constant traversal. This type of discretization may take a long time to run. The number of bins that are obtained after segmentation is not limited by the value specified by the maxBins parameter.

Configure the component

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Feature Discretization component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Discrete Features	The features that require discretization.
Fields Setting	Label Column	The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed.
Parameters Setting	Discretization Method	The method that is used for discretization. Valid values: Equal Width Discretization Equal Frequency Discretization Gini Gain-based Discretization Entropy Gain-based Discretization We recommend that you use Isometric Discretization or Isofrequecy Discretization. The other two method, Gini-gain-based Discretization and Entropy-gain-based Discretization, can be understand as experiment properties. If you need WOE metric, see Binning.
Parameters Setting	Discretization Interval	The number of discrete intervals. The value must be a positive integer that is greater than 1.
Tuning	Cores	The number of cores used in computing. The value must be a positive integer.
Tuning	Memory Size per Core	The memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name fe_discrete_runner_1 -project algo_public
   -DdiscreteMethod=SameFrequecy
   -Dlifecycle=28
   -DmaxBins=5
   -DinputTable=pai_dense_10_1
   -DdiscreteCols=nr_employed
   -DoutputTable=pai_temp_2262_20382_1
   -DmodelTable=pai_temp_2262_20382_2;

Parameter	Required	Description	Default value
inputTable	Yes	The name of the input table.	None
inputTablePartitions	No	The partitions that are selected from the input table for training. Specify this parameter in the `Partition_name=value` format. To specify multi-level partitions, specify this parameter in the `name1=value1/name2=value2;` format. If you specify multiple partitions, separate them with commas (,).	All partitions in the input table
outputTable	Yes	The output table after discretization.	None
discreteCols	Yes	The features that require discretization. Sparse features are automatically filtered by the system.	""
labelCol	No	The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed.	None
discreteMethod	No	The method that is used for discretization. Valid values: Isometric Discretization Isofrequecy Discretization Gini-gain-based Discretization Entropy-gain-based Discretization	Isometric Discretization
maxBins	No	The number of discrete intervals. The value must be a positive integer that is greater than 1.	100
lifecycle	No	The lifecycle of the output table. The value must be a positive integer.	7
coreNum	No	The number of cores. This parameter is used together with the memSizePerCore parameter. The value must be a positive integer.	Determined by the system
memSizePerCore	No	The memory size of each core. Unit: MB. The value must be a positive integer.	Determined by the system

Examples

Input data

Execute the following SQL statements to generate input data:

create table if not exists pai_dense_10_1 as
select
    nr_employed
from bank_data limit 10;

Configure the component
The input table is pai_dense_10_1. On the Fields Setting tab, set the Discrete Features parameter to nr_employed. On the Parameters Setting tab, set the Discretization Method parameter to Equal Width Discretization and the Discrete Interval parameter to 5.
Execution results
nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0

nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0