Binning - Platform For AI - Alibaba Cloud Documentation Center

The Binning component is used for feature discretization. Feature discretization is a process of converting continuous data into multiple discrete intervals. The Binning component supports equal frequency binning, equal width binning, and automated binning.

Component configurations

You can use one of the following methods to configure the Binning component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Binning component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.


Tab	Parameter	Description
Fields Setting	Feature Columns	Columns of the STRING, BIGINT, and DOUBLE types are supported.
	Label Column	This parameter is required only for binary classification.
	Positive Value	This parameter is valid only if the Label Column parameter is specified.
	Binning Parameter Source	Valid values: Parameters in Parameter Settings and Manual Binning or Custom JSON.
	Reserve Unselected Feature Columns	This parameter is valid only if you set the Binning Parameter Source parameter to Manual Binning or Custom JSON. If you set the Reserve Unselected Feature Columns parameter to Yes, the columns that are not specified for the Feature Columns parameter remain unchanged in the output. Otherwise, the columns that are not specified for the Feature Columns parameter are removed from the output.
	Upload Binning and Constraint JSON Code	This parameter is valid only if you set the Binning Parameter Source parameter to Manual Binning or Custom JSON.
Parameters Setting	Bins	If you set this parameter to 10, continuous features are converted into 10 discrete intervals.
	Custom Bins	You can specify the numbers of bins for specific columns. The setting of this parameter takes precedence over the setting of the Bins parameter. If a specific column is not included in the selected columns, this column is also used in binning. For example, columns col0 and col1 are selected for data binning. The number of bins customized for the col0 column is 3, and that customized for the col2 column is 5. If the Bins parameter is set to 10, binning is performed based on col0:3,col1:10,col2:5. Specify this parameter in the format of Column name 1:Number of bins,Column name 2:Number of bins.
	Custom Discrete Value Count Threshold	Specify this parameter in the col0:3 format.
	Interval Type	Valid values: Left-open, Right-closed and Left-closed, Right-open.
	Binning Mode	Valid values: Equal Frequency, Equal Width, and Automatic Binning.
	Discrete Value Count Threshold	If a value is less than this threshold, the value is distributed to the else bin.
Tuning	Cores	The number of cores. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name binning
    -project algo_public
    -DinputTableName=input
    -DoutputTableName=output


Parameter	Description	Required	Default value
inputTableName	The name of the input table.	Yes	None
outputTableName	The name of the output table.	Yes	None
selectedColNames	The columns that are selected from the input table for binning.	No	All columns except the label column (If no label column exists, all columns are selected.)
labelColName	The label column.	No	None
validTableName	The name of the validation table. This parameter is required if the binningMethod parameter is set to auto.	No	Null
validTablePartitions	The partitions that are selected from the validation table.	No	Full table
inputTablePartitions	The partitions that are selected from the input table.	No	Full table
inputBinTableName	The input binning table.	No	None
selectedBinColNames	The columns that are selected from the input binning table.	No	Null
positiveLabel	Specifies whether the samples are positive samples.	No	1
nDivide	The number of bins. The value of this parameter must be a positive integer.	No	10
colsNDivide	The numbers of bins for specific columns. Specify this parameter in the format of Column name 1:Number of bins,Column name 2:Number of bins. Example: col0:3,col2:5. If the columns that are specified for the colsNDivide parameter are not included in those specified for the selectedColNames parameter, the columns are also used in binning. For example, the selectedColNames parameter is set to col0,col1, the colsNDivide parameter is set to col0:3,col2:5, and the nDivide parameter is set to 10. In this case, binning is performed based on col0:3,col1:10,col2:5.	No	Null
isLeftOpen	The interval type. Valid values: {true}: left-open, right-closed intervals {false}: left-closed, right-open intervals	No	true
stringThreshold	The threshold for discrete values in the else bin.	No	None
colsStringThreshold	The threshold for specific columns. Specify this parameter in the same format as the colsNDivide parameter.	No	Null
binningMethod	The binning mode. Valid values: quantile: indicates equal frequency binning. bucket: indicates equal width binning. auto: indicates that the system automatically selects a binning mode.	No	quantile
lifecycle	The lifecycle of the output table. The value of this parameter must be a positive integer.	No	None
coreNum	The number of cores. The value of this parameter must be a positive integer.	No	Determined by the system
memSizePerCore	The memory size of each core. The value of this parameter must be a positive integer.	No	Determined by the system

The Binning component must be used with the Scorecard Training component. During scorecard training, the Binning component converts continuous features into multiple discrete dummy variables to achieve feature engineering. You can specify constraints for the weights of the dummy variables. The following information describes the constraints:

Ascending order: Weights must be added to the dummy variables of a feature based on index values in ascending order. This indicates that a dummy variable with a greater index value has a higher weight.
Descending order: Weights must be added to the dummy variables of a feature based on index values in descending order. This indicates that a dummy variable with a greater index value has a lower weight.
Same weight: The weights of two dummy variables of a feature must be the same.
Zero weight: The weight of a dummy variable must be 0.
Specific weight: The weight of a dummy variable must be a specific floating-point value.
WOE order: Weights must be added to the dummy variables of a feature based on the weight of evidence (WOE) values in ascending order. This indicates that a dummy variable with a greater WOE value has a higher weight.

Result presentation

After the workflow that contains the Binning component finishes running, right-click the Binning component on the canvas and select Binning.
On the variable list page, you can check the Bins, Type, and IV information for each variable. The following figure shows an example of variable information.
Click the name of a variable such as f1 to go to the binning details page of the variable. The following figure shows the binning details page of f1.
You can click Merge or Split to merge or split binning data. You can also specify constraints for bins.
Note The specified constraints take effect only on the subsequent Scorecard Training component. If you use the Binning component without the Scorecard Training component, these constraints can be ignored.