The Feature Discretization component discretizes continuous features based on a specific rule.
Overview
- Discretization of features that are of numeric data types and in the dense format
- Unsupervised discretization such as equal frequency discretization and equal width
discretization
Note The default unsupervised discretization is equal width discretization.
- Supervised discretization such as Gini gain-based discretization and entropy gain-based
discretization
Note The data type for label feature discretization must be ENUM, STRING, or BIGINT.

- You must use the same discrete model to predict feature discretization. This ensures that the same measurement is used during the prediction.
- Supervised discretization is used to search for segmentation points based on entropy gains by performing constant transversal. This type of discretization may take a long time to run. The number of bins that are obtained after segmentation is not limited by the value specified by the maxBins parameter.
Configure the component
- Use the Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Discrete Features The features that require discretization. Sparse features are automatically filtered by the system. Label Column The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. Parameters Setting Discretization Method The method that is used for discretization. Valid values: - Isometric Discretization
- Isofrequecy Discretization
- Gini-gain-based Discretization
- Entropy-gain-based Discretization
Discrete Interval The interval for discretization. The value of this parameter must be a positive integer greater than 1. Tuning Number of Computing Cores The number of cores used in computing. Set the parameter to a positive integer. Memory Size per Core The memory size of each core. - Use commands
PAI -name fe_discrete_runner_1 -project algo_public -DdiscreteMethod=SameFrequecy -Dlifecycle=28 -DmaxBins=5 -DinputTable=pai_dense_10_1 -DdiscreteCols=nr_employed -DoutputTable=pai_temp_2262_20382_1 -DmodelTable=pai_temp_2262_20382_2;
Parameter Required Description Default value inputTable Yes The name of the input table. N/A inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the Partition_name=value
format.To specify multi-level partitions, set this parameter in the
name1=value1/name2=value2;
format.If you specify multiple partitions, separate them with commas (,).
All partitions in the input table outputTable Yes The output table after discretization. N/A discreteCols Yes The features that require discretization. Sparse features are automatically filtered by the system. "" labelCol No The label column. If this field is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. N/A categoryCols No The selected fields that are processed as enumerated features. These fields do not support discretization. Empty string discreteMethod No The method that is used for discretization. Valid values: - Isometric Discretization
- Isofrequecy Discretization
- Gini-gain-based Discretization
- Entropy-gain-based Discretization
Isometric Discretization discreteTopN No If you do not set the discreteCols parameter, the system automatically selects the top N features that require discretization. The value must be a positive integer. 10 maxBins No The interval for discretization. The value of this parameter must be a positive integer greater than 1. 100 isSparse No Specifies whether features are sparse features in the key-value format. Valid values: - true
- false
The default value is false, indicating that features are in the dense format.
false itemSpliter No The delimiter that is used to separate key-value pairs in the sparse format. , kvSpliter No The delimiter that is used to separate keys and values in the sparse format. : lifecycle No The lifecycle of the output table. The value must be a positive integer. 7 coreNum No The number of cores. This parameter must be used with the memSizePerCore parameter. The value of this parameter must be a positive integer. By default, the system determines the value. memSizePerCore No The memory size of each core. Unit: MB. The value must be a positive integer. By default, the system determines the value.
Example
- Input data
Execute the following SQL statements to generate input data:
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10;
- Parameter settings
The input table is pai_dense_10_1. On the Fields Setting tab, set the Discrete Features parameters to nr_employed. On the Parameters Setting tab, set the Discretization Method parameter to Isometric Discretization and the Discrete Interval parameter to 5.
- Result
nr_employed 4.0 3.0 1.0 3.0 2.0 4.0 3.0 3.0 2.0 3.0