The Feature Softening component can smooth anomalous features in input data to a specific interval. Data in the sparse and dense formats are supported.
Background information
- Z-Score
If a feature is in a normal distribution, the noise is distributed outside the range of -3×alpha to 3×alpha. Z-Score smooths the noise to the range of [-3×alpha,3×alpha].
For example, a feature is in a normal distribution, the mean value is 0, and the standard deviation is 3. The feature value -10 is identified anomalous and corrected to -9 based on the smoothing rule of Z-Score. In the same way, the feature value 10 is corrected to 9. - Percentile smoothing
Percentile smoothing is used to smooth the data distributed outside the range of [minPer, maxPer] to the minPer or maxPer quantile.
For example, the feature value of age is in the range of 0 to 200. Set minPer to 0 and maxPer to 50%. Feature values outside the range of 0 to 100 are corrected to 0 or 100.
- Threshold smoothing
Threshold smoothing is used to smooth the data distributed outside the range of [minThresh, maxThresh] to the minThresh or maxThresh data point.
For example, the feature value of age is in the range of 0 to 200. Set minThresh to 10 and maxThresh to 80. Feature values outside the range of 0 to 80 are corrected to 0 or 80.
Configure the component
- Use the Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Softened Feature Columns The feature columns that you want to smooth. Label Column The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. Parameters Setting Soften Method The method that is used for smoothing. Valid values: - Z-Score
- Percentile
- Threshold
- Boxplot Smoothing
Confidence Interval The confidence level. This parameter is required when the Soften Method parameter is set to Z-Score. Minimum Threshold The minimum threshold. The default value is -9999, which indicates no minimum threshold. This parameter is required when the Soften Method is set to Threshold.
Maximum Threshold The maximum threshold. The default value is -9999, which indicates no maximum threshold. This parameter is required when the Soften Method parameter is set to Threshold.
Minimum Percentile The minimum percentile. This parameter is required when the Soften Method parameter is set to Percentile or Boxplot Smoothing.
Maximum Percentile The maximum percentile. This parameter is required when the Soften Method parameter is set to Percentile or Boxplot Smoothing.
- Use commands
PAI -name fe_soften_runner -project algo_public -DminThresh=5000 -Dlifecycle=28 -DsoftenMethod=min-max-thresh -DsoftenCols=nr_employed -DmaxThresh=6000 -DinputTable=pai_dense_10_1 -DoutputTable=pai_temp_2262_20381_1;
Parameter Required Description Default value inputTable Yes The name of the input table. N/A inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the Partition_name=value
format.To specify multi-level partitions, set this parameter in the
name1=value1/name2=value2;
format.If you specify multiple partitions, separate them with commas (,).
All partitions in the input table outputTable Yes The output table after smoothing. N/A labelCol No The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. Empty string categoryCols No The selected fields that are processed as enumerated features. Empty string softenCols Yes The features that you want to smooth. Sparse features are automatically filtered by the system. N/A softenMethod No The method that is used for smoothing. Valid values: - Z-Score: ZScore
- min-max-per: percentile smoothing
- min-max-thresh: threshold smoothing
- boxplot: boxplot
Z-Score softenTopN No If you do not set the softenCols parameter, the system automatically selects the top N features that require smoothing. The value must be a positive integer. 10 cl No The confidence level. This parameter is required when the softenMethod parameter is set to Z-Score. 10 minPer No The minimum percentile. This parameter is required when the softenMethod parameter is set to Percentile or Boxplot Smoothing. 0.0 maxPer No The maximum percentile. This parameter is required when the softenMethod parameter is set to Percentile or Boxplot Smoothing. 1.0 minThresh No The minimum threshold. This parameter is required when the softenMethod parameter is set to Threshold. -9999 maxThresh No The maximum threshold. This parameter is required when the softenMethod parameter is set to Threshold. -9999 isSparse No Specifies whether features are sparse features in the key-value format. Valid values: - true
- false
The default value is false, indicating that features are in the dense format.
false itemSpliter No The delimiter that is used to separate key-value pairs in the sparse format. , kvSpliter No The delimiter that is used to separate keys and values in the sparse format. : lifecycle No The lifecycle of the output table. The value must be a positive integer. 7 coreNum No The number of cores. This parameter must be used with the memSizePerCore parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. Determined by the system memSizePerCore No The memory size of each core. Unit: MB. The value of this parameter must be a positive integer. Valid values: [2048,64 × 1024]. Determined by the system
Example
- Input data
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10;
nr_employed 5228.1 5195.8 4991.6 5099.1 5076.2 5228.1 5099.1 5099.1 5076.2 5099.1 - Parameter settings
On the Fields Setting tab, set the Softened Feature Columns parameter to nr_employed. On the Parameters Setting tab, set the Soften Method to Threshold, Minimum Threshold to 5000, and Maximum Threshold to 6000.
- Result
nr_employed 5228.1 5195.8 5000.0 5099.1 5076.2 5228.1 5099.1 5099.1 5076.2 5099.1