The Feature Anomaly Smoothing component can smooth anomalous features in input data to a specific interval. Sparse and dense data is supported.

Background information

The method that is used for smoothing. Valid values:
  • Z-Score

    If a feature is in a normal distribution, the noise is distributed outside the range of -3×alpha to 3×alpha. Z-Score smooths the noise to the range of [-3×alpha,3×alpha].

    For example, assume that for a feature in a normal distribution, the mean value is 0, and the standard deviation is 3. The feature value -10 is identified as anomalous and corrected to -3 × 3 + 0 (-9) based on the smoothing rule of Z-Score. In the same way, the feature value 10 is corrected to 3 × 3 + 0 (9).

  • Percentile smoothing

    Percentile smoothing is used to smooth the data distributed outside the range of [minPer, maxPer] to the minPer or maxPer quantile.

    For example, assume that the feature value of age is in the range of 0 to 200. Set minPer to 0 and maxPer to 50%. Feature values outside the range of 0 to 100 are corrected to 0 or 100.

  • Threshold smoothing

    Threshold smoothing is used to smooth the data distributed outside the range of [minThresh, maxThresh] to the minThresh or maxThresh data point.

    For example, assume that the feature value of age is in the range of 0 to 200. Set minThresh to 10 and maxThresh to 80. Feature values outside the range of 0 to 80 are corrected to 0 or 80.

  • Boxplot smoothing

    This method uses quartiles to smooth data to the range of minThresh=q1-1.5(q3-q1) to maxThresh=q3+1.5(q3-q1).

Note The Feature Anomaly Smoothing component corrects anomalous values but does not filter or delete records. The dimensions and number of input data records are not changed.

Configure the component

You can use one of the following methods to configure the Feature Anomaly Smoothing component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Feature Anomaly Smoothing component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingSmoothed Feature ColumnsThe feature columns that you want to smooth.
Label ColumnThe label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and the objective variables can be viewed.
Parameters SettingSmoothing MethodThe method that is used for smoothing. Valid values:
  • Z-Score
  • Percentile
  • Threshold Smoothing
  • Box Plot
Confidence IntervalThe confidence level. This parameter is required when the Smoothing Method parameter is set to Z-Score.
Minimum ThresholdThe minimum threshold. The default value is -9999, which indicates that no minimum threshold is set.

This parameter is required when the Smoothing Method parameter is set to Threshold Smoothing.

Maximum ThresholdThe maximum threshold. The default value is -9999, which indicates that no maximum threshold is set.

This parameter is required when the Smoothing Method parameter is set to Threshold Smoothing.

Minimum PercentileThe minimum percentile.

This parameter is required when the Smoothing Method parameter is set to Percentile or Box Plot.

Maximum PercentileThe maximum percentile.

This parameter is required when the Smoothing Method parameter is set to Percentile or Box Plot.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name fe_soften_runner -project algo_public
    -DminThresh=5000
    -Dlifecycle=28
    -DsoftenMethod=min-max-thresh
    -DsoftenCols=nr_employed
    -DmaxThresh=6000
    -DinputTable=pai_dense_10_1
    -DoutputTable=pai_temp_2262_20381_1;
ParameterRequiredDescriptionDefault value
inputTableYesThe name of the input table.N/A
inputTablePartitionsNoThe partitions that are selected from the input table for training. Specify this parameter in the Partition_name=value format.

To specify multi-level partitions, specify this parameter in the name1=value1/name2=value2; format.

If you specify multiple partitions, separate them with commas (,).

All partitions in the input table
outputTableYesThe output table after smoothing. N/A
labelColNoThe label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and the objective variables can be viewed. Empty string
categoryColsNoThe selected fields that are processed as enumerated features. Empty string
softenColsYesThe features that you want to smooth. Sparse features are automatically displayed by the system. N/A
softenMethodNoThe method that is used for smoothing. Valid values:
  • ZScore: Z-Score
  • min-max-per: percentile smoothing
  • min-max-thresh: threshold smoothing
  • boxplot: boxplot smoothing
ZScore
softenTopNNoIf you do not set the softenCols parameter, the system automatically selects the top N features that require smoothing. The value must be a positive integer. 10
clNoThe confidence level. This parameter is required when the softenMethod parameter is set to ZScore. 10
minPerNoThe minimum percentile. This parameter is required when the softenMethod parameter is set to min-max-per or boxplot. 0.0
maxPerNoThe maximum percentile. This parameter is required when the softenMethod parameter is set to min-max-per or boxplot. 1.0
minThreshNoThe minimum threshold. This parameter is required when the softenMethod parameter is set to min-max-thresh. -9999
maxThreshNoThe maximum threshold. This parameter is required when the softenMethod parameter is set to min-max-thresh. -9999
isSparseNoSpecifies whether features are sparse features in the key-value format. Valid values:
  • true
  • false

The default value is false, which indicates that features are dense.

false
itemSpliter NoThe delimiter that is used to separate sparse key-value pairs. ,
kvSpliterNoThe delimiter that is used to separate sparse keys and values. :
lifecycleNoThe lifecycle of the output table. The value must be a positive integer. 7
coreNumNoThe number of cores. This parameter is used together with the memSizePerCore parameter. The value must be a positive integer. Valid values: [1,9999]. Determined by the system
memSizePerCore NoThe memory size of each core. Unit: MB. The value must be a positive integer. Valid values: [2048,64 × 1024]. Determined by the system

Examples

  • Input data
    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;
    nr_employed
    5228.1
    5195.8
    4991.6
    5099.1
    5076.2
    5228.1
    5099.1
    5099.1
    5076.2
    5099.1
  • Parameter settings
    On the Fields Setting tab, set Smoothed Feature Columns to nr_employed. On the Parameters Setting tab, set Smoothing Method to Threshold Smoothing, Minimum Threshold to 5000, and Maximum Threshold to 6000. The following figure shows the configurations on the Parameters Setting tab. Smoothing features
  • Execution results
    nr_employed
    5228.1
    5195.8
    5000.0
    5099.1
    5076.2
    5228.1
    5099.1
    5099.1
    5076.2
    5099.1