The Feature Softening component can smooth anomalous features in input data to a specific interval. Data in the sparse and dense formats are supported.

Background information

The following smoothing methods are supported:
  • Z-Score

    If a feature is in a normal distribution, the noise is distributed outside the range of -3×alpha to 3×alpha. Z-Score smooths the noise to the range of [-3×alpha,3×alpha].

    For example, a feature is in a normal distribution, the mean value is 0, and the standard deviation is 3. The feature value -10 is identified anomalous and corrected to -9 based on the smoothing rule of Z-Score. In the same way, the feature value 10 is corrected to 9. Z-Score
  • Percentile smoothing

    Percentile smoothing is used to smooth the data distributed outside the range of [minPer, maxPer] to the minPer or maxPer quantile.

    For example, the feature value of age is in the range of 0 to 200. Set minPer to 0 and maxPer to 50%. Feature values outside the range of 0 to 100 are corrected to 0 or 100.

  • Threshold smoothing

    Threshold smoothing is used to smooth the data distributed outside the range of [minThresh, maxThresh] to the minThresh or maxThresh data point.

    For example, the feature value of age is in the range of 0 to 200. Set minThresh to 10 and maxThresh to 80. Feature values outside the range of 0 to 80 are corrected to 0 or 80.

Note The component corrects anomalous values but does not filter or delete records. The dimensions and number of input data records are not changed.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Softened Feature Columns The feature columns that you want to smooth.
    Label Column The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized.
    Parameters Setting Soften Method The method that is used for smoothing. Valid values:
    • Z-Score
    • Percentile
    • Threshold
    • Boxplot Smoothing
    Confidence Interval The confidence level. This parameter is required when the Soften Method parameter is set to Z-Score.
    Minimum Threshold The minimum threshold. The default value is -9999, which indicates no minimum threshold.

    This parameter is required when the Soften Method is set to Threshold.

    Maximum Threshold The maximum threshold. The default value is -9999, which indicates no maximum threshold.

    This parameter is required when the Soften Method parameter is set to Threshold.

    Minimum Percentile The minimum percentile.

    This parameter is required when the Soften Method parameter is set to Percentile or Boxplot Smoothing.

    Maximum Percentile The maximum percentile.

    This parameter is required when the Soften Method parameter is set to Percentile or Boxplot Smoothing.

  • Use commands
    PAI -name fe_soften_runner -project algo_public
        -DminThresh=5000
        -Dlifecycle=28
        -DsoftenMethod=min-max-thresh
        -DsoftenCols=nr_employed
        -DmaxThresh=6000
        -DinputTable=pai_dense_10_1
        -DoutputTable=pai_temp_2262_20381_1;
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Set this parameter in the Partition_name=value format.

    To specify multi-level partitions, set this parameter in the name1=value1/name2=value2; format.

    If you specify multiple partitions, separate them with commas (,).

    All partitions in the input table
    outputTable Yes The output table after smoothing. N/A
    labelCol No The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and objective variables can be visualized. Empty string
    categoryCols No The selected fields that are processed as enumerated features. Empty string
    softenCols Yes The features that you want to smooth. Sparse features are automatically filtered by the system. N/A
    softenMethod No The method that is used for smoothing. Valid values:
    • Z-Score: ZScore
    • min-max-per: percentile smoothing
    • min-max-thresh: threshold smoothing
    • boxplot: boxplot
    Z-Score
    softenTopN No If you do not set the softenCols parameter, the system automatically selects the top N features that require smoothing. The value must be a positive integer. 10
    cl No The confidence level. This parameter is required when the softenMethod parameter is set to Z-Score. 10
    minPer No The minimum percentile. This parameter is required when the softenMethod parameter is set to Percentile or Boxplot Smoothing. 0.0
    maxPer No The maximum percentile. This parameter is required when the softenMethod parameter is set to Percentile or Boxplot Smoothing. 1.0
    minThresh No The minimum threshold. This parameter is required when the softenMethod parameter is set to Threshold. -9999
    maxThresh No The maximum threshold. This parameter is required when the softenMethod parameter is set to Threshold. -9999
    isSparse No Specifies whether features are sparse features in the key-value format. Valid values:
    • true
    • false

    The default value is false, indicating that features are in the dense format.

    false
    itemSpliter No The delimiter that is used to separate key-value pairs in the sparse format. ,
    kvSpliter No The delimiter that is used to separate keys and values in the sparse format. :
    lifecycle No The lifecycle of the output table. The value must be a positive integer. 7
    coreNum No The number of cores. This parameter must be used with the memSizePerCore parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. Determined by the system
    memSizePerCore No The memory size of each core. Unit: MB. The value of this parameter must be a positive integer. Valid values: [2048,64 × 1024]. Determined by the system

Example

  • Input data
    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;
    nr_employed
    5228.1
    5195.8
    4991.6
    5099.1
    5076.2
    5228.1
    5099.1
    5099.1
    5076.2
    5099.1
  • Parameter settings

    On the Fields Setting tab, set the Softened Feature Columns parameter to nr_employed. On the Parameters Setting tab, set the Soften Method to Threshold, Minimum Threshold to 5000, and Maximum Threshold to 6000.

  • Result
    nr_employed
    5228.1
    5195.8
    5000.0
    5099.1
    5076.2
    5228.1
    5099.1
    5099.1
    5076.2
    5099.1