The Data Conversion Module component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Feature Columns in Input Table The feature columns that are selected from the input table. By default, all columns in the input table are selected.
    Columns without Data Conversion The columns on which data conversion is not required. The selected columns in the output are the same as those in the input. You can specify labels in the columns.
    Data Conversion Mode Valid values: Normalization, Discretization, WOE Conversion, and Index.
    Default WOE Value

    This parameter is valid only if the Data Conversion Mode parameter is set to WOE Conversion.

    If this parameter is specified and a sample value falls into a bin without WOE values, this value is used as the WOE value. If this parameter is not specified, when a sample value falls into a bin without WOE values, the system reports an error.

    Tuning Number of Cores The number of CPU cores that are required. By default, the system determines the value.
    Memory Size per Core The memory size of each CPU core. By default, the system determines the value.
  • Use commands
    PAI -name data_transform
    -project algo_public
    -DinputFeatureTableName=feature_table
    -DinputBinTableName=bin_table
    -DoutputTableName=output_table
    -DmetaColNames=label
    -DfeatureColNames=feaname1,feaname2
    Parameter Description Required Default value
    inputFeatureTableName The name of the input feature table. Yes N/A
    inputBinTableName The name of the binning result table. Yes N/A
    inputFeatureTablePartitions The partitions that are selected from the input feature table. No Full table
    outputTableName The name of the output table. Yes N/A
    featureColNames The feature columns that are selected from the input table. No All columns
    metaColNames The columns that do not need to be converted. These columns in the output are the same as those in the input. You can specify labels and sample_id in the columns. No N/A
    transformType The type of data conversion. Valid values:
    • normalize: normalization
    • dummy: discretization
    • woe: WOE conversion
    No dummy
    itemDelimiter The delimiter that is used to separate features. This parameter is valid only if the transformType parameter is set to dummy. No ,
    kvDelimiter The delimiter that is used to separate keys and values. This parameter is valid only if the transformType parameter is set to dummy. No :
    lifecycle The lifecycle of the output table. No N/A
    coreNum The number of CPU cores that are required. No Determined by the system
    memSizePerCore The memory size of each CPU core. Unit: MB. No Determined by the system
To implement normalization, the Data Conversion Module component converts variable values into values between 0 and 1 based on input binning information, and sets missing values to 0. The following algorithm is used:
if feature_raw_value == null or feature_raw_value == 0 then
    feature_norm_value = 0.0
else
    bin_index = FindBin(bin_table, feature_raw_value)
    bin_width = round(1.0 / bin_count * 1000) / 1000.0
    feature_norm_value = 1.0 - (bin_count - bin_index - 1) * bin_width
The Data Conversion Module component can convert different types of data into different formats.
  • For normalization and WOE conversion, the component generates a regular table.
  • During discretization in which data is converted into dummy variables, the component generates a table in the key-value format. Each variable in the table is in the ${feaname}]\_bin\_${bin_id} format. In the following example, the sns variable is used:
    • If sns falls into the second bin, the generated variable is [sns]_bin_2.
    • If sns does not have a value, it falls into the empty bin, and the generated variable is [sns]_bin_null.
    • If sns has a value but does not fall into a defined bin, it falls into the else bin, and the generated variable is [sns]_bin_else.