The Data Conversion Module component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.

Configure the component

You can use one of the following methods to configure the Data Conversion Module component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Data Conversion Module component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingFeature Columns in Input TableThe feature columns that are selected from the input table. By default, all columns in the input table are selected.
Columns without Data ConversionThe columns on which data conversion is not required. The selected columns in the output are the same as those in the input. You can specify labels in the columns.
Data Conversion ModeValid values: Normalization, Discretization, WOE Conversion, and Index.
Default WOE Value

This parameter is valid only if the Data Conversion Mode parameter is set to WOE Conversion.

If this parameter is specified and a sample value falls into a bin without WOE values, this value is used as the WOE value. If this parameter is not specified and a sample value falls into a bin without WOE values, the system reports an error.

TuningNumber of CoresThe number of CPU cores that are required. By default, the system determines the value.
Memory Size per CoreThe memory size of each CPU core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name data_transform
-project algo_public
-DinputFeatureTableName=feature_table
-DinputBinTableName=bin_table
-DoutputTableName=output_table
-DmetaColNames=label
-DfeatureColNames=feaname1,feaname2
ParameterDescriptionRequiredDefault value
inputFeatureTableNameThe name of the input feature table. YesNo default value
inputBinTableNameThe name of the binning result table. YesNo default value
inputFeatureTablePartitionsThe partitions that are selected from the input feature table. NoFull table
outputTableNameThe name of the output table. YesNo default value
featureColNamesThe feature columns that are selected from the input table. NoAll columns
metaColNamesThe columns that do not need to be converted. These columns in the output are the same as those in the input. You can specify labels and sample IDs in the columns. NoNo default value
transformTypeThe type of data conversion. Valid values:
  • normalize: normalization
  • dummy: discretization
  • woe: WOE conversion
Nodummy
itemDelimiterThe delimiter that is used to separate features. This parameter is valid only if the transformType parameter is set to dummy. No,
kvDelimiterThe delimiter that is used to separate keys and values. This parameter is valid only if the transformType parameter is set to dummy. No:
lifecycleThe lifecycle of the output table. NoNo default value
coreNumThe number of CPU cores that are required. NoDetermined by the system
memSizePerCoreThe memory size of each CPU core. Unit: MB. NoDetermined by the system
To implement normalization, the Data Conversion Module component converts variable values into values between 0 and 1 based on the input binning information, and sets missing values to 0. The following algorithm is used:
if feature_raw_value == null or feature_raw_value == 0 then
    feature_norm_value = 0.0
else
    bin_index = FindBin(bin_table, feature_raw_value)
    bin_width = round(1.0 / bin_count * 1000) / 1000.0
    feature_norm_value = 1.0 - (bin_count - bin_index - 1) * bin_width
The Data Conversion Module component can convert different types of data into different formats:
  • For normalization and WOE conversion, the component generates a regular table.
  • During discretization in which data is converted into dummy variables, the component generates a table in the key-value format. Each variable in the table is in the ${feaname}]\_bin\_${bin_id} format. In the following example, the sns variable is used:
    • If sns falls into the second bin, the generated variable is [sns]_bin_2.
    • If sns does not have a value, it falls into the empty bin, and the generated variable is [sns]_bin_null.
    • If sns has a value but does not fall into a defined bin, it falls into the else bin, and the generated variable is [sns]_bin_else.