Use the Data Conversion Module to normalize, discretize, index, or perform Weight of Evidence (WOE) conversion on data.
Configure the component
You can configure the parameters for the Data Conversion Module component in one of the following ways.
Method 1: Use the GUI
You can configure the component parameters on the workflow page in Designer.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Feature columns in input table |
The feature columns from the input table. By default, all columns are selected. |
|
Columns to exclude from conversion |
The selected columns are passed through to the output without changes. You can specify a label column here. |
|
|
Data conversion type |
Supported conversion types include Normalization, Discretization, WOE conversion, and Index. |
|
|
Default WOE value |
This parameter takes effect only when Data conversion type is set to WOE conversion. If you specify this parameter, this value is used to replace any sample value that falls into a bin without a WOE value. If you do not specify this parameter, the algorithm reports an error when a sample value falls into a bin without a WOE value. |
|
|
Execution Tuning |
Number of cores |
The number of CPU cores to use. By default, the system automatically allocates the cores. |
|
Memory per core |
The amount of memory for each CPU core. By default, the system automatically allocates the memory. |
Method 2: Use PAI commands
You can configure the component parameters using PAI commands in the SQL Script component. For more information, see SQL Script.
PAI -name data_transform
-project algo_public
-DinputFeatureTableName=feature_table
-DinputBinTableName=bin_table
-DoutputTableName=output_table
-DmetaColNames=label
-DfeatureColNames=feaname1,feaname2
|
Parameter |
Description |
Required |
Default value |
|
inputFeatureTableName |
The input feature table. |
Yes |
None |
|
inputBinTableName |
The input binning result table. |
Yes |
None |
|
inputFeatureTablePartitions |
The partitions to use from the input feature table. |
No |
Complete table |
|
outputTableName |
The output table. |
Yes |
None |
|
featureColNames |
The feature columns to select from the input table. |
No |
All columns |
|
metaColNames |
The columns that are not converted. The selected columns are passed through to the output without changes. You can specify columns such as the label and sample_id. |
No |
None |
|
transformType |
The type of data conversion. Valid values:
|
No |
dummy |
|
itemDelimiter |
The feature separator. This parameter is valid only for discretization. |
No |
Comma (,) |
|
kvDelimiter |
The key-value separator. This parameter is valid only for discretization. |
No |
Colon (:) |
|
lifecycle |
The lifecycle of the output table. |
No |
None |
|
coreNum |
The number of CPU cores to use. |
No |
System-calculated |
|
memSizePerCore |
The amount of memory for each CPU core, in MB. |
No |
System-calculated |
Normalization converts variable values to a range between 0 and 1 based on the input binning information. Missing values are filled with 0. The algorithm is as follows.
if feature_raw_value == null or feature_raw_value == 0 then
feature_norm_value = 0.0
else
bin_index = FindBin(bin_table, feature_raw_value)
bin_width = round(1.0 / bin_count * 1000) / 1000.0
feature_norm_value = 1.0 - (bin_count - bin_index - 1) * bin_width
The output format varies depending on the type of data conversion performed by the Data Conversion Module:
-
Normalization and WOE conversion output a standard table.
-
Discretization into dummy variables outputs a table in key-value (KV) format. The generated variables use the format ${feaname}]\_bin\_${bin_id}. For example, for a variable named sns, the generated variables are as follows:
-
If sns falls into the second bin, the generated variable is [sns]_bin_2.
-
If sns is empty, it falls into the null bin, and the generated variable is [sns]_bin_null.
-
If sns is not empty and does not fall into any defined bin, it falls into the else bin, and the generated variable is [sns]_bin_else.
-