Normalize feature columns to improve model training efficiency and accuracy.
Component configuration
Configure Normalization component parameters using either method:
Method 1: Configure the component in the GUI
Configure component parameters on the Designer workflow page.
|
Tab |
Parameter |
Description |
|
Fields setting |
Select all by default |
All columns are selected by default. Extra columns do not affect prediction result. |
|
Keep original columns |
Processed columns are prefixed with "stdized_". Supports columns of DOUBLE and BIGINT types. |
|
|
Execution tuning |
Number of computing cores |
System automatically allocates the number of instances for training based on input data volume. |
|
Memory size per core |
System automatically allocates memory based on input data volume. Unit: MB. |
Method 2: Use PAI commands
Use PAI commands to configure component parameters. Use the SQL Script component to call PAI commands. For more information, see SQL Script.
-
Command for dense data
PAI -name Normalize -project algo_public -DkeepOriginal="true" -DoutputTableName="test_4" -DinputTablePartitions="pt=20150501" -DinputTableName="bank_data_partition" -DselectedColNames="emp_var_rate,euribor3m" -
Command for sparse data
PAI -name Normalize -project projectxlib4 -DkeepOriginal="true" -DoutputTableName="kv_norm_output" -DinputTableName=kv_norm_test -DselectedColNames="f0,f1,f2" -DenableSparse=true -DoutputParaTableName=kv_norm_model -DkvIndices=1,2,8,6 -DitemDelimiter=",";
|
Parameter Name |
Required |
Description |
Default value |
|
inputTableName |
Yes |
Input table name. |
None |
|
selectedColNames |
No |
Columns in the input table used for training. Separate column names with commas (,). Supports INT and DOUBLE types. If input data is in sparse format, also supports STRING type. |
All columns |
|
inputTablePartitions |
No |
Partitions in the input table used for training. Supported formats:
Note
Separate multiple partitions with commas (,). |
All partitions |
|
outputTableName |
Yes |
Output table name. |
None |
|
outputParaTableName |
No |
Name of the output parameter table. |
Defaults to non-partitioned table. |
|
inputParaTableName |
Yes |
Name of the input parameter table. |
None |
|
keepOriginal |
No |
Retain original column:
|
false |
|
lifecycle |
No |
Output table lifecycle. Valid range: 1 to 3650. |
None |
|
coreNum |
No |
Number of cores for computing. Valid values: positive integer. |
System auto-allocated. |
|
memSizePerCore |
No |
Memory size per core in MB. Valid range: 1 to 65536. |
System auto-allocated. |
|
enableSparse |
No |
Enable sparse support. Valid values:
|
false |
|
itemDelimiter |
No |
Separator between key-value pairs. |
default |
|
kvDelimiter |
No |
Separator between a key and its value. |
default |
|
kvIndices |
No |
Indexes of features that require normalization in the key-value table. |
None |
Example
-
Generate data
drop table if exists normalize_test_input; create table normalize_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table normalize_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime ) tmp; -
PAI command
drop table if exists normalize_test_input_output; drop table if exists normalize_test_input_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output" -DinputTableName="normalize_test_input" -DselectedColNames="col_double,col_bigint" -DkeepOriginal="true"; drop table if exists normalize_test_input_output_using_model; drop table if exists normalize_test_input_output_using_model_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_output_using_model_model_output" -DinputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output_using_model" -DinputTableName="normalize_test_input"; -
Input
normalize_test_input
col_string
col_bigint
col_double
col_boolean
col_datetime
01
10
10.1
true
2016-07-01 10:00:00
NULL
11
10.2
false
2016-07-02 10:00:00
02
NULL
10.3
true
2016-07-03 10:00:00
03
12
NULL
false
2016-07-04 10:00:00
04
13
10.4
NULL
2016-07-05 10:00:00
05
14
10.5
true
NULL
-
Outputs
-
normalize_test_input_output
col_string
col_bigint
col_double
col_boolean
col_datetime
normalized_col_bigint
normalized_col_double
01
10
10.1
true
2016-07-01 10:00:00
0.0
0.0
NULL
11
10.2
false
2016-07-02 10:00:00
0.25
0.2499999999999989
02
NULL
10.3
true
2016-07-03 10:00:00
NULL
0.5000000000000022
03
12
NULL
false
2016-07-04 10:00:00
0.5
NULL
04
13
10.4
NULL
2016-07-05 10:00:00
0.75
0.7500000000000011
05
14
10.5
true
NULL
1.0
1.0
-
normalize_test_input_model_output
feature
json
col_bigint
{"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
col_double
{"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}
-
normalize_test_input_output_using_model
col_string
col_bigint
col_double
col_boolean
col_datetime
01
0.0
0.0
true
2016-07-01 10:00:00
NULL
0.25
0.2499999999999989
false
2016-07-02 10:00:00
02
NULL
0.5000000000000022
true
2016-07-03 10:00:00
03
0.5
NULL
false
2016-07-04 10:00:00
04
0.75
0.7500000000000011
NULL
2016-07-05 10:00:00
05
1.0
1.0
true
NULL
-
normalize_test_input_output_using_model_model_output
feature
json
col_bigint
{"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
col_double
{"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}
-