This topic describes the Normalization component provided by Machine Learning Designer (formerly known as Machine Learning Studio).
Configure the component
You can use one of the following methods to configure the Normalization component.
Method 1: Configure the component on the pipeline page
Configure the component parameters on the pipeline page of Machine Learning Designer.
Tab | Parameter | Description |
---|---|---|
Fields Setting | All Selected by Default | By default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result. |
Reserve Original Columns | Specifies whether to reserve original columns. Column names are prefixed with normalized_ after normalization. Only columns of the DOUBLE or BIGINT type can be reserved. | |
Tuning | Cores | The number of cores. The system automatically allocates cores used for training based on the volume of input data. |
Memory Size per Core | The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
- Command for dense data
PAI -name Normalize -project algo_public -DkeepOriginal="true" -DoutputTableName="test_4" -DinputTablePartitions="pt=20150501" -DinputTableName="bank_data_partition" -DselectedColNames="emp_var_rate,euribor3m"
- Command for sparse data
PAI -name Normalize -project projectxlib4 -DkeepOriginal="true" -DoutputTableName="kv_norm_output" -DinputTableName=kv_norm_test -DselectedColNames="f0,f1,f2" -DenableSparse=true -DoutputParaTableName=kv_norm_model -DkvIndices=1,2,8,6 -DitemDelimiter=",";
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input table. | No default value |
selectedColNames | No | The columns that are selected from the input table for training. The column names must be separated by commas (,). Columns of the INT and DOUBLE types are supported. If the input data is in the sparse format, columns of the STRING type are supported. | All columns |
inputTablePartitions | No | The partitions that are selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
outputParaTableName | No | The name of the output parameter table. | No default value |
inputParaTableName | Yes | The name of the input parameter table. | No default value |
keepOriginal | No | Specifies whether to reserve original columns. Valid values:
| false |
lifecycle | No | The lifecycle of the output table. Valid values: [1,3650]. | No default value |
coreNum | No | The number of cores used in computing. The value must be a positive integer. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. Valid values: (1,65536). | Determined by the system |
enableSparse | No | Specifies whether to support the input data in the sparse format. Valid values:
| false |
itemDelimiter | No | The delimiter used between key-value pairs. | , |
kvDelimiter | No | The delimiter used between keys and values. | : |
kvIndices | No | The feature indexes that require normalization in the table that contains data in the key-value format. | No default value |
Example
- Generate input data
drop table if exists normalize_test_input; create table normalize_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table normalize_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime from dual union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime from dual union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime from dual union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime from dual union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime from dual union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime from dual ) tmp;
- Run PAI commands
drop table if exists normalize_test_input_output; drop table if exists normalize_test_input_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output" -DinputTableName="normalize_test_input" -DselectedColNames="col_double,col_bigint" -DkeepOriginal="true"; drop table if exists normalize_test_input_output_using_model; drop table if exists normalize_test_input_output_using_model_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_output_using_model_model_output" -DinputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output_using_model" -DinputTableName="normalize_test_input";
- Inputnormalize_test_input
col_string col_bigint col_double col_boolean col_datetime 01 10 10.1 true 2016-07-01 10:00:00 NULL 11 10.2 false 2016-07-02 10:00:00 02 NULL 10.3 true 2016-07-03 10:00:00 03 12 NULL false 2016-07-04 10:00:00 04 13 10.4 NULL 2016-07-05 10:00:00 05 14 10.5 true NULL - Output
- normalize_test_input_output
col_string col_bigint col_double col_boolean col_datetime normalized_col_bigint normalized_col_double 01 10 10.1 true 2016-07-01 10:00:00 0.0 0.0 NULL 11 10.2 false 2016-07-02 10:00:00 0.25 0.2499999999999989 02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.5000000000000022 03 12 NULL false 2016-07-04 10:00:00 0.5 NULL 04 13 10.4 NULL 2016-07-05 10:00:00 0.75 0.7500000000000011 05 14 10.5 true NULL 1.0 1.0 - normalize_test_input_model_output
feature json col_bigint {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}} col_double {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}} - normalize_test_input_output_using_model
col_string col_bigint col_double col_boolean col_datetime 01 0.0 0.0 true 2016-07-01 10:00:00 NULL 0.25 0.2499999999999989 false 2016-07-02 10:00:00 02 NULL 0.5000000000000022 true 2016-07-03 10:00:00 03 0.5 NULL false 2016-07-04 10:00:00 04 0.75 0.7500000000000011 NULL 2016-07-05 10:00:00 05 1.0 1.0 true NULL - normalize_test_input_output_using_model_model_output
feature json col_bigint {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}} col_double {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}
- normalize_test_input_output