This topic describes the Normalization component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

Configure the component

You can use one of the following methods to configure the Normalization component.

Method 1: Configure the component on the pipeline page

Configure the component parameters on the pipeline page of Machine Learning Designer.
TabParameterDescription
Fields SettingAll Selected by DefaultBy default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result.
Reserve Original ColumnsSpecifies whether to reserve original columns. Column names are prefixed with normalized_ after normalization. Only columns of the DOUBLE or BIGINT type can be reserved.
TuningCoresThe number of cores. The system automatically allocates cores used for training based on the volume of input data.
Memory Size per CoreThe memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
  • Command for dense data
    PAI -name Normalize
        -project algo_public
        -DkeepOriginal="true"
        -DoutputTableName="test_4"
        -DinputTablePartitions="pt=20150501"
        -DinputTableName="bank_data_partition"
        -DselectedColNames="emp_var_rate,euribor3m"
  • Command for sparse data
    PAI -name Normalize
        -project projectxlib4
        -DkeepOriginal="true"
        -DoutputTableName="kv_norm_output"
        -DinputTableName=kv_norm_test
        -DselectedColNames="f0,f1,f2"
        -DenableSparse=true
        -DoutputParaTableName=kv_norm_model
        -DkvIndices=1,2,8,6
        -DitemDelimiter=",";
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table. No default value
selectedColNamesNoThe columns that are selected from the input table for training. The column names must be separated by commas (,). Columns of the INT and DOUBLE types are supported. If the input data is in the sparse format, columns of the STRING type are supported. All columns
inputTablePartitionsNoThe partitions that are selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate them with commas (,).
All partitions
outputTableNameYesThe name of the output table. No default value
outputParaTableNameNoThe name of the output parameter table. No default value
inputParaTableNameYesThe name of the input parameter table. No default value
keepOriginalNoSpecifies whether to reserve original columns. Valid values:
  • true: renames the normalized columns with the normalized_ prefix and reserves original columns.
  • false: reserves all columns without renaming them.
false
lifecycleNoThe lifecycle of the output table. Valid values: [1,3650]. No default value
coreNumNoThe number of cores used in computing. The value must be a positive integer. Determined by the system
memSizePerCoreNoThe memory size of each core. Unit: MB. Valid values: (1,65536). Determined by the system
enableSparseNoSpecifies whether to support the input data in the sparse format. Valid values:
  • true
  • false
false
itemDelimiterNoThe delimiter used between key-value pairs. ,
kvDelimiterNoThe delimiter used between keys and values. :
kvIndicesNoThe feature indexes that require normalization in the table that contains data in the key-value format. No default value

Example

  • Generate input data
    drop table if exists normalize_test_input;
    create table normalize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table normalize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        from dual
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
            from dual
    ) tmp;
  • Run PAI commands
    drop table if exists normalize_test_input_output;
    drop table if exists normalize_test_input_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output"
        -DinputTableName="normalize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists normalize_test_input_output_using_model;
    drop table if exists normalize_test_input_output_using_model_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_output_using_model_model_output"
        -DinputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output_using_model"
        -DinputTableName="normalize_test_input";
  • Input
    normalize_test_input
    col_stringcol_bigintcol_doublecol_booleancol_datetime
    011010.1true2016-07-01 10:00:00
    NULL1110.2false2016-07-02 10:00:00
    02NULL10.3true2016-07-03 10:00:00
    0312NULLfalse2016-07-04 10:00:00
    041310.4NULL2016-07-05 10:00:00
    051410.5trueNULL
  • Output
    • normalize_test_input_output
      col_stringcol_bigintcol_doublecol_booleancol_datetimenormalized_col_bigintnormalized_col_double
      011010.1true2016-07-01 10:00:000.00.0
      NULL1110.2false2016-07-02 10:00:000.250.2499999999999989
      02NULL10.3true2016-07-03 10:00:00NULL0.5000000000000022
      0312NULLfalse2016-07-04 10:00:000.5NULL
      041310.4NULL2016-07-05 10:00:000.750.7500000000000011
      051410.5trueNULL1.01.0
    • normalize_test_input_model_output
      featurejson
      col_bigint{"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
      col_double{"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}
    • normalize_test_input_output_using_model
      col_stringcol_bigintcol_doublecol_booleancol_datetime
      010.00.0true2016-07-01 10:00:00
      NULL0.250.2499999999999989false2016-07-02 10:00:00
      02NULL0.5000000000000022true2016-07-03 10:00:00
      030.5NULLfalse2016-07-04 10:00:00
      040.750.7500000000000011NULL2016-07-05 10:00:00
      051.01.0trueNULL
    • normalize_test_input_output_using_model_model_output
      featurejson
      col_bigint{"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
      col_double{"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}