This topic describes the Normalization component provided by Machine Learning Studio.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI (PAI) console
    Tab Parameter Description
    Fields Setting All Selected by Default All columns are selected by default. Additional columns do not affect the prediction result.
    Reserve Original Columns Specifies whether to reserve original columns. Column names are prefixed with normalized_ after normalization. Only the columns of the DOUBLE or BIGINT data type can be reserved.
    Tuning Cores The number of cores. The system automatically allocates the cores used for training based on the volume of input data.
    Memory Size per Core The memory size of each core. The system automatically allocates the memory size based on the volume of input data. Unit: MB.
  • PAI command
    • Command for dense data
      PAI -name Normalize
          -project algo_public
          -DkeepOriginal="true"
          -DoutputTableName="test_4"
          -DinputTablePartitions="pt=20150501"
          -DinputTableName="bank_data_partition"
          -DselectedColNames="emp_var_rate,euribor3m"
    • Command for sparse data
      PAI -name Normalize
          -project projectxlib4
          -DkeepOriginal="true"
          -DoutputTableName="kv_norm_output"
          -DinputTableName=kv_norm_test
          -DselectedColNames="f0,f1,f2"
          -DenableSparse=true
          -DoutputParaTableName=kv_norm_model
          -DkvIndices=1,2,8,6
          -DitemDelimiter=",";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    selectedColNames No The names of the columns selected from the input table for training. If you specify multiple columns, separate the column names with commas (,). Columns of the INT or DOUBLE data type can be used for training. However, if the input data is in the sparse format, only columns of the STRING data type can be used for training. All columns
    inputTablePartitions No The partitions selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note If you specify multiple partitions, separate them with commas (,).
    All partitions
    outputTableName Yes The name of the output table. No default value
    outputParaTableName No The name of the output parameter table. Non-partitioned output table 1
    inputParaTableName Yes The name of the input parameter table. No default value
    keepOriginal No Specifies whether to reserve original columns. Valid values:
    • true: Rename the normalized columns with the normalized_ prefix and reserve original columns.
    • false: Reserve all columns without renaming them.
    false
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. No default value
    coreNum No The number of cores used in computing. The value of this parameter must be a positive integer. Automatically allocated
    memSizePerCore No The memory size of each core. Unit: MB. Valid values: (1,65536). Automatically allocated
    enableSparse No Specifies whether to support the input data in the sparse format. Valid values:
    • true
    • false
    false
    itemDelimiter No The delimiter used between key-value pairs. ,
    kvDelimiter No The delimiter used between keys and values. :
    kvIndices No The feature indexes that require normalization in the table that contains data in the key-value format. No default value

Example

  • Data generation
    drop table if exists normalize_test_input;
    create table normalize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table normalize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        from dual
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
            from dual
    ) tmp;
  • PAI commands
    drop table if exists normalize_test_input_output;
    drop table if exists normalize_test_input_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output"
        -DinputTableName="normalize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists normalize_test_input_output_using_model;
    drop table if exists normalize_test_input_output_using_model_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_output_using_model_model_output"
        -DinputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output_using_model"
        -DinputTableName="normalize_test_input";
  • Input
    normalize_test_input
    col_string col_bigint col_double col_boolean col_datetime
    01 10 10.1 true 2016-07-01 10:00:00
    NULL 11 10.2 false 2016-07-02 10:00:00
    02 NULL 10.3 true 2016-07-03 10:00:00
    03 12 NULL false 2016-07-04 10:00:00
    04 13 10.4 NULL 2016-07-05 10:00:00
    05 14 10.5 true NULL
  • Output
    • normalize_test_input_output
      col_string col_bigint col_double col_boolean col_datetime normalized_col_bigint normalized_col_double
      01 10 10.1 true 2016-07-01 10:00:00 0.0 0.0
      NULL 11 10.2 false 2016-07-02 10:00:00 0.25 0.2499999999999989
      02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.5000000000000022
      03 12 NULL false 2016-07-04 10:00:00 0.5 NULL
      04 13 10.4 NULL 2016-07-05 10:00:00 0.75 0.7500000000000011
      05 14 10.5 true NULL 1.0 1.0
    • normalize_test_input_model_output
      feature json
      col_bigint {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
      col_double {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}
    • normalize_test_input_output_using_model
      col_string col_bigint col_double col_boolean col_datetime
      01 0.0 0.0 true 2016-07-01 10:00:00
      NULL 0.25 0.2499999999999989 false 2016-07-02 10:00:00
      02 NULL 0.5000000000000022 true 2016-07-03 10:00:00
      03 0.5 NULL false 2016-07-04 10:00:00
      04 0.75 0.7500000000000011 NULL 2016-07-05 10:00:00
      05 1.0 1.0 true NULL
    • normalize_test_input_output_using_model_model_output
      feature json
      col_bigint {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}
      col_double {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}