This topic describes the Standardization component provided by Machine Learning Studio.

Background information

  • You can standardize one or more columns in a table and save the generated data to a new table.
  • The following formula is used for standardization: (X - Mean)/(Standard deviation)
    • Mean: the mean of samples.
    • Standard deviation: the standard deviation of samples. The standard deviation is used when samples are used to calculate the total deviation. To make the value obtained after standardization closer to the mean, you must moderately increase the calculated standard deviation by using the formula standard deviation.
    • The formula used to calculate the standard deviation of samples is expression.

      x represents the mean of samples X1, X2, ..., and Xn.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI (PAI) console
    Tab Parameter Description
    Fields Setting All Selected by Default By default, all columns are selected. Additional columns do not affect the prediction result.
    Reserve Original Columns Column names are prefixed with stdized_ after standardization. The DOUBLE and BIGINT data types are supported.
    Tuning Cores The number of cores. The system automatically allocates the number of cores used for training based on the volume of input data.
    Memory Size per Core The memory size of each core. The system automatically allocates the memory size based on the volume of input data. Unit: MB.
  • PAI command
    • Command for dense data
      PAI -name Standardize
          -project algo_public 
          -DkeepOriginal="false"    
          -DoutputTableName="test_5"
          -DinputTablePartitions="pt=20150501"  
          -DinputTableName="bank_data_partition" 
          -DselectedColNames="euribor3m,pdays"
    • Command for sparse data
      PAI -name Standardize    
          -project projectxlib4
          -DkeepOriginal="true"
          -DoutputTableName="kv_standard_output"
          -DinputTableName=kv_standard_test
          -DselectedColNames="f0,f1,f2"
          -DenableSparse=true
          -DoutputParaTableName=kv_standard_model    
          -DkvIndices=1,2,8,6
          -DitemDelimiter=",";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    selectedColNames No The names of columns selected from the input table for training. Separate multiple names with commas (,). The selected columns can contain data of the INT or DOUBLE type. If data in the input table is in sparse format, the selected columns can contain data of the STRING type. All columns
    inputTablePartitions No The partitions selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note If you specify multiple partitions, separate them with commas (,).
    All partitions
    outputTableName Yes The name of the output table. No default value
    outputParaTableName Yes The output table of parameters. No default value
    inputParaTableName No The input table of parameters. No default value
    keepOriginal No Specifies whether to reserve original columns. Valid values:
    • true: Rename the standardized columns with the stdized_ prefix and reserve original columns.
    • false: Reserve all columns without renaming them.
    false
    lifecycle No The lifecycle of the output table. No default value
    coreNum No The number of cores. Automatically allocated
    memSizePerCore No The memory size of each core. Automatically allocated
    enableSparse No Specifies whether to support the input data in sparse format. Valid values:
    • true
    • false
    false
    itemDelimiter No The delimiter used between key-value pairs. ,
    kvDelimiter No The delimiter used between keys and values. :
    kvIndices No The feature indexes that require standardization in the table that contains data in the key-value format. No default value

Example

Detailed example
drop table if exists standardize_test_input;
create table standardize_test_input(
    col_string string,
    col_bigint bigint,
    col_double double,
    col_boolean boolean,
    col_datetime datetime);
insert overwrite table standardize_test_input
select
    *
from
(
    select
        '01' as col_string,
        10 as col_bigint,
        10.1 as col_double,
        True as col_boolean,
        cast('2016-07-01 10:00:00' as datetime) as col_datetime
    from dual
    union all
        select
            cast(null as string) as col_string,
            11 as col_bigint,
            10.2 as col_double,
            False as col_boolean,
            cast('2016-07-02 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '02' as col_string,
            cast(null as bigint) as col_bigint,
            10.3 as col_double,
            True as col_boolean,
            cast('2016-07-03 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '03' as col_string,
            12 as col_bigint,
            cast(null as double) as col_double,
            False as col_boolean,
            cast('2016-07-04 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '04' as col_string,
            13 as col_bigint,
            10.4 as col_double,
            cast(null as boolean) as col_boolean,
            cast('2016-07-05 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '05' as col_string,
            14 as col_bigint,
            10.5 as col_double,
            True as col_boolean,
            cast(null as datetime) as col_datetime
        from dual
) tmp;
  • PAI commands
    drop table if exists standardize_test_input_output;
    drop table if exists standardize_test_input_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output"
        -DinputTableName="standardize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists standardize_test_input_output_using_model;
    drop table if exists standardize_test_input_output_using_model_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
        -DinputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output_using_model"
        -DinputTableName="standardize_test_input";
  • Input
    standardize_test_input
    col_string col_bigint col_double col_boolean col_datetime
    01 10 10.1 true 2016-07-01 10:00:00
    NULL 11 10.2 false 2016-07-02 10:00:00
    02 NULL 10.3 true 2016-07-03 10:00:00
    03 12 NULL false 2016-07-04 10:00:00
    04 13 10.4 NULL 2016-07-05 10:00:00
    05 14 10.5 true NULL
  • Output
    • standardize_test_input_output
      col_string col_bigint col_double col_boolean col_datetime stdized_col_bigint stdized_col_double
      01 10 10.1 true 2016-07-01 10:00:00 -1.2649110640673518 -1.2649110640683832
      NULL 11 10.2 false 2016-07-02 10:00:00 -0.6324555320336759 -0.6324555320341972
      02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.0
      03 12 NULL false 2016-07-04 10:00:00 0.0 NULL
      04 13 10.4 NULL 2016-07-05 10:00:00 0.6324555320336759 0.6324555320341859
      05 14 10.5 true NULL 1.2649110640673518 1.2649110640683718
    • standardize_test_input_model_output
      feature json
      col_bigint {"name": "standardize", "type":"bigint", "paras":{"mean":12, "std": 1.58113883008419}}
      col_double {"name": "standardize", "type":"double", "paras":{"mean":10.3, "std": 0.1581138830082909}}
    • standardize_test_input_output_using_model
      col_string col_bigint col_double col_boolean col_datetime
      01 -1.2649110640673515 -1.264911064068383 true 2016-07-01 10:00:00
      NULL -0.6324555320336758 -0.6324555320341971 false 2016-07-02 10:00:00
      02 NULL 0.0 true 2016-07-03 10:00:00
      03 0.0 NULL false 2016-07-04 10:00:00
      04 0.6324555320336758 0.6324555320341858 NULL 2016-07-05 10:00:00
      05 1.2649110640673515 1.2649110640683716 true NULL
    • standardize_test_input_output_using_model_model_output
      feature json
      col_bigint {"name": "standardize", "type":"bigint", "paras":{"mean":12, "std": 1.58113883008419}}
      col_double {"name": "standardize", "type":"double", "paras":{"mean":10.3, "std": 0.1581138830082909}}