The Missing Data Imputation component allows you to use a list of values to impute missing values in the input table.

Background information

  • This component can replace a numeric null value with a maximum, minimum, average, or custom value.
  • This component can also replace a null string, empty string, null and empty string, or specified value with a custom value.
  • The missing values you want to impute can be null values, empty strings, or custom values.

    If you use empty strings to impute missing values, the destination column for which you want to impute missing values must be of the STRING type.

  • You can replace a numeric null value with a custom value. Alternatively, you can directly replace the numeric null value with a maximum, minimum, or average value.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns to Impute By default, all columns are selected. Extra columns do not affect the prediction result.
    Original Value
    • Null (Numerical and String)
    • Empty String
    • Null and Empty String (String)
    • Custom (String)
    Replace With
    • Minimum (Numerical)
    • Maximum (Numerical)
    • Custom (Numerical)
    • Custom (Numerical and String)
    configs The ID column.
    Note This parameter is available only when you select Advanced Options.
    Tuning Cores
    Memory Size per Core
  • PAI command
    PAI -name FillMissingValues
        -project algo_public
        -Dconfigs="poutcome,null-empty,testing" \
        -DoutputTableName="test_3"
        -DinputPartitions="pt=20150501"
        -DinputTableName="bank_data_partition";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multiple-level partitions
    Note If you specify multiple partitions, separate these partitions with commas (,).
    All partitions
    outputTableName Yes The name of the output table. No default value
    configs Yes The values that you want to use to impute missing values.
    For example, in col1, null, 3.14; col2, empty, hello; col3, empty-null, world, null indicates a null value, and empty indicates an empty string.
    • If you use empty strings, the destination column for which you want to impute missing values must be of the STRING type.
    • If you use a maximum, minimum, or average value, variables can be used and named as min, max, or mean.
    • If you use custom values, user-defined is used in a specified format, such as col4,user-defined,str,str123.
    No default value
    outputParaTableName No The name of the output table. Output table 1 is a non-partitioned table.
    inputParaTableName Yes The name of the input table. No default value
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. No default value
    coreNum No The number of cores, which must be a positive integer. Automatically allocated
    memSizePerCore No The memory size of each core, in MB. Valid values: (1, 65536). Automatically allocated

Example

  1. Execute the following SQL statements to generate test data:
    drop table if exists fill_missing_values_test_input;
    create table fill_missing_values_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table fill_missing_values_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        from dual
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
            from dual
    ) tmp;
    Input data:
    +------------+------------+------------+-------------+--------------+
    | col_string | col_bigint | col_double | col_boolean | col_datetime |
    +------------+------------+------------+-------------+--------------+
    | 04         | 13         | 10.4       | NULL        | 2016-07-05 10:00:00 |
    | 02         | NULL       | 10.3       | true        | 2016-07-03 10:00:00 |
    | 03         | 12         | NULL       | false       | 2016-07-04 10:00:00 |
    | NULL       | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
    | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
    | 05         | 14         | 10.5       | true        | NULL         |
    +------------+------------+------------+-------------+--------------+
  2. Run the following commands:
    drop table if exists fill_missing_values_test_input_output;
    drop table if exists fill_missing_values_test_input_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
    -DoutputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output"
    -DinputTableName="fill_missing_values_test_input";
    drop table if exists fill_missing_values_test_input_output_using_model;
    drop table if exists fill_missing_values_test_input_output_using_model_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output"
    -DinputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output_using_model"
    -DinputTableName="fill_missing_values_test_input";
  3. View the return results.
    • fill_missing_values_test_input_output
      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_model_output
      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+
    • fill_missing_values_test_input_output_using_model
      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_output_using_model_model_output
      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+