The Missing Data Imputation component allows you to use a list of values to impute missing values in the input table.
Background information
This component can replace a numeric null value with a maximum, minimum, average, or custom value.
This component can also replace a null string, empty string, null and empty string, or specified value with a custom value.
The missing values you want to impute can be null values, empty strings, or custom values.
If you use empty strings to impute missing values, the destination column for which you want to impute missing values must be of the STRING type.
You can replace a numeric null value with a custom value. Alternatively, you can directly replace the numeric null value with a maximum, minimum, or average value.
Configure the component
You can configure the component by using one of the following methods:
Method 1: Using the Machine Learning Platform for AI console
Configure the component parameters on the pipeline configuration page of Machine Learning Designer.
Tab | Parameter | Description |
Fields Setting | Columns to Impute | By default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result. |
Original Value |
| |
Replace With |
| |
Configs | The ID column. Note This parameter is displayed only if you select Advanced Options. | |
Tuning | Cores | |
Memory Size per Core |
Method 2: Using Machine Learning Platform for AI (PAI) commands
Configure the component parameters by using a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script. The following table describes the parameters of the PAI command.
PAI -name FillMissingValues
-project algo_public
-Dconfigs="poutcome,null-empty,testing"
-DoutputParaTableName="test_input_model_output"
-DoutputTableName="test_3"
-DinputTablePartitions="pt=20150501"
-DinputTableName="bank_data_partition";
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | No default value |
inputTablePartitions | No | The partitions that are selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
configs | Yes | The values that you want to use to impute missing values. For example, in
| No default value |
outputParaTableName | Yes | The name of the output parameter table. | No default value |
inputParaTableName | No | The name of the input parameter table. | No default value |
lifecycle | No | The lifecycle of the output table. Valid values: [1,3650]. | No default value |
coreNum | No | The number of cores used in computing. The value must be a positive integer. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. Valid values: (1,65536). | Determined by the system |
Example
Execute the following SQL statements to generate test data:
drop table if exists fill_missing_values_test_input; create table fill_missing_values_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table fill_missing_values_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime from dual union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime from dual union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime from dual union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime from dual union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime from dual union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime from dual ) tmp;
Input data:
+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 | | 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | NULL | false | 2016-07-04 10:00:00 | | NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | NULL | +------------+------------+------------+-------------+--------------+
Run the following commands:
drop table if exists fill_missing_values_test_input_output; drop table if exists fill_missing_values_test_input_model_output; PAI -name FillMissingValues -project algo_public -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00" -DoutputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output" -DinputTableName="fill_missing_values_test_input"; drop table if exists fill_missing_values_test_input_output_using_model; drop table if exists fill_missing_values_test_input_output_using_model_model_output; PAI -name FillMissingValues -project algo_public -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output" -DinputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output_using_model" -DinputTableName="fill_missing_values_test_input";
View the return results.
fill_missing_values_test_input_output
+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
fill_missing_values_test_input_model_output
+------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+
fill_missing_values_test_input_output_using_model
+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
fill_missing_values_test_input_output_using_model_model_output
+------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+