全部產品
Search
文件中心

Platform For AI:缺失值填充

更新時間:Dec 31, 2024

缺失值填充是一種處理資料集中缺失資料的方法,旨在通過推斷和替換缺失值來提高資料完整性和模型效能。常見的填充方法包括使用最小值、最大值、平均數和自訂值進行填補。這些方法能夠協助減少資料不完整對模型訓練和預測的影響。

配置組件

方式一:可視化方式

在Designer工作流程頁面添加缺失值填充組件,並在介面右側配置相關參數:

參數類型

參數

描述

欄位設定

填充的欄位

選擇待填充的欄位。

原值

待填充欄位原值,取值:

  • Null(數值和string)

  • Null 字元串(string)

  • Null和Null 字元串(string)

  • 自訂(string)

替換為

替換欄位,取值:

  • Min(數值型)

  • Max(數值型)

  • Mean(數值型)

  • 自訂(數值型和string)

進階選項

自訂替換策略config。格式為:欄位名1, 原值, 替換值; 欄位名2, 原值, 替換值; ......

執行調優

計算核心數

計算核心數。

每個核記憶體數

每個核記憶體數,單位MB。

方式二:PAI命令方式

使用PAI命令配置缺失值填充組件參數。您可以使用SQL指令碼組件進行PAI命令調用,詳情請參見情境4:在SQL指令碼組件中執行PAI命令

PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="poutcome,null-empty,testing" 
    -DoutputParaTableName="test_input_model_output"
    -DoutputTableName="test_3"
    -DinputTablePartitions="pt=20150501"
    -DinputTableName="bank_data_partition";

參數名稱

是否必選

預設值

描述

inputTableName

輸入表的表名。

inputTablePartitions

所有分區

輸入表中,參與訓練的分區。支援以下格式:

  • Partition_name=value

  • name1=value1/name2=value2:多級格式

說明

如果指定多個分區,則使用英文逗號(,)分隔。

outputTableName

輸出結果表。

configs

缺失值填充的配置。

例如格式col1, null, 3.14; col2, empty, hello; col3, empty-null, world,其中null表示空值,empty表示Null 字元。

  • 如果選擇Null 字元,則填充的目標列應是STRING類型。

  • 如果採用最大值、最小值、均值,可以採用變數,其命名規範形如:min, max, mean。

  • 如果使用者自訂替換值,則使用user-defined,格式例如col4,user-defined,str,str123

outputParaTableName

輸出表1為非分區表

配置輸出表。

inputParaTableName

配置輸入表。

lifecycle

輸出表的生命週期,取值範圍為[1,3650]

coreNum

系統自動分配

計算的核心數目,取值為正整數。

memSizePerCore

系統自動分配

每個核心的記憶體(單位是兆),取值範圍為(1, 65536)

使用樣本

  1. 使用SQL語句,產生測試資料。

    drop table if exists fill_missing_values_test_input;
    create table fill_missing_values_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table fill_missing_values_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
    ) tmp;

    輸入資料說明:

    +------------+------------+------------+-------------+--------------+
    | col_string | col_bigint | col_double | col_boolean | col_datetime |
    +------------+------------+------------+-------------+--------------+
    | 04         | 13         | 10.4       | NULL        | 2016-07-05 10:00:00 |
    | 02         | NULL       | 10.3       | true        | 2016-07-03 10:00:00 |
    | 03         | 12         | NULL       | false       | 2016-07-04 10:00:00 |
    | NULL       | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
    | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
    | 05         | 14         | 10.5       | true        | NULL         |
    +------------+------------+------------+-------------+--------------+
  2. 運行PAI命令。

    drop table if exists fill_missing_values_test_input_output;
    drop table if exists fill_missing_values_test_input_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
    -DoutputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output"
    -DinputTableName="fill_missing_values_test_input";
    drop table if exists fill_missing_values_test_input_output_using_model;
    drop table if exists fill_missing_values_test_input_output_using_model_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output"
    -DinputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output_using_model"
    -DinputTableName="fill_missing_values_test_input";
  3. 運行結果。

    • fill_missing_values_test_input_output

      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_model_output

      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+
    • fill_missing_values_test_input_output_using_model

      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_output_using_model_model_output

      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+