All Products
Search
Document Center

Platform For AI:GBDT Binary Classification

Last Updated:Feb 06, 2024

The GBDT Binary Classification component is used to set a threshold. If a feature value is greater than the threshold, the feature is a positive example. Otherwise, the feature is a negative example.

Configure the component

You can use one of the following methods to configure the GBDT Binary Classification component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the GBDT Binary Classification component on the pipeline page of Machine Learning Designer of Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Feature Columns

The feature columns that are selected from the input table for training. The columns of the DOUBLE and BIGINT types are supported.

Note

A maximum of 800 feature columns can be selected.

Label Column

The label column. Only the column of the BIGINT type is supported.

Stratified Column

The columns of the DOUBLE and BIGINT types are supported. By default, the full table is a group.

Parameters Setting

Metric Type

The metric type. Valid values: NDCG and DCG.

Decision Tree Quantity

The number of trees. Valid values: 1 to 10000.

Learning Rate

The learning rate. Valid values: (0,1).

Rate of Samples for Training

The proportion of samples that are selected for training. Valid values: (0,1].

Ratio of Features for Training

The proportion of features that are selected for training. Valid values: (0,1].

Maximum Leaf Quantity

The maximum number of leaf nodes on each tree. Valid values: 1 to 1000.

Testing Data Ratio

The proportion of data that is selected for testing. Valid values: [0,1).

Maximum Decision Tree Depth

The maximum depth of each tree. Valid values: 1 to 100.

Minimum Number of Samples on a Leaf Node

The minimum number of samples on each leaf node. Valid values: 1 to 1000.

Random Seed

The random seed. Valid values: [0,10].

Maximum Feature Split Times

The maximum number of splits of each feature. Valid values: 1 to 1000.

Tuning

Number of Cores

The number of cores. The system automatically allocates cores used for training based on the volume of input data.

Memory

The memory size of each core. The system automatically allocates the memory based on the volume of input data. The memory size of each core. Valid values: 1024 to 64 × 1024. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name gbdt_lr
    -project algo_public
    -DfeatureSplitValueMaxSize="500"
    -DrandSeed="0"
    -Dshrinkage="0.5"
    -DmaxLeafCount="32"
    -DlabelColName="y"
    -DinputTableName="bank_data_partition"
    -DminLeafSampleCount="500"
    -DgroupIDColName="nr_employed"
    -DsampleRatio="0.6"
    -DmaxDepth="11"
    -DmodelName="xlab_m_GBDT_LR_21208"
    -DmetricType="2"
    -DfeatureRatio="0.6"
    -DinputTablePartitions="pt=20150501"
    -DtestRatio="0.0"
    -DfeatureColNames="age,previous,cons_conf_idx,euribor3m"
    -DtreeCount="500"

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

featureColNames

No

The feature columns that are selected from the input table for training.

All columns of numeric data types

labelColName

Yes

The name of the label column in the input table.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate these partitions with commas (,).

All partitions

modelName

Yes

The name of the output model.

N/A

outputImportanceTableName

No

The name of the table that provides feature importance.

N/A

groupIDColName

No

The name of the group column.

Full table

lossType

No

The type of the loss function is 4: LOG_LIKELIHOOD.

4

metricType

No

The metric type. Valid values:

  • 0: normalized discounted cumulative gain (NDCG).

  • 1: discounted cumulative gain (DCG).

  • 2: area under the curve (AUC). This metric type is suitable only for the scenario where the value of label is set to 0 or 1(Deprecated).

0

treeCount

No

The number of trees. Valid values: 1 to 10000.

500

shrinkage

No

The learning rate. Valid values: (0,1).

0.05

maxLeafCount

No

The maximum number of leaf nodes on each tree. Valid values: 1 to 1000.

32

maxDepth

No

The maximum depth of each tree. Valid values: 1 to 100.

10

minLeafSampleCount

No

The minimum number of samples on each leaf node. Valid values: 1 to 1000.

500

sampleRatio

No

The proportion of samples that are selected for training. Valid values: (0,1].

0.6

featureRatio

No

The proportion of features that are selected for training. Valid values: (0,1].

0.6

tau

No

The Tau parameter for the GBRank loss function. Valid values: [0,1].

0.6

p

No

The p parameter for the GBRank loss function. Valid values: [1,10].

1

randSeed

No

The random seed. Valid values: [0,10].

0

newtonStep

No

Specifies whether to use Newton's method. Valid values: 0 and 1.

1

featureSplitValueMaxSize

No

The maximum number of splits of each feature. Valid values: 1 to 1000.

500

lifecycle

No

The lifecycle of the output table. Unit: days.

N/A

Note
  • By default, the loss function types of gradient boosting decision tree (GBDT) and those of GBDT_LR are different. By default, GBDT uses regression loss:mean squared error loss as its loss function, and GBDT_LR uses logistic regression loss as its loss function. Therefore, you do not need to configure a loss function for GBDT_LR.

  • The feature column, label column, and stratification column of GBDT must be of numeric data types.

  • You must specify the objective reference value for the Prediction component to generate a receiver operating characteristic (ROC) curve.

Example

  1. Execute the following SQL statements to generate training data:

    drop table if exists gbdt_lr_test_input;
    create table gbdt_lr_test_input
    as
    select
        *
    from
    (
        select
            cast(1 as double) as f0,
            cast(0 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(1 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(0 as double) as f1,
                cast(1 as double) as f2,
                cast(0 as double) as f3,
                cast(1 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(0 as double) as f1,
                cast(0 as double) as f2,
                cast(1 as double) as f3,
                cast(1 as bigint) as label
        from dual
        union all
            select
                cast(1 as double) as f0,
                cast(0 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(1 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
    ) a;

    The following training data table gbdt_lr_test_input is generated.

    f0

    f1

    f2

    f3

    label

    1.0

    0.0

    0.0

    0.0

    0

    0.0

    0.0

    1.0

    0.0

    1

    0.0

    0.0

    0.0

    1.0

    1

    0.0

    1.0

    0.0

    0.0

    0

    1.0

    0.0

    0.0

    0.0

    0

    0.0

    1.0

    0.0

    0.0

    0

  2. Run the following PAI command to submit the training parameters configured for the GBDT Binary Classification component:

    drop offlinemodel if exists gbdt_lr_test_model;
    PAI -name gbdt_lr
        -project algo_public
        -DfeatureSplitValueMaxSize="500"
        -DrandSeed="1"
        -Dshrinkage="1"
        -DmaxLeafCount="30"
        -DlabelColName="label"
        -DinputTableName="gbdt_lr_test_input"
        -DminLeafSampleCount="1"
        -DsampleRatio="1"
        -DmaxDepth="10"
        -DmodelName="gbdt_lr_test_model"
        -DmetricType="0"
        -DfeatureRatio="1"
        -DtestRatio="0"
        -DfeatureColNames="f0,f1,f2,f3"
        -DtreeCount="5"
  3. Run the following PAI command to submit the parameters configured for the Prediction component:

    drop table if exists gbdt_lr_test_prediction_result;
    PAI -name prediction
        -project algo_public
        -DdetailColName="prediction_detail"
        -DmodelName="gbdt_lr_test_model"
        -DitemDelimiter=","
        -DresultColName="prediction_result"
        -Dlifecycle="28"
        -DoutputTableName="gbdt_lr_test_prediction_result"
        -DscoreColName="prediction_score"
        -DkvDelimiter=":"
        -DinputTableName="gbdt_lr_test_input"
        -DenableSparse="false"
        -DappendColNames="label";
  4. View the prediction result table gbdt_lr_test_prediction_result.

    label

    prediction_result

    prediction_score

    prediction_detail

    0

    0

    0.9984308925552831

    {"0": 0.9984308925552831, "1": 0.001569107444716943}

    0

    0

    0.9984308925552831

    {"0": 0.9984308925552831, "1": 0.001569107444716943}

    1

    1

    0.9982721832240973

    {"0": 0.001727816775902724, "1": 0.9982721832240973}

    1

    1

    0.9982721832240973

    {"0": 0.001727816775902724, "1": 0.9982721832240973}

    0

    0

    0.9984308925552831

    {"0": 0.9984308925552831, "1": 0.001569107444716943}

    0

    0

    0.9984308925552831

    {"0": 0.9984308925552831, "1": 0.001569107444716943}