Gradient boosting decision tree (GBDT) is an iterative decision tree algorithm that is suitable for linear and nonlinear regression scenarios.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Input Columns The feature columns that are selected from the input table for training. The columns of the DOUBLE and BIGINT types are supported
    Note A maximum of 800 feature columns can be selected.
    Label Column The label column. The columns of the DOUBLE and BIGINT types are supported.
    Group Column The columns of the DOUBLE and BIGINT types are supported. By default, the full table is a group.
    Parameters Setting Loss Function Type The type of the loss function. Valid values: Gbrank Loss, Lambdamart DCG Loss, Lambdamart NDCG Loss, and Regression Loss.
    Tau in gbrank loss This parameter is required only if the Loss Function Type parameter is set to Gbrank Loss. Valid values: [0,1].
    Exponent Base of Gbrank and Regression Loss This parameter is required only if the Loss Function Type parameter is set to Gbrank Loss or Regression Loss. Valid values: [1,10].
    Metric Type The metric type. Valid values: NDCG and DCG.
    Number of Decision Trees The number of trees. Valid values: 1 to 10000.
    Learning Rate The learning rate. Valid values: (0,1).
    Maximum Leaf Quantity The maximum number of leaf nodes on each tree. Valid values: 1 to 1000.
    Maximum Decision Tree Depth The maximum depth of each tree. Valid values: 1 to 100.
    Minimum Sample Quantity on a Leaf Node The minimum number of samples on each leaf node. Valid values: 1 to 1000.
    Sample Ratio The proportion of samples that are selected for training. Valid values: (0,1).
    Feature Ratio The proportion of features that are selected for training. Valid values: (0,1).
    Sample Ratio The proportion of samples that are selected for testing. Valid values: [0,1).
    Random Seed The random seed. Valid values: [0,10].
    Use Newton-Raphson Method Specifies whether to use Newton's method.
    Maximum Feature Split Times The maximum number of splits of each feature. Valid values: 1 to 1000.
    Tuning Number of Computing Cores The number of cores. The system automatically allocates cores based on the volume of input data.
    Memory Size per Core The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.
  • Use commands
    PAI -name gbdt
        -project algo_public
        -DfeatureSplitValueMaxSize="500"
        -DlossType="0"
        -DrandSeed="0"
        -DnewtonStep="0"
        -Dshrinkage="0.05"
        -DmaxLeafCount="32"
        -DlabelColName="campaign"
        -DinputTableName="bank_data_partition"
        -DminLeafSampleCount="500"
        -DsampleRatio="0.6"
        -DgroupIDColName="age"
        -DmaxDepth="11"
        -DmodelName="xlab_m_GBDT_83602"
        -DmetricType="2"
        -DfeatureRatio="0.6"
        -DinputTablePartitions="pt=20150501"
        -Dtau="0.6"
        -Dp="1"
        -DtestRatio="0.0"
        -DfeatureColNames="previous,cons_conf_idx,euribor3m"
        -DtreeCount="500"
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    featureColNames No The feature columns that are selected from the input table for training. The columns of the DOUBLE and BIGINT types are supported. All columns of numeric data types
    labelColName Yes The label column in the input table. The columns of the DOUBLE and BIGINT types are supported. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note If you specify multiple partitions, separate these partitions with commas (,).
    All partitions
    modelName Yes The name of the output model. N/A
    outputImportanceTableName No The name of the table that provides feature importance. N/A
    groupIDColName No The name of the group column. Full table
    lossType No The type of the loss function. Valid values:
    • 0: GBRank
    • 1: LAMBDAMART_DCG
    • 2: LAMBDAMART_NDCG
    • 3: LEAST_SQUARE
    • 4: LOG_LIKELIHOOD
    0
    metricType No The metric type. Valid values:
    • 0: normalized discounted cumulative gain (NDCG).
    • 1: discounted cumulative gain (DCG).
    • 2: area under the curve (AUC). This metric type is suitable only for the scenario where the value of label is set to 0 or 1.
    0
    treeCount No The number of trees. Valid values: 1 to 10000. 500
    shrinkage No The learning rate. Valid values: (0,1). 0.05
    maxLeafCount No The maximum number of leaf nodes on each tree. Valid values: 1 to 1000. 32
    maxDepth No The maximum depth of each tree. Valid values: 1 to 100. 10
    minLeafSampleCount No The minimum number of samples on each leaf node. Valid values: 1 to 1000. 500
    sampleRatio No The proportion of samples selected for training. Valid values: (0,1). 0.6
    featureRatio No The proportion of features that are selected for training. Valid values: (0,1). 0.6
    tau No The Tau parameter for the GBRank loss function. Valid values: [0,1]. 0.6
    p No The p parameter for the GBRank loss function. Valid values: [1,10]. 1
    randSeed No The random seed. Valid values: [0,10]. 0
    newtonStep No Specifies whether to use Newton's method. Valid values: 0 and 1. 1
    featureSplitValueMaxSize No The maximum number of splits of each feature. Valid values: 1 to 1000. 500
    lifecycle No The lifecycle of the output table. N/A

Example

  1. Execute the following SQL statements to generate test data:
    drop table if exists gbdt_ls_test_input;
    create table gbdt_ls_test_input
    as
    select
        *
    from
    (
        select
            cast(1 as double) as f0,
            cast(0 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(1 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(0 as double) as f1,
                cast(1 as double) as f2,
                cast(0 as double) as f3,
                cast(1 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(0 as double) as f1,
                cast(0 as double) as f2,
                cast(1 as double) as f3,
                cast(1 as bigint) as label
        from dual
        union all
            select
                cast(1 as double) as f0,
                cast(0 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
        union all
            select
                cast(0 as double) as f0,
                cast(1 as double) as f1,
                cast(0 as double) as f2,
                cast(0 as double) as f3,
                cast(0 as bigint) as label
        from dual
    ) a;
    The following test data table gbdt_ls_test_input is generated.
    f0 f1 f2 f3 label
    1.0 0.0 0.0 0.0 0
    0.0 0.0 1.0 0.0 1
    0.0 0.0 0.0 1.0 1
    0.0 1.0 0.0 0.0 0
    1.0 0.0 0.0 0.0 0
    0.0 1.0 0.0 0.0 0
  2. Run the following PAI command to submit the training parameters configured for the GBDT Regression component:
    drop offlinemodel if exists gbdt_ls_test_model;
    PAI -name gbdt
        -project algo_public
        -DfeatureSplitValueMaxSize="500"
        -DlossType="3"
        -DrandSeed="0"
        -DnewtonStep="1"
        -Dshrinkage="0.5"
        -DmaxLeafCount="32"
        -DlabelColName="label"
        -DinputTableName="gbdt_ls_test_input"
        -DminLeafSampleCount="1"
        -DsampleRatio="1"
        -DmaxDepth="10"
        -DmetricType="0"
        -DmodelName="gbdt_ls_test_model"
        -DfeatureRatio="1"
        -Dp="1"
        -Dtau="0.6"
        -DtestRatio="0"
        -DfeatureColNames="f0,f1,f2,f3"
        -DtreeCount="10"
  3. Run the following PAI command to submit the parameters configured for the Prediction component:
    drop table if exists gbdt_ls_test_prediction_result;
    PAI -name prediction
        -project algo_public
        -DdetailColName="prediction_detail"
        -DmodelName="gbdt_ls_test_model"
        -DitemDelimiter=","
        -DresultColName="prediction_result"
        -Dlifecycle="28"
        -DoutputTableName="gbdt_ls_test_prediction_result"
        -DscoreColName="prediction_score"
        -DkvDelimiter=":"
        -DinputTableName="gbdt_ls_test_input"
        -DenableSparse="false"
        -DappendColNames="label"
  4. View the prediction result table gbdt_ls_test_prediction_result.
    label prediction_result prediction_score prediction_detail
    0 NULL 0.0 {"label": 0}
    0 NULL 0.0 {"label": 0}
    1 NULL 0.9990234375 {"label": 0.9990234375}
    1 NULL 0.9990234375 {"label": 0.9990234375}
    0 NULL 0.0 {"label": 0}
    0 NULL 0.0 {"label": 0}