All Products
Search
Document Center

Platform For AI:GBDT Binary Classification V2

Last Updated:Apr 23, 2024

Gradient Boosting Decision Trees (GBDT) Binary Classification is a classic supervised learning model based on gradient boosting. It is suitable for binary classification scenarios.

Supported computing resources

You can use GBDT Binary Classification V2 based only on the computing resources of MaxCompute.

Algorithm

GBDT Binary Classification is a classic supervised learning model based on gradient boosting. It can be used in binary classification scenarios.

Principle

The gradient boosting decision tree model consists of multiple decision trees. Each decision tree corresponds to a weak learner. Combining these weak learners together can achieve better classification and regression results.

The following figure shows the basic recursive structure of gradient boosting.

image

In most cases, image is a CART decision tree, image are the parameters of the decision tree, and imageis the step size. Each decision tree optimizes the objective function on the basis of the previous decision tree. After the preceding process, a model that contains multiple decision trees is obtained.

Scenario

The algorithm includes optimizations such as second-order optimization provided by XGBoost and histogram approximation provided by LightGBM. This algorithm delivers high performance and interpretability, and can be used for common binary classification.

GBDT Binary Classification V2 supports input in the sparse vector format and the multiple feature column format. If you use the sparse vector format, you can select only one column of the string type. Each data entry is a key-value pair separated by a space. The values are separated by a colon. Example: 1:0.3 3:0.9. If you use the multiple feature column format, you can select multiple columns of the double, bigint, and string type, including numerical features and categorical features. The algorithm bins the data of numerical features and uses a many-vs-many splitting strategy to process categorical features. You do not need to perform one-hot encoding for categorical features.

The algorithm requires that the category of binary classification must be 0 and 1.

This component is used in pairs with the GBDT Binary Classification Prediction V2 component. After you run these components, you can deploy the trained model as an online service. For more information, see Deploy a pipeline as an online service.

Configure the component in the Machine Learning Designer

Input ports

Port (from left to right)

Data type

Recommended upstream component

Required

Input Data

No parameters are returned.

Read Table

Yes

Component parameters

Tab

Parameter

Required

Description

Default value

Fields Setting

Use Sparse Vector Format

No

Specifies whether the feature columns used for training in the input table are in the sparse vector format. In the sparse vector format, each data entry is a key-value pair separated by a space. The values are separated by a colon. Example: 1:0.3 3:0.9.

No

Select Feature Columns

Yes

The names of the feature columns that are selected from the input table for training. If the sparse vector format is not selected, you can select columns of the double, bigint, or string type. If the sparse vector format is selected, you can select only one column of the string type.

No parameters are returned.

Select Categorical Feature Columns

No

Select columns to be processed as categorical features. Columns not selected are processed as numerical features. This parameter is valid when the sparse vector format is not selected.

No parameters are returned.

Select Label Column

Yes

The names of the label columns that are selected from the input table for training.

No parameters are returned.

Select Weight Column

No

The name of the weight columns that are selected from the input table for training.

No parameters are returned.

Parameter Setting

Number of Trees

No

The number of trees in the model.

1

Maximum Number of Leaf Nodes

No

The maximum number of leaf nodes on each tree.

32

Learning Rate

No

The learning rate.

0.05

Ratio of Samples

No

The proportion of samples that are selected for training. Valid values: (0,1].

0.6

Ratio of Features

No

The proportion of features that are selected for training. Valid values: (0,1].

0.6

Minimum Number of Samples in a Leaf Node

No

The minimum number of samples on each leaf node.

500

Maximum Number of Bins

No

Maximum number of bins allowed when discretizing continuous features. A larger value indicates more precise splits. More precise splits generate higher costs.

32

Maximum Number of Distinct Categories

No

Maximum number of distinct categories allowed for categorical features. For each categorical feature, the categories are sorted based on frequencies. The categories with ranks larger than this value are combined to one bucket. A larger value indicates more precise splits. More precise splits incurs higher possibilities of over-fitting and higher costs.

1024

Number of features

No

This parameter is valid when the sparse vector format is selected. Specify the parameter as the value of the maximum feature ID + 1. The system automatically scans data to calculate if this parameter is left empty.

The number is automatically calculated based on the input data.

Initial Prediction

No

The probability of positive samples. The system automatically scans the data to estimate if this parameter is left empty.

The number is automatically calculated based on the input data.

Random Seed

No

The random seed, which is used for sampling.

0

Tuning

Choose Running Mode

No

Choose Running Mode. Valid values:

  • MaxCompute

  • Flink

MaxCompute

Number of Instances

No

The number of instances that are used to run the job.

The number is automatically calculated based on the input data.

Memory Per Instance

No

The memory size of each instance. Unit: MB.

The number is automatically calculated based on the input data.

Num of Threads

No

If multi-threading is used, higher costs are incurred. In most cases, performance does not linearly increase with the number of threads. If more threads than the optimum thread number are used, performance decreases.

1

Output ports

Output port

Storage location

Recommended downstream component

Data type

Output Model

N/A

GBDT Binary Classification Prediction V2

MaxCompute table

Output Feature Importance

N/A

No parameters are returned.

MaxCompute table

Comparison with PS-SMART Binary Classification Training

If you encounter issues that are hard to handle when you use the PS-SMART Binary Classification Training component, you can use the GBDT Binary Classification V2 component. For more information, see PS-SMART Binary Classification Training. The following table describes the features and parameters of the two components.

Parameters of PS-SMART Binary Classification Training

GBDT Binary Classification V2

Use Sparse Format

Use Sparse Vector Format

Feature Columns

Feature Columns

Label Column

Label Column

Weight Column

Select Weight Column

Evaluation Indicator Type

Not supported. Area under curve (AUC) is used by default. You can view the metrics in the worker log.

Trees

Number of Trees

Maximum Tree Depth

Maximum Number of Leaf Nodes: Maximum Number of Leaf Nodes = 2 ^ (Maximum Tree Depth - 1).

Data Sampling Fraction

Ratio of Samples

Feature Sampling Fraction

Ratio of Features

L1 Penalty Coefficient

Not supported

L2 Penalty Coefficient

Not supported

Learning Rate

Learning Rate

Sketch-based Approximate Precision

Maximum Number of Bins: Maximum Number of Bins = 1/Sketch-based Approximate Precision.

Minimum Split Loss Change

Minimum Number of Samples in a Leaf Node: cannot be converted to Minimum Split Loss Change directly, but both parameters can be used to prevent overfitting.

Features

Features

Global Offset

Global Offset

Random Seed

Random Seed

Feature Importance Type

N/A. Default value: gain.

Cores

Number of Instances: not the same value as Cores. We recommend that you adjust the number based on the value automatically generated by the system.

Memory Size per Core

Memory per Instance: not the same value as Memory Size per Core. We recommend that you adjust the number based on the value automatically generated by the system.