LightGBM algorithm - PolarDB - Alibaba Cloud Documentation Center

This topic describes the light gradient boosting machine (LightGBM) algorithm.

Background information

LightGBM is a distributed gradient lifting framework based on decision tree algorithms. LightGBM is designed as a fast, efficient, low-memory, and high-accuracy tool that supports parallel and large-scale data processing. LightGBM can reduce the amount of memory occupied by data, lower communication costs, improve the efficiency of multi-node elastic parallel query (ePQ), and achieve linear acceleration in data computing.

Scenarios

LightGBM is an algorithm framework that includes GBDT models, random forests, and logistic regression models. LightGBM is primarily used in scenarios such as binary classification, multiclass classification, and sorting.

For example, in most personalized commodity recommendation scenarios, click estimation models are required. Historical user behaviors such as clicks, unclicks, and purchases can be used as training data to predict the probability of user clicks or purchases. The following features are extracted based on user behavior and user attributes:

Categorical feature: a string type, such as gender (male or female).
Commodity category: categories such as clothing, toys, or electronics.
Numerical feature: integer or floating-point data type, such as user activity or commodity prices.

Parameters

The values of the parameters described in the following table are the same as those of the model_parameter parameter specified in the CREATE MODEL statement that is used to create a model. You can configure the parameters based on your business requirements.

Parameter	Description
boosting_type	The type of the weak learner. Valid values: gbdt (default): uses the gradient-boosted decision tree model. gblinear: uses the linear model. rf: uses the random forest model. dart: uses the dropout technique to delete some trees to prevent overfitting. goss: uses the gradient-based one side sampling algorithm. This type is fast, but may cause underfitting. Note When you specify a value for this parameter value, enclose the value in single quotation marks ('). Example: `boosting_type='gbdt'`
n_estimators	The number of iterations. The value must be an integer. Default value: 100.
loss	The learning task and the learning objectives of the task. Valid values: binary (default): binary classification. regression: uses the L2-regularization regression model. regression_l1: uses the L1-regularization regression model. multiclass: multiclass classification.
num_leaves	The number of leaves. The value must be an integer. Default value: 128.
max_depth	The maximum depth of the tree. The value must be an integer. Default value: 7. Note If this parameter is set to -1, the depth of the tree is not specified. However, to prevent overfitting, we recommend that you specify an appropriate value for this parameter.
learning_rate	The learning rate. The value must be a floating-point number. Default value: 0.06.
max_leaf_nodes	The maximum number of leaf nodes in the tree. The value can be left empty or an integer. By default, this parameter is left empty, which indicates that the number of leaf nodes in the tree is not limited.
min_samples_leaf	The minimum number of sample leaf nodes in the tree. The value must be an integer. Default value: 20. Note If the number of leaf nodes is less than the minimum number of sample leaf nodes, the nodes are pruned together with sibling nodes.
subsample	The ratio of training samples to all samples. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1. Note If the value of this parameter is less than 1, only samples with this proportion value are used in the training.
max_features	The ratio of training features to all features. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1.
max_depth	The maximum depth of the tree. The value must be an integer. Default value: 7. Note A greater value specifies the higher precision. However, overfitting may occur.
random_state	The random number seed. The value must be an integer. Default value: 1. Note If different values exist, the construction of the tree and the segmentation of training data may be affected.
model_type	The storage type of the model. Valid values: pkl (default): PKL file. pmml: PMML file. This type can display tree-related information such as the structure of the tree.
n_jobs	The number of threads used for training. The value must be an integer. Default value: 4. Note A large number of treads used for training can improve training speed.
is_unbalance	Specifies whether to increase the weight of the category with a small number of samples to address sample imbalance. Valid values: False (default): does not increase the weight of the category with a small number of samples. True: increases the weight of the category with a small number of samples.
categorical_feature	The categorical feature. The value must be a string array. In most cases, LightGBM determines the data type to automatically configure the categorical_feature parameter. You can also configure this parameter. For example, if the `categorical_feature` parameter is set to AirportTo,DayOfWeek', the value AirportTo,DayOfWeek' indicates two categorical features.
automl	Specifies whether to enable the automatic parameter tuning feature. Valid values: False (default): The automatic parameter tuning feature is disabled. True: The automatic parameter tuning feature is enabled. By default, after the automatic parameter tuning feature is enabled, the early stopping technique is used to stop iteration when the learning task and the learning objectives of the task specified by the `loss` parameter remain unchanged.
automl_train_tag	The label of the training.
automl_test_tag	The label of the test.
automl_column	The name of the column in the training set or development set that is specified by the automatic parameter tuning feature. You must configure the `automl_column` and `automl_test_tag` parameters. The data volume of the `automl_train_tag` parameter must be 4 to 9 times that of the `automl_test_tag` parameter. Note After you configure the `automl_column` parameter, automatic searching for the optimal parameter combination is enabled. In this case, you can add the `automl_` prefix before the `learning_rate` and `subsample` parameters. This way, you can search for existing parameters to locate the optimal parameter. Example: `automl_column='automl_column', automl_train_tag='train', automl_test_tag='test', automl_learning_rate='0.05,0.04,0.03,0.01',automl_subsample='0.6,0.5'` The optimal parameter is obtained from the learning rate set "0.05, 0.04, 0.03, and 0.01" and the sample sampling parameters "0.6 and 0.5".

Examples

Create a model and an offline training task.

/*polar4ai*/
CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines)

Use the model for prediction.

/*polar4ai*/
SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
 y_cols='Delay')