Parameters and usage of the LightGBM algorithm - PolarDB

The light gradient boosting machine (LightGBM) is a distributed gradient boosting framework built on decision trees. In PolarDB for MySQL, you run LightGBM models directly through SQL using the CREATE MODEL, EVALUATE, and PREDICT statements — no external ML environment required.

LightGBM reduces the memory footprint of training data, lowers communication costs across nodes, improves the efficiency of multi-node elastic parallel query (ePQ), and achieves linear acceleration in data computing.

Use cases

LightGBM covers three model families — gradient boosted decision trees (GBDT), random forests, and logistic regression — and is well-suited for:

Binary classification: predict a binary outcome, such as whether a user will click or purchase
Multiclass classification: assign items to one of several categories, such as product types
Ranking and sorting: order results by predicted relevance

Example: click prediction in a recommendation system

A personalized product recommendation system needs to predict whether a user will click or buy an item, based on past behavior (clicks, non-clicks, and purchases). The input features typically include:

Categorical features: string values such as gender (male, female) or product category (clothing, toys, electronics)
Numerical features: integers or floating-point values such as user activity score or product price

Configure a LightGBM model

Work through these decisions in order before setting parameters:

Choose the learning objective (`loss`): Select binary for two-class problems, multiclass for multi-class problems, or a regression variant for continuous outputs.
Choose the weak learner (`boosting_type`): Start with gbdt (the default). Switch to dart if you need to prevent overfitting, goss for faster training, or gblinear for a linear model.
Set tree complexity (`num_leaves`, `max_depth`): Keep num_leaves below 2^max_depth to avoid overfitting.
Tune regularization (`learning_rate`, `subsample`, `min_samples_leaf`): A lower learning_rate produces a more stable model but requires more iterations (n_estimators). Setting subsample below 1 uses only a proportion of samples for model creation.
Enable AutoML if unsure (`automl`): Set automl=True to let PolarDB search for the best parameter combination automatically.

Parameters

The following parameters map to model_parameter in the CREATE MODEL statement.

For string-valued parameters such as boosting_type, enclose the value in single quotation marks. Example: boosting_type='gbdt'.

Parameter	Type	Default	Description
`boosting_type`	string	`gbdt`	The type of weak learner. `gbdt`: gradient boosted decision tree (recommended default). `rf`: random forest. `dart`: uses dropout to reduce overfitting. `goss`: gradient-based one-side sampling; fast, but may cause underfitting. `gblinear`: linear model.
`n_estimators`	integer	`100`	The number of boosting iterations.
`loss`	string	`binary`	The learning objective. `binary`: binary classification. `multiclass`: multiclass classification. `regression`: L2-regularization regression. `regression_l1`: L1-regularization regression.
`num_leaves`	integer	`128`	The maximum number of leaves per tree. Controls model complexity.
`max_depth`	integer	`7`	The maximum tree depth. Set to `-1` to remove the depth limit. A deeper tree can capture more patterns but risks overfitting.
`learning_rate`	float	`0.06`	The step size at each iteration.
`max_leaf_nodes`	integer or blank	blank	The maximum number of leaf nodes. Blank means no limit.
`min_samples_leaf`	integer	`20`	The minimum number of samples required at a leaf node. Leaf nodes (and their siblings) are pruned if the sample count falls below this value.
`subsample`	float	`1`	The fraction of training samples used per iteration. Valid range: 0–1. Values below 1 use only the specified proportion of samples for model creation.
`max_features`	float	`1`	The fraction of features considered when splitting a node. Valid range: 0–1.
`random_state`	integer	`1`	The random number seed. Changing this value affects tree construction and data splitting, which can produce different results.
`model_type`	string	`pkl`	The storage format for the trained model. `pkl`: PKL file. `pmml`: PMML (Predictive Model Markup Language) file; includes the full tree structure, which is useful for inspection.
`n_jobs`	integer	`4`	The number of threads used for training. More threads reduce training time.
`is_unbalance`	boolean	`False`	Whether to upweight the minority class to address class imbalance. Set to `True` when one class has significantly fewer samples than the other.
`categorical_feature`	string array	—	The names of categorical feature columns, as a comma-separated string. LightGBM detects categorical features automatically in most cases. Override this parameter when auto-detection is insufficient. Example: `categorical_feature='AirportTo,DayOfWeek'`.
`automl`	boolean	`False`	Whether to enable automatic parameter tuning. When set to `True`, early stopping is applied when the metric specified by `loss` stops improving.
`automl_train_tag`	string	—	The label value that identifies training rows in the `automl_column`.
`automl_test_tag`	string	—	The label value that identifies test rows in the `automl_column`. The training set must be 4 to 9 times the size of the test set.
`automl_column`	string	—	The column name used to split rows into training and test sets for AutoML. When set, prefix parameters with `automl_` to define the search space. For example: `automl_learning_rate='0.05,0.04,0.03,0.01'` searches across four values. Requires `automl_train_tag` and `automl_test_tag`.

Examples

The following examples use the db4ai.airlines dataset. All SQL statements use the /*polar4ai*/ hint to route queries to the PolarDB AI engine.

Create a LightGBM model

/*polar4ai*/CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);

Evaluate the model

/*polar4ai*/SELECT Airline FROM EVALUATE(MODEL airline_gbm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');

Run predictions

/*polar4ai*/SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');