The light gradient boosting machine (LightGBM) is a distributed gradient boosting framework built on decision trees. In PolarDB for MySQL, you run LightGBM models directly through SQL using the CREATE MODEL, EVALUATE, and PREDICT statements — no external ML environment required.
LightGBM reduces the memory footprint of training data, lowers communication costs across nodes, improves the efficiency of multi-node elastic parallel query (ePQ), and achieves linear acceleration in data computing.
Use cases
LightGBM covers three model families — gradient boosted decision trees (GBDT), random forests, and logistic regression — and is well-suited for:
Binary classification: predict a binary outcome, such as whether a user will click or purchase
Multiclass classification: assign items to one of several categories, such as product types
Ranking and sorting: order results by predicted relevance
Example: click prediction in a recommendation system
A personalized product recommendation system needs to predict whether a user will click or buy an item, based on past behavior (clicks, non-clicks, and purchases). The input features typically include:
Categorical features: string values such as gender (
male,female) or product category (clothing,toys,electronics)Numerical features: integers or floating-point values such as user activity score or product price
Configure a LightGBM model
Work through these decisions in order before setting parameters:
Choose the learning objective (`loss`): Select
binaryfor two-class problems,multiclassfor multi-class problems, or a regression variant for continuous outputs.Choose the weak learner (`boosting_type`): Start with
gbdt(the default). Switch todartif you need to prevent overfitting,gossfor faster training, orgblinearfor a linear model.Set tree complexity (`num_leaves`, `max_depth`): Keep
num_leavesbelow2^max_depthto avoid overfitting.Tune regularization (`learning_rate`, `subsample`, `min_samples_leaf`): A lower
learning_rateproduces a more stable model but requires more iterations (n_estimators). Settingsubsamplebelow 1 uses only a proportion of samples for model creation.Enable AutoML if unsure (`automl`): Set
automl=Trueto let PolarDB search for the best parameter combination automatically.
Parameters
The following parameters map to model_parameter in the CREATE MODEL statement.
For string-valued parameters such as boosting_type, enclose the value in single quotation marks. Example: boosting_type='gbdt'.
| Parameter | Type | Default | Description |
|---|---|---|---|
boosting_type | string | gbdt | The type of weak learner. gbdt: gradient boosted decision tree (recommended default). rf: random forest. dart: uses dropout to reduce overfitting. goss: gradient-based one-side sampling; fast, but may cause underfitting. gblinear: linear model. |
n_estimators | integer | 100 | The number of boosting iterations. |
loss | string | binary | The learning objective. binary: binary classification. multiclass: multiclass classification. regression: L2-regularization regression. regression_l1: L1-regularization regression. |
num_leaves | integer | 128 | The maximum number of leaves per tree. Controls model complexity. |
max_depth | integer | 7 | The maximum tree depth. Set to -1 to remove the depth limit. A deeper tree can capture more patterns but risks overfitting. |
learning_rate | float | 0.06 | The step size at each iteration. |
max_leaf_nodes | integer or blank | blank | The maximum number of leaf nodes. Blank means no limit. |
min_samples_leaf | integer | 20 | The minimum number of samples required at a leaf node. Leaf nodes (and their siblings) are pruned if the sample count falls below this value. |
subsample | float | 1 | The fraction of training samples used per iteration. Valid range: 0–1. Values below 1 use only the specified proportion of samples for model creation. |
max_features | float | 1 | The fraction of features considered when splitting a node. Valid range: 0–1. |
random_state | integer | 1 | The random number seed. Changing this value affects tree construction and data splitting, which can produce different results. |
model_type | string | pkl | The storage format for the trained model. pkl: PKL file. pmml: PMML (Predictive Model Markup Language) file; includes the full tree structure, which is useful for inspection. |
n_jobs | integer | 4 | The number of threads used for training. More threads reduce training time. |
is_unbalance | boolean | False | Whether to upweight the minority class to address class imbalance. Set to True when one class has significantly fewer samples than the other. |
categorical_feature | string array | — | The names of categorical feature columns, as a comma-separated string. LightGBM detects categorical features automatically in most cases. Override this parameter when auto-detection is insufficient. Example: categorical_feature='AirportTo,DayOfWeek'. |
automl | boolean | False | Whether to enable automatic parameter tuning. When set to True, early stopping is applied when the metric specified by loss stops improving. |
automl_train_tag | string | — | The label value that identifies training rows in the automl_column. |
automl_test_tag | string | — | The label value that identifies test rows in the automl_column. The training set must be 4 to 9 times the size of the test set. |
automl_column | string | — | The column name used to split rows into training and test sets for AutoML. When set, prefix parameters with automl_ to define the search space. For example: automl_learning_rate='0.05,0.04,0.03,0.01' searches across four values. Requires automl_train_tag and automl_test_tag. |
Examples
The following examples use the db4ai.airlines dataset. All SQL statements use the /*polar4ai*/ hint to route queries to the PolarDB AI engine.
Create a LightGBM model
/*polar4ai*/CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);Evaluate the model
/*polar4ai*/SELECT Airline FROM EVALUATE(MODEL airline_gbm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');Run predictions
/*polar4ai*/SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');