All Products
Search
Document Center

PolarDB:LightGBM algorithm

Last Updated:Mar 28, 2026

The light gradient boosting machine (LightGBM) is a distributed gradient boosting framework built on decision trees. In PolarDB for MySQL, you run LightGBM models directly through SQL using the CREATE MODEL, EVALUATE, and PREDICT statements — no external ML environment required.

LightGBM reduces the memory footprint of training data, lowers communication costs across nodes, improves the efficiency of multi-node elastic parallel query (ePQ), and achieves linear acceleration in data computing.

Use cases

LightGBM covers three model families — gradient boosted decision trees (GBDT), random forests, and logistic regression — and is well-suited for:

  • Binary classification: predict a binary outcome, such as whether a user will click or purchase

  • Multiclass classification: assign items to one of several categories, such as product types

  • Ranking and sorting: order results by predicted relevance

Example: click prediction in a recommendation system

A personalized product recommendation system needs to predict whether a user will click or buy an item, based on past behavior (clicks, non-clicks, and purchases). The input features typically include:

  • Categorical features: string values such as gender (male, female) or product category (clothing, toys, electronics)

  • Numerical features: integers or floating-point values such as user activity score or product price

Configure a LightGBM model

Work through these decisions in order before setting parameters:

  1. Choose the learning objective (`loss`): Select binary for two-class problems, multiclass for multi-class problems, or a regression variant for continuous outputs.

  2. Choose the weak learner (`boosting_type`): Start with gbdt (the default). Switch to dart if you need to prevent overfitting, goss for faster training, or gblinear for a linear model.

  3. Set tree complexity (`num_leaves`, `max_depth`): Keep num_leaves below 2^max_depth to avoid overfitting.

  4. Tune regularization (`learning_rate`, `subsample`, `min_samples_leaf`): A lower learning_rate produces a more stable model but requires more iterations (n_estimators). Setting subsample below 1 uses only a proportion of samples for model creation.

  5. Enable AutoML if unsure (`automl`): Set automl=True to let PolarDB search for the best parameter combination automatically.

Parameters

The following parameters map to model_parameter in the CREATE MODEL statement.

For string-valued parameters such as boosting_type, enclose the value in single quotation marks. Example: boosting_type='gbdt'.

ParameterTypeDefaultDescription
boosting_typestringgbdtThe type of weak learner. gbdt: gradient boosted decision tree (recommended default). rf: random forest. dart: uses dropout to reduce overfitting. goss: gradient-based one-side sampling; fast, but may cause underfitting. gblinear: linear model.
n_estimatorsinteger100The number of boosting iterations.
lossstringbinaryThe learning objective. binary: binary classification. multiclass: multiclass classification. regression: L2-regularization regression. regression_l1: L1-regularization regression.
num_leavesinteger128The maximum number of leaves per tree. Controls model complexity.
max_depthinteger7The maximum tree depth. Set to -1 to remove the depth limit. A deeper tree can capture more patterns but risks overfitting.
learning_ratefloat0.06The step size at each iteration.
max_leaf_nodesinteger or blankblankThe maximum number of leaf nodes. Blank means no limit.
min_samples_leafinteger20The minimum number of samples required at a leaf node. Leaf nodes (and their siblings) are pruned if the sample count falls below this value.
subsamplefloat1The fraction of training samples used per iteration. Valid range: 0–1. Values below 1 use only the specified proportion of samples for model creation.
max_featuresfloat1The fraction of features considered when splitting a node. Valid range: 0–1.
random_stateinteger1The random number seed. Changing this value affects tree construction and data splitting, which can produce different results.
model_typestringpklThe storage format for the trained model. pkl: PKL file. pmml: PMML (Predictive Model Markup Language) file; includes the full tree structure, which is useful for inspection.
n_jobsinteger4The number of threads used for training. More threads reduce training time.
is_unbalancebooleanFalseWhether to upweight the minority class to address class imbalance. Set to True when one class has significantly fewer samples than the other.
categorical_featurestring arrayThe names of categorical feature columns, as a comma-separated string. LightGBM detects categorical features automatically in most cases. Override this parameter when auto-detection is insufficient. Example: categorical_feature='AirportTo,DayOfWeek'.
automlbooleanFalseWhether to enable automatic parameter tuning. When set to True, early stopping is applied when the metric specified by loss stops improving.
automl_train_tagstringThe label value that identifies training rows in the automl_column.
automl_test_tagstringThe label value that identifies test rows in the automl_column. The training set must be 4 to 9 times the size of the test set.
automl_columnstringThe column name used to split rows into training and test sets for AutoML. When set, prefix parameters with automl_ to define the search space. For example: automl_learning_rate='0.05,0.04,0.03,0.01' searches across four values. Requires automl_train_tag and automl_test_tag.

Examples

The following examples use the db4ai.airlines dataset. All SQL statements use the /*polar4ai*/ hint to route queries to the PolarDB AI engine.

Create a LightGBM model

/*polar4ai*/CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);

Evaluate the model

/*polar4ai*/SELECT Airline FROM EVALUATE(MODEL airline_gbm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');

Run predictions

/*polar4ai*/SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');