All Products
Search
Document Center

PolarDB:Forecast sales with LightGBM

Last Updated:Apr 01, 2026

Use a gradient boosting regression tree (GBRT) algorithm — such as XGBoost or LightGBM — to predict monthly sales for the next several months and identify peak and off-peak seasons. Features are built from three dimensions: time series attributes, business logic, and external factors.

After completing this guide, you will have:

  • Designed a feature set for time-series sales forecasting

  • Trained a LightGBM regression model using CREATE MODEL

  • Evaluated model accuracy with the R² score

  • Generated sales predictions and labeled each month as peak, off-peak, or normal

Feature engineering design

Features fall into four categories. The table below lists each category with representative examples.

Category Subcategory Examples
Time-related Basic time features Month (1–12); cyclical encoding: sin(2π × month/12) and cos(2π × month/12) to capture periodicity; quarter (Q1–Q4); week number or date range (beginning/end of month); holiday flag (Spring Festival, Double 11, Christmas); weekday/weekend flag
Time series statistics Lag features for 1, 3, 6, and 12 months to capture short-term fluctuations and long-term trends; sliding window statistics (mean, variance, maximum over the past 3 months) to smooth noise; year-over-year and month-over-month growth rate; year-to-date cumulative sales
Business-related Promotional activities Promotion intensity (discount rate), number of promotion days, promotion type (spend-and-save promotions or flash sales)
Product life cycle New product launch flag, delisting warning flag, lifecycle stage (introduction, growth, or maturity)
Inventory and price Inventory level (low inventory can suppress sales), price changes
External factors Macroeconomic Consumer Confidence Index, Consumer Price Index (CPI) for inflation, industry growth data
Market competition Competitor promotion activity and market share changes (sourced via web crawlers or third-party data)
Weather and season Temperature and rainfall (for example, ice cream sales correlate with temperature); holiday seasonality (for example, mooncake demand peaks before the Mid-Autumn Festival)
Natural cycles Fixed cycles Fourier terms to extract annual or quarterly cycles
Floating cycles Moving holiday effect to adjust for holidays whose Gregorian calendar dates shift each year, such as the Spring Festival

Build a model

Step 1: Prepare the data

Split the data chronologically — not randomly. Random splitting lets future data leak into training, which inflates evaluation scores without reflecting real prediction performance. Use the following split:

  • Training dataset: 2018–2021

  • Validation set: 2022

  • Test set: 2023

Handle missing values. For missing values in external data such as promotions or weather, fill with the column mean or use interpolation.

Normalize continuous features. Apply standardization or normalization to continuous features such as price and temperature. For categorical variables, use one-hot encoding or target encoding.

The following table lists the features and target variable used in the model:

Type Name Description
Feature month_sin Sine cyclical encoding for month: sin(2π × month/12)
month_cos Cosine cyclical encoding for month: cos(2π × month/12)
lag_1 Sales from the previous month, capturing short-term trends
lag_3 Sales from 3 months ago, reflecting medium-term fluctuations
promotion_flag Binary flag (0/1) indicating whether a promotion is active
temperature Temperature value, for weather-sensitive products such as beverages or apparel
Target sales Continuous sales value to predict

Step 2: Create the model

LightGBM handles missing values natively and works well with categorical features.

The following table describes the key model parameters. For the full parameter reference, see LightGBM algorithm.

Parameter Description Recommended value
loss Loss function for the regression task. Mean Squared Error (MSE) penalizes large errors more heavily; Mean Absolute Error (MAE) is more robust to outliers. regression (MSE)
learning_rate Controls how much each tree corrects the previous prediction. Lower values reduce overfitting but require more trees to converge. Pair with early stopping to tune the number of training iterations automatically. 0.05–0.1
n_estimators Maximum number of weak learners (trees). Keep this low — 5 to 8 — when your dataset is small. Adding more trees than the data supports causes overfitting to noise rather than learning generalizable patterns. 5–8
/*polar4ai*/CREATE MODEL sales_gbrt WITH
(model_class = 'lightgbm',
x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature',
y_cols='sales',model_parameter=(n_estimators=8,boosting_type='gbdt',loss='regression',automl='True'))
AS (SELECT * FROM train_data);

Step 3: Evaluate the model

Run the following statement to start an asynchronous evaluation job. A task ID is returned immediately.

/*polar4ai*/SELECT target FROM EVALUATE(MODEL sales_gbrt,
SELECT * FROM test_data) WITH
(x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature', y_cols='sales',metrics='r2_score',mode='async');

View the results by running:

/*polar4ai*/SHOW TASK `<task_id>`;

Replace <task_id> with the ID returned by the evaluation statement.

The output includes an R² score (coefficient of determination). A higher R² score indicates better model fit.

Step 4: Generate predictions

Run the following statement to start an asynchronous prediction job. A task ID is returned immediately.

/*polar4ai*/SELECT target FROM PREDICT(MODEL sales_gbrt,
SELECT * FROM predict_data) WITH
(x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature', mode='async');

View the results by running:

/*polar4ai*/SHOW TASK `<task_id>`;

When the task completes, the output includes an OSS address. Download the file at that address to get the full prediction results.

Step 5: Identify peak and off-peak seasons

Compare the predicted sales for each future month against the historical average:

  • Peak season: predicted sales more than 20% above the historical average

  • Off-peak season: predicted sales more than 20% below the historical average

  • Normal: within ±20% of the average

Step 6: Inspect feature importance

Run the following statement to see which features contributed most to the model's predictions:

/*polar4ai*/describe sales_gbrt;

Use this output to verify that time-related features such as month encoding and holiday flags rank as dominant predictors. If they don't, revisit your feature engineering.

Output results

A completed model produces three outputs:

  • Sales predictions: Predicted values for the next six months to support inventory preparation and resource allocation.

  • Peak/off-peak season tags: Each month is labeled peak, off-peak, or normal based on comparison with the historical average.

  • Key driver analysis: Feature importance scores identify which factors most influence sales — for example, the contribution of promotions or the impact of holidays.

Limitations and maintenance

Cold start problem. New products with no sales history lack the lag features the model depends on. As a workaround, substitute lag features from similar products or use industry-level averages.

Exogenous variable availability. If predictions rely on external data such as weather forecasts or economic indicators, confirm that this data can be retrieved on time before each prediction run.

Model monitoring. Retrain the model periodically to reflect shifting market conditions, such as changes in consumer behavior or seasonal patterns that evolve year over year.

What's next

A GBRT model can predict sales with measurable accuracy. Use feature importance analysis to identify what drives peak and off-peak seasons, providing a quantitative basis for inventory and business decisions.