Use a gradient boosting regression tree (GBRT) algorithm — such as XGBoost or LightGBM — to predict monthly sales for the next several months and identify peak and off-peak seasons. Features are built from three dimensions: time series attributes, business logic, and external factors.
After completing this guide, you will have:
-
Designed a feature set for time-series sales forecasting
-
Trained a LightGBM regression model using
CREATE MODEL -
Evaluated model accuracy with the R² score
-
Generated sales predictions and labeled each month as peak, off-peak, or normal
Feature engineering design
Features fall into four categories. The table below lists each category with representative examples.
| Category | Subcategory | Examples |
|---|---|---|
| Time-related | Basic time features | Month (1–12); cyclical encoding: sin(2π × month/12) and cos(2π × month/12) to capture periodicity; quarter (Q1–Q4); week number or date range (beginning/end of month); holiday flag (Spring Festival, Double 11, Christmas); weekday/weekend flag |
| Time series statistics | Lag features for 1, 3, 6, and 12 months to capture short-term fluctuations and long-term trends; sliding window statistics (mean, variance, maximum over the past 3 months) to smooth noise; year-over-year and month-over-month growth rate; year-to-date cumulative sales | |
| Business-related | Promotional activities | Promotion intensity (discount rate), number of promotion days, promotion type (spend-and-save promotions or flash sales) |
| Product life cycle | New product launch flag, delisting warning flag, lifecycle stage (introduction, growth, or maturity) | |
| Inventory and price | Inventory level (low inventory can suppress sales), price changes | |
| External factors | Macroeconomic | Consumer Confidence Index, Consumer Price Index (CPI) for inflation, industry growth data |
| Market competition | Competitor promotion activity and market share changes (sourced via web crawlers or third-party data) | |
| Weather and season | Temperature and rainfall (for example, ice cream sales correlate with temperature); holiday seasonality (for example, mooncake demand peaks before the Mid-Autumn Festival) | |
| Natural cycles | Fixed cycles | Fourier terms to extract annual or quarterly cycles |
| Floating cycles | Moving holiday effect to adjust for holidays whose Gregorian calendar dates shift each year, such as the Spring Festival |
Build a model
Step 1: Prepare the data
Split the data chronologically — not randomly. Random splitting lets future data leak into training, which inflates evaluation scores without reflecting real prediction performance. Use the following split:
-
Training dataset: 2018–2021
-
Validation set: 2022
-
Test set: 2023
Handle missing values. For missing values in external data such as promotions or weather, fill with the column mean or use interpolation.
Normalize continuous features. Apply standardization or normalization to continuous features such as price and temperature. For categorical variables, use one-hot encoding or target encoding.
The following table lists the features and target variable used in the model:
| Type | Name | Description |
|---|---|---|
| Feature | month_sin |
Sine cyclical encoding for month: sin(2π × month/12) |
month_cos |
Cosine cyclical encoding for month: cos(2π × month/12) | |
lag_1 |
Sales from the previous month, capturing short-term trends | |
lag_3 |
Sales from 3 months ago, reflecting medium-term fluctuations | |
promotion_flag |
Binary flag (0/1) indicating whether a promotion is active | |
temperature |
Temperature value, for weather-sensitive products such as beverages or apparel | |
| Target | sales |
Continuous sales value to predict |
Step 2: Create the model
LightGBM handles missing values natively and works well with categorical features.
The following table describes the key model parameters. For the full parameter reference, see LightGBM algorithm.
| Parameter | Description | Recommended value |
|---|---|---|
loss |
Loss function for the regression task. Mean Squared Error (MSE) penalizes large errors more heavily; Mean Absolute Error (MAE) is more robust to outliers. | regression (MSE) |
learning_rate |
Controls how much each tree corrects the previous prediction. Lower values reduce overfitting but require more trees to converge. Pair with early stopping to tune the number of training iterations automatically. | 0.05–0.1 |
n_estimators |
Maximum number of weak learners (trees). Keep this low — 5 to 8 — when your dataset is small. Adding more trees than the data supports causes overfitting to noise rather than learning generalizable patterns. | 5–8 |
/*polar4ai*/CREATE MODEL sales_gbrt WITH
(model_class = 'lightgbm',
x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature',
y_cols='sales',model_parameter=(n_estimators=8,boosting_type='gbdt',loss='regression',automl='True'))
AS (SELECT * FROM train_data);
Step 3: Evaluate the model
Run the following statement to start an asynchronous evaluation job. A task ID is returned immediately.
/*polar4ai*/SELECT target FROM EVALUATE(MODEL sales_gbrt,
SELECT * FROM test_data) WITH
(x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature', y_cols='sales',metrics='r2_score',mode='async');
View the results by running:
/*polar4ai*/SHOW TASK `<task_id>`;
Replace <task_id> with the ID returned by the evaluation statement.
The output includes an R² score (coefficient of determination). A higher R² score indicates better model fit.
Step 4: Generate predictions
Run the following statement to start an asynchronous prediction job. A task ID is returned immediately.
/*polar4ai*/SELECT target FROM PREDICT(MODEL sales_gbrt,
SELECT * FROM predict_data) WITH
(x_cols = 'month_sin,month_cos,lag_1,lag_3,promotion_flag,temperature', mode='async');
View the results by running:
/*polar4ai*/SHOW TASK `<task_id>`;
When the task completes, the output includes an OSS address. Download the file at that address to get the full prediction results.
Step 5: Identify peak and off-peak seasons
Compare the predicted sales for each future month against the historical average:
-
Peak season: predicted sales more than 20% above the historical average
-
Off-peak season: predicted sales more than 20% below the historical average
-
Normal: within ±20% of the average
Step 6: Inspect feature importance
Run the following statement to see which features contributed most to the model's predictions:
/*polar4ai*/describe sales_gbrt;
Use this output to verify that time-related features such as month encoding and holiday flags rank as dominant predictors. If they don't, revisit your feature engineering.
Output results
A completed model produces three outputs:
-
Sales predictions: Predicted values for the next six months to support inventory preparation and resource allocation.
-
Peak/off-peak season tags: Each month is labeled peak, off-peak, or normal based on comparison with the historical average.
-
Key driver analysis: Feature importance scores identify which factors most influence sales — for example, the contribution of promotions or the impact of holidays.
Limitations and maintenance
Cold start problem. New products with no sales history lack the lag features the model depends on. As a workaround, substitute lag features from similar products or use industry-level averages.
Exogenous variable availability. If predictions rely on external data such as weather forecasts or economic indicators, confirm that this data can be retrieved on time before each prediction run.
Model monitoring. Retrain the model periodically to reflect shifting market conditions, such as changes in consumer behavior or seasonal patterns that evolve year over year.
What's next
A GBRT model can predict sales with measurable accuracy. Use feature importance analysis to identify what drives peak and off-peak seasons, providing a quantitative basis for inventory and business decisions.