Build a model with the random forest regression algorithm - PolarDB

Random forest regression is a PolarDB AI (Polar4AI) algorithm for predicting continuous numerical values. It builds multiple unrelated decision trees in parallel by randomly sampling data and features, then averages their outputs to produce a final prediction.

When to use this algorithm

Use random forest regression when the prediction target is a continuous number — such as a price, count, or score — and the dataset has tens of data dimensions and high accuracy requirements. For targets limited to a fixed set of categories, use a classification algorithm instead.

Example: Predict the hourly popularity of a social media topic. Input features include the number of discussion groups, participant count, and engagement level. The output is the average number of active discussion groups per hour — a positive floating-point value.

How it works

Randomly select samples and features from the training data.
Build multiple unrelated decision trees in parallel, each producing its own prediction.
Average the predictions of all trees to get the final regression output.

Using random selection for both samples and features reduces overfitting and improves generalization compared to a single decision tree.

Input requirements

Column	Required type	Notes
`x_cols` (feature columns)	Floating-point or integer	All feature columns must be numeric
`y_cols` (target column)	Floating-point or integer	The model predicts a continuous numeric value

Parameters

Pass parameters through the model_parameter option in a CREATE MODEL statement.

Parameter	Type	Default	Description
`n_estimators`	Positive integer	`100`	Number of decision trees. A higher value improves fitting.
`objective`	String	`mse`	Loss function used during training. Valid values: `mse` (mean squared error) and `mae` (mean absolute error).
`max_features`	String, integer, or float	`"sqrt"`	Maximum number of features considered at each split. See the table below for accepted values.
`max_depth`	Positive integer or `None`	`None`	Maximum depth of each tree. When set to `None`, the depth of the tree is not specified.
`n_jobs`	Positive integer	`4`	Number of parallel threads. A higher value speeds up model creation.
`random_state`	Positive integer	`1`	Random seed for reproducibility.

Accepted values for max_features

Value	Maximum features used
`"sqrt"` (default)	`sqrt(n_features)`
`"log2"`	`log2(n_features)`
Integer	The specified integer, between `0` and `n_features` (inclusive)
Float	`max_features × n_features`

Examples

The following examples walk through the full workflow: create a model, evaluate it, then run predictions. All statements use the /*polar4ai*/ prefix to route the SQL to the Polar4AI engine.

Create a model

/*polar4ai*/CREATE MODEL randomforestreg1 WITH
( model_class = 'randomforestreg', x_cols = 'dx1,dx2', y_cols='y',
 model_parameter=(objective='mse')) AS (SELECT * FROM db4ai.testdata1);

Evaluate the model

Run EVALUATE to score the model against labeled data. The metrics='r2_score' option returns the R² (coefficient of determination), which measures how well the model explains variance in the target variable.

/*polar4ai*/SELECT dx1,dx2 FROM EVALUATE(MODEL randomforestreg1,
SELECT * FROM db4ai.testdata1 LIMIT 10) WITH
(x_cols = 'dx1,dx2',y_cols='y',metrics='r2_score');

Run predictions

/*polar4ai*/SELECT dx1,dx2 FROM
PREDICT(MODEL randomforestreg1, SELECT * FROM db4ai.testdata1 LIMIT 10)
WITH (x_cols = 'dx1,dx2');