Random forest regression is a PolarDB AI (Polar4AI) algorithm for predicting continuous numerical values. It builds multiple unrelated decision trees in parallel by randomly sampling data and features, then averages their outputs to produce a final prediction.
When to use this algorithm
Use random forest regression when the prediction target is a continuous number — such as a price, count, or score — and the dataset has tens of data dimensions and high accuracy requirements. For targets limited to a fixed set of categories, use a classification algorithm instead.
Example: Predict the hourly popularity of a social media topic. Input features include the number of discussion groups, participant count, and engagement level. The output is the average number of active discussion groups per hour — a positive floating-point value.
How it works
Randomly select samples and features from the training data.
Build multiple unrelated decision trees in parallel, each producing its own prediction.
Average the predictions of all trees to get the final regression output.
Using random selection for both samples and features reduces overfitting and improves generalization compared to a single decision tree.
Input requirements
| Column | Required type | Notes |
|---|---|---|
x_cols (feature columns) | Floating-point or integer | All feature columns must be numeric |
y_cols (target column) | Floating-point or integer | The model predicts a continuous numeric value |
Parameters
Pass parameters through the model_parameter option in a CREATE MODEL statement.
| Parameter | Type | Default | Description |
|---|---|---|---|
n_estimators | Positive integer | 100 | Number of decision trees. A higher value improves fitting. |
objective | String | mse | Loss function used during training. Valid values: mse (mean squared error) and mae (mean absolute error). |
max_features | String, integer, or float | "sqrt" | Maximum number of features considered at each split. See the table below for accepted values. |
max_depth | Positive integer or None | None | Maximum depth of each tree. When set to None, the depth of the tree is not specified. |
n_jobs | Positive integer | 4 | Number of parallel threads. A higher value speeds up model creation. |
random_state | Positive integer | 1 | Random seed for reproducibility. |
Accepted values for max_features
| Value | Maximum features used |
|---|---|
"sqrt" (default) | sqrt(n_features) |
"log2" | log2(n_features) |
| Integer | The specified integer, between 0 and n_features (inclusive) |
| Float | max_features × n_features |
Examples
The following examples walk through the full workflow: create a model, evaluate it, then run predictions. All statements use the /*polar4ai*/ prefix to route the SQL to the Polar4AI engine.
Create a model
/*polar4ai*/CREATE MODEL randomforestreg1 WITH
( model_class = 'randomforestreg', x_cols = 'dx1,dx2', y_cols='y',
model_parameter=(objective='mse')) AS (SELECT * FROM db4ai.testdata1);Evaluate the model
Run EVALUATE to score the model against labeled data. The metrics='r2_score' option returns the R² (coefficient of determination), which measures how well the model explains variance in the target variable.
/*polar4ai*/SELECT dx1,dx2 FROM EVALUATE(MODEL randomforestreg1,
SELECT * FROM db4ai.testdata1 LIMIT 10) WITH
(x_cols = 'dx1,dx2',y_cols='y',metrics='r2_score');Run predictions
/*polar4ai*/SELECT dx1,dx2 FROM
PREDICT(MODEL randomforestreg1, SELECT * FROM db4ai.testdata1 LIMIT 10)
WITH (x_cols = 'dx1,dx2');