AnalyticDB for PostgreSQL V7.0 includes an in-database AI/ML feature that integrates Hugging Face models for AI tasks and XGBoost, LightGBM, CatBoost, Linfa, and all scikit-learn algorithms for machine learning — so you can train, fine-tune, deploy, and run inference directly on data in your tables or views.
Training
Use pgml.train to train a model. This function runs on coordinator nodes only.
Syntax
CREATE FUNCTION pgml.train(
"project_name" TEXT,
"task" TEXT DEFAULT NULL,
"relation_name" TEXT DEFAULT NULL,
"y_column_name" TEXT DEFAULT NULL,
"algorithm" pgml.Algorithm DEFAULT 'linear',
"hyperparams" jsonb DEFAULT '{}',
"search" pgml.Search DEFAULT NULL,
"search_params" jsonb DEFAULT '{}',
"search_args" jsonb DEFAULT '{}',
"test_size" real DEFAULT 0.25,
"test_sampling" pgml.Sampling DEFAULT 'stratified',
"runtime" pgml.Runtime DEFAULT NULL,
"automatic_deploy" bool DEFAULT true,
"materialize_snapshot" bool DEFAULT false,
"preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
"project" TEXT,
"task" TEXT,
"algorithm" TEXT,
"deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';
Parameters
| Parameter | Description | Example |
|---|---|---|
project_name |
The project name. | 'Breast Cancer Detection' |
task |
The task type. | 'classification' |
relation_name |
The training table or view. Use SCHEMA.TABLE format to reference tables outside the current schema. |
'pgml.breast_cancer' |
y_column_name |
The target column to predict. | 'malignant' |
algorithm |
The algorithm to use. See Supported algorithms. | 'xgboost' |
hyperparams |
Hyperparameters to pass to the algorithm, in JSON format. | {"n_estimators": 25, "nthread": 4} |
search |
The hyperparameter search method. See Hyperparameter search. | 'grid' |
search_params |
The hyperparameter ranges to search. See the documentation for scikit-learn or XGBoost for valid parameters. | {"max_depth": [1, 2, 3, 4]} |
search_args |
Supplementary parameters for the search method. See the documentation for scikit-learn or XGBoost for details. | {"n_iter": 10} |
test_size |
The fraction of data to hold out as a test set. Default: 0.25. |
0.25 |
test_sampling |
The method used to sample the test set. Default: stratified. |
'random' |
runtime |
The runtime environment. Valid values: python, rust. |
'python' |
automatic_deploy |
Whether to automatically deploy the best model after training. Default: true. |
true |
materialize_snapshot |
Whether to materialize the training data snapshot. Default: false. |
true |
preprocess |
Preprocessing steps applied to each column before training. See Preprocessing. | {"col_name": {"impute": "mean", "scale": "standard"}} |
Training uses a single core by default. To use multiple cores, pass the relevant hyperparameter for your algorithm — for example, nthread for XGBoost.
To review trained models and their metrics after training, query pgml.trained_models.
Supported algorithms
The following tables list the algorithms available through the pgml.algorithm, pgml.sampling, and pgml.runtime enumeration types.
pgml.algorithm
| Algorithm | Description |
|---|---|
linear |
Linear model for regression. |
xgboost |
Optimized distributed gradient boosting. |
xgboost_random_forest |
XGBoost tuned for the random forest algorithm. |
svm |
Support vector machine (SVM) for classification and regression. |
lasso |
Lasso regression. |
elastic_net |
Elastic net regression (combines L1 and L2 regularization). |
ridge |
Ridge regression (L2 regularization). |
kmeans |
k-means clustering. |
dbscan |
Density-based clustering for clusters of various shapes. |
knn |
k-nearest neighbors (KNN) for regression. |
random_forest |
Ensemble learning for classification and regression. |
least_angle |
Least angle regression (LARS), compatible with lasso. |
lasso_least_angle |
Lasso least angle regression (combines LARS and lasso). |
orthogonal_matching_pursuit |
Greedy algorithm for sparse signal recovery. |
bayesian_ridge |
Bayesian ridge regression. |
automatic_relevance_determination |
Bayesian regression. |
stochastic_gradient_descent |
Stochastic gradient descent for regression and classification. |
perceptron |
Linear model for binary classification. |
passive_aggressive |
Online learning for large-scale tasks. |
ransac |
Random sample consensus for robust linear regression. |
theil_sen |
Theil-Sen estimator for robust linear regression. |
huber |
Huber regression, robust to outliers. |
quantile |
Quantile regression for predicting conditional quantiles. |
kernel_ridge |
Kernel ridge regression (KRR) for nonlinear relationships. |
gaussian_process |
Gaussian process for probabilistic regression and classification. |
nu_svm |
SVM variant for classification and regression. |
ada_boost |
AdaBoost for classification. |
bagging |
Bagging ensemble to reduce overfitting. |
extra_trees |
Extra trees, an extension of random forest. |
gradient_boosting_trees |
Gradient boosting for regression and classification. |
hist_gradient_boosting |
Histogram-based gradient boosting for large datasets. |
linear_svm |
Linear SVM for classification. |
lightgbm |
LightGBM gradient boosting framework. |
affinity_propagation |
Message-passing clustering based on data similarity. |
birch |
Clustering for large datasets. |
feature_agglomeration |
Clustering-based feature selection. |
mini_batch_kmeans |
Mini-batch variant of k-means for large or online datasets. |
mean_shift |
Centroid-based clustering for any cluster shape. |
optics |
Clustering for datasets with varying densities. |
spectral |
Spectral clustering. |
spectral_bi |
Spectral bi-clustering across two dimensions (rows and columns). |
spectral_co |
Spectral co-clustering across two dimensions. |
catboost |
CatBoost gradient boosting for classification and regression. |
pca |
Principal component analysis (PCA) for dimensionality reduction. |
pgml.sampling
| Sampling method | Description |
|---|---|
random |
Random sampling. |
last |
Sequential scan from the end of the dataset. |
stratified |
Stratified sampling (default). |
pgml.runtime
| Runtime | Description |
|---|---|
python |
Python runtime. |
rust |
Rust runtime. |
Hyperparameter search
Use search, search_params, and search_args together to find the optimal hyperparameters for your model.
| Parameter | What it controls | Example |
|---|---|---|
search |
Search method: grid (exhaustive) or random (sampled) |
'grid' |
search_params |
Hyperparameter ranges to evaluate | {"max_depth": [1, 2, 3], "n_estimators": [20, 80]} |
search_args |
Number of iterations (n_iter) or cross-validation folds (cv) |
{"cv": 3} |
-
grid: Trains every combination in
search_paramsusing the Cartesian product. -
random: Randomly samples combinations from
search_paramsfor up ton_iteriterations.
For regression tasks, the R2 score selects the optimal hyperparameter set. For classification tasks, the F1 score is used.
Example: Grid search with 3-fold cross-validation
The following query trains an XGBoost classifier with grid search across max_depth and n_estimators. The total number of combinations evaluated is len(max_depth) × len(n_estimators) × cv = 3 × 3 × 3 = 27.
SELECT * FROM pgml.train(
'Handwritten Digit Image Classifier',
'classification',
'pgml.digits',
'target',
algorithm => 'xgboost',
search => 'grid',
search_params => '{
"max_depth": [1, 3, 5],
"n_estimators": [20, 80, 160]
}',
search_args => '{"cv": 3}'
);
Preprocessing
Pass a preprocess JSON object in pgml.train to configure preprocessing for each column. Preprocessing settings are saved in pgml.snapshots after training and automatically applied to new data during inference — no extra configuration needed at inference time.
Enumeration type encoding
Converts categorical columns to numeric values.
| Method | Description |
|---|---|
native |
Converts non-text types to 32-bit float (default). |
target |
Replaces each category with the mean of the target variable for that category. |
one_hot |
One-hot encoding. |
ordinal |
Assigns each category a unique integer based on a predefined ordinal array. NULL values are assigned index 0 by default. |
Missing value imputation
Replaces NULL or NaN values with a computed substitute.
| Method | Description |
|---|---|
error |
Returns an error if NULL or NaN values are detected (default). |
mean |
Replaces with the training data mean. |
median |
Replaces with the training data median. |
mode |
Replaces with the most frequent value in the training data. |
min |
Replaces with the training data minimum. |
max |
Replaces with the training data maximum. |
zero |
Replaces with 0. |
Scaling
Scales features to a standard range to improve model generalization.
| Method | Description |
|---|---|
preserve |
No scaling (default). |
standard |
Standardizes to zero mean and unit variance. |
min-max |
Scales to a specified range, typically 0–1. |
max-abs |
Divides by the maximum absolute value, so the max absolute value in the training set becomes 1. |
robust |
Scales based on the interquartile range (Q1–Q3), more robust to outliers. |
Deployment
After training, project metadata, preprocessing settings, and model parameters are stored in internal tables. At inference time, reference the project by name or ID — no need to pass model details again.
automatic_deploy (default true) automatically deploys the best-performing model after each training run. For classification, the model with the highest F1 score is deployed. For regression, the model with the highest R2 score is deployed.
To deploy a model manually, use one of the following pgml.deploy overloads.
Deploy by model ID
CREATE FUNCTION pgml."deploy"(
"model_id" bigint
) RETURNS TABLE (
"project" TEXT,
"strategy" TEXT,
"algorithm" TEXT
)
STRICT EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_model_wrapper';
Deploy by strategy
CREATE FUNCTION pgml."deploy"(
"project_name" TEXT,
"strategy" pgml.Strategy,
"algorithm" pgml.Algorithm DEFAULT NULL
) RETURNS TABLE (
"project" TEXT,
"strategy" TEXT,
"algorithm" TEXT
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';
Deployment strategies (`pgml.strategy`)
| Strategy | Description |
|---|---|
best_score |
Deploys the model with the best evaluation metrics. |
most_recent |
Deploys the most recently trained model, regardless of metrics. |
rollback |
Rolls back to the previous deployment. |
Inference
Single-row inference with predict
pgml.predict returns a FLOAT4 value: the predicted class for classification tasks, or the predicted numeric value for regression tasks. The model resolved by project name is always the most recently deployed model for that project.
Call pgml.predict directly inside a SELECT statement:
-- Predict using project name (resolves the most recently deployed model)
SELECT target_column, pgml.predict('My Project', row_data) AS prediction
FROM my_table
LIMIT 10;
-- Predict using a specific model ID
SELECT target_column, pgml.predict(model_id, row_data) AS prediction
FROM my_table
LIMIT 10;
pgml.predict accepts the following input types:
| Function signature | Input type |
|---|---|
pgml.predict(project_name TEXT, row anyelement) |
Table row |
pgml.predict(model_id bigint, row anyelement) |
Table row, by model ID |
pgml.predict(project_name TEXT, features real[]) |
FLOAT4 array |
pgml.predict(model_id bigint, features real[]) |
FLOAT4 array, by model ID |
pgml.predict(project_name TEXT, features bigint[]) |
BIGINT array |
pgml.predict(project_name TEXT, features INT[]) |
INT array |
pgml.predict(project_name TEXT, features smallint[]) |
SMALLINT array |
pgml.predict(project_name TEXT, features double precision[]) |
FLOAT8 array |
pgml.predict(project_name TEXT, features bool[]) |
BOOL array |
Batch inference
For batch inference on multiple rows, use predict_batch, predict_proba, or predict_joint. All batch functions accept a flattened one-dimensional FLOAT4 array as input and return a FLOAT4 array.
predict_batch
Use predict_batch for standard batch inference.
-- By project name
CREATE FUNCTION pgml."predict_batch"(
"project_name" TEXT,
"features" real[]
) RETURNS SETOF real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_batch_wrapper';
-- By model ID
CREATE FUNCTION pgml."predict_batch"(
"model_id" bigint,
"features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';
predict_proba
Use predict_proba when you need the confidence probability for each class, not just the predicted label. It takes the same inputs as predict_batch but returns the probability of each class.
-- By project name
CREATE FUNCTION pgml."predict_proba"(
"project_name" TEXT,
"features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_proba_wrapper';
-- By model ID
CREATE FUNCTION pgml."predict_proba"(
"model_id" bigint,
"features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';
predict_joint
Use predict_joint for multi-label regression tasks. This function returns multi-dimensional predictions for each input row. Classification tasks are not supported.
-- By project name
CREATE FUNCTION pgml."predict_joint"(
"project_name" TEXT,
"features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_joint_wrapper';
-- By model ID
CREATE FUNCTION pgml."predict_joint"(
"model_id" bigint,
"features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';
Examples
The following examples show end-to-end workflows for classification, regression, clustering, dimensionality reduction, and preprocessing. Each example follows the same pattern: load data, train a baseline model, run predictions, then experiment with additional algorithms and deployment strategies.
Preprocessing
This example uses the diamonds dataset to demonstrate encoding categorical variables and scaling numeric features before training.
-- Step 1: Load the diamonds dataset (contains categorical columns)
SELECT pgml.load_dataset('jdxcosta/diamonds');
-- Inspect the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;
-- Remove the unnamed index column
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";
-- Step 2: Train with preprocessing: standard scaling for numeric columns, target encoding for categoricals
SELECT pgml.train(
project_name => 'Diamond prices',
task => 'regression',
relation_name => 'pgml.jdxcosta/diamonds',
y_column_name => 'price',
algorithm => 'lightgbm',
preprocess => '{
"carat": {"scale": "standard"},
"depth": {"scale": "standard"},
"table": {"scale": "standard"},
"cut": {"encode": "target", "scale": "standard"},
"color": {"encode": "target", "scale": "standard"},
"clarity": {"encode": "target", "scale": "standard"}
}'
);
-- Step 3: Run predictions
-- Pass a heterogeneous row (tuple) instead of a homogeneous ARRAY[]
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;