All Products
Search
Document Center

AnalyticDB:Machine learning user guide

Last Updated:Mar 30, 2026

AnalyticDB for PostgreSQL V7.0 includes an in-database AI/ML feature that integrates Hugging Face models for AI tasks and XGBoost, LightGBM, CatBoost, Linfa, and all scikit-learn algorithms for machine learning — so you can train, fine-tune, deploy, and run inference directly on data in your tables or views.

Training

Use pgml.train to train a model. This function runs on coordinator nodes only.

Syntax

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false,
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

Parameters

Parameter Description Example
project_name The project name. 'Breast Cancer Detection'
task The task type. 'classification'
relation_name The training table or view. Use SCHEMA.TABLE format to reference tables outside the current schema. 'pgml.breast_cancer'
y_column_name The target column to predict. 'malignant'
algorithm The algorithm to use. See Supported algorithms. 'xgboost'
hyperparams Hyperparameters to pass to the algorithm, in JSON format. {"n_estimators": 25, "nthread": 4}
search The hyperparameter search method. See Hyperparameter search. 'grid'
search_params The hyperparameter ranges to search. See the documentation for scikit-learn or XGBoost for valid parameters. {"max_depth": [1, 2, 3, 4]}
search_args Supplementary parameters for the search method. See the documentation for scikit-learn or XGBoost for details. {"n_iter": 10}
test_size The fraction of data to hold out as a test set. Default: 0.25. 0.25
test_sampling The method used to sample the test set. Default: stratified. 'random'
runtime The runtime environment. Valid values: python, rust. 'python'
automatic_deploy Whether to automatically deploy the best model after training. Default: true. true
materialize_snapshot Whether to materialize the training data snapshot. Default: false. true
preprocess Preprocessing steps applied to each column before training. See Preprocessing. {"col_name": {"impute": "mean", "scale": "standard"}}
Training uses a single core by default. To use multiple cores, pass the relevant hyperparameter for your algorithm — for example, nthread for XGBoost.

To review trained models and their metrics after training, query pgml.trained_models.

Supported algorithms

The following tables list the algorithms available through the pgml.algorithm, pgml.sampling, and pgml.runtime enumeration types.

pgml.algorithm

Algorithm Description
linear Linear model for regression.
xgboost Optimized distributed gradient boosting.
xgboost_random_forest XGBoost tuned for the random forest algorithm.
svm Support vector machine (SVM) for classification and regression.
lasso Lasso regression.
elastic_net Elastic net regression (combines L1 and L2 regularization).
ridge Ridge regression (L2 regularization).
kmeans k-means clustering.
dbscan Density-based clustering for clusters of various shapes.
knn k-nearest neighbors (KNN) for regression.
random_forest Ensemble learning for classification and regression.
least_angle Least angle regression (LARS), compatible with lasso.
lasso_least_angle Lasso least angle regression (combines LARS and lasso).
orthogonal_matching_pursuit Greedy algorithm for sparse signal recovery.
bayesian_ridge Bayesian ridge regression.
automatic_relevance_determination Bayesian regression.
stochastic_gradient_descent Stochastic gradient descent for regression and classification.
perceptron Linear model for binary classification.
passive_aggressive Online learning for large-scale tasks.
ransac Random sample consensus for robust linear regression.
theil_sen Theil-Sen estimator for robust linear regression.
huber Huber regression, robust to outliers.
quantile Quantile regression for predicting conditional quantiles.
kernel_ridge Kernel ridge regression (KRR) for nonlinear relationships.
gaussian_process Gaussian process for probabilistic regression and classification.
nu_svm SVM variant for classification and regression.
ada_boost AdaBoost for classification.
bagging Bagging ensemble to reduce overfitting.
extra_trees Extra trees, an extension of random forest.
gradient_boosting_trees Gradient boosting for regression and classification.
hist_gradient_boosting Histogram-based gradient boosting for large datasets.
linear_svm Linear SVM for classification.
lightgbm LightGBM gradient boosting framework.
affinity_propagation Message-passing clustering based on data similarity.
birch Clustering for large datasets.
feature_agglomeration Clustering-based feature selection.
mini_batch_kmeans Mini-batch variant of k-means for large or online datasets.
mean_shift Centroid-based clustering for any cluster shape.
optics Clustering for datasets with varying densities.
spectral Spectral clustering.
spectral_bi Spectral bi-clustering across two dimensions (rows and columns).
spectral_co Spectral co-clustering across two dimensions.
catboost CatBoost gradient boosting for classification and regression.
pca Principal component analysis (PCA) for dimensionality reduction.

pgml.sampling

Sampling method Description
random Random sampling.
last Sequential scan from the end of the dataset.
stratified Stratified sampling (default).

pgml.runtime

Runtime Description
python Python runtime.
rust Rust runtime.

Hyperparameter search

Use search, search_params, and search_args together to find the optimal hyperparameters for your model.

Parameter What it controls Example
search Search method: grid (exhaustive) or random (sampled) 'grid'
search_params Hyperparameter ranges to evaluate {"max_depth": [1, 2, 3], "n_estimators": [20, 80]}
search_args Number of iterations (n_iter) or cross-validation folds (cv) {"cv": 3}
  • grid: Trains every combination in search_params using the Cartesian product.

  • random: Randomly samples combinations from search_params for up to n_iter iterations.

For regression tasks, the R2 score selects the optimal hyperparameter set. For classification tasks, the F1 score is used.

Example: Grid search with 3-fold cross-validation

The following query trains an XGBoost classifier with grid search across max_depth and n_estimators. The total number of combinations evaluated is len(max_depth) × len(n_estimators) × cv = 3 × 3 × 3 = 27.

SELECT * FROM pgml.train(
    'Handwritten Digit Image Classifier',
    'classification',
    'pgml.digits',
    'target',
    algorithm => 'xgboost',
    search => 'grid',
    search_params => '{
        "max_depth": [1, 3, 5],
        "n_estimators": [20, 80, 160]
    }',
    search_args => '{"cv": 3}'
);

Preprocessing

Pass a preprocess JSON object in pgml.train to configure preprocessing for each column. Preprocessing settings are saved in pgml.snapshots after training and automatically applied to new data during inference — no extra configuration needed at inference time.

Enumeration type encoding

Converts categorical columns to numeric values.

Method Description
native Converts non-text types to 32-bit float (default).
target Replaces each category with the mean of the target variable for that category.
one_hot One-hot encoding.
ordinal Assigns each category a unique integer based on a predefined ordinal array. NULL values are assigned index 0 by default.

Missing value imputation

Replaces NULL or NaN values with a computed substitute.

Method Description
error Returns an error if NULL or NaN values are detected (default).
mean Replaces with the training data mean.
median Replaces with the training data median.
mode Replaces with the most frequent value in the training data.
min Replaces with the training data minimum.
max Replaces with the training data maximum.
zero Replaces with 0.

Scaling

Scales features to a standard range to improve model generalization.

Method Description
preserve No scaling (default).
standard Standardizes to zero mean and unit variance.
min-max Scales to a specified range, typically 0–1.
max-abs Divides by the maximum absolute value, so the max absolute value in the training set becomes 1.
robust Scales based on the interquartile range (Q1–Q3), more robust to outliers.

Deployment

After training, project metadata, preprocessing settings, and model parameters are stored in internal tables. At inference time, reference the project by name or ID — no need to pass model details again.

automatic_deploy (default true) automatically deploys the best-performing model after each training run. For classification, the model with the highest F1 score is deployed. For regression, the model with the highest R2 score is deployed.

To deploy a model manually, use one of the following pgml.deploy overloads.

Deploy by model ID

CREATE FUNCTION pgml."deploy"(
    "model_id" bigint
) RETURNS TABLE (
    "project" TEXT,
    "strategy" TEXT,
    "algorithm" TEXT
)
STRICT EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_model_wrapper';

Deploy by strategy

CREATE FUNCTION pgml."deploy"(
    "project_name" TEXT,
    "strategy" pgml.Strategy,
    "algorithm" pgml.Algorithm DEFAULT NULL
) RETURNS TABLE (
    "project" TEXT,
    "strategy" TEXT,
    "algorithm" TEXT
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

Deployment strategies (`pgml.strategy`)

Strategy Description
best_score Deploys the model with the best evaluation metrics.
most_recent Deploys the most recently trained model, regardless of metrics.
rollback Rolls back to the previous deployment.

Inference

Single-row inference with predict

pgml.predict returns a FLOAT4 value: the predicted class for classification tasks, or the predicted numeric value for regression tasks. The model resolved by project name is always the most recently deployed model for that project.

Call pgml.predict directly inside a SELECT statement:

-- Predict using project name (resolves the most recently deployed model)
SELECT target_column, pgml.predict('My Project', row_data) AS prediction
FROM my_table
LIMIT 10;

-- Predict using a specific model ID
SELECT target_column, pgml.predict(model_id, row_data) AS prediction
FROM my_table
LIMIT 10;

pgml.predict accepts the following input types:

Function signature Input type
pgml.predict(project_name TEXT, row anyelement) Table row
pgml.predict(model_id bigint, row anyelement) Table row, by model ID
pgml.predict(project_name TEXT, features real[]) FLOAT4 array
pgml.predict(model_id bigint, features real[]) FLOAT4 array, by model ID
pgml.predict(project_name TEXT, features bigint[]) BIGINT array
pgml.predict(project_name TEXT, features INT[]) INT array
pgml.predict(project_name TEXT, features smallint[]) SMALLINT array
pgml.predict(project_name TEXT, features double precision[]) FLOAT8 array
pgml.predict(project_name TEXT, features bool[]) BOOL array

Batch inference

For batch inference on multiple rows, use predict_batch, predict_proba, or predict_joint. All batch functions accept a flattened one-dimensional FLOAT4 array as input and return a FLOAT4 array.

predict_batch

Use predict_batch for standard batch inference.

-- By project name
CREATE FUNCTION pgml."predict_batch"(
    "project_name" TEXT,
    "features" real[]
) RETURNS SETOF real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_batch"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';

predict_proba

Use predict_proba when you need the confidence probability for each class, not just the predicted label. It takes the same inputs as predict_batch but returns the probability of each class.

-- By project name
CREATE FUNCTION pgml."predict_proba"(
    "project_name" TEXT,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_proba"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';

predict_joint

Use predict_joint for multi-label regression tasks. This function returns multi-dimensional predictions for each input row. Classification tasks are not supported.

-- By project name
CREATE FUNCTION pgml."predict_joint"(
    "project_name" TEXT,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_joint"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';

Examples

The following examples show end-to-end workflows for classification, regression, clustering, dimensionality reduction, and preprocessing. Each example follows the same pattern: load data, train a baseline model, run predictions, then experiment with additional algorithms and deployment strategies.

Classification

This example trains a cancer detection model on the pgml.breast_cancer dataset, runs predictions, and compares multiple algorithms.

\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('breast_cancer');

-- Inspect the data
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- Step 2: Train a baseline model
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- Step 3: Run predictions using the deployed model
SELECT malignant, pgml.predict(
    'Breast Cancer Detection',
    (
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- Get class probabilities instead of predicted labels
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- After initial training, omitted parameters are reused from the previous run.
-- The following calls reuse the training data snapshot from above.

-- Linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- Support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- Ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- Gradient boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');

-- Review the top models by F1 score
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- Deploy a specific algorithm
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');

-- View deployed models
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Run hyperparameter search on XGBoost
SELECT pgml.train(
    'Breast Cancer Detection',
    algorithm => 'xgboost',
    search => 'grid',
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- Deploy using different strategies
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- Check predictions after redeployment
SELECT malignant, pgml.predict(
    'Breast Cancer Detection',
    (
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

Regression

This example trains a diabetes progression model and compares multiple algorithms.

-- Step 1: Load the dataset
SELECT pgml.load_dataset('diabetes');

-- Inspect the data
SELECT * FROM pgml.diabetes LIMIT 10;

-- Step 2: Train a baseline model
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- Step 3: Run predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
LIMIT 10;

-- Check predictions for a specific model ID
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

-- After initial training, omitted parameters are reused from the previous run.

-- Linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');

-- Support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- Ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- Gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');

-- Compare runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- Review the top models by mean squared error
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- Deploy and check
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Hyperparameter search
SELECT pgml.train(
    'Diabetes Progression',
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid',
    search_params => '{
        "max_depth": [1, 2],
        "n_estimators": [20, 40]
    }'
);

-- Deploy using different strategies
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- Verify predictions after redeployment
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
LIMIT 10;

Clustering

This example clusters handwritten digit images using the pgml.digits dataset. It demonstrates using a single array feature column for unsupervised learning.

The pgml.digits dataset is derived from the UCI ML hand-written digits dataset.
\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('digits');

-- Create an unlabeled view for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- Inspect the data
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- Step 2: Train a k-means clustering model
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- Step 3: Check cluster assignments
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

-- Try other clustering algorithms
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

Dimensionality reduction

This example reduces 64-dimensional digit image vectors using principal component analysis (PCA).

\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('digits');

-- Inspect the data
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- Create a view of image vectors without labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

-- Step 2: Train a PCA decomposition model
SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- Step 3: View the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- Reduce to 3 components
-- After initial training, omitted parameters are reused from the previous run.
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- View the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- Review models by cumulative explained variance
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- Deploy the PCA model
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Switch deployment strategy
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- Run predictions after redeployment
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

Preprocessing

This example uses the diamonds dataset to demonstrate encoding categorical variables and scaling numeric features before training.

-- Step 1: Load the diamonds dataset (contains categorical columns)
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- Inspect the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- Remove the unnamed index column
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- Step 2: Train with preprocessing: standard scaling for numeric columns, target encoding for categoricals
SELECT pgml.train(
    project_name => 'Diamond prices',
    task => 'regression',
    relation_name => 'pgml.jdxcosta/diamonds',
    y_column_name => 'price',
    algorithm => 'lightgbm',
    preprocess => '{
        "carat":   {"scale": "standard"},
        "depth":   {"scale": "standard"},
        "table":   {"scale": "standard"},
        "cut":     {"encode": "target", "scale": "standard"},
        "color":   {"encode": "target", "scale": "standard"},
        "clarity": {"encode": "target", "scale": "standard"}
    }'
);

-- Step 3: Run predictions
-- Pass a heterogeneous row (tuple) instead of a homogeneous ARRAY[]
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;