How to use machine learning - AnalyticDB - Alibaba Cloud Documentation Center

Training

Use pgml.train to train a model. This function runs on coordinator nodes only.

Syntax

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false,
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

Parameters

Parameter	Description	Example
`project_name`	The project name.	`'Breast Cancer Detection'`
`task`	The task type.	`'classification'`
`relation_name`	The training table or view. Use `SCHEMA.TABLE` format to reference tables outside the current schema.	`'pgml.breast_cancer'`
`y_column_name`	The target column to predict.	`'malignant'`
`algorithm`	The algorithm to use. See Supported algorithms.	`'xgboost'`
`hyperparams`	Hyperparameters to pass to the algorithm, in JSON format.	`{"n_estimators": 25, "nthread": 4}`
`search`	The hyperparameter search method. See Hyperparameter search.	`'grid'`
`search_params`	The hyperparameter ranges to search. See the documentation for scikit-learn or XGBoost for valid parameters.	`{"max_depth": [1, 2, 3, 4]}`
`search_args`	Supplementary parameters for the search method. See the documentation for scikit-learn or XGBoost for details.	`{"n_iter": 10}`
`test_size`	The fraction of data to hold out as a test set. Default: `0.25`.	`0.25`
`test_sampling`	The method used to sample the test set. Default: `stratified`.	`'random'`
`runtime`	The runtime environment. Valid values: `python`, `rust`.	`'python'`
`automatic_deploy`	Whether to automatically deploy the best model after training. Default: `true`.	`true`
`materialize_snapshot`	Whether to materialize the training data snapshot. Default: `false`.	`true`
`preprocess`	Preprocessing steps applied to each column before training. See Preprocessing.	`{"col_name": {"impute": "mean", "scale": "standard"}}`

Training uses a single core by default. To use multiple cores, pass the relevant hyperparameter for your algorithm — for example, nthread for XGBoost.

To review trained models and their metrics after training, query pgml.trained_models.

Supported algorithms

The following tables list the algorithms available through the pgml.algorithm, pgml.sampling, and pgml.runtime enumeration types.

pgml.algorithm

Algorithm	Description
`linear`	Linear model for regression.
`xgboost`	Optimized distributed gradient boosting.
`xgboost_random_forest`	XGBoost tuned for the random forest algorithm.
`svm`	Support vector machine (SVM) for classification and regression.
`lasso`	Lasso regression.
`elastic_net`	Elastic net regression (combines L1 and L2 regularization).
`ridge`	Ridge regression (L2 regularization).
`kmeans`	k-means clustering.
`dbscan`	Density-based clustering for clusters of various shapes.
`knn`	k-nearest neighbors (KNN) for regression.
`random_forest`	Ensemble learning for classification and regression.
`least_angle`	Least angle regression (LARS), compatible with lasso.
`lasso_least_angle`	Lasso least angle regression (combines LARS and lasso).
`orthogonal_matching_pursuit`	Greedy algorithm for sparse signal recovery.
`bayesian_ridge`	Bayesian ridge regression.
`automatic_relevance_determination`	Bayesian regression.
`stochastic_gradient_descent`	Stochastic gradient descent for regression and classification.
`perceptron`	Linear model for binary classification.
`passive_aggressive`	Online learning for large-scale tasks.
`ransac`	Random sample consensus for robust linear regression.
`theil_sen`	Theil-Sen estimator for robust linear regression.
`huber`	Huber regression, robust to outliers.
`quantile`	Quantile regression for predicting conditional quantiles.
`kernel_ridge`	Kernel ridge regression (KRR) for nonlinear relationships.
`gaussian_process`	Gaussian process for probabilistic regression and classification.
`nu_svm`	SVM variant for classification and regression.
`ada_boost`	AdaBoost for classification.
`bagging`	Bagging ensemble to reduce overfitting.
`extra_trees`	Extra trees, an extension of random forest.
`gradient_boosting_trees`	Gradient boosting for regression and classification.
`hist_gradient_boosting`	Histogram-based gradient boosting for large datasets.
`linear_svm`	Linear SVM for classification.
`lightgbm`	LightGBM gradient boosting framework.
`affinity_propagation`	Message-passing clustering based on data similarity.
`birch`	Clustering for large datasets.
`feature_agglomeration`	Clustering-based feature selection.
`mini_batch_kmeans`	Mini-batch variant of k-means for large or online datasets.
`mean_shift`	Centroid-based clustering for any cluster shape.
`optics`	Clustering for datasets with varying densities.
`spectral`	Spectral clustering.
`spectral_bi`	Spectral bi-clustering across two dimensions (rows and columns).
`spectral_co`	Spectral co-clustering across two dimensions.
`catboost`	CatBoost gradient boosting for classification and regression.
`pca`	Principal component analysis (PCA) for dimensionality reduction.

pgml.sampling

Sampling method	Description
`random`	Random sampling.
`last`	Sequential scan from the end of the dataset.
`stratified`	Stratified sampling (default).

pgml.runtime

Runtime	Description
`python`	Python runtime.
`rust`	Rust runtime.

Hyperparameter search

Use search, search_params, and search_args together to find the optimal hyperparameters for your model.

Parameter	What it controls	Example
`search`	Search method: `grid` (exhaustive) or `random` (sampled)	`'grid'`
`search_params`	Hyperparameter ranges to evaluate	`{"max_depth": [1, 2, 3], "n_estimators": [20, 80]}`
`search_args`	Number of iterations (`n_iter`) or cross-validation folds (`cv`)	`{"cv": 3}`

grid: Trains every combination in search_params using the Cartesian product.
random: Randomly samples combinations from search_params for up to n_iter iterations.

For regression tasks, the R2 score selects the optimal hyperparameter set. For classification tasks, the F1 score is used.

Example: Grid search with 3-fold cross-validation

The following query trains an XGBoost classifier with grid search across max_depth and n_estimators. The total number of combinations evaluated is len(max_depth) × len(n_estimators) × cv = 3 × 3 × 3 = 27.

SELECT * FROM pgml.train(
    'Handwritten Digit Image Classifier',
    'classification',
    'pgml.digits',
    'target',
    algorithm => 'xgboost',
    search => 'grid',
    search_params => '{
        "max_depth": [1, 3, 5],
        "n_estimators": [20, 80, 160]
    }',
    search_args => '{"cv": 3}'
);

Preprocessing

Pass a preprocess JSON object in pgml.train to configure preprocessing for each column. Preprocessing settings are saved in pgml.snapshots after training and automatically applied to new data during inference — no extra configuration needed at inference time.

Enumeration type encoding

Converts categorical columns to numeric values.

Method	Description
`native`	Converts non-text types to 32-bit float (default).
`target`	Replaces each category with the mean of the target variable for that category.
`one_hot`	One-hot encoding.
`ordinal`	Assigns each category a unique integer based on a predefined ordinal array. NULL values are assigned index 0 by default.

Missing value imputation

Replaces NULL or NaN values with a computed substitute.

Method	Description
`error`	Returns an error if NULL or NaN values are detected (default).
`mean`	Replaces with the training data mean.
`median`	Replaces with the training data median.
`mode`	Replaces with the most frequent value in the training data.
`min`	Replaces with the training data minimum.
`max`	Replaces with the training data maximum.
`zero`	Replaces with 0.

Scaling

Scales features to a standard range to improve model generalization.

Method	Description
`preserve`	No scaling (default).
`standard`	Standardizes to zero mean and unit variance.
`min-max`	Scales to a specified range, typically 0–1.
`max-abs`	Divides by the maximum absolute value, so the max absolute value in the training set becomes 1.
`robust`	Scales based on the interquartile range (Q1–Q3), more robust to outliers.

Deployment

After training, project metadata, preprocessing settings, and model parameters are stored in internal tables. At inference time, reference the project by name or ID — no need to pass model details again.

automatic_deploy (default true) automatically deploys the best-performing model after each training run. For classification, the model with the highest F1 score is deployed. For regression, the model with the highest R2 score is deployed.

To deploy a model manually, use one of the following pgml.deploy overloads.

Deploy by model ID

CREATE FUNCTION pgml."deploy"(
    "model_id" bigint
) RETURNS TABLE (
    "project" TEXT,
    "strategy" TEXT,
    "algorithm" TEXT
)
STRICT EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_model_wrapper';

Deploy by strategy

CREATE FUNCTION pgml."deploy"(
    "project_name" TEXT,
    "strategy" pgml.Strategy,
    "algorithm" pgml.Algorithm DEFAULT NULL
) RETURNS TABLE (
    "project" TEXT,
    "strategy" TEXT,
    "algorithm" TEXT
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

Deployment strategies (`pgml.strategy`)

Strategy	Description
`best_score`	Deploys the model with the best evaluation metrics.
`most_recent`	Deploys the most recently trained model, regardless of metrics.
`rollback`	Rolls back to the previous deployment.

Inference

Single-row inference with `predict`

pgml.predict returns a FLOAT4 value: the predicted class for classification tasks, or the predicted numeric value for regression tasks. The model resolved by project name is always the most recently deployed model for that project.

Call pgml.predict directly inside a SELECT statement:

-- Predict using project name (resolves the most recently deployed model)
SELECT target_column, pgml.predict('My Project', row_data) AS prediction
FROM my_table
LIMIT 10;

-- Predict using a specific model ID
SELECT target_column, pgml.predict(model_id, row_data) AS prediction
FROM my_table
LIMIT 10;

pgml.predict accepts the following input types:

Function signature	Input type
`pgml.predict(project_name TEXT, row anyelement)`	Table row
`pgml.predict(model_id bigint, row anyelement)`	Table row, by model ID
`pgml.predict(project_name TEXT, features real[])`	`FLOAT4` array
`pgml.predict(model_id bigint, features real[])`	`FLOAT4` array, by model ID
`pgml.predict(project_name TEXT, features bigint[])`	`BIGINT` array
`pgml.predict(project_name TEXT, features INT[])`	`INT` array
`pgml.predict(project_name TEXT, features smallint[])`	`SMALLINT` array
`pgml.predict(project_name TEXT, features double precision[])`	`FLOAT8` array
`pgml.predict(project_name TEXT, features bool[])`	`BOOL` array

Batch inference

For batch inference on multiple rows, use predict_batch, predict_proba, or predict_joint. All batch functions accept a flattened one-dimensional FLOAT4 array as input and return a FLOAT4 array.

predict_batch

Use predict_batch for standard batch inference.

-- By project name
CREATE FUNCTION pgml."predict_batch"(
    "project_name" TEXT,
    "features" real[]
) RETURNS SETOF real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_batch"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';

predict_proba

Use predict_proba when you need the confidence probability for each class, not just the predicted label. It takes the same inputs as predict_batch but returns the probability of each class.

-- By project name
CREATE FUNCTION pgml."predict_proba"(
    "project_name" TEXT,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_proba"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';

predict_joint

Use predict_joint for multi-label regression tasks. This function returns multi-dimensional predictions for each input row. Classification tasks are not supported.

-- By project name
CREATE FUNCTION pgml."predict_joint"(
    "project_name" TEXT,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

-- By model ID
CREATE FUNCTION pgml."predict_joint"(
    "model_id" bigint,
    "features" real[]
) RETURNS real[]
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';

Examples

The following examples show end-to-end workflows for classification, regression, clustering, dimensionality reduction, and preprocessing. Each example follows the same pattern: load data, train a baseline model, run predictions, then experiment with additional algorithms and deployment strategies.

Classification

This example trains a cancer detection model on the pgml.breast_cancer dataset, runs predictions, and compares multiple algorithms.

\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('breast_cancer');

-- Inspect the data
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- Step 2: Train a baseline model
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- Step 3: Run predictions using the deployed model
SELECT malignant, pgml.predict(
    'Breast Cancer Detection',
    (
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- Get class probabilities instead of predicted labels
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- After initial training, omitted parameters are reused from the previous run.
-- The following calls reuse the training data snapshot from above.

-- Linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- Support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- Ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- Gradient boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');

-- Review the top models by F1 score
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- Deploy a specific algorithm
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');

-- View deployed models
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Run hyperparameter search on XGBoost
SELECT pgml.train(
    'Breast Cancer Detection',
    algorithm => 'xgboost',
    search => 'grid',
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- Deploy using different strategies
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- Check predictions after redeployment
SELECT malignant, pgml.predict(
    'Breast Cancer Detection',
    (
        "mean radius", "mean texture", "mean perimeter", "mean area",
        "mean smoothness", "mean compactness", "mean concavity",
        "mean concave points", "mean symmetry", "mean fractal dimension",
        "radius error", "texture error", "perimeter error", "area error",
        "smoothness error", "compactness error", "concavity error",
        "concave points error", "symmetry error", "fractal dimension error",
        "worst radius", "worst texture", "worst perimeter", "worst area",
        "worst smoothness", "worst compactness", "worst concavity",
        "worst concave points", "worst symmetry", "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

Regression

This example trains a diabetes progression model and compares multiple algorithms.

-- Step 1: Load the dataset
SELECT pgml.load_dataset('diabetes');

-- Inspect the data
SELECT * FROM pgml.diabetes LIMIT 10;

-- Step 2: Train a baseline model
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- Step 3: Run predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
LIMIT 10;

-- Check predictions for a specific model ID
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

-- After initial training, omitted parameters are reused from the previous run.

-- Linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');

-- Support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- Ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- Gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');

-- Compare runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- Review the top models by mean squared error
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- Deploy and check
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Hyperparameter search
SELECT pgml.train(
    'Diabetes Progression',
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid',
    search_params => '{
        "max_depth": [1, 2],
        "n_estimators": [20, 40]
    }'
);

-- Deploy using different strategies
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- Verify predictions after redeployment
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
LIMIT 10;

Clustering

This example clusters handwritten digit images using the pgml.digits dataset. It demonstrates using a single array feature column for unsupervised learning.

The pgml.digits dataset is derived from the UCI ML hand-written digits dataset.

\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('digits');

-- Create an unlabeled view for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- Inspect the data
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- Step 2: Train a k-means clustering model
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- Step 3: Check cluster assignments
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

-- Try other clustering algorithms
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

Dimensionality reduction

This example reduces 64-dimensional digit image vectors using principal component analysis (PCA).

\timing on

-- Step 1: Load the dataset
SELECT pgml.load_dataset('digits');

-- Inspect the data
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- Create a view of image vectors without labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

-- Step 2: Train a PCA decomposition model
SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- Step 3: View the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- Reduce to 3 components
-- After initial training, omitted parameters are reused from the previous run.
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- View the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- Review models by cumulative explained variance
SELECT trained_models.* FROM pgml.trained_models
JOIN pgml.models ON models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- Deploy the PCA model
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- Switch deployment strategy
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- Run predictions after redeployment
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

Preprocessing

This example uses the diamonds dataset to demonstrate encoding categorical variables and scaling numeric features before training.

-- Step 1: Load the diamonds dataset (contains categorical columns)
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- Inspect the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- Remove the unnamed index column
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- Step 2: Train with preprocessing: standard scaling for numeric columns, target encoding for categoricals
SELECT pgml.train(
    project_name => 'Diamond prices',
    task => 'regression',
    relation_name => 'pgml.jdxcosta/diamonds',
    y_column_name => 'price',
    algorithm => 'lightgbm',
    preprocess => '{
        "carat":   {"scale": "standard"},
        "depth":   {"scale": "standard"},
        "table":   {"scale": "standard"},
        "cut":     {"encode": "target", "scale": "standard"},
        "color":   {"encode": "target", "scale": "standard"},
        "clarity": {"encode": "target", "scale": "standard"}
    }'
);

-- Step 3: Run predictions
-- Pass a heterogeneous row (tuple) instead of a homogeneous ARRAY[]
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

Training

Supported algorithms

Hyperparameter search

Preprocessing

Enumeration type encoding

Missing value imputation

Scaling

Deployment

Inference

Single-row inference with predict

Batch inference

predict_batch

predict_proba

predict_joint

Examples

Preprocessing

Single-row inference with `predict`