全部產品
Search
文件中心

AnalyticDB:機器學習使用文檔

更新時間:Apr 24, 2025

AnalyticDB PostgreSQL 7.0版支援In-Database AI/ML功能,其中AI部分整合Huggingface模型,ML部分整合XGBoost、LightGBM、CatBoost、Linfa以及所有Scikit-Learn等演算法。該功能可直接對錶或視圖中的資料進行AI/ML訓練、微調、部署和推理。本文介紹在AnalyticDB PostgreSQL 7.0版中如何進行機器學習的訓練、部署及推理等。

訓練

訓練介面以UDF方式提供,介面名為pgml.train。當前僅支援在master節點訓練。

文法

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false, 
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

參數說明

參數名稱

描述

樣本值

project_name

專案名稱。

'Breast Cancer Detection'

task

任務名稱。

'classification'

relation_name

表的名稱,可使用'$SCHEMA.$TABLE'指定其他模式中的表。

'pgml.breast_cancer'

y_column_name

表中的訓練列的名稱。

'mglignant'

algorithm

演算法名稱,您可以通過參數詳情擷取使用方法。

'XGBoost'

hyperparams

訓練超參數。

{

"n_estimators": 25,

"nthread": 4

}

search

最佳超參數搜尋策略,你可以通過參數詳情擷取使用方法。

'grid'

search_params

對搜尋過程中的超參數進行範圍或值的定義。詳情請參見Scikit-Learn/XGBoost等官方文檔。

{ "max_depth": [1, 2, 3, 4] }

search_args

搜尋策略的補充參數,詳情請參見Scikit-Learn/XGBoost等官方文檔。

{ "n_iter": 10}

test_size

測試集劃分比率。

0.25

test_sampling

測試集劃分方法,您可以通過參數詳情擷取使用方法。

'random'

runtime

演算法實現的運行環境,有Python和Rust兩種。

'python'

automatic_deploy

是否自動部署。

true

materialize_snapshot

是否物化部署。

true

preprocess

預先處理。

{"col_name": {"impute": "mean", scale: "standard"}}

說明

如無特殊指定,任務預設使用單核訓練。您可以參閱具體模型的參數文檔通過hyperparams進行傳參控制。例如XGBoost模型可通過nthread參數控制訓練使用的核心數。

pgml自訂類型

下面列出pgml.algorithmpgml.samplingpgml.runtime三種自訂參數類型當前支援的演算法等資訊。

  • pgml.algorithm為自訂枚舉類型,支援以下演算法。

    演算法名稱

    描述

    linear

    針對迴歸分析的線性模型。

    xgboost

    一個最佳化的分布式梯度增強庫。

    xgboost_random_forest

    為隨機森林特別調整的XGBoost版本。

    svm

    用於分類和迴歸的支援向量機演算法。

    lasso

    統計學和機器學習中用於迴歸分析的演算法。

    elastic_net

    結合L1和L2正則化的彈性網路迴歸方法。

    ridge

    嶺迴歸通過L2正則化來防止過擬合。

    kmeans

    聚類演算法。

    dbscan

    基於密度的聚類演算法,能夠發現各種形狀的聚類。

    knn

    用於迴歸的k最近鄰演算法。

    random_forest

    用於分類和迴歸的整合學習演算法。

    least_angle

    與lasso相容的最小角迴歸演算法。

    lasso_least_angle

    結合lar和lasso特點的lasso最小角迴歸演算法。

    orthogonal_matching_pursuit

    用於稀疏訊號恢複的貪心演算法。

    bayesian_ridge

    通過先驗分布估計迴歸參數的貝葉斯嶺迴歸演算法。

    automatic_relevance_determination

    一種貝葉斯迴歸演算法。

    stochastic_gradient_descent

    針對迴歸和分類的簡單而高效的最佳化演算法。

    perceptron

    一種二分類的線性模型演算法。

    passive_aggressive

    適用於大規模學習的線上學習演算法。

    ransac

    用於魯棒線性迴歸的隨機採樣一致性方法。

    theil_sen

    用於魯棒線性迴歸的泰爾森估計器。

    huber

    對離群點具有魯棒性的Huber迴歸。

    quantile

    預測條件分位元(如中位元)的分位元迴歸。

    kernel_ridge

    核嶺迴歸通過核技巧學習非線性關係。

    gaussian_process

    用於機率迴歸和分類的高斯過程。

    nu_svm

    適用於分類和迴歸的支援向量機變體。

    ada_boost

    用於分類問題的提升演算法。

    bagging

    用於降低過擬合的整合演算法。

    extra_trees

    對隨機森林演算法的擴充。

    gradient_boosting_trees

    用於迴歸和分類問題的提升演算法。

    hist_gradient_boosting

    適用於巨量資料集的梯度提升方法。

    linear_svm

    適合分類問題的線性支援向量機。

    lightgbm

    微軟開發的適用於大規模資料的梯度提升架構。

    affinity_propagation

    基於資料點間相似性的訊息傳遞聚類演算法。

    birch

    特別適用於巨量資料集的聚類演算法。

    feature_agglomeration

    用於特徵選取的聚類方法。

    mini_batch_kmeans

    適用於大型或線上資料集的K均值聚類變式。

    mean_shift

    能夠發現任何形狀簇的基於質心的聚類演算法。

    optics

    一種識別各種密度簇的演算法。

    spectral

    利用資料的譜特性進行聚類。

    spectral_bi

    同時對資料的兩個維度(如行和列)進行聚類。

    spectral_co

    使用譜方法同時對兩個維度進行聚類。

    catboost

    針對分類/迴歸問題有效梯度提升機。

    pca

    用於資料降維的技術。

  • pgml.sampling為自訂枚舉類型,支援以下函數/方法。

    枚舉值

    描述

    random

    測試資料隨機抽樣。

    last

    測試資料取seq scan尾部資料。

    stratified

    測試資料。

  • pgml.runtime為自訂枚舉類型,支援以下方法。

    枚舉值

    描述

    Python

    演算法實現為Python。

    Rust

    演算法實現為Rust。

超參數搜尋

在模型訓練時,可使用超參數搜尋和交叉驗證找到一組最優訓練超參數。涉及到的訓練介面參數主要有:searchsearch_paramssearch_args

  • pgml.search為自訂枚舉類型,支援如下方法。

    枚舉值

    描述

    grid

    使用笛卡爾積訓練search_params中每一種參數排列組合。

    random

    根據search_args中提供的n_iter數量,隨機抽取search_params中的參數組合,進行最多n_iter次迭代的抽樣。

  • search_params:對搜尋過程中的超參數進行範圍或值的定義。超參數詳情請參見Scikit-Learn/XGBoost等官方文檔。

  • search_args支援的格式如下。

    JSONB格式

    描述

    '{ "n_iter": K }'

    適用於隨機搜尋策略。

    '{ "cv": K }'

    k-fold交叉驗證中指定的K。

    使用樣本

    在下面樣本中,網格搜尋將訓練len(max_depth) * len(n_estimators) * k-fold= 3 * 3 * 3 = 27種組合,以比較search_params中的所有可能的參數排列。

    說明
    • 對於迴歸,預設使用R2 Score來選擇最優超參數。

    • 對於分類,預設使用F1 Score來選擇最優超參數。

    SELECT * FROM pgml.train(
        'Handwritten Digit Image Classifier', 
        'classification',
        'pgml.digits',
        'target',
        algorithm => 'xgboost', 
        SEARCH => 'grid', 
        search_params => '{
            "max_depth": [1, 3, 5], 
            "n_estimators": [20, 80, 160]
        }',
        search_args => '{ "cv": 3 }'
    );

預先處理

訓練函數提供了preprocess參數用作預先處理資料。前置處理器可以根據訓練資料對每列進行集中配置。pgml支援以下三種資料預先處理方式。

說明

前置處理器資訊會在訓練後被儲存在pgml.snapshots中繼資料表裡,且在後續調用pgml.predict()時會對預測資料做同樣的預先處理。

枚舉類型編碼

將枚舉類型編碼成數實值型別。

編碼方法

描述

native

預設值。對於非文本類型,強制轉換為32位浮點型。

target

對於文本類型,會編碼成target類的均值。

one_hot

將枚舉類型轉換為one-hot格式。

ordinal

轉換為ordinal指定數組的枚舉值對應的下標,NULL預設下標為0。

缺失值補全

將NULL或NaN替換為數實值型別。

補全方法

描述

error

預設值。在訓練或推理中檢測到NULL或NaN值報錯。

mean

補全為訓練資料中的均值。

median

補全為訓練資料中的中位元。

mode

補全為訓練資料中的出現頻率最高的值。

min

補全為訓練資料中的最小值。

max

補全為訓練資料中的最大值。

zero

補全為0。

縮放

將所有變數縮放到一個標準範圍內,有助於提高模型的泛化能力。

縮放方法

描述

preserve

預設值。不做任何縮放處理。

standard

通過去除均值並縮放到單位方差,對特徵進行標準化。

min-max

將特徵縮放到給定的最小值和最大值(通常是0和1)之間。

max-abs

通過除以每個特徵的最大絕對值來縮放資料,使得訓練集中每個特徵的最大絕對值為1。

robust

根據該類的統計資料的第一四分位元和第三四分位元的範圍來縮放資料。相比於其他方法,這種方法在處理異常值時更加穩健。

部署

部署過程將訓練的專案資訊、預先處理方法和模型參數儲存在中繼資料表裡,後續推理時只指定專案編號或者名稱即可進行推理,簡化了推理流程。

部署策略型別參數為pgml.strategy,有以下幾種類型。

枚舉值

解釋

best_score

達到最佳指標的模型被部署。

most_recent

最新訓練的模型被部署,忽略指標情況。

rollback

返回當前模型之前的部署。

在訓練時提供了automatic_deploy參數實現部署,預設值為true。在多次訓練中將自動選擇最優的一組模型部署,對於分類任務預設選取F1 Score最大的模型,對於迴歸任務預設選擇R2 Score最大的模型。同時,也可以使用下面兩種方式的UDF手動部署某次訓練出來的模型。

  • 強制部署指定model_id的模型。

    CREATE FUNCTION pgml."deploy"(
            "model_id" bigint
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    STRICT EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_model_wrapper';
  • 從當前專案所有模型中選擇指定演算法和指定策略的模型來部署。

    CREATE FUNCTION pgml."deploy"(
            "project_name" TEXT,
            "strategy" pgml.Strategy,
            "algorithm" pgml.Algorithm DEFAULT NULL
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

推理

單條推理

單條推理使用predict來實現,根據專案名稱或模型ID對一組特徵進行推理。傳回值為float4,對於分類是所屬類別,對於迴歸是預測值。以下列出幾種不同方式的推理使用方式。

說明

根據專案名稱找到的模型是該專案最近部署的模型。

  • 指定專案名稱對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_row_wrapper';
  • 指定模型ID對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';
  • 指定專案名稱對float4特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f32_wrapper';
  • 指定模型ID對float4特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_wrapper';
  • 指定專案名稱對bigint特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bigint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i64_wrapper';
  • 指定專案名稱對int特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" INT[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i32_wrapper';
  • 指定專案名稱對smallint特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" smallint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i16_wrapper';
  • 指定專案名稱對float8特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" double precision[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f64_wrapper';
  • 指定專案名稱對bool特徵進行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bool[]
    ) RETURNS real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_bool_wrapper';

批量推理

批量推理根據專案名稱或模型ID對多組float4特徵進行批量推理。批量推理使用predict_batchpredict_probapredict_joint實現。

說明

入參的多組特徵需要平鋪成一維數組,傳回值是float4的數組。

predict_batch

通常情況下,使用predict_batch實現批量推理。以下列出兩種不同方式的推理使用方式。

  • 指定模型ID對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict_batch"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';
  • 指定專案名稱對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict_batch"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS SETOF real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

predict_proba

當您需要返回置信機率時,predict_proba是恰好的選擇。predict_proba的使用類似於predict_batch,不同的是返回模型的輸出,即每個類別的置信機率。

  • 指定模型ID對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict_proba"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';
  • 指定專案名稱對錶的資料行進行推理。

    CREATE FUNCTION pgml."predict_proba"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

predict_joint

當您需要對多label列的推理進行推理時,請使用predict_joint實現,目前不支援分類情境。predict_joint對於每一組特徵返回多維預測結果。

  • 指定模型ID對多組float4特徵進行推理。

     CREATE FUNCTION pgml."predict_joint"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';
    
  • 指定專案名稱對多組float4特徵進行推理。

    CREATE FUNCTION pgml."predict_joint"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

使用樣本

以下列出針對分類、迴歸、聚類、降維和預先處理的SQL完整樣本。

分類

分類使用樣本

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('breast_cancer');

-- view the dataset
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- check out the predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- view raw class probabilities
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- other
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');

-- Gradient Boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');


-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Breast Cancer Detection', 
    algorithm => 'xgboost', 
    search => 'grid', 
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- check out the improved predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer 
LIMIT 10;

迴歸

迴歸完整樣本

SELECT pgml.load_dataset('diabetes');

-- view the dataset
SELECT * FROM pgml.diabetes LIMIT 10;

-- train a simple model on the data
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- check out the predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

-- Check predictions against a specific model id
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
-- Quantile Regression too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');

-- support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- other
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');

-- gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');

-- runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');

--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'python', hyperparams => '{"n_estimators": 1}'); -- too slow
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Diabetes Progression', 
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid', 
    search_params => '{
        "max_depth": [1, 2], 
        "n_estimators": [20, 40]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- check out the improved predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

聚類

聚類完整樣本

-- This example trains models on the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for clustering. You could do something similar with a vector column
--

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- create an unlabeled table of the images for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- view the dataset
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- check out the predictions
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

降維

降維完整樣本

-- This example reduces the dimensionality of images in the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for decomposition to reduce dimensionality.
--
-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- view the dataset
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- create a view of just the vectors for decomposition, without any labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- check out the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- We can reduce the image vectors from 64 dimensions to 3 components
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- check out the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models
                                 JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- deploy the PCA model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- check out the improved predictions
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

預先處理

預先處理完整樣本

-- load the diamonds dataset, that contains text categorical variables
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- view the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- drop the Unamed column, since it's not useful for training (you could create a view instead)
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- train a model using preprocessors to scale the numeric variables, and target encode the categoricals
SELECT pgml.train(
       project_name => 'Diamond prices',
       task => 'regression',
       relation_name => 'pgml.jdxcosta/diamonds',
       y_column_name => 'price',
       algorithm => 'lightgbm',
       preprocess => '{
                      "carat": {"scale": "standard"},
                      "depth": {"scale": "standard"},
                      "table": {"scale": "standard"},
                      "cut": {"encode": "target", "scale": "standard"},
                      "color": {"encode": "target", "scale": "standard"},
                      "clarity": {"encode": "target", "scale": "standard"}
                  }'
);

-- run some predictions, notice we're passing a heterogeneous row (tuple) as input, rather than a homogenous ARRAY[].
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

-- This is a difficult dataset for more algorithms, which makes it a good challenge for preprocessing, and additional
-- feature engineering. What's next?