如何使用機器學習 - AnalyticDB

AnalyticDB PostgreSQL 7.0版支援In-Database AI/ML功能，其中AI部分整合Huggingface模型，ML部分整合XGBoost、LightGBM、CatBoost、Linfa以及所有Scikit-Learn等演算法。該功能可直接對錶或視圖中的資料進行AI/ML訓練、微調、部署和推理。本文介紹在AnalyticDB PostgreSQL 7.0版中如何進行機器學習的訓練、部署及推理等。

訓練

訓練介面以UDF方式提供，介面名為pgml.train。當前僅支援在master節點訓練。

文法

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false, 
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

參數說明

參數名稱	描述	樣本值
project_name	專案名稱。	'Breast Cancer Detection'
task	任務名稱。	'classification'
relation_name	表的名稱，可使用'$SCHEMA.$TABLE'指定其他模式中的表。	'pgml.breast_cancer'
y_column_name	表中的訓練列的名稱。	'mglignant'
algorithm	演算法名稱，您可以通過參數詳情擷取使用方法。	'XGBoost'
hyperparams	訓練超參數。	{ "n_estimators": 25, "nthread": 4 }
search	最佳超參數搜尋策略，你可以通過參數詳情擷取使用方法。	'grid'
search_params	對搜尋過程中的超參數進行範圍或值的定義。詳情請參見Scikit-Learn/XGBoost等官方文檔。	{ "max_depth": [1, 2, 3, 4] }
search_args	搜尋策略的補充參數，詳情請參見Scikit-Learn/XGBoost等官方文檔。	{ "n_iter": 10}
test_size	測試集劃分比率。	0.25
test_sampling	測試集劃分方法，您可以通過參數詳情擷取使用方法。	'random'
runtime	演算法實現的運行環境，有Python和Rust兩種。	'python'
automatic_deploy	是否自動部署。	true
materialize_snapshot	是否物化部署。	true
preprocess	預先處理。	{"col_name": {"impute": "mean", scale: "standard"}}

說明

如無特殊指定，任務預設使用單核訓練。您可以參閱具體模型的參數文檔通過hyperparams進行傳參控制。例如XGBoost模型可通過nthread參數控制訓練使用的核心數。

pgml自訂類型

下面列出pgml.algorithm、pgml.sampling和pgml.runtime三種自訂參數類型當前支援的演算法等資訊。

pgml.algorithm為自訂枚舉類型，支援以下演算法。

演算法名稱	描述
linear	針對迴歸分析的線性模型。
xgboost	一個最佳化的分布式梯度增強庫。
xgboost_random_forest	為隨機森林特別調整的XGBoost版本。
svm	用於分類和迴歸的支援向量機演算法。
lasso	統計學和機器學習中用於迴歸分析的演算法。
elastic_net	結合L1和L2正則化的彈性網路迴歸方法。
ridge	嶺迴歸通過L2正則化來防止過擬合。
kmeans	聚類演算法。
dbscan	基於密度的聚類演算法，能夠發現各種形狀的聚類。
knn	用於迴歸的k最近鄰演算法。
random_forest	用於分類和迴歸的整合學習演算法。
least_angle	與lasso相容的最小角迴歸演算法。
lasso_least_angle	結合lar和lasso特點的lasso最小角迴歸演算法。
orthogonal_matching_pursuit	用於稀疏訊號恢複的貪心演算法。
bayesian_ridge	通過先驗分布估計迴歸參數的貝葉斯嶺迴歸演算法。
automatic_relevance_determination	一種貝葉斯迴歸演算法。
stochastic_gradient_descent	針對迴歸和分類的簡單而高效的最佳化演算法。
perceptron	一種二分類的線性模型演算法。
passive_aggressive	適用於大規模學習的線上學習演算法。
ransac	用於魯棒線性迴歸的隨機採樣一致性方法。
theil_sen	用於魯棒線性迴歸的泰爾森估計器。
huber	對離群點具有魯棒性的Huber迴歸。
quantile	預測條件分位元（如中位元）的分位元迴歸。
kernel_ridge	核嶺迴歸通過核技巧學習非線性關係。
gaussian_process	用於機率迴歸和分類的高斯過程。
nu_svm	適用於分類和迴歸的支援向量機變體。
ada_boost	用於分類問題的提升演算法。
bagging	用於降低過擬合的整合演算法。
extra_trees	對隨機森林演算法的擴充。
gradient_boosting_trees	用於迴歸和分類問題的提升演算法。
hist_gradient_boosting	適用於巨量資料集的梯度提升方法。
linear_svm	適合分類問題的線性支援向量機。
lightgbm	微軟開發的適用於大規模資料的梯度提升架構。
affinity_propagation	基於資料點間相似性的訊息傳遞聚類演算法。
birch	特別適用於巨量資料集的聚類演算法。
feature_agglomeration	用於特徵選取的聚類方法。
mini_batch_kmeans	適用於大型或線上資料集的K均值聚類變式。
mean_shift	能夠發現任何形狀簇的基於質心的聚類演算法。
optics	一種識別各種密度簇的演算法。
spectral	利用資料的譜特性進行聚類。
spectral_bi	同時對資料的兩個維度（如行和列）進行聚類。
spectral_co	使用譜方法同時對兩個維度進行聚類。
catboost	針對分類/迴歸問題有效梯度提升機。
pca	用於資料降維的技術。

pgml.sampling為自訂枚舉類型，支援以下函數/方法。
枚舉值
描述
random
測試資料隨機抽樣。
last
測試資料取seq scan尾部資料。
stratified
測試資料。
pgml.runtime為自訂枚舉類型，支援以下方法。
枚舉值
描述
Python
演算法實現為Python。
Rust
演算法實現為Rust。

超參數搜尋

在模型訓練時，可使用超參數搜尋和交叉驗證找到一組最優訓練超參數。涉及到的訓練介面參數主要有：search、search_params和search_args。

pgml.search為自訂枚舉類型，支援如下方法。

枚舉值	描述
grid	使用笛卡爾積訓練`search_params`中每一種參數排列組合。
random	根據`search_args`中提供的`n_iter`數量，隨機抽取`search_params`中的參數組合，進行最多`n_iter`次迭代的抽樣。

search_params：對搜尋過程中的超參數進行範圍或值的定義。超參數詳情請參見Scikit-Learn/XGBoost等官方文檔。
search_args支援的格式如下。
JSONB格式
描述
'{ "n_iter": K }'
適用於隨機搜尋策略。
'{ "cv"： K }'
k-fold交叉驗證中指定的K。
使用樣本
在下面樣本中，網格搜尋將訓練len(max_depth) * len(n_estimators) * k-fold= 3 * 3 * 3 = 27種組合，以比較search_params中的所有可能的參數排列。
說明
- 對於迴歸，預設使用R2 Score來選擇最優超參數。
- 對於分類，預設使用F1 Score來選擇最優超參數。
```
SELECT * FROM pgml.train(
    'Handwritten Digit Image Classifier', 
    'classification',
    'pgml.digits',
    'target',
    algorithm => 'xgboost', 
    SEARCH => 'grid', 
    search_params => '{
        "max_depth": [1, 3, 5], 
        "n_estimators": [20, 80, 160]
    }',
    search_args => '{ "cv": 3 }'
);
```

預先處理

訓練函數提供了preprocess參數用作預先處理資料。前置處理器可以根據訓練資料對每列進行集中配置。pgml支援以下三種資料預先處理方式。

說明

前置處理器資訊會在訓練後被儲存在pgml.snapshots中繼資料表裡，且在後續調用pgml.predict()時會對預測資料做同樣的預先處理。

枚舉類型編碼

將枚舉類型編碼成數實值型別。

編碼方法	描述
native	預設值。對於非文本類型，強制轉換為32位浮點型。
target	對於文本類型，會編碼成target類的均值。
one_hot	將枚舉類型轉換為one-hot格式。
ordinal	轉換為ordinal指定數組的枚舉值對應的下標，NULL預設下標為0。

缺失值補全

將NULL或NaN替換為數實值型別。

補全方法	描述
error	預設值。在訓練或推理中檢測到NULL或NaN值報錯。
mean	補全為訓練資料中的均值。
median	補全為訓練資料中的中位元。
mode	補全為訓練資料中的出現頻率最高的值。
min	補全為訓練資料中的最小值。
max	補全為訓練資料中的最大值。
zero	補全為0。

縮放

將所有變數縮放到一個標準範圍內，有助於提高模型的泛化能力。

縮放方法	描述
preserve	預設值。不做任何縮放處理。
standard	通過去除均值並縮放到單位方差，對特徵進行標準化。
min-max	將特徵縮放到給定的最小值和最大值（通常是0和1）之間。
max-abs	通過除以每個特徵的最大絕對值來縮放資料，使得訓練集中每個特徵的最大絕對值為1。
robust	根據該類的統計資料的第一四分位元和第三四分位元的範圍來縮放資料。相比於其他方法，這種方法在處理異常值時更加穩健。

部署

部署過程將訓練的專案資訊、預先處理方法和模型參數儲存在中繼資料表裡，後續推理時只指定專案編號或者名稱即可進行推理，簡化了推理流程。

部署策略型別參數為pgml.strategy，有以下幾種類型。

枚舉值	解釋
best_score	達到最佳指標的模型被部署。
most_recent	最新訓練的模型被部署，忽略指標情況。
rollback	返回當前模型之前的部署。

在訓練時提供了automatic_deploy參數實現部署，預設值為true。在多次訓練中將自動選擇最優的一組模型部署，對於分類任務預設選取F1 Score最大的模型，對於迴歸任務預設選擇R2 Score最大的模型。同時，也可以使用下面兩種方式的UDF手動部署某次訓練出來的模型。

強制部署指定model_id的模型。

CREATE FUNCTION pgml."deploy"(
        "model_id" bigint
) RETURNS TABLE (
        "project" TEXT,
        "strategy" TEXT,
        "algorithm" TEXT
)
STRICT EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_model_wrapper';

從當前專案所有模型中選擇指定演算法和指定策略的模型來部署。

CREATE FUNCTION pgml."deploy"(
        "project_name" TEXT,
        "strategy" pgml.Strategy,
        "algorithm" pgml.Algorithm DEFAULT NULL
) RETURNS TABLE (
        "project" TEXT,
        "strategy" TEXT,
        "algorithm" TEXT
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

推理

單條推理

單條推理使用predict來實現，根據專案名稱或模型ID對一組特徵進行推理。傳回值為float4，對於分類是所屬類別，對於迴歸是預測值。以下列出幾種不同方式的推理使用方式。

說明

根據專案名稱找到的模型是該專案最近部署的模型。

指定專案名稱對錶的資料行進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "row" anyelement
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_row_wrapper';

指定模型ID對錶的資料行進行推理。

CREATE FUNCTION pgml."predict"(
        "model_id" bigint,
        "row" anyelement
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';

指定專案名稱對float4特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" real[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_f32_wrapper';

指定模型ID對float4特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "model_id" bigint,
        "features" real[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_wrapper';

指定專案名稱對bigint特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" bigint[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i64_wrapper';

指定專案名稱對int特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" INT[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i32_wrapper';

指定專案名稱對smallint特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" smallint[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i16_wrapper';

指定專案名稱對float8特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" double precision[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_f64_wrapper';

指定專案名稱對bool特徵進行推理。

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" bool[]
) RETURNS real /* f32 */
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_bool_wrapper';

批量推理

批量推理根據專案名稱或模型ID對多組float4特徵進行批量推理。批量推理使用predict_batch、predict_proba或predict_joint實現。

說明

入參的多組特徵需要平鋪成一維數組，傳回值是float4的數組。

predict_batch

通常情況下，使用predict_batch實現批量推理。以下列出兩種不同方式的推理使用方式。

指定模型ID對錶的資料行進行推理。

CREATE FUNCTION pgml."predict_batch"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';

指定專案名稱對錶的資料行進行推理。

CREATE FUNCTION pgml."predict_batch"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS SETOF real /* f32 */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

predict_proba

當您需要返回置信機率時，predict_proba是恰好的選擇。predict_proba的使用類似於predict_batch，不同的是返回模型的輸出，即每個類別的置信機率。

指定模型ID對錶的資料行進行推理。

CREATE FUNCTION pgml."predict_proba"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';

指定專案名稱對錶的資料行進行推理。

CREATE FUNCTION pgml."predict_proba"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

predict_joint

當您需要對多label列的推理進行推理時，請使用predict_joint實現，目前不支援分類情境。predict_joint對於每一組特徵返回多維預測結果。

指定模型ID對多組float4特徵進行推理。

 CREATE FUNCTION pgml."predict_joint"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';

指定專案名稱對多組float4特徵進行推理。

CREATE FUNCTION pgml."predict_joint"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

使用樣本

以下列出針對分類、迴歸、聚類、降維和預先處理的SQL完整樣本。

分類

分類使用樣本

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('breast_cancer');

-- view the dataset
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- check out the predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- view raw class probabilities
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- other
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');

-- Gradient Boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');


-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Breast Cancer Detection', 
    algorithm => 'xgboost', 
    search => 'grid', 
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- check out the improved predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer 
LIMIT 10;

迴歸

迴歸完整樣本

SELECT pgml.load_dataset('diabetes');

-- view the dataset
SELECT * FROM pgml.diabetes LIMIT 10;

-- train a simple model on the data
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- check out the predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

-- Check predictions against a specific model id
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
-- Quantile Regression too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');

-- support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- other
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');

-- gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');

-- runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');

--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'python', hyperparams => '{"n_estimators": 1}'); -- too slow
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Diabetes Progression', 
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid', 
    search_params => '{
        "max_depth": [1, 2], 
        "n_estimators": [20, 40]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- check out the improved predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

聚類

聚類完整樣本

-- This example trains models on the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for clustering. You could do something similar with a vector column
--

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- create an unlabeled table of the images for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- view the dataset
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- check out the predictions
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

降維

降維完整樣本

-- This example reduces the dimensionality of images in the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for decomposition to reduce dimensionality.
--
-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- view the dataset
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- create a view of just the vectors for decomposition, without any labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- check out the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- We can reduce the image vectors from 64 dimensions to 3 components
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- check out the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models
                                 JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- deploy the PCA model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- check out the improved predictions
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

預先處理

預先處理完整樣本

-- load the diamonds dataset, that contains text categorical variables
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- view the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- drop the Unamed column, since it's not useful for training (you could create a view instead)
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- train a model using preprocessors to scale the numeric variables, and target encode the categoricals
SELECT pgml.train(
       project_name => 'Diamond prices',
       task => 'regression',
       relation_name => 'pgml.jdxcosta/diamonds',
       y_column_name => 'price',
       algorithm => 'lightgbm',
       preprocess => '{
                      "carat": {"scale": "standard"},
                      "depth": {"scale": "standard"},
                      "table": {"scale": "standard"},
                      "cut": {"encode": "target", "scale": "standard"},
                      "color": {"encode": "target", "scale": "standard"},
                      "clarity": {"encode": "target", "scale": "standard"}
                  }'
);

-- run some predictions, notice we're passing a heterogeneous row (tuple) as input, rather than a homogenous ARRAY[].
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

-- This is a difficult dataset for more algorithms, which makes it a good challenge for preprocessing, and additional
-- feature engineering. What's next?

枚舉值	描述
Python	演算法實現為Python。
Rust	演算法實現為Rust。

JSONB格式	描述
'{ "n_iter": K }'	適用於隨機搜尋策略。
'{ "cv"： K }'	k-fold交叉驗證中指定的K。

枚舉值	描述
random	測試資料隨機抽樣。
last	測試資料取seq scan尾部資料。
stratified	測試資料。