edit-icon download-icon

Machine learning

Last Updated: Aug 15, 2018

Contents

Linear SVM

Developed in the mid 90’s, Support Vector Machine (SVM) is a statistical learning theory based machine learning method. It seeks to improve the learning machine’s generalization ability through structural risk minimization, so as to minimize the empirical risk and confidence range. Therefore, good statistics can be obtained from small sample sizes. For more information about SVM, see wiki.

This linear SVM is not implemented using the kernel function. For details about the implementation, see section “6 Trust Region Method for L2-SVM” in http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf. This algorithm only supports binary classification models.

Algorithm component

  1. Set field parameters of the component.

    image

    • Input column: You can select only a column of bigint or double type.

    • Label column: The data type of the label column can be bigint, double, or string. This component supports only binary classification models.

  2. Set algorithm parameters.

    image

    • Penalty factor: Defaults to 1.

    • Base value: (Optional) Positive value. If this parameter is not specified, the system selects a random value. We recommend that you specify this parameter when the positive and negative samples are significantly different.

    • Positive weight: (Optional) Positive penalty factor. The default value is 1.0, and the value range is (0, ~).

    • Negative weight: (Optional) Negative penalty factor. The default value is 1.0, and the value range is (0, ~).

    • Convergence coefficient: (Optional) Convergence deviation. The default value is 0.001, and the value range is (0, 1).

      NOTE: If the base value is not specified, the positive weight and negative weight must be the same.

PAI command

  1. PAI -name LinearSVM -project algo_public
  2. -DnegativeCost="1.0" \
  3. -DmodelName="xlab_m_LinearSVM_6143"
  4. -DpositiveCost="1.0" \
  5. -Depsilon="0.001"
  6. -DlabelColName="y" \
  7. -DfeatureColNames="pdays,emp_var_rate,cons_conf_idx" \
  8. -DinputTableName="bank_data"
  9. -DpositiveLabel="0";

Parameter settings

Parameter Description Option Default value
inputTableName Name of the input table NA NA
inputTableParitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value. The multilevel format is name1=value1/name2=value2, multiple partitions are separated by commas NA All partitions in the input table
modelName (Required) Name of the output model NA NA
featureColNames (Required) Names of the feature columns used for training in the input table NA NA
labelColName (Required) Name of the label column in the input table NA NA
positiveLabel (Optional) Positive value NA Random value selected among label values
negativeCost (Optional) Negative weight (negative penalty factor) (0, ∞) 1.0
positiveCost (Optional) Positive weight (positive penalty factor) (0, ∞) 1.0
(Optional) Convergence coefficient (0, 1) 0.001

Example

Training data

id y f0 f1 f2 f3 f4 f5 f6 f7
1 -1 -0.294118 0.487437 0.180328 -0.292929 -1 0.00149028 -0.53117 -0.0333333
2 +1 -0.882353 -0.145729 0.0819672 -0.414141 -1 -0.207153 -0.766866 -0.666667
3 -1 -0.0588235 0.839196 0.0491803 -1 -1 -0.305514 -0.492741 -0.633333
4 +1 -0.882353 -0.105528 0.0819672 -0.535354 -0.777778 -0.162444 -0.923997 -1
5 -1 -1 0.376884 -0.344262 -0.292929 -0.602837 0.28465 0.887276 -0.6
6 +1 -0.411765 0.165829 0.213115 -1 -1 -0.23696 -0.894962 -0.7
7 -1 -0.647059 -0.21608 -0.180328 -0.353535 -0.791962 -0.0760059 -0.854825 -0.833333
8 +1 0.176471 0.155779 -1 -1 -1 0.052161 -0.952178 -0.733333
9 -1 -0.764706 0.979899 0.147541 -0.0909091 0.283688 -0.0909091 -0.931682 0.0666667
10 -1 -0.0588235 0.256281 0.57377 -1 -1 -1 -0.868488 0.1

Test data

id y f0 f1 f2 f3 f4 f5 f6 f7
1 +1 -0.882353 0.0854271 0.442623 -0.616162 -1 -0.19225 -0.725021 -0.9
2 +1 -0.294118 -0.0351759 -1 -1 -1 -0.293592 -0.904355 -0.766667
3 +1 -0.882353 0.246231 0.213115 -0.272727 -1 -0.171386 -0.981213 -0.7
4 -1 -0.176471 0.507538 0.278689 -0.414141 -0.702128 0.0491804 -0.475662 0.1
5 -1 -0.529412 0.839196 -1 -1 -1 -0.153502 -0.885568 -0.5
6 +1 -0.882353 0.246231 -0.0163934 -0.353535 -1 0.0670641 -0.627669 -1
7 -1 -0.882353 0.819095 0.278689 -0.151515 -0.307329 0.19225 0.00768574 -0.966667
8 +1 -0.882353 -0.0753769 0.0163934 -0.494949 -0.903073 -0.418778 -0.654996 -0.866667
9 +1 -1 0.527638 0.344262 -0.212121 -0.356974 0.23696 -0.836038 -0.8
10 +1 -0.882353 0.115578 0.0163934 -0.737374 -0.56974 -0.28465 -0.948762 -0.933333
  1. Create an experiment svm_example.

image

  1. Select feature columns.

    image

  2. Select the label column.

    image

  3. Set SVM parameters.

    image

  4. Run the experiment.

  • The following model is generated:

    image

  • The following figure shows the predicted result.

    ex_svm_predict_result

Logistic regression

  • Classic logistic regression is a binary classification algorithm. Logistic regression on the algorithm platform supports multiclass classification.

  • The logistic regression component supports two data formats: sparse and dense.

  • Logical regression for multiclass classification supports a maximum of 100 classes.

Parameter settings

Parameters of the logistic regression component:

  • Support for sparse matrix: The component supports the sparse matrix format.

  • Base value: (Optional) Specify the label value of the training coefficient in the case of binary classification. If this parameter is left blank, the system selects a random value.

  • Maximum iterations: (Optional) Maximum number of L-BFGS iterations. The default value is 100.

  • Convergence deviation: (Optional) Condition for L-BFGS termination, that is, the log-likelihood deviation between two iterations. The default value is 1.0e-06.

  • Regularization type: (Optional) Options are l1’, ‘l2’, and ‘None’. The default value is ‘l1’.

  • Regularization coefficient: (Optional) The default value is 1.0. This parameter is ignored when regularizedType is set to None.

PAI command (Not inheriting the type setting node)

  1. PAI -name LogisticRegression -project algo_public
  2. -DmodelName="xlab_m_logistic_regression_6096" \
  3. -DregularizedLevel="1"
  4. -DmaxIter="100"
  5. -DregularizedType="l1"
  6. -Depsilon="0.000001"
  7. -DlabelColName="y"\
  8. -DfeatureColNames="pdays,emp_var_rate"
  9. -DgoodValue="1"
  10. -DinputTableName="bank_data";
  • name: component name.

  • project: Name of the project. This parameter specifies the space where an algorithm is located. The default value is algo_public. If you change the default value, the system returns an error.

  • modelName: Name of the output model.

  • regularizedLevel: (Optional) Regularization coefficient. The default value is 1.0. This parameter is ignored if regularizedType is set to None.

  • maxIter: (Optional) Maximum iterations. It specifies the maximum number of L-BFGS iterations. The default value is 100.

  • regularizedType: (Optional) Regularization type. Options are l1’, ‘l2’, and ‘None’. The default value is ‘l1’.

  • epsilon: (Optional) Convergence deviation. It is the condition for L-BFGS termination, that is, the log-likelihood deviation between two iterations. The default value is 1.0e-06.

  • labelColName: Name of the label column in the input table.

  • featureColNames: Names of the feature columns used for training in the input table.

  • goodValue: (Optional) Base value. For binary classification, specify the label value of the training coefficient. If this parameter is left blank, the system selects a random value.

  • inputTableName: Name of the input table for training.

Example

Binary classification

Test data

SQL statement for data generation

  1. drop table if exists lr_test_input;
  2. create table lr_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. cast(1 as double) as f0,
  10. cast(0 as double) as f1,
  11. cast(0 as double) as f2,
  12. cast(0 as double) as f3,
  13. cast(0 as bigint) as label
  14. from dual
  15. union all
  16. select
  17. cast(0 as double) as f0,
  18. cast(1 as double) as f1,
  19. cast(0 as double) as f2,
  20. cast(0 as double) as f3,
  21. cast(0 as bigint) as label
  22. from dual
  23. union all
  24. select
  25. cast(0 as double) as f0,
  26. cast(0 as double) as f1,
  27. cast(1 as double) as f2,
  28. cast(0 as double) as f3,
  29. cast(1 as bigint) as label
  30. from dual
  31. union all
  32. select
  33. cast(0 as double) as f0,
  34. cast(0 as double) as f1,
  35. cast(0 as double) as f2,
  36. cast(1 as double) as f3,
  37. cast(1 as bigint) as label
  38. from dual
  39. union all
  40. select
  41. cast(1 as double) as f0,
  42. cast(0 as double) as f1,
  43. cast(0 as double) as f2,
  44. cast(0 as double) as f3,
  45. cast(0 as bigint) as label
  46. from dual
  47. union all
  48. select
  49. cast(0 as double) as f0,
  50. cast(1 as double) as f1,
  51. cast(0 as double) as f2,
  52. cast(0 as double) as f3,
  53. cast(0 as bigint) as label
  54. from dual
  55. ) a;

Input data description

  1. +------------+------------+------------+------------+------------+
  2. | f0 | f1 | f2 | f3 | label |
  3. +------------+------------+------------+------------+------------+
  4. | 1.0 | 0.0 | 0.0 | 0.0 | 0 |
  5. | 0.0 | 0.0 | 1.0 | 0.0 | 1 |
  6. | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
  7. | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
  8. | 1.0 | 0.0 | 0.0 | 0.0 | 0 |
  9. | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
  10. +------------+------------+------------+------------+------------+

Running command

  1. drop offlinemodel if exists lr_test_model;
  2. drop table if exists lr_test_prediction_result;
  3. PAI -name logisticregression_binary -project algo_public -DmodelName="lr_test_model" -DitemDelimiter="," -DregularizedLevel="1" -DmaxIter="100" -DregularizedType="None" -Depsilon="0.000001" -DkvDelimiter=":" -DlabelColName="label" -DfeatureColNames="f0,f1,f2,f3" -DenableSparse="false" -DgoodValue="1" -DinputTableName="lr_test_input";
  4. PAI -name prediction -project algo_public -DdetailColName="prediction_detail" -DmodelName="lr_test_model" -DitemDelimiter="," -DresultColName="prediction_result" -Dlifecycle="28" -DoutputTableName="lr_test_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DinputTableName="lr_test_input" -DenableSparse="false" -DappendColNames="label";

Running result

lr_test_prediction_result

  1. +------------+-------------------+------------------+-------------------+
  2. | label | prediction_result | prediction_score | prediction_detail |
  3. +------------+-------------------+------------------+-------------------+
  4. | 0 | 0 | 0.9999998793434426 | {
  5. "0": 0.9999998793434426,
  6. "1": 1.206565574533681e-07} |
  7. | 1 | 1 | 0.999999799574135 | {
  8. "0": 2.004258650156743e-07,
  9. "1": 0.999999799574135} |
  10. | 1 | 1 | 0.999999799574135 | {
  11. "0": 2.004258650156743e-07,
  12. "1": 0.999999799574135} |
  13. | 0 | 0 | 0.9999998793434426 | {
  14. "0": 0.9999998793434426,
  15. "1": 1.206565574533681e-07} |
  16. | 0 | 0 | 0.9999998793434426 | {
  17. "0": 0.9999998793434426,
  18. "1": 1.206565574533681e-07} |
  19. | 0 | 0 | 0.9999998793434426 | {
  20. "0": 0.9999998793434426,
  21. "1": 1.206565574533681e-07} |
  22. +------------+-------------------+------------------+-------------------+
Multiclass classification

Test data

SQL statement for data generation

  1. drop table if exists multi_lr_test_input;
  2. create table multi_lr_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. cast(1 as double) as f0,
  10. cast(0 as double) as f1,
  11. cast(0 as double) as f2,
  12. cast(0 as double) as f3,
  13. cast(0 as bigint) as label
  14. from dual
  15. union all
  16. select
  17. cast(0 as double) as f0,
  18. cast(1 as double) as f1,
  19. cast(0 as double) as f2,
  20. cast(0 as double) as f3,
  21. cast(0 as bigint) as label
  22. from dual
  23. union all
  24. select
  25. cast(0 as double) as f0,
  26. cast(0 as double) as f1,
  27. cast(1 as double) as f2,
  28. cast(0 as double) as f3,
  29. cast(2 as bigint) as label
  30. from dual
  31. union all
  32. select
  33. cast(0 as double) as f0,
  34. cast(0 as double) as f1,
  35. cast(0 as double) as f2,
  36. cast(1 as double) as f3,
  37. cast(1 as bigint) as label
  38. from dual
  39. ) a;

Input data description

  1. +------------+------------+------------+------------+------------+
  2. | f0 | f1 | f2 | f3 | label |
  3. +------------+------------+------------+------------+------------+
  4. | 1.0 | 0.0 | 0.0 | 0.0 | 0 |
  5. | 0.0 | 0.0 | 1.0 | 0.0 | 2 |
  6. | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
  7. | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
  8. +------------+------------+------------+------------+------------+

Running command

  1. drop offlinemodel if exists multi_lr_test_model;
  2. drop table if exists multi_lr_test_prediction_result;
  3. PAI -name logisticregression_multi -project algo_public -DmodelName="multi_lr_test_model" -DitemDelimiter="," -DregularizedLevel="1" -DmaxIter="100" -DregularizedType="None" -Depsilon="0.000001" -DkvDelimiter=":" -DlabelColName="label" -DfeatureColNames="f0,f1,f2,f3" -DenableSparse="false" -DinputTableName="multi_lr_test_input";
  4. PAI -name prediction -project algo_public -DdetailColName="prediction_detail" -DmodelName="multi_lr_test_model" -DitemDelimiter="," -DresultColName="prediction_result" -Dlifecycle="28" -DoutputTableName="multi_lr_test_prediction_result" -DscoreColName="prediction_score" -DkvDelimiter=":" -DinputTableName="multi_lr_test_input" -DenableSparse="false" -DappendColNames="label";

Running result

multi_lr_test_prediction_result

  1. +------------+-------------------+------------------+-------------------+
  2. | label | prediction_result | prediction_score | prediction_detail |
  3. +------------+-------------------+------------------+-------------------+
  4. | 0 | 0 | 0.9999997274902165 | {
  5. "0": 0.9999997274902165,
  6. "1": 2.324679066261573e-07,
  7. "2": 2.324679066261569e-07} |
  8. | 0 | 0 | 0.9999997274902165 | {
  9. "0": 0.9999997274902165,
  10. "1": 2.324679066261573e-07,
  11. "2": 2.324679066261569e-07} |
  12. | 2 | 2 | 0.9999999155958832 | {
  13. "0": 2.018833979850994e-07,
  14. "1": 2.324679066261573e-07,
  15. "2": 0.9999999155958832} |
  16. | 1 | 1 | 0.9999999155958832 | {
  17. "0": 2.018833979850994e-07,
  18. "1": 0.9999999155958832,
  19. "2": 2.324679066261569e-07} |
  20. +------------+-------------------+------------------+-------------------+

GBDT binary classification

This component is used to resolve binary classification issues based on GBDT regression and sorting. Values greater than the preset threshold are positive values and those smaller than the threshold are negative values.

PAI command

  1. PAI -name gbdt_lr
  2. -project algo_public
  3. -DfeatureSplitValueMaxSize="500"
  4. -DrandSeed="0"
  5. -Dshrinkage="0.5"
  6. -DmaxLeafCount="32"
  7. -DlabelColName="y"
  8. -DinputTableName="bank_data_partition"
  9. -DminLeafSampleCount="500"
  10. -DgroupIDColName="nr_employed"
  11. -DsampleRatio="0.6"
  12. -DmaxDepth="11"
  13. -DmodelName="xlab_m_GBDT_LR_21208"
  14. -DmetricType="2"
  15. -DfeatureRatio="0.6"
  16. -DinputTablePartitions="pt=20150501"
  17. -DtestRatio="0.0"
  18. -DfeatureColNames="age,previous,cons_conf_idx,euribor3m"
  19. -DtreeCount="500"

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
featureColNames Names of the feature columns used for training in the input table Column names Optional, default: all columns with values
labelColName Name of the label column in the input table Column name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). NA (Optional) All partitions are selected by default.
modelName Name of the output model NA Required
outputImportanceTableName Name of the output feature importance table NA Optional
groupIDColName Name of a grouping column Column name Optional, default: full table
lossType Loss function type, 0: GBRANK, 1: LAMBDAMART_DCG, 2: LAMBDAMART_NDCG, 3: LEAST_SQUARE, 4: LOG_LIKELIHOOD 0, 1, 2, 3, 4 Optional, default: 0
metricType Metric type, 0(NDCG)-: normalized discounted cumulative gain; 1(DCG): discounted cumulative gain; 2 (AUC) adaptive only to 0/1 label 0, 1, 2 Optional, default: 2
treeCount Number of trees [1, 10,000] Optional, default: 500
shrinkage Learning rate (0, 1] Optional, default: 0.05
maxLeafCount Maximum number of leaves, which must be an integer [2, 1000] Optional, default: 32
maxDepth Maximum depth of a tree, which must be an integer [1, 11] Optional, default: 11
minLeafSampleCount Minimum number of samples on a leaf node, which must be an integer [100, 1000] Optional, default: 500
sampleRatio Ratio of samples collected during the training (0, 1] Optional, default: 0.6
featureRatio Ratio of features collected during the training (0, 1] Optional, default: 0.6
tau Parameter Tau in gbrank loss [0, 1] Optional, default: 0.6
p Parameter p in gbrank loss [1, 10] Optional, default: 1
randSeed Random seed [0, 10] Optional, default: 0
newtonStep Whether to use the newton method 0,1 Optional, default: 1
featureSplitValueMaxSize Number of records that a feature can split into [1, 1000] Optional, default: 500
lifecycle Lifecycle of the output table NA Optional, default: not set

Example

Data generation

  1. drop table if exists gbdt_lr_test_input;
  2. create table gbdt_lr_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. cast(1 as double) as f0,
  10. cast(0 as double) as f1,
  11. cast(0 as double) as f2,
  12. cast(0 as double) as f3,
  13. cast(0 as bigint) as label
  14. from dual
  15. union all
  16. select
  17. cast(0 as double) as f0,
  18. cast(1 as double) as f1,
  19. cast(0 as double) as f2,
  20. cast(0 as double) as f3,
  21. cast(0 as bigint) as label
  22. from dual
  23. union all
  24. select
  25. cast(0 as double) as f0,
  26. cast(0 as double) as f1,
  27. cast(1 as double) as f2,
  28. cast(0 as double) as f3,
  29. cast(1 as bigint) as label
  30. from dual
  31. union all
  32. select
  33. cast(0 as double) as f0,
  34. cast(0 as double) as f1,
  35. cast(0 as double) as f2,
  36. cast(1 as double) as f3,
  37. cast(1 as bigint) as label
  38. from dual
  39. union all
  40. select
  41. cast(1 as double) as f0,
  42. cast(0 as double) as f1,
  43. cast(0 as double) as f2,
  44. cast(0 as double) as f3,
  45. cast(0 as bigint) as label
  46. from dual
  47. union all
  48. select
  49. cast(0 as double) as f0,
  50. cast(1 as double) as f1,
  51. cast(0 as double) as f2,
  52. cast(0 as double) as f3,
  53. cast(0 as bigint) as label
  54. from dual
  55. ) a;

PAI command

Training

  1. drop offlinemodel if exists gbdt_lr_test_model;
  2. PAI -name gbdt_lr
  3. -project algo_public
  4. -DfeatureSplitValueMaxSize="500"
  5. -DrandSeed="1"
  6. -Dshrinkage="1"
  7. -DmaxLeafCount="30"
  8. -DlabelColName="label"
  9. -DinputTableName="gbdt_lr_test_input"
  10. -DminLeafSampleCount="1"
  11. -DsampleRatio="1"
  12. -DmaxDepth="10"
  13. -DmodelName="gbdt_lr_test_model"
  14. -DmetricType="0"
  15. -DfeatureRatio="1"
  16. -DtestRatio="0"
  17. -DfeatureColNames="f0,f1,f2,f3"
  18. -DtreeCount="5"

Prediction

  1. drop table if exists gbdt_lr_test_prediction_result;
  2. PAI -name prediction
  3. -project algo_public
  4. -DdetailColName="prediction_detail"
  5. -DmodelName="gbdt_lr_test_model"
  6. -DitemDelimiter=","
  7. -DresultColName="prediction_result"
  8. -Dlifecycle="28"
  9. -DoutputTableName="gbdt_lr_test_prediction_result"
  10. -DscoreColName="prediction_score"
  11. -DkvDelimiter=":"
  12. -DinputTableName="gbdt_lr_test_input"
  13. -DenableSparse="false"
  14. -DappendColNames="label";

Input description

gbdt_lr_test_input

f0 f1 f2 f3 label
1.0 0.0 0.0 0.0 0
0.0 0.0 1.0 0.0 1
0.0 0.0 0.0 1.0 1
0.0 1.0 0.0 0.0 0
1.0 0.0 0.0 0.0 0
0.0 1.0 0.0 0.0 0

Output description

gbdt_lr_test_prediction_result

label prediction_result prediction_score prediction_detail
0 0 0.9984308925552831 {“0”: 0.9984308925552831, “1”: 0.001569107444716943}
0 0 0.9984308925552831 {“0”: 0.9984308925552831, “1”: 0.001569107444716943}
1 1 0.9982721832240973 {“0”: 0.001727816775902724, “1”: 0.9982721832240973}
1 1 0.9982721832240973 {“0”: 0.001727816775902724, “1”: 0.9982721832240973}
0 0 0.9984308925552831 {“0”: 0.9984308925552831, “1”: 0.001569107444716943}
0 0 0.9984308925552831 {“0”: 0.9984308925552831, “1”: 0.001569107444716943}

Important notes

  • GBDT and GBDT_LR have different default types of loss functions. The default loss function of GBDT is regression loss:mean squared error loss, and that of GBDT_LR is logistic regression loss. The system automatically writes the default loss function for GBDT_LR, and you do not need to set the loss function type for it.

  • For GBDT binary classification, the label column can only be a binary classification column and does not support data of the string type.

  • When connecting the ROC curve, select the custom mode for the prediction component and select a base value.

K-NN

This component selects from the training table K records nearest to each row of the prediction table, and takes the class with the maximum number of records among the K records as the class of the specific row.

The K-NN algorithm solves classification issues.

PAI command

  1. PAI -name knn
  2. -DtrainTableName=pai_knn_test_input
  3. -DtrainFeatureColNames=f0,f1
  4. -DtrainLabelColName=class
  5. -DpredictTableName=pai_knn_test_input
  6. -DpredictFeatureColNames=f0,f1
  7. -DoutputTableName=pai_knn_test_output
  8. -Dk=2;

Parameter description

Parameter Description Option Default value
trainTableName (Required) Name of the input table NA NA
trainFeatureColNames (Required) Names of the feature columns in the training table NA NA
trainLabelColName (Required) Name of the label column in the training table NA NA
trainTablePartitions (Optional) Partitions used for training in the training table NA All partitions
predictTableName (Required) Name of the prediction table NA NA
outputTableName (Required) Name of the output table NA NA
predictFeatureColNames (Optional) Names of feature columns in the prediction table NA Same as trainFeatureColNames
predictTablePartitions (Optional) Partitions used for prediction in the prediction table NA All partitions
appendColNames (Optional) Name of the prediction table appended to the output table NA Same as predictFeatureColNames
outputTablePartition (Optional) Partitions in the output table NA Output table not partitioned
k (Optional) Number of nearest neighbors Positive integer in the range of [1, 1000] 100
enableSparse Whether data in the input table is in sparse format true, false Optional, default: false
itemDelimiter Delimiter used between key-value pairs when data in the input table is in sparse format Optional, default: space
kvDelimiter Delimiter used between keys and values when data in the input table is in sparse format Optional, default: colon
coreNum Number of nodes Used together with the memSizePerCore parameter, positive integer in the range of [1, 20000] (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle

Example

Test data

  1. create table pai_knn_test_input as
  2. select * from
  3. (
  4. select 1 as f0,2 as f1, 'good' as class from dual
  5. union all
  6. select 1 as f0,3 as f1, 'good' as class from dual
  7. union all
  8. select 1 as f0,4 as f1, 'bad' as class from dual
  9. union all
  10. select 0 as f0,3 as f1, 'good' as class from dual
  11. union all
  12. select 0 as f0,4 as f1, 'bad' as class from dual
  13. )tmp;

PAI command

  1. pai -name knn
  2. -DtrainTableName=pai_knn_test_input
  3. -DtrainFeatureColNames=f0,f1
  4. -DtrainLabelColName=class
  5. -DpredictTableName=pai_knn_test_input
  6. -DpredictFeatureColNames=f0,f1
  7. -DoutputTableName=pai_knn_test_output
  8. -Dk=2;

Output description

f0 and f1 are appended columns in the output table.

  • prediction_result: classification result.
  • prediction_score: probability of the classification result.
  • prediction_detail: latest K conclusions and their probabilities.

    image

Random forest

A random forest is a classifier that contains multiple decision trees. The output class is the mode of classes of the individual trees.

ID3, C4.5, or CART can be used as the single decision tree algorithm.
For more information about the random forest, see wiki.

PAI command

  1. PAI -name randomforests
  2. -project algo_public
  3. -DinputTableName="pai_rf_test_input"
  4. -DmodelName="pai_rf_test_model"
  5. -DforceCategorical="f1"
  6. -DlabelColName="class"
  7. -DfeatureColNames="f0,f1"
  8. -DmaxRecordSize="100000"
  9. -DminNumPer="0"
  10. -DminNumObj="2"
  11. -DtreeNum="3";

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). NA (Optional) All partitions are selected by default.
labelColName Name of the label column in the input table Column name Required
modelName Name of the output model NA Required
treeNum Number of trees in the random forest Positive integer in the range of (0, 1000] Required
weightColName Name of the weight column in the input table NA Optional, default: no weight column
featureColNames Names of feature columns used for training in the input table NA Optional, default: all columns except labelColName and weightColName
excludedColNames Names of feature columns excluded from training in the input table, mutually exclusive with featureColNames NA Optional, default: empty
forceCategorical The default feature parsing rule is: Parse columns of string, boolean, and datetime types as discrete columns, and parse columns of double and bigint types as contiguous columns. You can specify the forceCategorical parameter to parse bigint columns as categorical columns. Optional, default: int columns are parsed as contiguous columns
algorithmTypes Location of the single decision tree algorithm in the forest The value contains two characters. If the forest has n trees and algorithmTypes is set to [a,b], then [0,a) indicates ID3, [a,b) indicates CART, and [b,n) indicates C4.5. If this parameter is set to [2, 4] for a forest with five trees, [0, 1) indicates the ID3 algorithm, [2, 3) indicates the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, the algorithms are evenly allocated in the forest. Optional, default: algorithms evenly allocated in the forest
randomColNum Number of random features selected in each split during generation of a single decision tree [1-N], where N is the number of features Optional, default: log2N
minNumObj Minimum number of leaf nodes Positive number Optional, default: 2
minNumPer Minimum ratio of leaf nodes to parent nodes [0,1] Optional, default: 0.0
maxTreeDeep Maximum depth of a tree [1, ∞) Optional, default: ∞
maxRecordSize Number of input random values on each tree in the forest (1000, 1,000,000] Optional, default: 100,000

Example

Test data

  1. create table pai_rf_test_input as
  2. select * from
  3. (
  4. select 1 as f0,2 as f1, "good" as class from dual
  5. union all
  6. select 1 as f0,3 as f1, "good" as class from dual
  7. union all
  8. select 1 as f0,4 as f1, "bad" as class from dual
  9. union all
  10. select 0 as f0,3 as f1, "good" as class from dual
  11. union all
  12. select 0 as f0,4 as f1, "bad" as class from dual
  13. )tmp;

PAI command

  1. PAI -name randomforests
  2. -project algo_public
  3. -DinputTableName="pai_rf_test_input"
  4. -Dmodelbashame="pai_rf_test_model"
  5. -DforceCategorical="f1"
  6. -DlabelColName="class"
  7. -DfeatureColNames="f0,f1"
  8. -DmaxRecordSize="100000"
  9. -DminNumPer="0"
  10. -DminNumObj="2"
  11. -DtreeNum="3";

Output description

Model PMML

  1. <?xml version="1.0" encoding="utf-8"?
  2. <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"
  3. <Header copyright="Copyright (c) 2014, Alibaba Inc." description=""
  4. <Application name="ODPS/PMML" version="0.1.0"/
  5. <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp
  6. </Header
  7. <DataDictionary numberOfFields="2"
  8. <DataField name="f0" optype="continuous" dataType="integer"/
  9. <DataField name="f1" optype="continuous" dataType="integer"/
  10. <DataField name="class" optype="categorical" dataType="string"
  11. <Value value="bad"/
  12. <Value value="good"/
  13. </DataField
  14. </DataDictionary
  15. <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
  16. <MiningSchema
  17. <MiningField name="f0" usageType="active"/
  18. <MiningField name="f1" usageType="active"/
  19. <MiningField name="class" usageType="target"/
  20. </MiningSchema
  21. <Segmentation multipleModelMethod="majorityVote"
  22. <Segment id="0"
  23. <True/
  24. <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
  25. <MiningSchema
  26. <MiningField name="f0" usageType="active"/
  27. <MiningField name="f1" usageType="active"/
  28. <MiningField name="class" usageType="target"/
  29. </MiningSchema
  30. <Node id="1"
  31. <True/
  32. <ScoreDistribution value="bad" recordCount="2"/
  33. <ScoreDistribution value="good" recordCount="3"/
  34. <Node id="2" score="good"
  35. <SimplePredicate field="f1" operator="equal" value="2"/
  36. <ScoreDistribution value="good" recordCount="1"/
  37. </Node
  38. <Node id="3" score="good"
  39. <SimplePredicate field="f1" operator="equal" value="3"/
  40. <ScoreDistribution value="good" recordCount="2"/
  41. </Node
  42. <Node id="4" score="bad"
  43. <SimplePredicate field="f1" operator="equal" value="4"/
  44. <ScoreDistribution value="bad" recordCount="2"/
  45. </Node
  46. </Node
  47. </TreeModel
  48. </Segment
  49. <Segment id="1"
  50. <True/
  51. <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
  52. <MiningSchema
  53. <MiningField name="f0" usageType="active"/
  54. <MiningField name="f1" usageType="active"/
  55. <MiningField name="class" usageType="target"/
  56. </MiningSchema
  57. <Node id="1"
  58. <True/
  59. <ScoreDistribution value="bad" recordCount="2"/
  60. <ScoreDistribution value="good" recordCount="3"/
  61. <Node id="2" score="good"
  62. <SimpleSetPredicate field="f1" booleanOperator="isIn"
  63. <Array n="2" type="integer"2 3</Array
  64. </SimpleSetPredicate
  65. <ScoreDistribution value="good" recordCount="3"/
  66. </Node
  67. <Node id="3" score="bad"
  68. <SimpleSetPredicate field="f1" booleanOperator="isNotIn"
  69. <Array n="2" type="integer"2 3</Array
  70. </SimpleSetPredicate
  71. <ScoreDistribution value="bad" recordCount="2"/
  72. </Node
  73. </Node
  74. </TreeModel
  75. </Segment
  76. <Segment id="2"
  77. <True/
  78. <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
  79. <MiningSchema
  80. <MiningField name="f0" usageType="active"/
  81. <MiningField name="f1" usageType="active"/
  82. <MiningField name="class" usageType="target"/
  83. </MiningSchema
  84. <Node id="1"
  85. <True/
  86. <ScoreDistribution value="bad" recordCount="2"/
  87. <ScoreDistribution value="good" recordCount="3"/
  88. <Node id="2" score="bad"
  89. <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/
  90. <ScoreDistribution value="bad" recordCount="1"/
  91. <ScoreDistribution value="good" recordCount="1"/
  92. </Node
  93. <Node id="3" score="good"
  94. <SimplePredicate field="f0" operator="greaterThan" value="0.5"/
  95. <ScoreDistribution value="bad" recordCount="1"/
  96. <ScoreDistribution value="good" recordCount="2"/
  97. </Node
  98. </Node
  99. </TreeModel
  100. </Segment
  101. </Segmentation
  102. </MiningModel
  103. </PMML

Model visualization

image

Naive Bayes

Naive Bayes classifier is a simple probabilistic classifier applying Bayes’ theorem with strong (naive) independence assumptions between the features. A probabilistic model that can more accurately describe this potential is called an independent feature model. For details about this algorithm, see Naive Bayes classifier.

Algorithm component

image

  • Feature column: Data type of feature columns can be string, double, or bigint.

  • Label column: You can select only a column other than feature columns. Its data type can be double, string, or bigint.

PAI command

  1. PAI -name NaiveBayes -project algo_public -DmodelName="xlab_m_NaiveBayes_23772" \
  2. -DinputTablePartitions="pt=20150501" -DlabelColName="poutcome" \
  3. -DfeatureColNames="age,previous,cons_conf_idx,euribor3m" \
  4. -DisFeatureContinuous="1,1,1,1" \
  5. -DinputTableName="bank_data_partition";

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table Format: Partition_name=value. The multilevel format is name1=value1/name2=value2; multiple partitions are separated by commas All partitions in the input table
modelName (Required) Name of the output model NA NA
labelColName (Required) Name of the label column in the input table NA NA
featureColNames (Optional) Names of feature columns used for training in the input table NA All columns except the label column
excludedColNames Names of feature columns excluded from training in the input table, mutually exclusive with featureColNames NA Empty
forceCategorical (Optional) The default feature parsing rule is: Parse columns of string, boolean, and datetime types as discrete columns, and parse columns of double and bigint types as contiguous columns. You can specify the forceCategorical parameter to parse bigint columns as categorical columns. NA int columns are parsed as contiguous columns

Example

Training data

id y f0 f1 f2 f3 f4 f5 f6 f7
1 -1 -0.294118 0.487437 0.180328 -0.292929 -1 0.00149028 -0.53117 -0.0333333
2 +1 -0.882353 -0.145729 0.0819672 -0.414141 -1 -0.207153 -0.766866 -0.666667
3 -1 -0.0588235 0.839196 0.0491803 -1 -1 -0.305514 -0.492741 -0.633333
4 +1 -0.882353 -0.105528 0.0819672 -0.535354 -0.777778 -0.162444 -0.923997 -1
5 -1 -1 0.376884 -0.344262 -0.292929 -0.602837 0.28465 0.887276 -0.6
6 +1 -0.411765 0.165829 0.213115 -1 -1 -0.23696 -0.894962 -0.7
7 -1 -0.647059 -0.21608 -0.180328 -0.353535 -0.791962 -0.0760059 -0.854825 -0.833333
8 +1 0.176471 0.155779 -1 -1 -1 0.052161 -0.952178 -0.733333
9 -1 -0.764706 0.979899 0.147541 -0.0909091 0.283688 -0.0909091 -0.931682 0.0666667
10 -1 -0.0588235 0.256281 0.57377 -1 -1 -1 -0.868488 0.1

Test data

id y f0 f1 f2 f3 f4 f5 f6 f7
1 +1 -0.882353 0.0854271 0.442623 -0.616162 -1 -0.19225 -0.725021 -0.9
2 +1 -0.294118 -0.0351759 -1 -1 -1 -0.293592 -0.904355 -0.766667
3 +1 -0.882353 0.246231 0.213115 -0.272727 -1 -0.171386 -0.981213 -0.7
4 -1 -0.176471 0.507538 0.278689 -0.414141 -0.702128 0.0491804 -0.475662 0.1
5 -1 -0.529412 0.839196 -1 -1 -1 -0.153502 -0.885568 -0.5
6 +1 -0.882353 0.246231 -0.0163934 -0.353535 -1 0.0670641 -0.627669 -1
7 -1 -0.882353 0.819095 0.278689 -0.151515 -0.307329 0.19225 0.00768574 -0.966667
8 +1 -0.882353 -0.0753769 0.0163934 -0.494949 -0.903073 -0.418778 -0.654996 -0.866667
9 +1 -1 0.527638 0.344262 -0.212121 -0.356974 0.23696 -0.836038 -0.8
10 +1 -0.882353 0.115578 0.0163934 -0.737374 -0.56974 -0.28465 -0.948762 -0.933333
  1. Create an experiment.

    image

  2. Select feature columns.

    image

  3. Select the label column.

    image

  4. Run the experiment.

  • The following model is generated.

    image

  • The following figure shows the predicted result.

    ex_naive_bayes_prediction

K-means clustering

K-means clustering is the most widely used clustering algorithm, which divides n objects into k clusters to maintain high similarity in each cluster. The similarity is calculated based on the average value of objects in a cluster.

This algorithm randomly selects k objects, each of which originally represents the average value or center of a cluster. Then, the algorithm assigns the remaining objects to the nearest clusters based on their distances from the center of each cluster, and re-calculates the average value of each cluster. This process repeats until the criterion function converges.

The K-means clustering algorithm assumes that object attributes are obtained from the spatial vector, and its objective is to ensure the minimum mean square error sum inside each group.
For more information about K-means clustering, see wiki.

PAI command

  1. pai -name kmeans
  2. -project algo_public
  3. -DinputTableName=pai_kmeans_test_input
  4. -DselectedColNames=f0,f1
  5. -DcenterCount=3
  6. -Dloop=10
  7. -Daccuracy=0.00001
  8. -DdistanceType=euclidean
  9. -DinitCenterMethod=random
  10. -Dseed=1
  11. -DmodelName=pai_kmeans_test_input_output_model
  12. -DidxTableName=pai_kmeans_test_input_output_idx
  13. -DclusterCountTableName=pai_kmeans_test_input_output_cc

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
selectedColNames Names of the columns used for training in the input table, which are separated by commas, columns of int and double types supported NA Optional, default: all columns in the input table
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). NA (Optional) All partitions are selected by default.
centerCount Number of clusters Positive integer in the range of [1, 1000] Required
loop Maximum number of iterations Positive integer in the range of [1, 1000] Optional, default: 100
accuracy Algorithm terminal condition. The algorithm is terminated if the variation between two iterations is smaller than this value. NA Optional, default: 0.0
distanceType Distance measuring method euclidean (Euclidean distance), cosine (included angle cosine), cityblock (Manhattan distance) Optional, default: euclidean
initCenterMethod Centroid initialization method random (random sampling), topk (first k rows of the input table), uniform (even distribution), kmpp (k-means++), external (specifying the initial centroid table) Optional, default: random
initCenterTableName Name of the initial centroid table Table name Valid when initCenterMethod is set to external
seed Initial random seed Positive integer Optional, default: current time. If the seed is set to a fixed value, clustering results do not fluctuate greatly.
enableSparse Whether data in the input table is in sparse format true, false Optional, default: false
itemDelimiter Delimiter used between key-value pairs when data in the input table is in sparse format NA Optional, default: space
kvDelimiter Delimiter used between keys and values when data in the input table is in sparse format NA Optional, default: colon
appendColNames Names of appended columns in the inputTableName to be exported to the idxTableName table, which are separated by commas Optional, default: no appended column
modelName Name of the output model Model name Required
idxTableName Clustering result output table corresponding to the input table, specifying class number of each record after clustering Table name Required
idxTablePartition Partition of the output cluster table Table name Optional, default: no partition
clusterCountTableName Output cluster count table, showing the number of nodes in each cluster Optional, no output
centerTableName Output cluster center table NA Optional, about to be deprecated, the modelName parameter recommended

Distance measurement method

Parameter Description
euclidean image
cosine image
cityblock image

Centroid initialization method

Parameter Description
random Sample K initial centers randomly from the input table. The initial random seed can be specified using the seed parameter.
topk Read the first K rows of the input table as the initial centers
uniform Calculate K evenly distributed initial centers in the input table in ascending order of values
kmpp Use the k-means++ algorithm to select K initial centers. For details about this algorithm, see wiki
external Specify an external initial center table

Example

Test data

  1. create table pai_kmeans_test_input as
  2. select * from
  3. (
  4. select 1 as f0,2 as f1 from dual
  5. union all
  6. select 1 as f0,3 as f1 from dual
  7. union all
  8. select 1 as f0,4 as f1 from dual
  9. union all
  10. select 0 as f0,3 as f1 from dual
  11. union all
  12. select 0 as f0,4 as f1 from dual
  13. )tmp;

PAI command

  1. pai -name kmeans
  2. -project algo_public
  3. -DinputTableName=pai_kmeans_test_input
  4. -DselectedColNames=f0,f1
  5. -DcenterCount=3
  6. -Dloop=10
  7. -Daccuracy=0.00001
  8. -DdistanceType=euclidean
  9. -DinitCenterMethod=random
  10. -Dseed=1
  11. -DmodelName=pai_kmeans_test_input_output_model
  12. -DidxTableName=pai_kmeans_test_input_output_idx
  13. -DclusterCountTableName=pai_kmeans_test_input_output_cc

Output description

Model PMML

  1. <?xml version="1.0" encoding="utf-8"?
  2. <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"
  3. <Header copyright="Copyright (c) 2014, Alibaba Inc." description=""
  4. <Application name="ODPS/PMML" version="0.1.0"/
  5. <TimestampFri, 15 Jul 2016 03:09:38 GMT</Timestamp
  6. </Header
  7. <DataDictionary numberOfFields="2"
  8. <DataField name="f0" optype="continuous" dataType="integer"/
  9. <DataField name="f1" optype="continuous" dataType="integer"/
  10. <DataField name="cluster_index" optype="continuous" dataType="integer"/
  11. </DataDictionary
  12. <ClusteringModel modelName="xlab_m_KMeans_2_76889_v0" functionName="clustering" algorithmName="kmeans" modelClass="centerBased" numberOfClusters="3"
  13. <MiningSchema
  14. <MiningField name="f0" usageType="active"/
  15. <MiningField name="f1" usageType="active"/
  16. </MiningSchema
  17. <ComparisonMeasure kind="distance" compareFunction="absDiff"
  18. <squaredEuclidean/
  19. </ComparisonMeasure
  20. <ClusteringField field="f0" compareFunction="absDiff"/
  21. <ClusteringField field="f1" compareFunction="absDiff"/
  22. <Cluster
  23. <Array n="2" type="real"0 3.5</Array
  24. </Cluster
  25. <Cluster
  26. <Array n="2" type="real"1 4</Array
  27. </Cluster
  28. <Cluster
  29. <Array n="2" type="real"1 2.5</Array
  30. </Cluster
  31. </ClusteringModel
  32. </PMML

Model visualization

image

Output cluster table: The number of rows equals the total number of rows in the input table. The values in each row represent the cluster numbers of the points in the corresponding row of the input table.

image

Output cluster count table: The number of rows equals the number of clusters. The values in each row represent the number of points in the corresponding clusters.

image

Linear regression

Linear regression is a model used to analyze the linear relationship between a dependent variable and multiple independent variables. For details, see https://en.wikipedia.org/wiki/Linear_regression.

PAI command

  1. PAI -name linearregression
  2. -project algo_public
  3. -DinputTableName=lm_test_input
  4. -DfeatureColNames=x
  5. -DlabelColName=y
  6. -DmodelName=lm_test_input_model_out;

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
modelName (Required) Name of the output model NA NA
outputTableName (Optional) Name of the output model evaluation table Required when enableFitGoodness is true “”
labelColName (Required) Dependent variable Double or bigint type, only one column allowed NA
featureColNames (Required) Independent variables double or bigint in dense format and string type in sparse format, multiple columns allowed NA
inputTablePartitions (Optional) Partitions in the input table NA “”
maxIter (Optional) Maximum number of iterations NA 100
epsilon (Optional) Minimum likelihood deviation NA 0.000001
enableSparse (Optional) Whether the input table data is in sparse format true, false false
enableFitGoodness (Optional) Whether to perform model evaluation, based on metrics including R-squared, AdjustedR-Squared, AIC, degree of freedom, residual standard deviation, and deviation [true, false] false
enableCoefficientEstimate (Optional) Whether to perform regression coefficient evaluation, based on metrics including value t, value p, and confidence interval [2.5%, 97.5%]. This parameter is valid only when enableFitGoodness is true and is set to false otherwise. true, false false
itemDelimiter (Optional) Delimiter used between key-value pairs in sparse format, valid only when enableSparse is true - Space on the command line interface and comma (,) on web pages
kvDelimiter (Optional) Delimiter used between keys and values in sparse format, valid only when enableSparse is true NA Colon (:)
lifecycle (Optional) Lifecycle of the model evaluation output table 0 -1
coreNum (Optional) Total number of instances [1, 800) Automatically calculated
memSizePerCore (Optional) Memory size for core [1024, 20*1024] Automatically calculated

Example

Test data

SQL statement for data generation:

  1. drop table if exists lm_test_input;
  2. create table lm_test_input as
  3. select
  4. *
  5. from
  6. (
  7. select 10 as y, 1.84 as x1, 1 as x2, '0:1.84 1:1' as sparsecol1 from dual
  8. union all
  9. select 20 as y, 2.13 as x1, 0 as x2, '0:2.13' as sparsecol1 from dual
  10. union all
  11. select 30 as y, 3.89 as x1, 0 as x2, '0:3.89' as sparsecol1 from dual
  12. union all
  13. select 40 as y, 4.19 as x1, 0 as x2, '0:4.19' as sparsecol1 from dual
  14. union all
  15. select 50 as y, 5.76 as x1, 0 as x2, '0:5.76' as sparsecol1 from dual
  16. union all
  17. select 60 as y, 6.68 as x1, 2 as x2, '0:6.68 1:2' as sparsecol1 from dual
  18. union all
  19. select 70 as y, 7.58 as x1, 0 as x2, '0:7.58' as sparsecol1 from dual
  20. union all
  21. select 80 as y, 8.01 as x1, 0 as x2, '0:8.01' as sparsecol1 from dual
  22. union all
  23. select 90 as y, 9.02 as x1, 3 as x2, '0:9.02 1:3' as sparsecol1 from dual
  24. union all
  25. select 100 as y, 10.56 as x1, 0 as x2, '0:10.56' as sparsecol1 from dual
  26. ) tmp;

Running command

  1. PAI -name linearregression
  2. -project algo_public
  3. -DinputTableName=lm_test_input
  4. -DlabelColName=y
  5. -DfeatureColNames=x1,x2
  6. -DmodelName=lm_test_input_model_out
  7. -DoutputTableName=lm_test_input_conf_out
  8. -DenableCoefficientEstimate=true
  9. -DenableFitGoodness=true
  10. -Dlifecycle=1;
  11. PAI -name prediction
  12. -project algo_public
  13. -DmodelName=lm_test_input_model_out
  14. -DinputTableName=lm_test_input
  15. -DoutputTableName=lm_test_input_predict_out
  16. -DappendColNames=y;

Running result

lm_test_input_conf_out

  1. +------------+------------+------------+------------+--------------------+------------+
  2. | colname | value | tscore | pvalue | confidenceinterval | p |
  3. +------------+------------+------------+------------+--------------------+------------+
  4. | Intercept | -6.42378496687763 | -2.2725755951390028 | 0.06 | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient |
  5. | x1 | 10.260063429838898 | 23.270944360826963 | 0.0 | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient |
  6. | x2 | 0.35374498323846265 | 0.2949247320997519 | 0.81 | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient |
  7. | rsquared | 0.9879675667384592 | NULL | NULL | NULL | goodness |
  8. | adjusted_rsquared | 0.9845297286637332 | NULL | NULL | NULL | goodness |
  9. | aic | 59.331109494251805 | NULL | NULL | NULL | goodness |
  10. | degree_of_freedom | 7.0 | NULL | NULL | NULL | goodness |
  11. | standardErr_residual | 3.765777749448906 | NULL | NULL | NULL | goodness |
  12. | deviance | 99.26757440771128 | NULL | NULL | NULL | goodness |
  13. +------------+------------+------------+------------+--------------------+------------+

lm_test_input_predict_out

  1. +------------+-------------------+------------------+-------------------+
  2. | y | prediction_result | prediction_score | prediction_detail |
  3. +------------+-------------------+------------------+-------------------+
  4. | 10 | NULL | 12.808476727264404 | {"y": 12.8084767272644} |
  5. | 20 | NULL | 15.43015013867922 | {"y": 15.43015013867922} |
  6. | 30 | NULL | 33.48786177519568 | {"y": 33.48786177519568} |
  7. | 40 | NULL | 36.565880804147355 | {"y": 36.56588080414735} |
  8. | 50 | NULL | 52.674180388994415 | {"y": 52.67418038899442} |
  9. | 60 | NULL | 62.82092871092313 | {"y": 62.82092871092313} |
  10. | 70 | NULL | 71.34749583130122 | {"y": 71.34749583130122} |
  11. | 80 | NULL | 75.75932310613193 | {"y": 75.75932310613193} |
  12. | 90 | NULL | 87.1832221199846 | {"y": 87.18322211998461} |
  13. | 100 | NULL | 101.92248485222113 | {"y": 101.9224848522211} |
  14. +------------+-------------------+------------------+-------------------+

GBDT regression

GBDT, short for gradient boosting decision tree, is an iterative decision tree algorithm based on multiple decision trees. The final output is the sum of conclusions of all trees. GBDT applies to almost all regression models (linear or nonlinear) and has a wider scope of application than logistic regression that only applies to linear regression. For details, see the following references (a) A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments, (b) From RankNet to LambdaRank to LambdaMART: An Overview.

PAI command

  1. PAI -name gbdt
  2. -project algo_public
  3. -DfeatureSplitValueMaxSize="500"
  4. -DlossType="0"
  5. -DrandSeed="0"
  6. -DnewtonStep="0"
  7. -Dshrinkage="0.05"
  8. -DmaxLeafCount="32"
  9. -DlabelColName="campaign"
  10. -DinputTableName="bank_data_partition"
  11. -DminLeafSampleCount="500"
  12. -DsampleRatio="0.6"
  13. -DgroupIDColName="age"
  14. -DmaxDepth="11"
  15. -DmodelName="xlab_m_GBDT_83602"
  16. -DmetricType="2"
  17. -DfeatureRatio="0.6"
  18. -DinputTablePartitions="pt=20150501"
  19. -Dtau="0.6"
  20. -Dp="1"
  21. -DtestRatio="0.0"
  22. -DfeatureColNames="previous,cons_conf_idx,euribor3m"
  23. -DtreeCount="500"

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
featureColNames Names of the feature columns used for training in the input table Column names Optional, default: all columns with values
labelColName Name of the label column in the input table Column name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). NA (Optional) All partitions are selected by default.
modelName Name of the output model NA Required
outputImportanceTableName Name of the output feature importance table NA Optional
groupIDColName Name of a grouping column Column name Optional, default: full table
lossType Loss function type, 0: GBRANK, 1: LAMBDAMART_DCG, 2: LAMBDAMART_NDCG, 3: LEAST_SQUARE, 4: LOG_LIKELIHOOD 0, 1, 2, 3, 4 Optional, default: 0
metricType Metric type, 0(NDCG)-: normalized discounted cumulative gain; 1(DCG): discounted cumulative gain; 2 (AUC) adaptive only to 0/1 label 0, 1, 2 Optional, default: 2
treeCount Number of trees [1, 10,000] Optional, default: 500
shrinkage Learning rate (0, 1] Optional, default: 0.05
maxLeafCount Maximum number of leaves, which must be an integer [2, 1000] Optional, default: 32
maxDepth Maximum depth of a tree, which must be an integer [1, 11] Optional, default: 11
minLeafSampleCount Minimum number of samples on a leaf node, which must be an integer [100, 1000] Optional, default: 500
sampleRatio Ratio of samples collected during the training (0, 1] Optional, default: 0.6
featureRatio Ratio of features collected during the training (0, 1] Optional, default: 0.6
tau Parameter Tau in gbrank loss [0, 1] Optional, default: 0.6
p Parameter p in gbrank loss [1, 10] Optional, default: 1
randSeed Random seed [0, 10] Optional, default: 0
newtonStep Whether to use the newton method 0,1 Optional, default: 1
featureSplitValueMaxSize Number of records that a feature can split into [1, 1000] Optional, default: 500
lifecycle Lifecycle of the output table NA Optional, default: not set

Example

Data generation

  1. drop table if exists gbdt_ls_test_input;
  2. create table gbdt_ls_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. cast(1 as double) as f0,
  10. cast(0 as double) as f1,
  11. cast(0 as double) as f2,
  12. cast(0 as double) as f3,
  13. cast(0 as bigint) as label
  14. from dual
  15. union all
  16. select
  17. cast(0 as double) as f0,
  18. cast(1 as double) as f1,
  19. cast(0 as double) as f2,
  20. cast(0 as double) as f3,
  21. cast(0 as bigint) as label
  22. from dual
  23. union all
  24. select
  25. cast(0 as double) as f0,
  26. cast(0 as double) as f1,
  27. cast(1 as double) as f2,
  28. cast(0 as double) as f3,
  29. cast(1 as bigint) as label
  30. from dual
  31. union all
  32. select
  33. cast(0 as double) as f0,
  34. cast(0 as double) as f1,
  35. cast(0 as double) as f2,
  36. cast(1 as double) as f3,
  37. cast(1 as bigint) as label
  38. from dual
  39. union all
  40. select
  41. cast(1 as double) as f0,
  42. cast(0 as double) as f1,
  43. cast(0 as double) as f2,
  44. cast(0 as double) as f3,
  45. cast(0 as bigint) as label
  46. from dual
  47. union all
  48. select
  49. cast(0 as double) as f0,
  50. cast(1 as double) as f1,
  51. cast(0 as double) as f2,
  52. cast(0 as double) as f3,
  53. cast(0 as bigint) as label
  54. from dual
  55. ) a;

PAI command

Training

  1. drop offlinemodel if exists gbdt_ls_test_model;
  2. PAI -name gbdt
  3. -project algo_public
  4. -DfeatureSplitValueMaxSize="500"
  5. -DlossType="3"
  6. -DrandSeed="0"
  7. -DnewtonStep="1"
  8. -Dshrinkage="0.5"
  9. -DmaxLeafCount="32"
  10. -DlabelColName="label"
  11. -DinputTableName="gbdt_ls_test_input"
  12. -DminLeafSampleCount="1"
  13. -DsampleRatio="1"
  14. -DmaxDepth="10"
  15. -DmetricType="0"
  16. -DmodelName="gbdt_ls_test_model"
  17. -DfeatureRatio="1"
  18. -Dp="1"
  19. -Dtau="0.6"
  20. -DtestRatio="0"
  21. -DfeatureColNames="f0,f1,f2,f3"
  22. -DtreeCount="10"

Prediction

  1. drop table if exists gbdt_ls_test_prediction_result;
  2. PAI -name prediction
  3. -project algo_public
  4. -DdetailColName="prediction_detail"
  5. -DmodelName="gbdt_ls_test_model"
  6. -DitemDelimiter=","
  7. -DresultColName="prediction_result"
  8. -Dlifecycle="28"
  9. -DoutputTableName="gbdt_ls_test_prediction_result"
  10. -DscoreColName="prediction_score"
  11. -DkvDelimiter=":"
  12. -DinputTableName="gbdt_ls_test_input"
  13. -DenableSparse="false"
  14. -DappendColNames="label"

Input description

gbdt_ls_test_input

f0 f1 f2 f3 label
1.0 0.0 0.0 0.0 0
0.0 0.0 1.0 0.0 1
0.0 0.0 0.0 1.0 1
0.0 1.0 0.0 0.0 0
1.0 0.0 0.0 0.0 0
0.0 1.0 0.0 0.0 0

Output description

gbdt_ls_test_prediction_result

label prediction_result prediction_score prediction_detail
0 NULL 0.0 {“label”: 0}
0 NULL 0.0 {“label”: 0}
1 NULL 0.9990234375 {“label”: 0.9990234375}
1 NULL 0.9990234375 {“label”: 0.9990234375}
0 NULL 0.0 {“label”: 0}
0 NULL 0.0 {“label”: 0}

Collaborative filtering (etrec)

  • Etrec is an item-based collaborative filtering algorithm that uses two input columns and provides the top K items with the highest similarity as the output.

  • For more information about jaccard, see Jaccard_index.

PAI command

  1. PAI -name pai_etrec
  2. -project algo_public
  3. -DsimilarityType="wbcosine"
  4. -Dweight="1"
  5. -DminUserBehavior="2"
  6. -Dlifecycle="28"
  7. -DtopN="2000"
  8. -Dalpha="0.5"
  9. -DoutputTableName="etrec_test_result"
  10. -DmaxUserBehavior="500"
  11. -DinputTableName="etrec_test_input"
  12. -Doperator="add"
  13. -DuserColName="user"
  14. -DitemColName="item"

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
userColName (Required) Name of the selected user column in the input table NA NA
itemColName (Required) Name of the selected item column in the input table NA NA
payloadColName (Optional) Name of the selected payload column in the input table NA No payload column
inputTablePartitions (Optional) Names of the selected partitions in the input table NA All partitions in the input table
outputTableName (Required) Name of the output table NA NA
outputTablePartition (Optional) Partition of the output table NA No partition
similarityType (Optional) Type of similarity wbcosine, asymcosine, jaccard wbcosine
topN (Optional) N items with the highest similarity [1, 10,000] 2000
minUserBehavior (Optional) Minimum user behavior [2,) 2
maxUserBehavior (Optional) Maximum user behavior [2, 100,000] 500
itemDelimiter (Optional) Delimiter used between items in the output table NA “ “
kvDelimiter (Optional) Delimiter used between keys and values in the output table NA “:”
alpha (Optional) Value of the smoothing factor for asymcosine NA 0.5
weight (Optional) Weighting exponent for asymcosine NA 1.0
operator (Optional) Action taken when the same items exist for one user add, mul, min, max add
lifecycle (Optional) Lifecycle of the output table NA 1

Example

Data generation

  1. drop table if exists etrec_test_input;
  2. create table etrec_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. cast(0 as string) as user,
  10. cast(0 as string) as item
  11. from dual
  12. union all
  13. select
  14. cast(0 as string) as user,
  15. cast(1 as string) as item
  16. from dual
  17. union all
  18. select
  19. cast(1 as string) as user,
  20. cast(0 as string) as item
  21. from dual
  22. union all
  23. select
  24. cast(1 as string) as user,
  25. cast(1 as string) as item
  26. from dual
  27. ) a;

PAI command

  1. drop table if exists etrec_test_result;
  2. PAI -name pai_etrec
  3. -project algo_public
  4. -DsimilarityType="wbcosine"
  5. -Dweight="1"
  6. -DminUserBehavior="2"
  7. -Dlifecycle="28"
  8. -DtopN="2000"
  9. -Dalpha="0.5"
  10. -DoutputTableName="etrec_test_result"
  11. -DmaxUserBehavior="500"
  12. -DinputTableName="etrec_test_input"
  13. -Doperator="add"
  14. -DuserColName="user"
  15. -DitemColName="item";

Input descriptionetrec_test_input

user item
0 0
0 1
1 0
1 1

Output descriptionetrec_test_result

itemid similarity
0 1:1
1 0:1

Confusion matrix

The confusion matrix is a visualization tool typically used in supervised learning. (In unsupervised learning, it is called a matching matrix). This tool is used to compare classification results with actual measured values and display the accuracy of classification results in a confusion matrix. For more information about the confusion matrix, see Confusion matrix.

Component connection: Generally, the classification prediction component is on the upper layer. The following figure shows an example.

image

Parameter settings

image

NOTE: Only one of the label column and detail column can be selected in the predicted result table.

Evaluation report

Right-click the confusion matrix to view the evaluation report.

  • Confusion matrix

    image

  • The following figure shows information about each label, including the metrics, accuracy, and recall rate.

    image

PAI command

  • Sample of a PAI command without a threshold specified
  1. pai -name confusionmatrix -project algo_public \
  2. -DinputTableName=wpbc_pred \
  3. -DlabelColName=label \
  4. -DpredictionColName=prediction_result \
  5. -DoutputTableName=wpbc_confu;
  • Sample of a PAI command with a threshold specified
  1. pai -name confusionmatrix -project algo_public \
  2. -DinputTableName=wpbc_pred \
  3. -DlabelColName=label \
  4. -DpredictionDetailColName=prediction_detail \
  5. -DgoodValue=N \
  6. -Dthreshold=0.8 \
  7. -DoutputTableName=wpbc_confu;

Parameter description

Parameter key Description Option Default value
inputTableName (Required) Name of the input table (prediction output table) NA NA
labelColName (Required) Name of the original label column NA NA
outputTableName (Required) Name of the output table that stores the confusion matrix NA NA
inputTablePartition (Optional) Partition of the input table NA All partitions in the input table
predictionColName (Optional) Name of the label column of the predicted result table, required when no threshold is specified NA NA
predictionDetailColName (Optional) Name of the detail column of the predicted result table, required when a threshold is specified NA NA
threshold (Optional) Threshold for a good value NA 0.5
goodValue (Optional) Label value of a training coefficient during binary classification, required when a threshold is specified NA NA
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle

Example

Test data

id label prediction_result
0 A A
1 A B
2 A A
3 A A
4 B B
5 B B
6 B A
7 B B
8 B A
9 A A
  1. Create an experiment.

    image

  2. Configuration parameters.

    image

  3. Run the experiment.

  • The confusion matrix is as follows:

    image

  • The following figure shows statistics about each label.

    image

Multiclass classification evaluation

This component evaluates a multiclass classification algorithm model based on the accuracy, kappa, F1 score, and other metrics in the predicted result and expected result of the classification model.

Component connection: The multiclass classification evaluation component needs to be connected to the prediction component and does not support regression models. The following figure shows an example of connection between components.

image

Parameter settings

image

  • In the Expected Classification Result Column drop-down list, you can select the Original Label column.

NOTE: A maximum of 1000 classes are supported.

  • In the Predicted Classification Result Column drop-down list, you can select the Predicted Label column. The default value is prediction_result.

  • In the Predicted Result Probability Column drop-down list, you can select the Predicted Label Probability column. Generally, the name of this field is prediction_detail.

NOTE: This parameter applies only to random forest prediction.

Evaluation report description

  • Metrics of each label are calculated using the one-vs-all method, as shown in the following figure.
    image

  • The following figure shows the summary of metrics, among which MacroAveraged indicates the average value of each label’s metrics.
    image

  • Confusion matrix.
    image

PAI command

  1. PAI -name MultiClassEvaluation -project algo_public \
  2. DinputTableName="test_input" \
  3. -DoutputTableName="test_output" \
  4. -DlabelColName="label" \
  5. -DpredictionColName="prediction_result" \
  6. -Dlifecycle=30;

Parameter settings

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for calculation in the input table NA All partitions in the input table
outputTableName (Required) Name of the output table NA NA
labelColName (Required) Name of the original label column in the input table NA NA
predictionColName (Required) Name of the label column of the predicted result table NA NA
predictionDetailColName (Optional) Name of the predicted result probability column, in the format of {“A“:0.2, “B“:0.3, “C“, 0.5} NA Empty
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically assigned
memSizePerCore (Optional) Memory size for each core, in MB Positive integer in the range of (0, 65536) Automatically assigned

Description of the output table in JSON format

  1. {
  2. "LabelNumber": 3,
  3. "LabelList": ["A", "B", "C"],
  4. "ConfusionMatrix": [ // Confusion matrix [actual][predict]
  5. [100, 10, 20],
  6. [30, 50, 9],
  7. [7, 40, 90] ],
  8. "ProportionMatrix": [ // Proportion in each row [actual][predict]
  9. [0.6, 0.2, 0.2],
  10. [0.3, 0.6, 0.1],
  11. [0.1, 0.4, 0.5] ],
  12. "ActualLabelFrequencyList": [ // Actual frequency of each label
  13. 200, 300, 600],
  14. "ActualLabelProportionList": [ // Actual proportion of each label
  15. 0.1, 0.2, 0.7],
  16. "PredictedLabelFrequencyList": [ // Predicted frequency of each label
  17. 300, 400, 400],
  18. "PredictedLabelProportionList": [ // Predicted proportion of each label
  19. 0.2, 0.1, 0.7],
  20. "OverallMeasures": { // Overall metrics
  21. "Accuracy": 0.70,
  22. "Kappa" : 0.3,
  23. "MacroList": { // Average value of each label's metrics
  24. "Sensitivity": 0.4,
  25. "Specificity": 0.3,
  26. },
  27. "MicroList": { // Metrics calculated based on the sum of TP, TN, FP, and FN values of each label
  28. "Sensitivity": 0.4,
  29. "Specificity": 0.3,
  30. },
  31. "LabelFrequencyBasedMicro": { // Frequency-based weighted average value of each label's metrics
  32. "Sensitivity": 0.4,
  33. "Specificity": 0.3,
  34. },
  35. },
  36. "LabelMeasuresList": [ // Metrics of each label
  37. {
  38. "Accuracy": 0.6,
  39. "Sensitivity": 0.4,
  40. "Specificity": 0.3,
  41. "Kappa": 0.3
  42. },
  43. {
  44. "Accuracy": 0.6,
  45. "Sensitivity": 0.4,
  46. "Specificity": 0.3,
  47. "Kappa": 0.3
  48. },
  49. ]
  50. }

Example

Test data

data table example_input_mc_eval

id label prediction detail
0 A A {“A”: 0.6, “B”: 0.4}
1 A B {“A”: 0.45, “B”: 0.55}
2 A A {“A”: 0.7, “B”: 0.3}
3 A A {“A”: 0.9, “B”: 0.1}
4 B B {“A”: 0.2, “B”: 0.8}
5 B B {“A”: 0.1, “B”: 0.9}
6 B A {“A”: 0.52, “B”: 0.48}
7 B B {“A”: 0.4, “B”: 0.6}
8 B A {“A”: 0.6, “B”: 0.4}
9 A A {“A”: 0.75, “B”: 0.25}
  1. Create an experiment.

    image

  2. Configuration parameters.

    image

  3. Run the experiment.

  • The overall report is as follows:

    image

  • Statistics about each label are as follows:
    image

  • The output table in JSON format is as follows:

  1. {
  2. "ActualLabelFrequencyList": [5,
  3. 5],
  4. "ActualLabelProportionList": [0.5,
  5. 0.5],
  6. "ConfusionMatrix": [[4,
  7. 1],
  8. [2,
  9. 3]],
  10. "LabelList": ["A",
  11. "B"],
  12. "LabelMeasureList": [{
  13. "Accuracy": 0.7,
  14. "Auc": 0.9,
  15. "F1": 0.7272727272727273,
  16. "FalseDiscoveryRate": 0.3333333333333333,
  17. "FalseNegative": 1,
  18. "FalseNegativeRate": 0.2,
  19. "FalsePositive": 2,
  20. "FalsePositiveRate": 0.4,
  21. "Kappa": 0.3999999999999999,
  22. "NegativePredictiveValue": 0.75,
  23. "Precision": 0.6666666666666666,
  24. "Sensitivity": 0.8,
  25. "Specificity": 0.6,
  26. "TrueNegative": 3,
  27. "TruePositive": 4},
  28. {
  29. "Accuracy": 0.7,
  30. "Auc": 0.9,
  31. "F1": 0.6666666666666666,
  32. "FalseDiscoveryRate": 0.25,
  33. "FalseNegative": 2,
  34. "FalseNegativeRate": 0.4,
  35. "FalsePositive": 1,
  36. "FalsePositiveRate": 0.2,
  37. "Kappa": 0.3999999999999999,
  38. "NegativePredictiveValue": 0.6666666666666666,
  39. "Precision": 0.75,
  40. "Sensitivity": 0.6,
  41. "Specificity": 0.8,
  42. "TrueNegative": 4,
  43. "TruePositive": 3}],
  44. "LabelNumber": 2,
  45. "OverallMeasures": {
  46. "Accuracy": 0.7,
  47. "Kappa": 0.3999999999999999,
  48. "LabelFrequencyBasedMicro": {
  49. "Accuracy": 0.7,
  50. "F1": 0.696969696969697,
  51. "FalseDiscoveryRate": 0.2916666666666666,
  52. "FalseNegative": 1.5,
  53. "FalseNegativeRate": 0.3,
  54. "FalsePositive": 1.5,
  55. "FalsePositiveRate": 0.3,
  56. "Kappa": 0.3999999999999999,
  57. "NegativePredictiveValue": 0.7083333333333333,
  58. "Precision": 0.7083333333333333,
  59. "Sensitivity": 0.7,
  60. "Specificity": 0.7,
  61. "TrueNegative": 3.5,
  62. "TruePositive": 3.5},
  63. "LogLoss": 0.4548640449724484,
  64. "MacroAveraged": {
  65. "Accuracy": 0.7,
  66. "F1": 0.696969696969697,
  67. "FalseDiscoveryRate": 0.2916666666666666,
  68. "FalseNegative": 1.5,
  69. "FalseNegativeRate": 0.3,
  70. "FalsePositive": 1.5,
  71. "FalsePositiveRate": 0.3,
  72. "Kappa": 0.3999999999999999,
  73. "NegativePredictiveValue": 0.7083333333333333,
  74. "Precision": 0.7083333333333333,
  75. "Sensitivity": 0.7,
  76. "Specificity": 0.7,
  77. "TrueNegative": 3.5,
  78. "TruePositive": 3.5},
  79. "MicroAveraged": {
  80. "Accuracy": 0.7,
  81. "F1": 0.7,
  82. "FalseDiscoveryRate": 0.3,
  83. "FalseNegative": 3,
  84. "FalseNegativeRate": 0.3,
  85. "FalsePositive": 3,
  86. "FalsePositiveRate": 0.3,
  87. "Kappa": 0.3999999999999999,
  88. "NegativePredictiveValue": 0.7,
  89. "Precision": 0.7,
  90. "Sensitivity": 0.7,
  91. "Specificity": 0.7,
  92. "TrueNegative": 7,
  93. "TruePositive": 7}},
  94. "PredictedLabelFrequencyList": [6,
  95. 4],
  96. "PredictedLabelProportionList": [0.6,
  97. 0.4],
  98. "ProportionMatrix": [[0.8,
  99. 0.2],
  100. [0.4,
  101. 0.6]]}

Binary classification evaluation

The evaluation module can calculate the AUC, KS, and F1 score, and provide output data for drawing the KS curve, PR curve, ROC curve, LIFT chart, and gain chart. The module also supports grouping evaluation.

PAI command

  1. pai -name=evaluate -project=algo_public
  2. -DoutputMetricTableName=output_metric_table
  3. -DoutputDetailTableName=output_detail_table
  4. -DinputTableName=input_data_table
  5. -DlabelColName=label
  6. -DscoreColName=score

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions in the input table - All partitions in the input table
labelColName (Required) Name of the label column NA NA
scoreColName (Required) Name of the score column NA NA
groupColName (Optional) Name of the group column used in grouping evaluation. NA NA
binCount (Optional) Number of equal-frequency bins used to calculate metrics such as KS and PR NA 1000
outputMetricTableName (Required) Name of the output metric table, consisting of two columns (Metric and Value) and three rows (AUC, KS, and F1 Score) NA NA
outputDetailTableName (Optional) Name of the output detail table used for chart drawing NA NA
positiveLabel (Optional) Class of the positive sample NA 1
lifecycle (Optional) Life cycle of the output table NA Unspecified by default
coreNum (Optional) Number of cores NA Automatically calculated by default
memSizePerCore (Optional) Size of memory NA Automatically calculated by default

Parameter settings

image

  • In the Original Label Column drop-down list, you can select the Original Label column.
  • In the Score Column box, you can select the Prediction Score column. The default option is prediction_score.
  • In the Positive Sample Label box, you can select the label value corresponding to the positive sample.
  • In the Number of Bins with Same Frequency when Calculating Indexes such as KS and PR box, you can determine how many equal-frequency bins to divide data into. The default value is 1000.
  • In the Grouping Columns box, you can select the grouping columns used to divide data during grouping evaluation.

Evaluation result display

Right-click the Binary Classification Evaluation node and select View Evaluation Report from the shortcut menu to view the evaluation result, as shown in the following figure.

image

image

Regression model evaluation

This component evaluates a regression algorithm model based on metrics and residual histograms in the predicted result and expected result. The metrics include SST, SSE, SSR, R2, R, MSE, RMSE, MAE, MAD, MAPE, count, yMean, and predictMean.

PAI command

  1. PAI -name regression_evaluation -project algo_public
  2. -DinputTableName=input_table
  3. -DyColName=y_col
  4. -DpredictionColName=prediction_col
  5. -DindexOutputTableName=index_output_table
  6. -DresidualOutputTableName=residual_output_table;
Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for calculation in the input table NA All partitions in the input table
yColName (Required) Name of the original dependent variable column in the input table, numerical value NA NA
predictionColName (Required) Name of the predicted dependent variable column, numerical value NA NA
indexOutputTableName (Required) Name of the regression metric output table NA NA
residualOutputTableName (Required) Name of the residual histogram output table NA NA
intervalNum (Optional) Number of intervals in a histogram NA 100
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle
coreNum (Optional) Number of instances NA Not set
memSizePerCore (Optional) Memory size for each core NA Not set

Output result

Regression metric output table

The output table is in JSON format. JSON fields are described as follows:

Field Description
SST Total sum of squares
SSE Sum of squared errors
SSR Sum of squares due to regression
R2 Coefficient of determination
R Coefficient of multiple correlation
MSE Mean squared error
RMSE Root-mean-square error
MAE Mean absolute error
MAD Mean error
MAPE Mean absolute percentage error
count Number of rows
yMean Mean of original dependent variables
predictionMean Mean of predicted results

Prediction

The prediction component is used to make model-based predictions. The component has two inputs (training model and prediction data) and one output (predicted result).
Traditional data mining algorithms all use this component to make predictions.

PAI command

  1. PAI -name prediction
  2. -DmodelName=nb_model
  3. -DinputTableName=wpbc
  4. -DoutputTableName=wpbc_pred
  5. -DappendColNames=label;

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Name of the input table Table name Required
modelName Name of a model Model name Required
outputTableName Name of the output table Table name Required
featureColNames Names of the feature columns used for prediction in the input table Column names Optional, default: all columns
appendColNames Names of the columns in the prediction input table to be appended to the output table Optional, default: no appended column
inputTablePartitions Partitions used for prediction in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2; multiple partitions are separated by commas NA Optional, default: all partitions in the input table
outputTablePartition Partition in the output table NA Optional, default: no partition
resultColName Name of the result column in the output table NA Optional, default: prediction_result
scoreColName Name of the score column in the output table NA Optional, default: prediction_score
detailColName Name of the detail column in the output table NA Optional, default: prediction_detail
enableSparse Whether data in the input table is in sparse format true, false Optional, default: false
itemDelimiter Delimiter used between key-value pairs when data in the input table is in sparse format NA Optional, default: space
kvDelimiter Delimiter used between keys and values when data in the input table is in sparse format NA Optional, default: colon
lifecycle Lifecycle of the output table Positive integer Optional, default: no lifecycle

Prediction formula

Naive Bayes

Prediction_result formula: image

Prediction_score formula: image

Classification variables: image

Continuous variables: image

K-means

Prediction_result formula: image

Prediction_score formulas:

euclidean: image

cosine: image

cityblock: image

Output description

Classification Model Prediction_result Prediction_score Prediction_detail
Binary classification Logistic regression model Predicted label Predicted label probability Each label and its probability
Linear SVM model Predicted label Predicted label probability Each label and its probability
Random forest model Predicted label Predicted label probability Each label and its probability
GBDT_LR model Predicted label Predicted label probability Each label and its probability
Naive Bayes model Predicted label Log (predicted label probability) Each label and its log (probability)
XGboost model Predicted label Predicted label probability Each label and its probability
Multiclass classification Logistic regression model Predicted label Predicted label probability Each label and its probability
Random forest model Predicted label Predicted label probability Label of each leaf node and its probability
Naive Bayes model Predicted label Log (predicted label probability) Each label and its log (probability)
Regression Linear regression model Empty Regression value Label column name: regression value
GBDT model Empty Regression value Label column name: regression value
XGboost model Empty Regression value Label column name: regression value
Cluster K-means model Predicted center sequence number Distance to the predicted center Distance to each center

Example

Test data

  1. create table pai_rf_test_input as
  2. select * from
  3. (
  4. select 1 as f0,2 as f1, "good" as class from dual
  5. union all
  6. select 1 as f0,3 as f1, "good" as class from dual
  7. union all
  8. select 1 as f0,4 as f1, "bad" as class from dual
  9. union all
  10. select 0 as f0,3 as f1, "good" as class from dual
  11. union all
  12. select 0 as f0,4 as f1, "bad" as class from dual
  13. )tmp;

Modeling

  1. PAI -name randomforests
  2. -project algo_public
  3. -DinputTableName="pai_rf_test_input"
  4. -Dmodelbashame="pai_rf_test_model"
  5. -DforceCategorical="f1"
  6. -DlabelColName="class"
  7. -DfeatureColNames="f0,f1"
  8. -DmaxRecordSize="100000"
  9. -DminNumPer="0"
  10. -DminNumObj="2"
  11. -DtreeNum="3";

PAI prediction

  1. PAI -name prediction
  2. -project algo_public
  3. -DinputTableName=pai_rf_test_input
  4. -DmodelName=pai_rf_test_model
  5. -DresultColName=predict

PS-SMART binary classification

PS stands for parameter server, which is used for online and offline training tasks of large-scale models. Scalable Multiple Additive Regression Tree (SMART) is an implementation of Gradient boosting decision tree (GBDT) on PS. PS-SMART can run training tasks containing up to tens of billions of samples and hundreds of thousands of features on thousands of nodes. It also supports failover to maintain a high stability. Additionally, PS-SMART supports various data formats, training targets, evaluation targets, output feature importance, and training acceleration (such as histogram similarity).

Quick Start

image

As shown in the figure, a PS-SMART binary classification model is learned based on training data. The model has three output studs:

  • Output model: offline model, which is connected to the unified prediction component. This model does not support output of leaf node numbers.

  • Output model table: a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics. However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.

  • Output feature importance table: lists importance of each feature. Three importance types are supported (see parameter description).

PAI command

Training

  1. PAI -name ps_smart
  2. -project algo_public
  3. -DinputTableName="smart_binary_input"
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545859_2"
  6. -DoutputImportanceTableName="pai_temp_24515_545859_3"
  7. -DlabelColName="label"
  8. -DfeatureColNames="f0,f1,f2,f3,f4,f5"
  9. -DenableSparse="false"
  10. -Dobjective="binary:logistic"
  11. -Dmetric="error"
  12. -DfeatureImportanceType="gain"
  13. -DtreeCount="5";
  14. -DmaxDepth="5"
  15. -Dshrinkage="0.3"
  16. -Dl2="1.0"
  17. -Dl1="0"
  18. -Dlifecycle="3"
  19. -DsketchEps="0.03"
  20. -DsampleRatio="1.0"
  21. -DfeatureRatio="1.0"
  22. -DbaseScore="0.5"
  23. -DminSplitLoss="0"

Prediction

  1. PAI -name prediction
  2. -project algo_public
  3. -DinputTableName="smart_binary_input";
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545860_1"
  6. -DfeatureColNames="f0,f1,f2,f3,f4,f5"
  7. -DappendColNames="label,qid,f0,f1,f2,f3,f4,f5"
  8. -DenableSparse="false"
  9. -Dlifecycle="28"

Parameter description

Data parameters

Command option Parameter Description Value range Required/Optional, default value
featureColNames Feature Column Name of the feature column selected from the input table for training The column name must be bigint or double type in dense format or string type in sparse KV format. If the sparse KV format is used, the keys and values must be numerical type data. Required
labelColName Label Column Name of the label column in the input table The column name can be either a string or a numerical value, but it can only be saved as a numerical value. For example, the value can be 0 or 1 for binary classification. Required
weightCol Weight Column You can specify a weight for each row of samples. Column name, numerical type (Optional) It is left blank by default.
enableSparse Sparse Format Whether data is in sparse format, in which key-value pairs are separated by spaces whereas keys and values are separated by colons, for example, 1:0.3 3:0.9 [true, false] (Optional) The default value is false.
inputTableName Name of the input table NA NA Required
modelName Name of the output model NA NA Required
outputImportanceTableName Output Feature Importance Table Name NA NA (Optional) It is left blank by default.
inputTablePartitions Input Table Partitions NA NA Optional, in the format of ds=1/pt=1
outputTableName Output Model Table Name The output table is an ODPS table that uses the binary format and is not readable. The self-contained prediction component of SMART can be used to provide output of leaf node numbers. true, false (Optional) The default value is false.
lifecycle Output Table Lifecycle NA Positive integer (Optional) The default value is 3.

Algorithm parameters

Command option Parameter Description Value range Required/Optional, default value
objective Objective Function Type The objective function type affects learning directly and must be selected correctly. Select “binary:logistic” for binary classification. Required
metric Evaluation Metric Type Evaluation metrics in the training set, which are output to stdout of the coordinator in a logview logloss, error, auc (Optional) It is left blank by default.
treeCount Number of Decision Trees Number of decision trees. The training time is in direct proportion to this number. Positive integer (Optional) The default value is 1.
maxDepth Maximum Decision Tree Depth Maximum depth of a decision tree, recommended value: 5 (a maximum of 32 leaf nodes) Positive integer, [1, 20] (Optional) The default value is 5.
sampleRatio Data Sampling Ratio Data sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that data sampling is disabled.
featureRatio Feature Sampling Ratio Feature sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that feature sampling is disabled.
l1 L1 Penalty Coefficient This parameter controls the number of leaf nodes. The larger the value is, the fewer the leaf nodes are. Set a larger value for overfitting. Non-negative real number (Optional) The default value is 0.
l2 L2 Penalty Coefficient This parameter controls distribution of leaf nodes. The larger the value is, the more evenly the leaf nodes are distributed.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 1.0.
shrinkage Learning Rate (0, 1] (Optional) The default value is 0.3.
sketchEps Sketch Precision Threshold of the splitting point when creating a sketch. The number of bins is O (1.0/sketchEps). The smaller the value is, the more bins are divided. This value does not need to be adjusted under normal conditions. (0, 1) (Optional) The default value is 0.03.
minSplitLoss Minimum Split Loss Minimum split loss required for splitting a node. The larger the value is, the more conservatively the node splits. Non-negative real number (Optional) The default value is 0.
featureNum Feature Quantity Number of features or the largest feature ID. Specify this parameter for resource usage estimation. Positive integer Optional
baseScore Global Offset Original predicted values of all samples Real number (Optional) The default value is 0.5.
featureImportanceType Feature Importance Type Type of feature importance. “weight” indicates the number of times features are split. “gain” indicates information gain provided by the feature. “cover” indicates the number of samples that feature covers on the splitting node. “weight”, “gain”, “cover” (Optional) The default value is “gain”.

NOTE:

  • Specify different values for the objective parameter in different learning models. On the binary classification web GUI, the objective function is automatically specified and hidden from users. On the command line interface, set the objective parameter to “binary:logistic”.
  • Mappings between metrics and objective functions are: “logloss” for “negative loglikelihood for logistic regression”, “error” for “binary classification error”, and “auc” for “Area under curve for classification”.

Execution optimization

Command option Parameter Description Value range Required/Optional, default value
coreNum Number of Cores Number of cores used for computing. The larger the value is, the faster the computing algorithm runs. Positive integer (Optional) Cores are automatically assigned by default.
memSizePerCore Memory Size per Core (MB) Memory sized used by each core, where 1024 represents 1 GB memory Positive integer (Optional) Memory is automatically assigned by default.

Example

Data generation

The following example generates data in dense format:

  1. drop table if exists smart_binary_input;
  2. create table smart_binary_input lifecycle 3 as
  3. select
  4. *
  5. from
  6. (
  7. select 0.72 as f0, 0.42 as f1, 0.55 as f2, -0.09 as f3, 1.79 as f4, -1.2 as f5, 0 as label from dual
  8. union all
  9. select 1.23 as f0, -0.33 as f1, -1.55 as f2, 0.92 as f3, -0.04 as f4, -0.1 as f5, 1 as label from dual
  10. union all
  11. select -0.2 as f0, -0.55 as f1, -1.28 as f2, 0.48 as f3, -1.7 as f4, 1.13 as f5, 1 as label from dual
  12. union all
  13. select 1.24 as f0, -0.68 as f1, 1.82 as f2, 1.57 as f3, 1.18 as f4, 0.2 as f5, 0 as label from dual
  14. union all
  15. select -0.85 as f0, 0.19 as f1, -0.06 as f2, -0.55 as f3, 0.31 as f4, 0.08 as f5, 1 as label from dual
  16. union all
  17. select 0.58 as f0, -1.39 as f1, 0.05 as f2, 2.18 as f3, -0.02 as f4, 1.71 as f5, 0 as label from dual
  18. union all
  19. select -0.48 as f0, 0.79 as f1, 2.52 as f2, -1.19 as f3, 0.9 as f4, -1.04 as f5, 1 as label from dual
  20. union all
  21. select 1.02 as f0, -0.88 as f1, 0.82 as f2, 1.82 as f3, 1.55 as f4, 0.53 as f5, 0 as label from dual
  22. union all
  23. select 1.19 as f0, -1.18 as f1, -1.1 as f2, 2.26 as f3, 1.22 as f4, 0.92 as f5, 0 as label from dual
  24. union all
  25. select -2.78 as f0, 2.33 as f1, 1.18 as f2, -4.5 as f3, -1.31 as f4, -1.8 as f5, 1 as label from dual
  26. ) tmp;

The following figure shows the generated data.

image

The table contains six feature columns.

Training

Configure training data and the training component according to Quick start. Select the label column as the target column and columns f0, f1, f2, f3, f4, f5 as feature columns. The following figure shows the algorithm parameter settings page.

image

You do not need to set the number of features because this number is calculated automatically by the algorithm. If you have a large number of features and want the algorithm to accurately estimate the amount of resources required, enter the actual number of features here.

To accelerate the training, you can set the number of cores on the execution optimization page. The larger the number is, the faster the algorithm runs. Generally, you do not need to enter the memory size per core because the algorithm can accurately estimate the memory size. In addition, the PS algorithm starts to run only when all hosts obtain resources. Therefore, you may need to wait for a longer time when the cluster is busy and requires many resources.

image

You can view metrics output values in stdout of the coordinator in a logview (http link starting with “http://logview.odps.aliyun-inc.com:8080/logview“). One PS-SMART training job has multiple tasks; therefore, multiple logviews are available. Select the logview with logs starting with PS, as shown in the following figure.

image

The one in the red box is the logview of the PS job. You can identify different tasks by information in the green circle.

Then, perform operations in the logview according to the following figure.

metric

Prediction

Use the unified prediction component

The output model obtained after training is saved in binary format and can be used for prediction. Configure the input model and test data for the prediction component according to Quick start and set parameters according to the following figure.

image

If the dense format is used, you only need to select feature columns. (All columns are selected by default, and extra columns do not affect the prediction.) If the KV format is used, set the data format to sparse format and select a correct delimiter. In the SMART model, key-value pairs are separated by space characters. Therefore, the delimiter must be set to space or “\u0020” (escape expression of space).

The following figure shows the prediction result.

image

In the “prediction_detail” column, the value 1 indicates a positive sample, and the value 0 indicates a negative sample. The values following 0 and 1 indicate probabilities of the corresponding classes.

Use the PS-SMART prediction component

The output model table obtained after training is saved in binary format and can be used by the PS-SMART prediction component for prediction. Configure the input model and test data for the prediction component according to Quick start and set parameters, including the data format, feature columns, target column, and number of classes. The ID column can only be a string type column other than feature columns and the target column. The loss function must be explicitly set to “binary:logistic”. The following figure shows the prediction result.

image

The “prediction_score” column lists probabilities of predicted positive samples. A sample is predicted as a positive sample if its score is greater than 0.5. Otherwise, it is predicted as a negative sample. The “leaf_index” column lists the predicted leaf node numbers. Each sample has N numbers, where N is the number of decision trees. Each tree maps to a number, which indicates the number of the leaf node on this tree.

NOTE:

  • The output model table is a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics.However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.
  • A string type column must be selected as the label column. You can enter character strings in the column but cannot leave it blank or enter Null in it. A feature column can be converted into the string type by using the data type conversion component and used as the label column.
  • The loss function must be explicitly set to “binary:logistic”. By default, the loss function does not work.

View feature importance

To view feature importance, you can export the third output stud to an output table, or directly right-click in the PS-SMART component and choose View Data > Output Feature Importance Table. The following figure shows the output feature importance table.

image

In the table, the id column lists numbers of input features. In this example, the data is in dense format and input features are “f0, f1, f2, f3, f4, f5”. Therefore, ID 0 represents f0, and ID 4 represents f4. If the KV format is used, the IDs represent keys in key-value pairs. Each value indicates a feature importance type. The default value is “gain”, indicating the sum of information gains brought by a feature in the model. The preceding figure shows only three features because only these three features are used in split of trees. The importance of unused features is considered 0.

Important notes

  • The target column in a PS-SMART binary classification model supports only numerical values (0 for negative samples and 1 for positive samples). Even if values in the ODPS table are strings, they are saved as numerical values. If the classification target is a type string similar to “Good” or “Bad”, convert it into 1 or 0.

  • In the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If feature IDs are character strings, use the serialization component to serialize them. If feature values are type character strings, perform engineering on the features, such as discretization.

  • Although PS-SMART supports tasks with hundreds of thousands of features, such tasks consume many resources and run slowly. Therefore, we do not recommend such a large number of features. The GBDT algorithm is suitable for training with continuous features. Continuous type features require one-hot coding (to filter out infrequent features) before they can be used for training. Continuous numerical features can be used for training with the GBDT algorithm directly. Discretization is not recommended for numerical features.

  • The PS-SMART algorithm applies randomness in many scenarios. For example, the data_sample_ratio and fea_sample_ratio items require data or feature sampling. In addition, the PS-SMART algorithm uses histograms to show similarity. When multiple workers run in a cluster in distributed mode, local sketches are merged to global sketches in a random order. Although different merging orders result in different tree structures, this does not bring great variation in the output model theoretically. Therefore, it is a normal situation if different results are obtained after the algorithm runs multiple times with the same data and same parameter settings.

PS-SMART multiclass classification

PS stands for parameter server, which is used for online and offline training tasks of large-scale models. Scalable Multiple Additive Regression Tree (SMART) is an implementation of Gradient boosting decision tree (GBDT) on PS. PS-SMART can run training tasks containing up to tens of billions of samples and hundreds of thousands of features on thousands of nodes. It also supports failover to maintain a high stability. Additionally, PS-SMART supports various data formats, training targets, evaluation targets, output feature importance, and training acceleration (such as histogram similarity).

Quick Start

image

As shown in the figure, a PS-SMART multiclass classification model is learned based on training data.The model has three output studs:

  • Output model: offline model, which is connected to the unified prediction component. This model does not support output of leaf node numbers.

  • Output model table: a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics.However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.

  • Output feature importance table: lists importance of each feature. Three importance types are supported (see parameter description).

PAI command

Training

  1. PAI -name ps_smart
  2. -project algo_public
  3. -DinputTableName="smart_multiclass_input"
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545859_2"
  6. -DoutputImportanceTableName="pai_temp_24515_545859_3"
  7. -DlabelColName="label"
  8. -DfeatureColNames="features"
  9. -DenableSparse="true"
  10. -Dobjective="multi:softprob"
  11. -Dmetric="mlogloss"
  12. -DfeatureImportanceType="gain"
  13. -DtreeCount="5";
  14. -DmaxDepth="5"
  15. -Dshrinkage="0.3"
  16. -Dl2="1.0"
  17. -Dl1="0"
  18. -Dlifecycle="3"
  19. -DsketchEps="0.03"
  20. -DsampleRatio="1.0"
  21. -DfeatureRatio="1.0"
  22. -DbaseScore="0.5"
  23. -DminSplitLoss="0"

Prediction

  1. PAI -name prediction
  2. -project algo_public
  3. -DinputTableName="smart_multiclass_input";
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545860_1"
  6. -DfeatureColNames="features"
  7. -DappendColNames="label,features"
  8. -DenableSparse="true"
  9. -DkvDelimiter=":"
  10. -Dlifecycle="28"

Parameter description

Data parameters

Command option Parameter Description Value range Required/Optional, default value
featureColNames Feature Column Name of the feature column selected from the input table for training The column name must be bigint or double type in dense format or string type in sparse KV format. If the sparse KV format is used, the keys and values must be numerical type data. Required
labelColName Label Column Name of the label column in the input table The column name can be either a string or a numerical value, but it can only be saved as a numerical value. For multiclass classification, select the column name among 0, 1, 2, …, n-1, where n is the number of classes Required
weightCol Weight Column You can specify a weight for each row of samples. Column name, numerical type (Optional) It is left blank by default.
enableSparse Sparse Format Whether data is in sparse format, in which key-value pairs are separated by spaces whereas keys and values are separated by colons, for example, 1:0.3 3:0.9 [true, false] (Optional) The default value is false.
inputTableName Name of the input table NA NA Required
modelName Name of the output model NA NA Required
outputImportanceTableName Output Feature Importance Table Name NA NA (Optional) It is left blank by default.
inputTablePartitions Input Table Partitions NA NA Optional, in the format of ds=1/pt=1
outputTableName Output Model Table Name The output table is an ODPS table that uses the binary format and is not readable. The self-contained prediction component of SMART can be used to provide output of leaf node numbers. true, false (Optional) The default value is false.
lifecycle Output Table Lifecycle NA Positive integer (Optional) The default value is 3.

Algorithm parameters

Command option Parameter Description Value range Required/Optional, default value
classNum Number of Classes Number of classes in multiclass classification. If the number of classes is n, the label column name can be 0, 1, 2, …, or n-1 Integer greater than or equal to 3 Required
objective Objective Function Type The objective function type affects learning directly and must be selected correctly. Set it to “multi:softprob” for multiclass classification. Required
metric Evaluation Metric Type Evaluation metrics in the training set, which are output to stdout of the coordinator in a logview “mlogloss”,”merror” (Optional) It is left blank by default.
treeCount Number of Decision Trees Number of decision trees. The training time is in direct proportion to this number. Positive integer (Optional) The default value is 1.
maxDepth Maximum Decision Tree Depth Maximum depth of a decision tree, recommended value: 5 (a maximum of 32 leaf nodes) Positive integer, [1, 20] (Optional) The default value is 5.
sampleRatio Data Sampling Ratio Data sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that data sampling is disabled.
featureRatio Feature Sampling Ratio Feature sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that feature sampling is disabled.
l1 L1 Penalty Coefficient This parameter controls the number of leaf nodes. The larger the value is, the fewer the leaf nodes are.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 0.
l2 L2 Penalty Coefficient This parameter controls distribution of leaf nodes. The larger the value is, the more evenly the leaf nodes are distributed.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 1.0.
shrinkage Learning Rate (0, 1] (Optional) The default value is 0.3.
sketchEps Sketch Precision Threshold of the splitting point when creating a sketch. The number of bins is O (1.0/sketchEps).The smaller the value is, the more bins are divided. This value does not need to be adjusted under normal conditions. (0, 1) (Optional) The default value is 0.03.
minSplitLoss Minimum Split Loss Minimum split loss required for splitting a node. The larger the value is, the more conservatively the node splits. Non-negative real number (Optional) The default value is 0.
featureNum Feature Quantity Number of features or the largest feature ID.Specify this parameter for resource usage estimation. Positive integer Optional
baseScore Global Offset Original predicted values of all samples Real number (Optional) The default value is 0.5.
featureImportanceType Feature Importance Type Type of feature importance. “weight” indicates the number of times features are split. “gain” indicates information gain provided by the feature. “cover” indicates the number of samples that feature covers on the splitting node. “weight”, “gain”, “cover” (Optional) The default value is “gain”.

NOTE:

  • Specify different values for the objective parameter in different learning models. On the multiclass classification web GUI, the objective function is automatically specified and hidden from users. On the command line interface, set the objective parameter to “multi:softprob”.

  • Mappings between metrics and objective functions are: “mlogloss” for “multiclass negative log likelihood” and “merror” for “multiclass classification error”.

Execution optimization

Command option Parameter Description Value range Required/Optional, default value
coreNum Number of Cores Number of cores used for computing. The larger the value is, the faster the computing algorithm runs. Positive integer (Optional) Cores are automatically assigned by default.
memSizePerCore Memory Size per Core (MB) Memory sized used by each core, where 1024 represents 1 GB memory Positive integer (Optional) Memory is automatically assigned by default.

Example

Data generation

The following example uses the KV data format.

  1. drop table if exists smart_multiclass_input;
  2. create table smart_multiclass_input lifecycle 3 as
  3. select
  4. *
  5. from
  6. (
  7. select 2 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features from dual
  8. union all
  9. select 1 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features from dual
  10. union all
  11. select 1 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features from dual
  12. union all
  13. select 2 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features from dual
  14. union all
  15. select 1 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features from dual
  16. union all
  17. select 1 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features from dual
  18. union all
  19. select 0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features from dual
  20. union all
  21. select 1 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features from dual
  22. union all
  23. select 0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features from dual
  24. union all
  25. select 1 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features from dual
  26. ) tmp;

The following figure shows the generated data.

image

The table contains five dimensions of features.

Training

Configure training data and the training component according to Quick start. Select the “label” column as the target column and “features” column as the feature column. The following figures show the data parameter settings page and algorithm parameter settings page.

image

image

You do not need to set the number of features because this number is calculated automatically by the algorithm.If you have a large number of features and want the algorithm to accurately estimate the amount of resources required, enter the actual number of features here.

To accelerate the training, you can set the number of cores on the execution optimization page. The larger the number is, the faster the algorithm runs.Generally, you do not need to enter the memory size per core because the algorithm can accurately estimate the memory size.In addition, the PS algorithm starts to run only when all hosts obtain resources. Therefore, you may need to wait for a longer time when the cluster is busy and requires many resources.

image

You can view metrics output values in stdout of the coordinator in a logview (http link starting with “http://logview.odps.aliyun-inc.com:8080/logview").One PS-SMART training job has multiple tasks; therefore, multiple logviews are available. Select the logview with logs starting with PS, as shown in the following figure.

image

The one in the red box is the logview of the PS job. You can identify different tasks by information in the green circle.

Then, perform operations in the logview according to the following figure.metric

Prediction

Use the unified prediction component

The output model obtained after training is saved in binary format and can be used for prediction.Configure the input model and test data for the prediction component according to Quick start and set parameters according to the following figure.

image

If the dense format is used, you only need to select feature columns. (All columns are selected by default, and extra columns do not affect the prediction.)If the KV format is used, set the data format to sparse format and select a correct delimiter.In the SMART model, key-value pairs are separated by space characters. Therefore, the delimiter must be set to space or “\u0020” (escape expression of space).

The following figure shows the prediction result.

image

In the “prediction_detail” column, values 0, 1, and 2 indicate classes, and the values following them indicate probabilities of the corresponding classes. The “predict_result” column lists the selected classes with the highest probability, and the “predict_score” column lists the probability of each selected class.

Use the PS-SMART prediction component

The output model table obtained after training is saved in binary format and can be used by the PS-SMART prediction component for prediction.Configure the input model and test data for the prediction component according to Quick start and set parameters, including the data format, feature columns, target column, and number of classes. The ID column can only be a string type column other than feature columns and the target column. The loss function must be explicitly set to “multi:softprob”. The following figure shows the prediction result.

image

The “score_class_k” columns list probabilities of class k. The class with the highest probability is the predicted class. The “leaf_index” column lists the predicted leaf node numbers. Each sample has N*M numbers, where N is the number of decision trees, and M is the number of classes. In this example, each sample has 15 numbers (5*3 = 15).Each tree maps to a number, which indicates the number of the leaf node on this tree.

NOTE:

  • The output model table is a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics.However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.

  • A string type column must be selected as the label column. You can enter character strings in the column but cannot leave it blank or enter Null in it. A feature column can be converted into the string type by using the data type conversion component and used as the label column.

  • The loss function must be explicitly set to “multi:softprob”. By default, the loss function does not work.

View feature importance

To view feature importance, you can export the third output stud to an output table, or directly right-click in the PS-SMART component and choose View Data > Output Feature Importance Table. The following figure shows the output feature importance table.

image

If the values in the id column are input feature numbers and the KV format is used, the IDs represent keys in key-value pairs. If the dense format is used and input features are “f0, f1, f2, f3, f4, f5”, ID 0 represents f0, and ID 4 represents f4.Each value indicates a feature importance type. The default value is “gain”, indicating the sum of information gains brought by a feature in the model. The preceding figure shows only four features because only these four features are used in the split of trees. The importance of unused features is considered 0.

Important notes

  • The target column in a PS-SMART multiclass classification model supports only positive integer IDs (class numbers are 0, 1, 2, …, n-1, where n is the number of classes). Even if the values in the ODPS table are strings, they are saved as numerical values. If the classification target is a type string similar to “Good”, “Medium”, or “Bad”, convert it into a numerical value (0, 1, 2, …, n-1).

  • In the key-value format, feature IDs must be positive integers, and feature values must be real numbers.If feature IDs are character strings, use the serialization component to serialize them.If feature values are type character strings, perform engineering on the features, such as discretization.

  • Although PS-SMART supports tasks with hundreds of thousands of features, such tasks consume many resources and run slowly. Therefore, we do not recommend such a large number of features.The GBDT algorithm is suitable for training with continuous features. Continuous type features require one-hot coding (to filter out infrequent features) before they can be used for training. Continuous numerical features can be used for training with the GBDT algorithm directly. Discretization is not recommended for numerical features.

  • The PS-SMART algorithm applies randomness in many scenarios. For example, the data_sample_ratio and fea_sample_ratio items require data or feature sampling.In addition, the PS-SMART algorithm uses histograms to show similarity. When multiple workers run in a cluster in distributed mode, local sketches are merged to global sketches in a random order. Although different merging orders result in different tree structures, this does not bring great variation in the output model theoretically.Therefore, it is a normal situation if different results are obtained after the algorithm runs multiple times with the same data and same parameter settings.

PS-SMART regression

PS stands for parameter server, which is used for online and offline training tasks of large-scale models. Scalable Multiple Additive Regression Tree (SMART) is an implementation of Gradient boosting decision tree (GBDT) on PS. PS-SMART can run training tasks containing up to tens of billions of samples and hundreds of thousands of features on thousands of nodes. It also supports failover to maintain a high stability. Additionally, PS-SMART supports various data formats, training targets, evaluation targets, output feature importance, and training acceleration (such as histogram similarity).

Quick Start

image

As shown in the figure, a PS-SMART regression model is learned based on training data.The model has three output studs:

  • Output model: offline model, which is connected to the unified prediction component. This model does not support output of leaf node numbers.

  • Output model table: a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics.However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.

  • Output feature importance table: lists importance of each feature. Three importance types are supported (see parameter description).

PAI command

Training

  1. PAI -name ps_smart
  2. -project algo_public
  3. -DinputTableName="smart_regression_input"
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545859_2"
  6. -DoutputImportanceTableName="pai_temp_24515_545859_3"
  7. -DlabelColName="label"
  8. -DfeatureColNames="features"
  9. -DenableSparse="true"
  10. -Dobjective="reg:linear"
  11. -Dmetric="rmse"
  12. -DfeatureImportanceType="gain"
  13. -DtreeCount="5";
  14. -DmaxDepth="5"
  15. -Dshrinkage="0.3"
  16. -Dl2="1.0"
  17. -Dl1="0"
  18. -Dlifecycle="3"
  19. -DsketchEps="0.03"
  20. -DsampleRatio="1.0"
  21. -DfeatureRatio="1.0"
  22. -DbaseScore="0.5"
  23. -DminSplitLoss="0"

Prediction

  1. PAI -name prediction
  2. -project algo_public
  3. -DinputTableName="smart_regression_input";
  4. -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
  5. -DoutputTableName="pai_temp_24515_545860_1"
  6. -DfeatureColNames="features"
  7. -DappendColNames="label,features"
  8. -DenableSparse="true"
  9. -Dlifecycle="28"

Parameter description

Data parameters

Command option Parameter Description Value range Required/Optional, default value
featureColNames Feature Column Name of the feature column selected from the input table for training The column name must be bigint or double type in dense format or string type in sparse KV format. If the sparse KV format is used, the keys and values must be numerical type data. Required
labelColName Label Column Name of the label column in the input table The column name can be either a string or a numeric value, but it can only be a numerical value when saved in internal storage. For example, the value can be 0 or 1 for regression. Required
weightCol Weight Column You can specify a weight for each row of samples. Column name, numerical type (Optional) It is left blank by default.
enableSparse Sparse Format Whether data is in sparse format, in which key-value pairs are separated by spaces whereas keys and values are separated by colons, for example, “1:0.3 3:0.9” true, false (Optional) The default value is false.
inputTableName Name of the input table NA NA Required
modelName Name of the output model NA NA Required
outputImportanceTableName Output Feature Importance Table Name NA NA (Optional) It is left blank by default.
inputTablePartitions Input Table Partitions NA NA Optional, in the format of ds=1/pt=1
outputTableName Output Model Table Name The output table is an ODPS table that uses the binary format and is not readable. The self-contained prediction component of SMART can be used to provide output of leaf node numbers. true, false (Optional) The default value is false.
lifecycle Output Table Lifecycle NA Positive integer (Optional) The default value is 3.

Algorithm parameters

Command option Parameter Description Value range Required/Optional, default value
objective Objective Function Type The objective function type affects learning directly and must be selected correctly. Multiple loss functions can be used for regression. See the notes below. NA (Required) The default type is linear regression.
metric Evaluation Metric Type Evaluation metrics in the training set, which must be corresponding to the objective function type and are output to stdout of the coordinator in a logview. See the notes and samples below. NA (Optional) It is left blank by default.
treeCount Number of Decision Trees Number of decision trees. The training time is in direct proportion to this number. Positive integer (Optional) The default value is 1.
maxDepth Maximum Decision Tree Depth Maximum depth of a decision tree, recommended value: 5 (a maximum of 32 leaf nodes) Positive integer, [1, 20] (Optional) The default value is 5.
sampleRatio Data Sampling Ratio Data sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that data sampling is disabled.
featureRatio Feature Sampling Ratio Feature sampling ratio for creating a weak learner to accelerate training when building each decision tree (0, 1] (Optional) The default value is 1.0, indicating that feature sampling is disabled.
l1 L1 Penalty Coefficient This parameter controls the number of leaf nodes. The larger the value is, the fewer the leaf nodes are.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 0.
l2 L2 Penalty Coefficient This parameter controls distribution of leaf nodes. The larger the value is, the more evenly the leaf nodes are distributed.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 1.0.
shrinkage Learning Rate (0, 1] (Optional) The default value is 0.3.
sketchEps Sketch Precision Threshold of the splitting point when creating a sketch. The number of bins is O (1.0/sketchEps).The smaller the value is, the more bins are divided. This value does not need to be adjusted under normal conditions. (0, 1) (Optional) The default value is 0.03.
minSplitLoss Minimum Split Loss Minimum split loss required for splitting a node. The larger the value is, the more conservatively the node splits. Non-negative real number (Optional) The default value is 0.
featureNum Feature Quantity Number of features or the largest feature ID.Specify this parameter for resource usage estimation. Positive integer Optional
baseScore Global Offset Original predicted values of all samples Real number (Optional) The default value is 0.5.
featureImportanceType Feature Importance Type Type of feature importance. “weight” indicates the number of times features are split. “gain” indicates information gain provided by the feature. “cover” indicates the number of samples that feature covers on the splitting node. “weight”, “gain”, “cover” (Optional) The default value is “gain”.
tweedieVarPower Tweedie Distribution Index Tweedie distribution index indicating the relationship between the variance and mean. $Var(y) ~ E(y)^tweedie_variance_power$ (1, 2) (Optional) The default value is 1.5.

NOTE:

  • Specify different values for the objective parameter in different learning models. The regression web GUI provides multiple objective functions, which are described as follows:
  1. reg:linear (Linear regression) Note: The range of label numbers is (-∞, +∞).
  2. reg:logistic (Logistic regression) Note: The range of label numbers is [0, 1].
  3. count:poisson (Poisson regression for count data, output mean of poisson distribution) Note: Label numbers must be greater than 0.
  4. reg:gamma (Gamma regression for modeling insurance claims severity, or for any outcome that might be [gamma-distributed](https://en.wikipedia.org/wiki/Gamma_distribution#Applications)) Note: Label numbers must be greater than 0.
  5. reg:tweedie (Tweedie regression for modeling total loss in insurance, or for any outcome that might be [Tweedie-distributed](https://en.wikipedia.org/wiki/Tweedie_distribution#Applications).) Note: Label numbers must be greater than 0.
  • Metrics for these objective functions are:
  1. rmse (rooted mean square error, corresponding to objective reg:linear)
  2. mae (mean absolute error, corresponding to objective reg:linear)
  3. poisson-nloglik (negative loglikelihood for poisson regression, corresponding to objective count:poisson)
  4. gamma-deviance (Residual deviance for gamma regression, corresponding to objective reg:gamma)
  5. gamma-nloglik (Negative log-likelihood for gamma regression, corresponding to objective reg:gamma)
  6. tweedie-nloglik (tweedie-nloglik@1.5, negative log-likelihood for Tweedie regression, at a specified value of the tweedie_variance_power parameter)

Execution optimization

Command option Parameter Description Value range Required/Optional, default value
coreNum Number of Cores Number of cores used for computing. The larger the value is, the faster the computing algorithm runs. Positive integer (Optional) Cores are automatically assigned by default.
memSizePerCore Memory Size per Core (MB) Memory sized used by each core, where 1024 represents 1 GB memory Positive integer (Optional) Memory is automatically assigned by default.

Example

Data generation

The following example uses the sparse KV data format.

  1. drop table if exists smart_regression_input;
  2. create table smart_regression_input as
  3. select
  4. *
  5. from
  6. (
  7. select 2.0 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features from dual
  8. union all
  9. select 1.0 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features from dual
  10. union all
  11. select 1.0 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features from dual
  12. union all
  13. select 2.0 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features from dual
  14. union all
  15. select 1.0 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features from dual
  16. union all
  17. select 1.0 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features from dual
  18. union all
  19. select 0.0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features from dual
  20. union all
  21. select 1.0 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features from dual
  22. union all
  23. select 0.0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features from dual
  24. union all
  25. select 1.0 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features from dual
  26. ) tmp;

The following figure shows the generated data.

image

In the table, feature IDs are numbered starting from 1, and the maximum feature ID is 5.

Training

Select the “label” column as the target column and “features” column as the feature column. The following figure shows the algorithm parameter settings page for linear regression.

image

You do not need to set the number of features because this number is calculated automatically by the algorithm.If you have a large number of features and want the algorithm to accurately estimate the amount of resources required, enter the actual number of features here.

To accelerate the training, you can set the number of cores on the execution optimization page. The larger the number is, the faster the algorithm runs.Generally, you do not need to enter the memory size per core because the algorithm can accurately estimate the memory size.In addition, the PS algorithm starts to run only when all hosts obtain resources. Therefore, you may need to wait for a longer time when the cluster is busy and requires many resources.

image

You can view metrics output values in stdout of the coordinator in a logview (http link starting with “http://logview.odps.aliyun-inc.com:8080/logview“). One PS-SMART training job has multiple tasks; therefore, multiple logviews are available. Select the logview with logs starting with PS, as shown in the following figure.

image

The one in the red box is the logview of the PS job. You can identify different tasks by information in the green circle.

Then, perform operations in the logview according to the following figure.metric

Prediction

Use the unified prediction component

The output model obtained after training is saved in binary format and can be used for prediction.Configure the input model and test data for the prediction component according to Quick start and set parameters according to the following figure.

image

If the dense format is used, you only need to select feature columns. (All columns are selected by default, and extra columns do not affect the prediction.)If the KV format is used, set the data format to sparse format and select a correct delimiter.In the SMART model, key-value pairs are separated by space characters. Therefore, the delimiter must be set to space or “\u0020” (escape expression of space).

The following figure shows the prediction result.

image

The “predict_result” column lists the predicted values.

Use the PS-SMART prediction component

The output model table obtained after training is saved in binary format and can be used by the PS-SMART prediction component for prediction.Configure the input model and test data for the prediction component according to Quick start and set parameters, including the data format, feature columns, target column, and number of classes. The ID column can only be a string type column other than feature columns and the target column. The loss function must be explicitly set to the objective function used for training.The following figure shows the prediction result.

image

The “prediction_score” column lists the predicted values. The “leaf_index” column lists the predicted leaf node numbers. Each sample has N numbers, where N is the number of decision trees.Each tree maps to a number, which indicates the number of the leaf node on this tree.

NOTE:

  • The output model table is a binary table that is not readable. To ensure compatibility with the PS-SMART prediction component, the table provides outputs such as leaf node numbers and evaluation metrics.However, the output table has strict requirements on data formats, resulting in poor user experiences. It will be improved gradually or be replaced by another component.

  • A string type column must be selected as the label column. You can enter character strings in the column but cannot leave it blank or enter Null in it. A feature column can be converted into the string type by using the data type conversion component and used as the label column.

  • The loss function must be explicitly set to the objective function used for training. By default, the loss function does not work.

View feature importance

To view feature importance, you can export the third output stud to an output table, or directly right-click in the PS-SMART component and choose View Data > Output Feature Importance Table. The following figure shows the output feature importance table.

image

If the values in the id column are input feature numbers and the KV format is used, the IDs represent keys in key-value pairs.If the dense format is used and input features are “f0, f1, f2, f3, f4, f5”, ID 0 represents f0, and ID 4 represents f4.Each value indicates a feature importance type. The default value is “gain”, indicating the sum of information gains brought by a feature in the model. The preceding figure shows only two features because only these two features are used in split of trees. The importance of unused features is considered 0.

Important notes

  • The target column in a PS-SMART regression model supports only numerical values. Even if values in the ODPS table are strings, they are saved as numerical values.

  • In the key-value format, feature IDs must be positive integers, and feature values must be real numbers.If feature IDs are character strings, use the serialization component to serialize them.If feature values are type character strings, perform engineering on the features, such as discretization.

  • Although PS-SMART supports tasks with hundreds of thousands of features, such tasks consume many resources and run slowly. Therefore, we do not recommend such a large number of features.The GBDT algorithm is suitable for training with continuous features. Continuous type features require one-hot coding (to filter out infrequent features) before they can be used for training. Continuous numerical features can be used for training with the GBDT algorithm directly. Discretization is not recommended for numerical features.

  • The PS-SMART algorithm applies randomness in many scenarios. For example, the data_sample_ratio and fea_sample_ratio items require data or feature sampling.In addition, the PS-SMART algorithm uses histograms to show similarity. When multiple workers run in a cluster in distributed mode, local sketches are merged to global sketches in a random order. Although different merging orders result in different tree structures, this does not bring great variation in the output model theoretically.Therefore, it is a normal situation if different results are obtained after the algorithm runs multiple times with the same data and same parameter settings.

PS linear regression

Linear regression is a classic regression algorithm used to analyze the linear relationship between a dependent variable and multiple independent variables. A parameter server (PS) is used to run online and offline training tasks of large-scale models. Parameter servers can use over a hundred billion rows of samples to train tens of billions of feature models at high efficiency. The PS linear regression model can run training tasks with hundreds of billions of samples and billions of features, and supports L1 and L2 regular expressions. To run tasks of larger scales, contact the author of the algorithm.

Quick Start

image

PAI command

Training

  1. PAI -name ps_linearregression
  2. -project algo_public
  3. -DinputTableName="lm_test_input"
  4. -DmodelName="linear_regression_model"
  5. -DlabelColName="label"
  6. -DfeatureColNames="features"
  7. -Dl1Weight=1.0
  8. -Dl2Weight=0.0
  9. -DmaxIter=100
  10. -Depsilon=1e-6
  11. -DenableSparse=true

Prediction

  1. drop table if exists logistic_regression_predict;
  2. PAI -name prediction
  3. -DmodelName="linear_regression_model"
  4. -DoutputTableName="linear_regression_predict"
  5. -DinputTableName="lm_test_input"
  6. -DappendColNames="label,features"
  7. -DfeatureColNames="features"
  8. -DenableSparse=true

Parameter description

Data parameters

Command option Parameter Description Value range Required/Optional, default value
featureColNames Feature Columns Feature columns selected from the input table for training The column names must be bigint or double type in dense format or string type in sparse KV format. Required
labelColName Label Column Name of the label column in the input table The column name must be a numerical value (bigint or double type). Required
enableSparse Sparse Format If the sparse KV format is used, do not use feature ID 0. We recommend that you number feature IDs starting from 1. true, false (Optional) The default value is false.
itemDelimiter KV Pair Delimiter Delimiter used between key-value pairs when data in the input table is in sparse format NA (Optional) The default delimiter is space.
kvDelimiter Key and Value Delimiter Delimiter used between keys and values when data in the input table is in sparse format NA (Optional) The default delimiter is colon.
inputTableName Name of the input table NA NA Required
modelName Name of the output model NA NA Required
inputTablePartitions Input Table Partitions NA NA Optional, in the format of ds=1/pt=1
enableModelIo Output to Offline Model When this parameter is set to false, output data is exported to the ODPS table, where you can view weights of models true, false (Optional) The default value is true.

Algorithm parameters

Command option Parameter Description Value range Required/Optional, default value
l1Weight L1 weight L1 regularization coefficient. The larger the value is, the less zeros a model has.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 1.0.
l2Weight L2 weight L2 regularization coefficient. The larger the value is, the smaller the model parameter absolute values are.Set a larger value for overfitting. Non-negative real number (Optional) The default value is 0.
maxIter Maximum Iterations Maximum number of L-BFGS/OWL-QN iterations. The value 0 indicates that the number of iterations is unlimited. Non-negative integer (Optional) The default value is 100.
epsilon Minimum Convergence Deviation Mean value of the relative loss change rate in ten iterations, which is used as the condition for determining whether to terminate the optimization algorithm. The smaller the value is, the more strict the condition is and the longer the algorithm runs. Real number between 0 and 1 (Optional) The default value is 1.0e-06.
modelSize Largest Feature ID Largest feature ID among all features (feature dimension). It can be larger than the actual largest feature ID. The larger the value is, the higher the memory usage is. If you leave this parameter blank, the system starts an SQL task to calculate the largest feature ID automatically. Non-negative integer (Optional) The default value is 0.

Both the maximum iterations and minimum convergence deviation determine when the algorithm stops. If both the parameters are set, the algorithm stops when either condition is triggered.

Execution optimization

Command option Parameter Description Value range Required/Optional, default value
coreNum Number of Cores Number of cores used for computing. The larger the value is, the faster the computing algorithm runs. Positive integer (Optional) Cores are automatically assigned by default.
memSizePerCore Memory Size per Core (MB) Memory sized used by each core, where 1024 represents 1 GB memory Positive integer (Optional) Memory is automatically assigned by default. Generally, you do not need to set this parameter because the algorithm can accurately estimate the memory size required.

Example

Data generation

The following example uses the sparse KV data format.

  1. drop table if exists lm_test_input;
  2. create table lm_test_input as
  3. select
  4. *
  5. from
  6. (
  7. select 2 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features from dual
  8. union all
  9. select 1 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features from dual
  10. union all
  11. select 1 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features from dual
  12. union all
  13. select 2 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features from dual
  14. union all
  15. select 1 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features from dual
  16. union all
  17. select 1 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features from dual
  18. union all
  19. select 0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features from dual
  20. union all
  21. select 1 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features from dual
  22. union all
  23. select 0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features from dual
  24. union all
  25. select 1 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features from dual
  26. ) tmp;

The following figure shows the generated data.

image

In the table, feature IDs are numbered starting from 1, and the maximum feature ID is 5.

Training

Configure training data and the training component according to Quick start. Select the “label” column as the target column and “features” column as the feature column. Select the sparse data format.The following figure shows the algorithm parameter settings page.

image

You can retain the default value 0 of the largest feature ID. The algorithm can start an SQL task to calculate the largest feature ID automatically. If you do not want to start the SQL task, enter a value greater than 5. This value indicates the number of feature columns in dense format and indicates the largest feature ID in KV format.

To accelerate the training, you can set the number of cores on the execution optimization page. The larger the number is, the faster the algorithm runs.Generally, you do not need to enter the memory size per core because the algorithm can accurately estimate the memory size.In addition, the PS algorithm starts to run only when all hosts obtain resources. Therefore, you may need to wait for a longer time when the cluster is busy and requires many resources.

image

Prediction

The model obtained after training is saved in binary format and can be used for prediction.Configure the input model and test data for the prediction component according to Quick start and set parameters according to the following figure.

image

Select the KV format used for training and set a correct delimiter. When the KV format is used, key-value pairs are separated by space characters. Therefore, the delimiter must be set to space or “\u0020” (escape expression of space).

The following figure shows the prediction result.

image

You only need to view the predict_result column.

Important notes

In the key-value format, feature IDs must be positive integers, and feature values must be real numbers.If feature IDs are character strings, use the serialization component to serialize them.If feature values are type character strings, perform engineering on the features, such as discretization.

Clustering model evaluation

This component evaluates clustering models based on metrics and icons.

PAI command

  1. PAI -name cluster_evaluation
  2. -project algo_public
  3. -DinputTableName=pai_cluster_evaluation_test_input
  4. -DselectedColNames=f0,f3
  5. -DmodelName=pai_kmeans_test_model
  6. -DoutputTableName=pai_ft_cluster_evaluation_out;

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
selectedColNames Names of the columns used for evaluation in the input table, which are separated by commas. The column names must be the same as feature names saved in the model. Column names (Optional) All columns in the input table are selected by default.
inputTablePartitions Partitions used for evaluation in the input table, in the format of name1=value1/name2=value2. Multiple partition names are separated by commas. NA (Optional) All partitions are selected by default.
enableSparse Whether data in the input table is in sparse format true, false Optional, default: false
itemDelimiter Delimiter used between key-value pairs when data in the input table is in sparse format NA Optional, default: space
kvDelimiter Delimiter used between keys and values when data in the input table is in sparse format NA Optional, default: colon
modelName Name of the input clustering model Model name Required
outputTableName Name of the output table Table name Required
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle

Evaluation formula

The Calinski-Harabasz metric is also known as the variance ratio criterion (VRC). It is defined as follows:

image

  • image is the inter-cluster variance.

  • imageis the intra-cluster variance.

  • N is the total number of records, and k is the number of cluster centers.

  • image is defined as follows:image

    • k is the number of the cluster centers.
    • image is the center of cluster i, and m is the mean of input data.
  • image is defined as follows:image

    • k is the number of cluster centers, and x is a data point.
    • imageindicates cluster i.
    • image indicates the center of cluster i.

Example

Test data

  1. create table if not exists pai_cluster_evaluation_test_input as
  2. select * from
  3. (
  4. select 1 as id, 1 as f0,2 as f3 from dual
  5. union all
  6. select 2 as id, 1 as f0,3 as f3 from dual
  7. union all
  8. select 3 as id, 1 as f0,4 as f3 from dual
  9. union all
  10. select 4 as id, 0 as f0,3 as f3 from dual
  11. union all
  12. select 5 as id, 0 as f0,4 as f3 from dual
  13. )tmp;

Build a clustering model

  1. PAI -name kmeans
  2. -project algo_public
  3. -DinputTableName=pai_cluster_evaluation_test_input
  4. -DselectedColNames=f0,f3
  5. -DcenterCount=3
  6. -Dloop=10
  7. -Daccuracy=0.00001
  8. -DdistanceType=euclidean
  9. -DinitCenterMethod=random
  10. -Dseed=1
  11. -DmodelName=pai_kmeans_test_model
  12. -DidxTableName=pai_kmeans_test_idx

PAI command

  1. PAI -name cluster_evaluation
  2. -project algo_public
  3. -DinputTableName=pai_cluster_evaluation_test_input
  4. -DselectedColNames=f0,f3
  5. -DmodelName=pai_kmeans_test_model
  6. -DoutputTableName=pai_ft_cluster_evaluation_out;

Output description

Output table outputTableName, with the following fields:

column name comment
count Total number of records
centerCount Number of cluster centers
calinhara Calinski Harabasz metric Evaluation formula
clusterCounts Total number of points in each cluster

Display on the Machine Learning Platform For AI web GUI:

image

PaiWeb-Pipeline:

image

Thank you! We've received your feedback.