All Products
Search
Document Center

Platform For AI:PS-SMART Regression

Last Updated:Mar 19, 2024

A parameter server (PS) is used to process a large number of offline and online training tasks. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Binary Classification Training component supports training tasks for tens of billions of samples and hundreds of thousands of features. It can run training tasks on thousands of nodes. This component also supports multiple data formats and optimization technologies such as histogram-based approximation.

Limits

The input data of the PS-SMART Regression component must meet the following requirements:

  • Only columns of numeric data types can be used by the PS-SMART Regression component. If the type of data in the MaxCompute table is STRING, the data type must be converted first.

  • If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature engineering such as feature discretization to process the values.

  • The PS-SMART Regression component can support hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for the scenario where continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.

  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when jobs run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.

  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.

Precautions

When you use the PS-SMART Regression component, you must take note of the following items:

  • The PS-SMART Regression component can support hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for the scenario where continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.

  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when jobs run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.

  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.

Configure the component

You can configure the PS-SMART Regression component by using one of the following methods:

Method 1: Configure the component in the Platform for AI (PAI) console

Configure the component parameters in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Use Sparse Format

Specifies whether the input data is in the sparse format. If data in the sparse format is presented by using key-value pairs, separate the key-value pairs with spaces and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

Feature Columns

Select the feature columns for training from the input table. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is in the sparse format presented by using key-value pairs, and keys and values are of numeric data types, only columns of the STRING type are supported.

Label Column

Select the label column from the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.

Weight Column

Select the column that contains the weight of each row of samples. The columns of numeric data types are supported.

Parameters Setting

Objective Function Type

Valid values:

  • Linear Regression (default)

  • Logistic Regression

  • Poisson Regression

  • Gamma Regression

  • Tweedie Regression

Tweedie Distribution Index

The index of the relationship between the variance and mean value of Tweedie distribution. This parameter is supported only when you set the Objective Function Type parameter to Tweedie Regression.

Evaluation Indicator Type

Valid values:

  • Rooted Mean Square Error

  • Mean Absolute Error

  • Negative Loglikelihood for Logistic Regression

  • Negative Loglikelihood for Poisson Regression

  • Residual Deviance for Gamma Regression

  • Negative Log-likelihood for Gamma Regression

  • Negative Log-likelihood for Tweedie Regression

  • None

Trees

Enter the number of trees. The value of this parameter must be an integer. The number of trees is proportional to the amount of training time.

Maximum Tree Depth

The default value is 5, which specifies that up to 32 leaf nodes can be configured.

Data Sampling Fraction

Enter the data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.

Feature Sampling Fraction

Enter the feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.

L1 Penalty Coefficient

Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

L2 Penalty Coefficient

Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

Learning Rate

Enter the learning rate. Valid values: (0,1).

Sketch-based Approximate Precision

Enter the threshold for selecting quantiles when you build a sketch. A smaller value specifies that more bins can be obtained. In most cases, the default value 0.03 is used.

Minimum Split Loss Change

Enter the minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.

Features

Enter the number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.

Global Offset

Enter the initial prediction values of all samples.

Random Seed

Enter the random seed. The value of this parameter must be an integer.

Feature Importance Type

Valid values:

  • Weight: indicates the number of splits of the feature.

  • Gain: indicates the information gain provided by the feature. This is the default value.

  • Cover: indicates the number of samples covered by the feature on the split node.

Tuning

Cores

Select the number of cores. By default, the system determines the value.

Memory Size per Core

Select the memory size of each core. Unit: MB. In most cases, the system determines the memory size.

Method 2: Configure the component by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_regression_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="reg:linear"
    -Dmetric="rmse"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
# Prediction 
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_regression_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -Dlifecycle="28"

Module

Parameter

Required

Description

Default value

Data parameters

featureColNames

Yes

The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE data types are supported. If data in the input table is in the sparse format presented by using key-value pairs, and keys and values are of numeric data types, only columns of the STRING data type are supported.

No default value

labelColName

Yes

The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.

No default value

weightCol

No

The column that contains the weight of each row of samples. The columns of numeric data types are supported.

No default value

enableSparse

No

Specifies whether the input data is in the sparse format. Valid values: true and false. Specifies whether the input data is in the sparse format. If data in the sparse format is presented by using key-value pairs, separate the key-value pairs with spaces and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

false

inputTableName

Yes

The name of the input table.

No default value

modelName

Yes

The name of the output model.

No default value

outputImportanceTableName

No

The name of the table that provides feature importance.

No default value

inputTablePartitions

No

The partitions that are selected from the input table for training. Format: ds=1/pt=1.

No default value

outputTableName

No

The name of the table that is exported to MaxCompute. The table name must be in the binary format.

No default value

lifecycle

No

The lifecycle of the output table. Unit: days.

3

Algorithm parameters

objective

Yes

The type of the objective function. Valid values:

  • reg:linear: Linear Regression

  • reg:logistic: Logistic Regression

  • count:poisson: Poisson Regression

  • reg:gamma: Gamma Regression

  • reg:tweedie: Tweedie Regression

reg:linear

metric

No

The type of the evaluation metric in the training set, which is included in stdout of the coordinator in a logview. Valid values:

  • rmse: the Rooted Mean Square Error value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • mae: the Mean Absolute Error value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • logistic-nloglik: the Negative Loglikelihood for Logistic Regression value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • poisson-nloglik: the Negative Loglikelihood for Poisson Regression value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • gamma-deviance: the Residual Deviance for Gamma Regression value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • gamma-nloglik: the Negative Log-likelihood for Gamma Regression value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

  • tweedie-nloglik: the Negative Log-likelihood for Tweedie Regression value of the Evaluation Index Type parameter in the Machine Learning Platform for AI console.

No default value

treeCount

No

The number of trees. The value is proportional to the training time.

1

maxDepth

No

The maximum depth of a tree. Valid values: 1 to 20.

5

sampleRatio

No

The data sampling ratio. Value range: (0,1]. If this parameter is set to 1.0, sampling is not performed.

1.0

featureRatio

No

The feature sampling ratio. Value range: (0,1]. If this parameter is set to 1.0, no data is sampled.

1.0

l1

No

The L1 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

0

l2

No

The L2 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

1.0

shrinkage

No

The learning rate. Value range: (0,1).

0.3

sketchEps

No

The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value specifies that more bins can be obtained. In most cases, the default value is used. Value range: (0,1).

0.03

minSplitLoss

No

Enter the minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.

0

featureNum

No

The number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.

No default value

baseScore

No

The initial prediction values of all samples.

0.5

randSeed

No

The random seed. The value of this parameter must be an integer.

No default value

featureImportanceType

No

The feature importance type. Valid values:

  • weight: indicates the number of splits of the feature.

  • gain: indicates the information gain provided by the feature.

  • cover: indicates the number of samples covered by the feature on the splitting node.

gain

tweedieVarPower

No

The index of the relationship between the variance and mean value of Tweedie distribution.

1.5

Tuning parameters

coreNum

No

The number of cores that are used by the computing algorithm. The larger the value, the faster the computing algorithm runs.

Specified by the system

memSizePerCore

No

The memory size of each core. Unit: MB.

Specified by the system

Example

  1. Execute the following SQL statements to generate input data on an ODPS SQL node. In this example, the input data in the key-value pairs is generated.

    drop table if exists smart_regression_input;
    create table smart_regression_input as
    select
    *
    from
    (
    select 2.0 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features
        union all
    select 1.0 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features
        union all
    select 1.0 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features
        union all
    select 2.0 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features
        union all
    select 1.0 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features
        union all
    select 1.0 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features
        union all
    select 0.0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features
        union all
    select 1.0 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features
        union all
    select 0.0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features
        union all
    select 1.0 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features
    ) tmp;

    The following figure shows the generated training data.image

  2. Create a pipeline shown in the following figure and run the component. For more information, see Algorithm modeling.image

    1. In the left-side component list of Machine Learning Designer, separately search for the Read Table, PS-SMART Regression, Prediction, and Write Table components, and drag the components to the canvas on the right.

    2. Configure the component parameters.

      • On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to smart_regression_input.

      • On the canvas, click the PS-SMART Regression-1 component and configure the parameters in the right pane. The following table describes the parameters. Retain the default values for the parameters that are not described in the table.

        Tab

        Parameter

        Description

        Fields Setting

        Use Sparse Format

        Select Use Sparse Format.

        Feature Columns

        Select the features column.

        Label Column

        Select the label column.

        Parameters Setting

        Objective Function Type

        Set the parameter to Linear Regression.

        Evaluation Indicator Type

        Set the parameter to Rooted Mean Square Error.

        Trees

        Set the parameter to 5.

      • On the canvas, click the Prediction-1 component and configure the parameters. The following table describes the parameters. Retain the default values for the parameters that are not described in the table.

        Tab

        Parameter

        Description

        Fields Setting

        Feature Columns

        By default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result.

        Reserved Columns

        Select the label column.

        Sparse Matrix

        Select Sparse Matrix.

        KV Delimiter

        Set this parameter to a colon (:).

        KV Pair Delimiter

        Set this parameter to a space.

      • On the canvas, click the Write Table-1 component. On the Select Table tab in the right pane, set New Table Name to smart_regression_output.

    3. After the parameter configuration is complete, click the image button to run the pipeline.

  3. Right-click the Prediction-1 component and choose View Data > Prediction Result Output Port to view the prediction results.image

  4. Right-click the PS-SMART Regression-1 component and choose View Data > Output Feature Importance Table to view the feature importance table. image

    id: the ID of a passed feature. In this example, the input data is in the key-value format. The values in the id column indicate the keys in the key-value pairs. The feature importance table has only two features. This indicates that only these two features are used in the split of the tree. The feature importance of other features can be considered as 0. value: the feature importance type. The default value of this parameter is gain, which indicates the sum of the information gains provided by the feature for the model.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Model service deployment by using the PAI console.

References