All Products
Search
Document Center

Platform For AI:PS-SMART Binary Classification Training

Last Updated:Feb 28, 2024

A parameter server (PS) is used to process a large number of offline and online training tasks. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Binary Classification Training component supports training tasks for tens of billions of samples and hundreds of thousands of features. It can run training tasks on thousands of nodes. This component also supports multiple data formats and optimization technologies such as histogram-based approximation.

Limits

You can use this component based only on the computing resources of MaxCompute.

Usage notes

  • Only columns of numeric data types can be used by the PS-SMART Binary Classification Training component. 0 indicates a negative example, and 1 indicates a positive example. If the type of data in the MaxCompute table is STRING, the data type must be converted first. For example, you must convert Good/Bad to 1/0.

  • If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature engineering such as feature discretization to process the values.

  • The PS-SMART Binary Classification Training component supports hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for scenarios where continuous features are used for training. You can perform one-hot encoding on categorical features to filter out low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.

  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when tasks run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.

  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.

Configure the component

You can use one of the following methods to configure the component.

Method 1: Use the console

Configure the component parameters in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Use Sparse Format

Specifies whether the input data is sparse. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

Feature Columns

Select the feature columns for training from the input table. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse key-value pairs and keys and values are of numeric data types, only columns of the STRING type are supported.

Label Column

Select the label column from the input table. The columns of the STRING type and numeric data types are supported. Data contained in the columns must be of numeric types, such as 0 and 1 that are used in binary classification.

Weight Column

Select the column that contains the weight of each row of samples. The columns of numeric data types are supported.

Parameters Setting

Evaluation Indicator Type

Select the evaluation metric type of the training set. Valid values:

  • negative loglikelihood for logistic regression

  • binary classification error

  • Area under curve for classification

Trees

Enter the number of trees. The value of this parameter must be an integer. The number of trees is proportional to the amount of training time.

Maximum Tree Depth

Enter the maximum tree depth. The default value is 5, which indicates that a maximum of 16 leaf nodes can be configured. The value of this parameter must be a positive integer.

Data Sampling Fraction

Enter the data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.

Feature Sampling Fraction

Enter the feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.

L1 Penalty Coefficient

Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

L2 Penalty Coefficient

Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

Learning Rate

Enter the learning rate. Valid values: (0,1).

Sketch-based Approximate Precision

Enter the threshold for selecting quantiles when you build a sketch. A smaller value specifies that more bins can be obtained. In most cases, the default value 0.03 is used.

Minimum Split Loss Change

Enter the minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.

Features

Enter the number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.

Global Offset

Enter the initial prediction values of all samples.

Random Seed

Enter the random seed. The value of this parameter must be an integer.

Feature Importance Type

Select the feature importance type. Valid values:

  • Weight: indicates the number of splits of the feature.

  • Gain: indicates the information gain provided by the feature. This is the default value.

  • Cover: indicates the number of samples covered by the feature on the split node.

Tuning

Cores

Select the number of cores. By default, the system determines the value.

Memory Size per Core

Select the memory size of each core. Unit: MB. In most cases, the system determines the memory size.

Method 2: Use PAI commands

Configure the component parameters by using Platform for AI (PAI) commands. The following section describes the parameters. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_binary_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="f0,f1,f2,f3,f4,f5"
    -DenableSparse="false"
    -Dobjective="binary:logistic"
    -Dmetric="error"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0";

# Prediction 
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_binary_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="f0,f1,f2,f3,f4,f5"
    -DappendColNames="label,qid,f0,f1,f2,f3,f4,f5"
    -DenableSparse="false"
    -Dlifecycle="28";

Module

Parameter

Required

Description

Default value

Data parameters

featureColNames

Yes

The feature columns that are selected from the input table for training. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported.

None

labelColName

Yes

The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.

None

weightCol

No

The column that contains the weight of each row of samples. The columns of numeric data types are supported.

None

enableSparse

No

Specifies whether the input data is sparse. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

false

inputTableName

Yes

The name of the input table.

None

modelName

Yes

The name of the output model.

None

outputImportanceTableName

No

The name of the table that provides feature importance.

None

inputTablePartitions

No

The partitions that are selected from the input table for training. Format: ds=1/pt=1.

None

outputTableName

No

The MaxCompute table that is generated. The table is a binary file. It cannot be read and can be obtained only by using the PS-SMART prediction component.

None

lifecycle

No

The lifecycle of the output table. Unit: days.

3

Algorithm parameters

objective

Yes

The type of the objective function. If the training is performed by using binary classification components, set the parameter to binary:logistic.

None

metric

No

The evaluation metric type of the training set, which is contained in stdout of the coordinator in LogView. Valid values:

  • logloss: corresponds to the Negative Loglikelihood for Logistic Regression value of the Evaluation Indicator Type parameter in the PAI console.

  • error: corresponds to the Binary Classification Error value of the Evaluation Indicator Type parameter in the PAI console.

  • auc: corresponds to the AUC for Classification value of the Evaluation Indicator Type parameter in the PAI console.

None

treeCount

No

The number of trees. The value is proportional to the amount of training time.

1

maxDepth

No

The maximum depth of the tree. The value must be a positive integer. Valid values: 1 to 20.

5

sampleRatio

No

The data sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled.

1.0

featureRatio

No

The feature sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled.

1.0

l1

No

The L1 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

0

l2

No

The L2 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

1.0

shrinkage

No

Valid values: (0,1).

0.3

sketchEps

No

The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value specifies that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1).

0.03

minSplitLoss

No

The minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.

0

featureNum

No

The number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.

None

baseScore

No

The initial prediction values of all samples.

0.5

randSeed

No

The random seed. The value of this parameter must be an integer.

None

featureImportanceType

No

The feature importance type. Valid values:

  • weight: indicates the number of splits of the feature.

  • gain: indicates the information gain provided by the feature.

  • cover: indicates the number of samples covered by the feature on the splitting node.

gain

Tuning parameters

coreNum

No

The number of cores used in computing. A larger value indicates faster running of the computing algorithm.

Automatically allocated

memSizePerCore

No

The memory size of each core. Unit: MB.

Automatically allocated

Example

  1. Execute the following SQL statements on an ODPS SQL node to generate training data. In this example, dense training data is generated.

    drop table if exists smart_binary_input;
    create table smart_binary_input lifecycle 3 as
    select
    *
    from
    (
    select 0.72 as f0, 0.42 as f1, 0.55 as f2, -0.09 as f3, 1.79 as f4, -1.2 as f5, 0 as label
    union all
    select 1.23 as f0, -0.33 as f1, -1.55 as f2, 0.92 as f3, -0.04 as f4, -0.1 as f5, 1 as label
    union all
    select -0.2 as f0, -0.55 as f1, -1.28 as f2, 0.48 as f3, -1.7 as f4, 1.13 as f5, 1 as label
    union all
    select 1.24 as f0, -0.68 as f1, 1.82 as f2, 1.57 as f3, 1.18 as f4, 0.2 as f5, 0 as label
    union all
    select -0.85 as f0, 0.19 as f1, -0.06 as f2, -0.55 as f3, 0.31 as f4, 0.08 as f5, 1 as label
    union all
    select 0.58 as f0, -1.39 as f1, 0.05 as f2, 2.18 as f3, -0.02 as f4, 1.71 as f5, 0 as label
    union all
    select -0.48 as f0, 0.79 as f1, 2.52 as f2, -1.19 as f3, 0.9 as f4, -1.04 as f5, 1 as label
    union all
    select 1.02 as f0, -0.88 as f1, 0.82 as f2, 1.82 as f3, 1.55 as f4, 0.53 as f5, 0 as label
    union all
    select 1.19 as f0, -1.18 as f1, -1.1 as f2, 2.26 as f3, 1.22 as f4, 0.92 as f5, 0 as label
    union all
    select -2.78 as f0, 2.33 as f1, 1.18 as f2, -4.5 as f3, -1.31 as f4, -1.8 as f5, 1 as label
    ) tmp;

    The following figure shows the generated training data.image

  2. Create the pipeline shown in the following figure and run the component. For more information, see Algorithm modeling.image

    1. In the left-side component list of Machine Learning Designer, separately search for the Read Table, PS-SMART Binary Classification Training, Prediction, and Write Table components, and drag the components to the canvas on the right.

    2. Connect nodes by drawing lines to organize the nodes into a pipeline that includes upstream and downstream relationships based on the preceding figure.

    3. Configure the component parameters.

      • On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to smart_binary_input.

      • On the canvas, click the PS-SMART Binary Classification Training-1 component and configure the parameters listed in the following table in the right pane. Retain the default values for the parameters that are not listed in the table.

        Tab

        Parameter

        Description

        Fields Setting

        Feature Columns

        Select the feature columns. Select the f0, f1, f2, f3, f4, and f5 columns.

        Label Column

        Select the label column.

        Parameters Setting

        Evaluation Indicator Type

        Select the evaluation metric type. Set the parameter to AUC for Classification.

        Trees

        Set this parameter to 5.

      • On the canvas, click the Prediction-1 component. On the Field Settings tab in the right pane, select Select All for Reserved Columns. Retain the default values for the remaining parameters.

      • On the canvas, click the Write Table-1 component. On the Select Table tab in the right pane, set New Table Name to smart_binary_output.

    4. After the parameter configuration is complete, click the image button to run the pipeline.

  3. Right-click the Prediction-1 component and choose View Data > Prediction Result Output Port to view the prediction results. image1 in the prediction_detail column indicates a positive example, and 0 indicates a negative example.

  4. Right-click the PS-SMART Binary Classification Training-1 component and choose View Data > Output Feature Importance Table to view the feature importance table. imageParameters:

    • id: the ID of a passed feature. In this example, the f0, f1, f2, f3, f4, and f5 features are passed. Therefore, in the id column, 0 represents the feature column of f0, and 4 represents the feature column of f4. If data in the input table is key-value pairs, the id column lists keys in the key-value pairs.

    • value: the feature importance type. The default value of this parameter is gain, which indicates the sum of the information gains provided by the feature for the model.

    • The preceding feature importance table has only three features. This indicates that only the three features are used to split the tree. The feature importance of other features can be considered as 0.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Model service deployment by using the PAI console.

References