All Products
Search
Document Center

Platform For AI:PS-SMART Multiclass Classification

Last Updated:Apr 03, 2024

A parameter server (PS) is used to process a large number of offline and online training jobs. Scalable Multiple Additive Regression Tree (SMART) is an iterative algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Multiclass Classification component of Platform for AI (PAI) supports training jobs for tens of billions of samples and hundreds of thousands of features. The component can run training jobs on thousands of nodes. The component also supports multiple data formats and optimization technologies, such as approximation by using histograms.

Limits

The input data of the PS-SMART Multiclass Classification component must meet the following requirements:

  • Data in the destination columns must be of numeric data types. If the data type in the MaxCompute table is STRING, the data must be converted into a numeric data type. For example, if the classification object is a string, such as Good/Medium/Bad, you must convert the string into 0/1/2.

  • If the data is in the key-value format, feature IDs must be positive integers and feature values must be real numbers. If the feature IDs are of the STRING type, you must use the serialization component to serialize the data. If the feature values are categorical strings, you must perform feature engineering, such as feature discretization, to process the values.

  • The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.

  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.

  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.

Usage notes

When you use the PS-SMART Multiclass Classification component, take note of the following items:

  • The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.

  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.

  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.

Configure the component

You can use one of the following methods to configure the PS-SMART Multiclass Classification component.

Method 1: Configure the component in the PAI console

Configure the component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Use Sparse Format

Specify whether the input data is in the sparse format. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

Feature Columns

Select the feature columns for training from the input table. If the data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If the data in the input table is key-value pairs in the sparse format, and keys and values are of numeric data types, only columns of the STRING type are supported.

Label Column

The label column in the input table. Columns of the STRING and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.

Weight Column

Select the column that contains the weight of each row of samples. Columns of numeric data types are supported.

Parameters Setting

Classes

The number of classes for multiclass classification. If you set the parameter to n, the values of the label column are {0,1,2,...,n-1}.

Evaluation Indicator Type

You can set this parameter to Multiclass Negative Log Likelihood or Multiclass Classification Error.

Trees

The number of trees. The value must be a positive integer. The value of Trees is proportional to the training duration.

Maximum Decision Tree Depth

The default value is 5, which indicates that up to 32 leaf nodes can be configured.

Data Sampling Ratio

The data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.

Feature Sampling Fraction

The feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.

L1 Penalty Coefficient

The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

L2 Penalty Coefficient

The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

Learning Rate

Enter the learning rate. Valid values: (0,1).

Sketch-based Approximate Precision

Enter the threshold for selecting quantiles when you build a sketch. A smaller value indicates that more bins can be obtained. In most cases, the default value 0.03 is used.

Minimum Split Loss Change

Enter the minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting.

Features

Enter the number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage.

Global Offset

Enter the initial prediction values of all samples.

Random Seed

Enter the random seed. The value of this parameter must be an integer.

Feature Importance Type

The type of feature. Valid values: Weight, Gain, and Cover. Weight indicates the number of splits of the feature. Gain indicates the information gain provided by the feature. Cover indicates the number of samples covered by the feature on the split node.

Tuning

Cores

The number of cores. By default, the system determines the value.

Memory Size per Core (MB)

The memory size of each core. Unit: MB. In most cases, the system determines the memory size.

Method 2: Configure the component by using PAI commands

The following table describes the parameters that are used in PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_multiclass_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="multi:softprob"
    -Dmetric="mlogloss"
    -DfeatureImportanceType="gain"
    -DtreeCount="5";
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
# Prediction 
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_multiclass_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -DkvDelimiter=":"
    -Dlifecycle="28"

Module

Parameter

Required

Description

Default value

Data parameters

featureColNames

Yes

The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported.

N/A

labelColName

Yes

The label column in the input table. Columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be {0,1,2,…,n-1} in multiclass classification. n indicates the number of classes.

N/A

weightCol

No

Select the column that contains the weight of each row of samples. Columns of numeric data types are supported.

N/A

enableSparse

No

Specify whether the input data is in the sparse format. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.

false

inputTableName

Yes

The name of the input table.

N/A

modelName

Yes

The name of the output model.

N/A

outputImportanceTableName

No

The name of the table that contains feature importance.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training. Format: ds=1/pt=1.

N/A

outputTableName

No

The MaxCompute table that is generated. The table is a binary file that cannot be read and can be obtained only by using the PS-SMART prediction component.

N/A

lifecycle

No

The lifecycle of the output table.

3

Algorithm parameters

classNum

Yes

The number of classes for multiclass classification. If you set this parameter to n, the values of the label column are {0,1,2,...,n-1}.

N/A

objective

Yes

The type of the objective function. If you use multiclass classification for training, specify the multi:softprob objective function.

N/A

metric

No

The evaluation metric type of the training set, which is specified in stdout of the coordinator in a logview. Valid values:

  • mlogloss: corresponds to the Multiclass Negative Log Likelihood value of the Evaluation Index Type parameter in the console.

  • merror: corresponds to the Multiclass Classification Error value of the Evaluation Index Type parameter in the console.

N/A

treeCount

No

The number of trees. The value is proportional to the amount of training time.

1

maxDepth

No

The maximum depth of a tree. Valid values: 1 to 20.

5

sampleRatio

No

The data sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled.

1.0

featureRatio

No

The feature sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled.

1.0

l1

No

The L1 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

0

l2

No

The L2 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.

1.0

shrinkage

No

Valid values: (0,1).

0.3

sketchEps

No

The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value indicates that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1).

0.03

minSplitLoss

No

The minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting.

0

featureNum

No

The number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage.

N/A

baseScore

No

The initial prediction values of all samples.

0.5

randSeed

No

The random seed. The value of this parameter must be an integer.

N/A

featureImportanceType

No

The type of the feature importance. Valid values:

  • weight: indicates the number of splits of the feature.

  • gain: indicates the information gain provided by the feature.

  • cover: indicates the number of samples covered by the feature on the splitting node.

gain

Tuning parameters

coreNum

No

The number of cores used in computing. The speed of the computing algorithm increases with the value of this parameter.

Automatically allocated

memSizePerCore

No

The memory size of each core. Unit: MB.

Automatically allocated

Examples

  1. Create a table named smart_multiclass_input by using the ODPS SQL node. For more information, see Develop a MaxCompute SQL task. In this example, input data in the key-value format is generated.

    drop table if exists smart_multiclass_input;
    create table smart_multiclass_input lifecycle 3 as
    select
    *
    from
    (
    select '2' as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features 
        union all
    select '1' as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features 
        union all
    select '1' as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features 
        union all
    select '2' as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features 
        union all
    select '1' as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features 
        union all
    select '1' as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features 
        union all
    select '0' as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features 
        union all
    select '1' as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features 
        union all
    select '0' as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features 
        union all
    select '1' as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features 
    ) tmp;

    The following figure shows the generated data.PS-smart输入数据

  2. Create a pipeline as shown in the following figure. For more information, see Generate a model. image

  3. Configure the component parameters.

    1. Click the Read Table -1 component on the canvas. On the Select Table tab on the right, set the Table Name parameter to smart_multiclass_input.

    2. Configure the parameters for the PS-SMART Multiclass Classification component. The following table describes the parameters. Use the default values of the parameters that are not included in the table.

      Tab

      Parameter

      Description

      Fields Setting

      Feature Columns

      Select the features column.

      Label Column

      Select the label column.

      Use Sparse Format

      Select Use Sparse Format.

      Parameters Setting

      Classes

      Set the parameter to 3.

      Evaluation Indicator Type

      Select Multiclass Negative Log Likelihood from the drop-down list.

      Trees

      Set the parameter to 5.

    3. Configure the parameters for the Prediction-1 component. The following table describes the parameters. Use the default values of the parameters that are not included in the table.

      Tab

      Parameter

      Description

      Fields Setting

      Feature Columns

      Select the features column.

      Reserved Columns

      Select the label and features columns.

      Sparse Matrix

      Select Sparse Matrix.

      KV Delimiter

      Set the parameter to a colon (:).

      KV Pair Delimiter

      If you leave the field empty, a space is used as a delimiter.

    4. Click the Write Table -1 component on the canvas. On the Select Table tab on the right, set the Table Name parameter to smart_multiclass_output.

  4. Click the image icon on the canvas to run the pipeline.

  5. After you run the pipeline, right-click the Prediction -1 component and choose View Data > Prediction Result Output Port to view the prediction results. imageParameters:

    • prediction_detail: the classes used for multiclass classification. Valid values: 0, 1, and 2.

    • prediction_result: the classes of the prediction results.

    • prediction_score: the probabilities of classes in the prediction_result column.

  6. On the canvas, right-click the PS-SMART Multiclass Classification component and choose View Data > View Output Port 3 to view the feature importance result.

    imageParameters:

    • id: the ID of a passed feature. In this example, the input data is in the key-value format. The values in the id column indicate the keys in the key-value pairs.

    • value: the type of feature importance. The default value is gain, which indicates the sum of information gains provided by a feature for the model.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Model service deployment by using the PAI console.