PS-SMART Binary Classification Training - Platform For AI

A parameter server (PS) is used to process a large number of offline and online training tasks. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Binary Classification Training component supports training tasks for tens of billions of samples and hundreds of thousands of features. It can run training tasks on thousands of nodes. This component also supports multiple data formats and optimization technologies such as histogram-based approximation.

Limits

You can use this component based only on the computing resources of MaxCompute.

Usage notes

Only columns of numeric data types can be used by the PS-SMART Binary Classification Training component. 0 indicates a negative example, and 1 indicates a positive example. If the type of data in the MaxCompute table is STRING, the data type must be converted first. For example, you must convert Good/Bad to 1/0.
If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature engineering such as feature discretization to process the values.
The PS-SMART Binary Classification Training component supports hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for scenarios where continuous features are used for training. You can perform one-hot encoding on categorical features to filter out low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when tasks run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.

Configure the component

You can use one of the following methods to configure the component.

Method 1: Use the console

Configure the component parameters in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Use Sparse Format	Specifies whether the input data is sparse. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.
	Feature Columns	Select the feature columns for training from the input table. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse key-value pairs and keys and values are of numeric data types, only columns of the STRING type are supported.
	Label Column	Select the label column from the input table. The columns of the STRING type and numeric data types are supported. Data contained in the columns must be of numeric types, such as 0 and 1 that are used in binary classification.
	Weight Column	Select the column that contains the weight of each row of samples. The columns of numeric data types are supported.
Parameters Setting	Evaluation Indicator Type	Select the evaluation metric type of the training set. Valid values: negative loglikelihood for logistic regression binary classification error Area under curve for classification
	Trees	Enter the number of trees. The value of this parameter must be an integer. The number of trees is proportional to the amount of training time.
	Maximum Tree Depth	Enter the maximum tree depth. The default value is 5, which indicates that a maximum of 16 leaf nodes can be configured. The value of this parameter must be a positive integer.
	Data Sampling Fraction	Enter the data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.
	Feature Sampling Fraction	Enter the feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.
	L1 Penalty Coefficient	Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	L2 Penalty Coefficient	Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	Learning Rate	Enter the learning rate. Valid values: (0,1).
	Sketch-based Approximate Precision	Enter the threshold for selecting quantiles when you build a sketch. A smaller value specifies that more bins can be obtained. In most cases, the default value 0.03 is used.
	Minimum Split Loss Change	Enter the minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.
	Features	Enter the number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.
	Global Offset	Enter the initial prediction values of all samples.
	Random Seed	Enter the random seed. The value of this parameter must be an integer.
	Feature Importance Type	Select the feature importance type. Valid values: Weight: indicates the number of splits of the feature. Gain: indicates the information gain provided by the feature. This is the default value. Cover: indicates the number of samples covered by the feature on the split node.
Tuning	Cores	Select the number of cores. By default, the system determines the value.
Tuning	Memory Size per Core	Select the memory size of each core. Unit: MB. In most cases, the system determines the memory size.

Method 2: Use PAI commands

Configure the component parameters by using Platform for AI (PAI) commands. The following section describes the parameters. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_binary_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="f0,f1,f2,f3,f4,f5"
    -DenableSparse="false"
    -Dobjective="binary:logistic"
    -Dmetric="error"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0";

# Prediction 
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_binary_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="f0,f1,f2,f3,f4,f5"
    -DappendColNames="label,qid,f0,f1,f2,f3,f4,f5"
    -DenableSparse="false"
    -Dlifecycle="28";

Module	Parameter	Required	Description	Default value
Data parameters	featureColNames	Yes	The feature columns that are selected from the input table for training. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported.	None
	labelColName	Yes	The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.	None
	weightCol	No	The column that contains the weight of each row of samples. The columns of numeric data types are supported.	None
	enableSparse	No	Specifies whether the input data is sparse. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.	false
	inputTableName	Yes	The name of the input table.	None
	modelName	Yes	The name of the output model.	None
	outputImportanceTableName	No	The name of the table that provides feature importance.	None
	inputTablePartitions	No	The partitions that are selected from the input table for training. Format: ds=1/pt=1.	None
	outputTableName	No	The MaxCompute table that is generated. The table is a binary file. It cannot be read and can be obtained only by using the PS-SMART prediction component.	None
	lifecycle	No	The lifecycle of the output table. Unit: days.	3
Algorithm parameters	objective	Yes	The type of the objective function. If the training is performed by using binary classification components, set the parameter to binary:logistic.	None
	metric	No	The evaluation metric type of the training set, which is contained in stdout of the coordinator in LogView. Valid values: logloss: corresponds to the Negative Loglikelihood for Logistic Regression value of the Evaluation Indicator Type parameter in the PAI console. error: corresponds to the Binary Classification Error value of the Evaluation Indicator Type parameter in the PAI console. auc: corresponds to the AUC for Classification value of the Evaluation Indicator Type parameter in the PAI console.	None
	treeCount	No	The number of trees. The value is proportional to the amount of training time.	1
	maxDepth	No	The maximum depth of the tree. The value must be a positive integer. Valid values: 1 to 20.	5
	sampleRatio	No	The data sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled.	1.0
	featureRatio	No	The feature sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled.	1.0
	l1	No	The L1 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.	0
	l2	No	The L2 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.	1.0
	shrinkage	No	Valid values: (0,1).	0.3
	sketchEps	No	The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value specifies that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1).	0.03
	minSplitLoss	No	The minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur.	0
	featureNum	No	The number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.	None
	baseScore	No	The initial prediction values of all samples.	0.5
	randSeed	No	The random seed. The value of this parameter must be an integer.	None
	featureImportanceType	No	The feature importance type. Valid values: weight: indicates the number of splits of the feature. gain: indicates the information gain provided by the feature. cover: indicates the number of samples covered by the feature on the splitting node.	gain
Tuning parameters	coreNum	No	The number of cores used in computing. A larger value indicates faster running of the computing algorithm.	Automatically allocated
Tuning parameters	memSizePerCore	No	The memory size of each core. Unit: MB.	Automatically allocated

Example

Execute the following SQL statements on an ODPS SQL node to generate training data. In this example, dense training data is generated.

drop table if exists smart_binary_input;
create table smart_binary_input lifecycle 3 as
select
*
from
(
select 0.72 as f0, 0.42 as f1, 0.55 as f2, -0.09 as f3, 1.79 as f4, -1.2 as f5, 0 as label
union all
select 1.23 as f0, -0.33 as f1, -1.55 as f2, 0.92 as f3, -0.04 as f4, -0.1 as f5, 1 as label
union all
select -0.2 as f0, -0.55 as f1, -1.28 as f2, 0.48 as f3, -1.7 as f4, 1.13 as f5, 1 as label
union all
select 1.24 as f0, -0.68 as f1, 1.82 as f2, 1.57 as f3, 1.18 as f4, 0.2 as f5, 0 as label
union all
select -0.85 as f0, 0.19 as f1, -0.06 as f2, -0.55 as f3, 0.31 as f4, 0.08 as f5, 1 as label
union all
select 0.58 as f0, -1.39 as f1, 0.05 as f2, 2.18 as f3, -0.02 as f4, 1.71 as f5, 0 as label
union all
select -0.48 as f0, 0.79 as f1, 2.52 as f2, -1.19 as f3, 0.9 as f4, -1.04 as f5, 1 as label
union all
select 1.02 as f0, -0.88 as f1, 0.82 as f2, 1.82 as f3, 1.55 as f4, 0.53 as f5, 0 as label
union all
select 1.19 as f0, -1.18 as f1, -1.1 as f2, 2.26 as f3, 1.22 as f4, 0.92 as f5, 0 as label
union all
select -2.78 as f0, 2.33 as f1, 1.18 as f2, -4.5 as f3, -1.31 as f4, -1.8 as f5, 1 as label
) tmp;

The following figure shows the generated training data.

Create the pipeline shown in the following figure and run the component. For more information, see Algorithm modeling.

In the left-side component list of Machine Learning Designer, separately search for the Read Table, PS-SMART Binary Classification Training, Prediction, and Write Table components, and drag the components to the canvas on the right.
Connect nodes by drawing lines to organize the nodes into a pipeline that includes upstream and downstream relationships based on the preceding figure.

Configure the component parameters.

On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to smart_binary_input.

On the canvas, click the PS-SMART Binary Classification Training-1 component and configure the parameters listed in the following table in the right pane. Retain the default values for the parameters that are not listed in the table.

Tab	Parameter	Description
Fields Setting	Feature Columns	Select the feature columns. Select the f0, f1, f2, f3, f4, and f5 columns.
Fields Setting	Label Column	Select the label column.
Parameters Setting	Evaluation Indicator Type	Select the evaluation metric type. Set the parameter to AUC for Classification.
Parameters Setting	Trees	Set this parameter to 5.

On the canvas, click the Prediction-1 component. On the Field Settings tab in the right pane, select Select All for Reserved Columns. Retain the default values for the remaining parameters.
On the canvas, click the Write Table-1 component. On the Select Table tab in the right pane, set New Table Name to smart_binary_output.

After the parameter configuration is complete, click the button to run the pipeline.

Right-click the Prediction-1 component and choose View Data > Prediction Result Output Port to view the prediction results. 1 in the prediction_detail column indicates a positive example, and 0 indicates a negative example.
Right-click the PS-SMART Binary Classification Training-1 component and choose View Data > Output Feature Importance Table to view the feature importance table. Parameters:
- id: the ID of a passed feature. In this example, the f0, f1, f2, f3, f4, and f5 features are passed. Therefore, in the id column, 0 represents the feature column of f0, and 4 represents the feature column of f4. If data in the input table is key-value pairs, the id column lists keys in the key-value pairs.
- value: the feature importance type. The default value of this parameter is gain, which indicates the sum of the information gains provided by the feature for the model.
- The preceding feature importance table has only three features. This indicates that only the three features are used to split the tree. The feature importance of other features can be considered as 0.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Model service deployment by using the PAI console.

References

For more information about the components provided by Machine Learning Designer, see Overview of Machine Learning Designer.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on your actual business scenario. For more information, see Component reference: Overview of all components.