A parameter server (PS) is used to process a large number of offline and online training tasks. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Binary Classification Training component supports training tasks for tens of billions of samples and hundreds of thousands of features. It can run training tasks on thousands of nodes. This component also supports multiple data formats and optimization technologies such as approximation by using histograms.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Use Sparse Format Specifies whether the input data is in the sparse format. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.
    Feature Columns The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is key-value pairs in the sparse format, and keys and values are of numeric data types, only columns of the STRING type are supported.
    Label Column The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.
    Weight Column The column that contains the weight of each row of samples. The columns of numeric data types are supported.
    Parameters Setting Evaluation Indicator Type The evaluation metric type of the training set. Valid values: Negative Loglikelihood for Logistic Regression, Binary Classification Error, and AUC for Classification.
    Trees The number of trees. The amount of training data is proportional to the number of trees.
    Maximum Tree Depth The default value is 5, which indicates that a maximum of 32 leaf nodes can be configured.
    Data Sampling Fraction The data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.
    Feature Sampling Fraction The feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.
    L1 Penalty Coefficient Controls the size of a leaf node. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
    L2 Penalty Coefficient Controls the size of a leaf node. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
    Learning Rate The learning rate. Valid values: (0,1).
    Sketch-based Approximate Precision The threshold for selecting quantiles when you build a sketch. A smaller value indicates that more bins can be obtained. In most cases, the default value 0.03 is used.
    Minimum Split Loss Change The minimum loss change required for splitting a node. A larger value indicates that node splitting is less likely to occur.
    Features The number of features or the maximum feature ID. If this parameter is not specified for the assessment of resource usage, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID.
    Global Offset The initial prediction values of all samples.
    Feature Importance Type The feature importance type. Valid values: Weight, Gain, and Cover. Weight indicates the number of splits of the feature. Gain indicates the information gain provided by the feature. Cover indicates the number of samples covered by the feature on the split node.
    Tuning Cores The number of cores. By default, the system determines the value.
    Memory Size per Core The memory size of each core. Unit: MB. In most cases, the system determines the memory size.
  • Use commands
    # Training 
    PAI -name ps_smart
        -project algo_public
        -DinputTableName="smart_binary_input"
        -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
        -DoutputTableName="pai_temp_24515_545859_2"
        -DoutputImportanceTableName="pai_temp_24515_545859_3"
        -DlabelColName="label"
        -DfeatureColNames="f0,f1,f2,f3,f4,f5"
        -DenableSparse="false"
        -Dobjective="binary:logistic"
        -Dmetric="error"
        -DfeatureImportanceType="gain"
        -DtreeCount="5";
        -DmaxDepth="5"
        -Dshrinkage="0.3"
        -Dl2="1.0"
        -Dl1="0"
        -Dlifecycle="3"
        -DsketchEps="0.03"
        -DsampleRatio="1.0"
        -DfeatureRatio="1.0"
        -DbaseScore="0.5"
        -DminSplitLoss="0"
    
    # Prediction 
    PAI -name prediction
        -project algo_public
        -DinputTableName="smart_binary_input";
        -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
        -DoutputTableName="pai_temp_24515_545860_1"
        -DfeatureColNames="f0,f1,f2,f3,f4,f5"
        -DappendColNames="label,qid,f0,f1,f2,f3,f4,f5"
        -DenableSparse="false"
        -Dlifecycle="28"
    Module Parameter Required Description Default value
    Data parameters featureColNames Yes The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported. N/A
    labelColName Yes The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification. N/A
    weightCol No The column that contains the weight of each row of samples. The columns of numeric data types are supported. N/A
    enableSparse No Specifies whether the input data is in the sparse format. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9. false
    inputTableName Yes The name of the input table. N/A
    modelName Yes The name of the output model. N/A
    outputImportanceTableName No The name of the table that provides feature importance. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Format: ds=1/pt=1. N/A
    outputTableName No The MaxCompute table that is generated. The table is a binary file. It cannot be read and can be obtained only by using the PS-SMART prediction component. N/A
    lifecycle No The lifecycle of the output table. 3
    Algorithm parameters objective Yes The type of the objective function. Set the parameter to binary:logistic if the training is performed by using binary classification components. N/A
    metric No The evaluation metric type of the training set, which is contained in stdout of the coordinator in a logview. The following types are supported:
    • logloss: corresponds to the Negative Loglikelihood for Logistic Regression value of the Evaluation Indicator Type parameter in the console.
    • error: corresponds to the Binary Classification Error value of the Evaluation Indicator Type parameter in the console.
    • auc: corresponds to the AUC for Classification value of the Evaluation Indicator Type parameter in the console.
    N/A
    treeCount No The number of trees. The value is proportional to the training time. 1
    maxDepth No The maximum depth of a tree. Valid values: 1 to 20. 5
    sampleRatio No The data sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled. 1.0
    featureRatio No The feature sampling ratio. Valid values: (0,1]. If the parameter value is set to 1.0, feature sampling is not performed. 1.0
    l1 No The L1 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. 0
    l2 No The L2 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. 1.0
    shrinkage No Valid values: (0,1). 0.3
    sketchEps No The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value indicates that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1). 0.03
    minSplitLoss No The minimum loss change required for splitting a node. A larger value indicates that node splitting is less likely to occur. 0
    featureNum No The number of features or the maximum feature ID. If this parameter is not specified for the assessment of resource usage, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID. N/A
    baseScore No The initial prediction values of all samples. 0.5
    featureImportanceType No The feature importance type. Valid values:
    • weight: indicates the number of splits of the feature.
    • gain: indicates the information gain provided by the feature.
    • cover: indicates the number of samples covered by the feature on the splitting node.
    gain
    Tuning parameters coreNum No The number of cores used in computing. The larger the value is, the faster the computing algorithm runs. Determined by the system
    memSizePerCore No The memory size of each core. Unit: MB. Determined by the system

Example

  1. Execute the following SQL statements to generate training data. In this example, training data in the dense format is generated.
    drop table if exists smart_binary_input;
    create table smart_binary_input lifecycle 3 as
    select
    *
    from
    (
    select 0.72 as f0, 0.42 as f1, 0.55 as f2, -0.09 as f3, 1.79 as f4, -1.2 as f5, 0 as label from dual
    union all
    select 1.23 as f0, -0.33 as f1, -1.55 as f2, 0.92 as f3, -0.04 as f4, -0.1 as f5, 1 as label from dual
    union all
    select -0.2 as f0, -0.55 as f1, -1.28 as f2, 0.48 as f3, -1.7 as f4, 1.13 as f5, 1 as label from dual
    union all
    select 1.24 as f0, -0.68 as f1, 1.82 as f2, 1.57 as f3, 1.18 as f4, 0.2 as f5, 0 as label from dual
    union all
    select -0.85 as f0, 0.19 as f1, -0.06 as f2, -0.55 as f3, 0.31 as f4, 0.08 as f5, 1 as label from dual
    union all
    select 0.58 as f0, -1.39 as f1, 0.05 as f2, 2.18 as f3, -0.02 as f4, 1.71 as f5, 0 as label from dual
    union all
    select -0.48 as f0, 0.79 as f1, 2.52 as f2, -1.19 as f3, 0.9 as f4, -1.04 as f5, 1 as label from dual
    union all
    select 1.02 as f0, -0.88 as f1, 0.82 as f2, 1.82 as f3, 1.55 as f4, 0.53 as f5, 0 as label from dual
    union all
    select 1.19 as f0, -1.18 as f1, -1.1 as f2, 2.26 as f3, 1.22 as f4, 0.92 as f5, 0 as label from dual
    union all
    select -2.78 as f0, 2.33 as f1, 1.18 as f2, -4.5 as f3, -1.31 as f4, -1.8 as f5, 1 as label from dual
    ) tmp;
    The generated training data is shown in the following figure. Input data
  2. Create the experiment shown in the following figure. For more information, see Generate a model by using an algorithm. Experiment of PS-SMART Binary Classification Training
  3. Configure the parameters listed in the following table for the PS-SMART Binary Classification component. Retain the default values of the parameters that are not listed in the table.
    Tab Parameter Description
    Fields Setting Feature Columns The feature columns. Select the f0, f1, f2, f3, f4, and f5 columns.
    Label Column Select the label column.
    Parameters Setting Evaluation Indicator Type The evaluation metric type. Set the parameter to AUC for Classification.
    Trees Set the parameter to 5.
  4. View the prediction result of the unified prediction component. 11 in the prediction_detail column indicates a positive example, and 0 indicates a negative example.
  5. View the prediction result of the PS-SMART prediction component. 2The result contains the following columns:
    • prediction_score: the probability that the prediction result is a positive example. If the value of this parameter is greater than 0.5, the prediction result is a positive example. Otherwise, the prediction result is a negative example.
    • leaf_index: the IDs of leaf nodes for prediction. Each sample has a specific number of trees, and each tree corresponds to a number. The number indicates the ID of the leaf node on to which the sample falls.
    Note The PS-SMART prediction component must use a STRING-type column as the label column. The values of the column cannot be empty or NULL. You can use the Data Type Conversion component to convert the data type of a feature column into STRING.
  6. Right-click the PS-SMART Binary Classification Training component and choose View Data > View Output Port 3. Then, view the feature importance table. 3The feature importance result contains the following columns:
    • id: the ID of a passed feature. In this example, the f0, f1, f2, f3, f4, and f5 features are passed. Therefore, in the id column, 0 represents the feature column of f0, and 4 represents the feature column of f4. If data in the input table is key-value pairs, the id column lists keys in the key-value pairs.
    • value: the feature importance type. The default value of this parameter is gain, which indicates the sum of the information gains provided by the feature for the model.
    • The preceding feature importance table has only three features. This indicates that only the three features are used to split the tree. The feature importance of other features can be considered as 0.
Additional information:
  • Only columns of numeric data types can be used by the PS-SMART Binary Classification Training component. 0 indicates a negative example, and 1 indicates a positive example. If the type of data in the MaxCompute table is STRING, the data type must be converted first. For example, you must convert Good/Bad to 1/0.
  • If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature engineering such as feature discretization to process the values.
  • The PS-SMART Binary Classification Training component supports hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for scenarios where continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.
  • The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when tasks run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.
  • If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.