Linear regression is a model that is used to analyze the linear relationship between a dependent variable and multiple independent variables. Parameter servers are used to serve large-scale online and offline training tasks. PS Linear Regression can support large-scale linear training tasks for hundreds of billions of samples and billions of features.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Feature Columns The feature columns that are selected from the input table for training.
    Label Column The label column. The columns of the DOUBLE and BIGINT types are supported.
    Use Sparse Format Specifies whether the input data is in the sparse format. Data in the sparse format is presented by using key-value pairs.
    KV Pair Delimiter The delimiter that is used to separate key-value pairs. Spaces are used by default.
    Key and Value Delimiter The delimiter that is used to separate keys and values. Colons (:) are used by default.
    Parameters Setting L1 Weight The L1 regularization coefficient. The larger the value of this parameter is, the fewer non-zero elements the model has. If overfitting occurs, increase the parameter value.
    L2 Weight The L2 regularization coefficient. The larger the value of this parameter is, the smaller the absolute values of the model parameters are. If overfitting occurs, increase the parameter value.
    Maximum Number of Iterations The maximum number of iterations performed by the algorithm. If this parameter is set to 0, the number of iterations is unlimited.
    Minimum Convergence Deviation The conditions for algorithm termination.
    Largest Feature ID The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If this parameter is not specified, the system automatically runs an SQL task to calculate the maximum feature ID or feature dimension.
    Tuning Number of Cores The memory size of each core. By default, the system determines the value.
    Memory Size per Core The memory size of each core. By default, the system determines the value.
  • Use commands
    # Training 
    PAI -name ps_linearregression
        -project algo_public
        -DinputTableName="lm_test_input"
        -DmodelName="linear_regression_model"
        -DlabelColName="label"
        -DfeatureColNames="features"
        -Dl1Weight=1.0
        -Dl2Weight=0.0
        -DmaxIter=100
        -Depsilon=1e-6
        -DenableSparse=true
    # Prediction 
    drop table if exists logistic_regression_predict;
    PAI -name prediction
        -DmodelName="linear_regression_model"
        -DoutputTableName="linear_regression_predict"
        -DinputTableName="lm_test_input"
        -DappendColNames="label,features"
        -DfeatureColNames="features"
        -DenableSparse=true
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    modelName Yes The name of the output model. N/A
    outputTableName No The name of the output model evaluation table. This parameter is required if the enableFitGoodness parameter is set to true. N/A
    labelColName Yes The label column that is selected from the input table. The columns of the DOUBLE and BIGINT types are supported. N/A
    featureColNames Yes The feature columns that are selected from the input table for training. If data in the input table is in the dense format, the columns of the DOUBLE and BIGINT types are supported. If the input data is in the sparse format, only the columns of the STRING type are supported. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. N/A
    enableSparse No Specifies whether data in the input table is in the sparse format. Valid values: true and false. false
    itemDelimiter No The delimiter that is used to separate key-value pairs. This parameter is valid if the enableSparse parameter is set to true. Space
    kvDelimiter No The delimiter that is used to separate keys and values. This parameter is valid if the enableSparse parameter is set to true. :
    enableModelIo No Specifies whether the model is generated as an offline model. If the enableModelIo parameter is set to false, the model is generated in a MaxCompute table. Valid values: true and false. true
    maxIter No The maximum number of iterations performed by the algorithm. The value of this parameter must be a non-negative integer. 100
    epsilon No The conditions for algorithm termination. Valid values: [0,1]. 0.000001
    l1Weight No The L1 regularization coefficient. The larger the value of this parameter is, the fewer non-zero elements the model has. If overfitting occurs, increase the parameter value. 1.0
    l2Weight No The L2 regularization coefficient. The larger the value of this parameter is, the smaller the absolute values of the model parameters are. If overfitting occurs, increase the parameter value. 0
    modelSize No The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If this parameter is not specified, the system automatically runs an SQL task to calculate the maximum feature ID or feature dimension. The value of this parameter must be a non-negative integer. 0
    coreNum No The number of cores used in computing. Determined by the system
    memSizePerCore No The memory size of each core. Unit: MB. Determined by the system

Example

  1. Execute the following SQL statements to generate input data. In this example, input data in the key-value format is generated.
    drop table if exists lm_test_input;
    create table lm_test_input as
    select
    *
    from
    (
    select 2 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features from dual
        union all
    select 1 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features from dual
        union all
    select 1 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features from dual
        union all
    select 2 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features from dual
        union all
    select 1 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features from dual
        union all
    select 1 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features from dual
        union all
    select 0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features from dual
        union all
    select 1 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features from dual
        union all
    select 0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features from dual
        union all
    select 1 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features from dual
    ) tmp;
    The generated data is shown in the following figure. 11
    Note If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature discretization to process the features.
  2. Create the experiment shown in the following figure. For more information, see Generate a model by using an algorithm. Experiment of PS Linear Regression
  3. Configure the parameters listed in the following table for the PS Linear Regression component. Retain the default values of the parameters that are not listed in the table.
    Tab Parameter Description
    Fields Setting Use Sparse Format Set the parameter to true.
    Feature Columns Select feature columns.
    Label Column Select the label column.
    Tuning Number of Cores Set the parameter to 3.
    Memory Size per Core Set the parameter to 1024. Unit: MB.
  4. Configure the parameters listed in the following table for the Prediction component. Retain the default values of the parameters that are not listed in the table.
    Tab Parameter Description
    Fields Setting Feature Columns By default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result.
    Reserved Output Column Select the label column.
    Sparse Matrix Select Sparse Matrix.
    KV Delimiter Use colons (:).
    KV pair delimiter Use \u0020.
  5. Run the experiment and view the prediction results. Prediction results of PS Linear Regression