Run PS-SMART Distributed Regression Training on Billions of Samples - Platform For AI

Train regression models on datasets with billions of samples and hundreds of thousands of features using the PS-SMART algorithm.

Limitations

Input data requirements:

Target column must be numeric. Convert STRING types in MaxCompute tables before training.
For key-value format data, feature IDs must be positive integers and feature values must be real numbers. Use a serialization component for STRING feature IDs or perform feature engineering (discretization) for categorical strings.
Tasks with hundreds of thousands of features consume large resources and run slowly. Use GBDT algorithms instead, which work directly with continuous features. One-hot encode categorical features to filter low-frequency features, but avoid discretizing continuous features.
The algorithm introduces randomness through data sampling (data_sample_ratio), feature sampling (fea_sample_ratio), histogram approximation, and sketch merging. Tree structures may differ across distributed workers, but model performance remains similar. Inconsistent results across runs with identical data and parameters are expected.
Increase computing cores to accelerate training. Training starts only after all resources are allocated. Requesting more resources increases waiting time when the cluster is busy.

Configuration

Visual interface

Add the PS-SMART Regression component to the Designer canvas and configure parameters in the right pane.

Parameter type	Parameter	Description
Fields setting	Is sparse format	Separates key-value pairs in sparse format with spaces. Separates keys and values with colons (:). Example: 1:0.3 3:0.9.
	Select feature columns	Feature columns from the input table for training. For dense format, select only BIGINT or DOUBLE columns. For sparse key-value format with numeric keys and values, select only STRING columns.
	Select label column	Label column from the input table. Supports STRING and numeric types. Internally converts to numeric types only, such as 0 and 1 for binary classification.
	Select weight column	Column for weighting each sample row. Supports numeric types only.
Parameters setting	Objective function type	Available types: Linear regression (default) Logistic regression Poisson regression Gamma regression Tweedie regression
	Tweedie distribution index	Available only when Objective function type is Tweedie regression. Specifies the index of the relationship between variance and mean of the Tweedie distribution.
	Evaluation metric type	Available types: rooted mean square error mean absolute error negative loglikelihood for logistic regression negative loglikelihood for poisson regression residual deviance for gamma regression negative log-likelihood for gamma regression negative log-likelihood for Tweedie regression None
	Number of trees	Number of trees. Must be a positive integer. Training time increases proportionally with tree count.
	Maximum tree depth	Default: 5 (maximum 32 leaf nodes).
	Data sampling ratio	When building each tree, samples a portion of data to build a weak learner, accelerating training.
	Feature sampling ratio	When building each tree, samples a portion of features to build a weak learner, accelerating training.
	L1 penalty coefficient	Controls leaf node size. Larger values produce more uniform distribution. Increase to reduce overfitting.
	L2 penalty coefficient	Controls leaf node size. Larger values produce more uniform distribution. Increase to reduce overfitting.
	Learning rate	Range: (0,1).
	Approximate sketch precision	Quantile threshold for splitting when constructing a sketch. Smaller values produce more buckets. Use default value 0.03.
	Minimum split loss change	Minimum loss change required to split a node. Larger values produce more conservative splits.
	Number of features	Number of features or maximum feature ID. If not configured, the system automatically calculates the value by starting an SQL task.
	Global bias	Initial prediction value for all samples.
	Random number generator seed	Seed for the random number generator. Must be an integer.
	Feature importance type	Available types: Number of times the feature is used as a split feature in the model Information gain brought by the feature in the model (default) Number of samples covered by the feature at split nodes in the model
Execution tuning	Number of cores	System-allocated by default.
Execution tuning	Memory per core (MB)	Memory per core in MB. System-allocated by default.

PAI commands

Configure parameters for the PS-SMART Regression component by calling PAI commands from the SQL script component. For more information, see SQL Script.

# Train the model.
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_regression_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="reg:linear"
    -Dmetric="rmse"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
# Make predictions.
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_regression_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -Dlifecycle="28"

Parameter type	Parameter	Required	Default	Description
Data parameters	featureColNames	Yes	None	Feature columns from the input table for training. For dense format, select only BIGINT or DOUBLE columns. For sparse key-value format with numeric keys and values, select only STRING columns.
	labelColName	Yes	None	Label column from the input table. Supports STRING and numeric types. Internally converts to numeric types only, such as 0 and 1 for binary classification.
	weightCol	No	None	Column for weighting each sample row. Supports numeric types only.
	enableSparse	No	false	Whether data is in sparse format. Valid values: {true,false}. Separates key-value pairs with spaces. Separates keys and values with colons (:). Example: 1:0.3 3:0.9.
	inputTableName	Yes	None	Name of the input table.
	modelName	Yes	None	Name of the output model.
	outputImportanceTableName	No	None	Name of the output table containing feature importance information.
	inputTablePartitions	No	None	Format: ds=1/pt=1.
	outputTableName	No	None	Output table in MaxCompute, in binary format.
	lifecycle	No	3	Output table lifecycle in days.
Algorithm parameters	objective	Yes	reg:linear	Objective function type. Available types: reg:linear: Linear Regression reg:logistic: Logistic Regression count:poisson: Poisson Regression reg:gamma: Gamma Regression reg:tweedie: Tweedie Regression
	metric	No	None	Evaluation metric type for the training dataset. Output is written to the stdout file in the Logview coordinator area. Available types: rmse: Root mean square error mae: Mean absolute error logistic-nloglik: Negative log-likelihood for logistic regression poisson-nloglik: Negative log-likelihood for Poisson regression gamma-deviance: Residual deviance for gamma regression gamma-nloglik: Negative log-likelihood for gamma regression tweedie-nloglik: Negative log-likelihood for Tweedie regression
	treeCount	No	1	Number of trees. Training time increases proportionally with this value.
	maxDepth	No	5	Maximum depth of a tree. Range: 1 to 20.
	sampleRatio	No	1.0	Data sampling ratio. Range: (0,1]. Value of 1.0 indicates no sampling.
	featureRatio	No	1.0	Feature sampling ratio. Range: (0,1]. Value of 1.0 indicates no sampling.
	l1	No	0	L1 penalty coefficient. Larger values produce more uniform leaf node distribution. Increase to reduce overfitting.
	l2	No	1.0	L2 penalty coefficient. Larger values produce more uniform leaf node distribution. Increase to reduce overfitting.
	shrinkage	No	0.3	Learning rate. Range: (0,1).
	sketchEps	No	0.03	Quantile threshold for splitting when constructing a sketch. The number of buckets is O(1.0/sketchEps). Smaller values produce more buckets. Use default value. Range: (0,1).
	minSplitLoss	No	0	Minimum loss change required to split a node. Larger values produce more conservative splits.
	featureNum	No	None	Number of features or maximum feature ID. If this parameter is not configured when estimating resource usage, the system automatically calculates the value by starting an SQL task.
	baseScore	No	0.5	Initial prediction value for all samples.
	randSeed	No	None	Seed for the random number generator. Must be an integer.
	featureImportanceType	No	gain	Method to calculate feature importance. Available methods: weight: Number of times the feature is used as a split feature in the model. gain: Information gain brought by the feature in the model. cover: Number of samples covered by the feature at split nodes in the model.
	tweedieVarPower	No	1.5	Index of the relationship between the variance and mean of the Tweedie distribution.
Tuning parameters	coreNum	No	System allocated	Number of cores. Larger values accelerate the algorithm.
Tuning parameters		No	System allocated	Memory per core in MB.

Example

Use the ODPS SQL node to run the SQL statement below and generate input data. This example uses key-value format.

drop table if exists smart_regression_input;
create table smart_regression_input as
select
*
from
(
select 2.0 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features
    union all
select 1.0 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features
    union all
select 1.0 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features
    union all
select 2.0 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features
    union all
select 1.0 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features
    union all
select 1.0 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features
    union all
select 0.0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features
    union all
select 1.0 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features
    union all
select 0.0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features
    union all
select 1.0 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features
) tmp;

Generated data is shown below.

Build the workflow below and run the components. For more information, see Algorithm modeling.

In the component list on the left of the Designer canvas, search for Read Table, PS-SMART Regression, Prediction, and Write Table components, then drag them to the canvas.

Configure component parameters.

Click the Read Table-1 component on the canvas. On the Select Table tab in the right pane, set Table Name to smart_regression_input.

Click the PS-SMART Regression-1 component on the canvas. In the right pane, configure parameters as shown in the table below. Use default values for remaining parameters.

Parameter type	Parameter	Description
Fields setting	Is sparse format	Select the Is sparse format check box.
	Feature columns	Select the features column.
	Label column	Select the label column.
Parameters setting	Objective function type	Set to Linear regression.
	Evaluation metric type	Select rooted mean square error.
	Number of trees	Set to 5.

Click the Prediction-1 component on the canvas. In the right pane, configure parameters as shown in the table below. Use default values for remaining parameters.

Parameter type	Parameter	Description
Fields setting	Feature columns	By default, all columns are selected. Extra columns do not affect prediction results.
	Pass-through columns	Select the label column.
	Sparse matrix	Select the Sparse matrix check box.
	Key-value separator	Set to a colon (:).
	Separator between key-value pairs	Set to a space.

Click the Write Table-1 component on the canvas. In the right pane, on the Select Table tab, set Table Name for Output to smart_regression_output.

After configuring parameters, click the run button to run the workflow.

Right-click the Prediction-1 component and choose View Data > Prediction Result Output.
To view feature importance, right-click the PS-SMART Regression-1 component and select View Data > Output Feature Importance Table from the shortcut menu.

The id column indicates the ordinal number of the input feature. Because the input data in this example is in key-value format, the id column represents the key in the key-value pair. The feature importance table contains only two features, indicating only these two features were used during tree splitting. Other features have importance 0. The value column indicates the feature importance type. Default value is gain, which is the sum of information gain that the feature contributes to the model.

Model deployment

To deploy the model generated by the PS-SMART component as an online service, add the General-purpose Model Export component downstream of the PS-SMART component. Configure component parameters the same way as other PS-series components. For more information, see General-purpose Model Export.

Upon successful execution, deploy the model service on the PAI-EAS Model Online Service page. For more information, see Deploy a service in the console.

References

For more information about Designer components, see Designer overview.
Designer provides a variety of algorithm components. Select appropriate components to process data for different scenarios. For more information, see Designer component overview.