Train and predict with an XGBoost model using MaxCompute - MaxCompute

This topic describes how to train an XGBoost model and use it to make predictions with MaxCompute.

Key concepts

XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm that uses a gradient boosting framework. It iteratively builds decision tree models, uses second-order derivatives to optimize the objective function, and gradually corrects prediction errors. XGBoost is designed to improve the accuracy of classification and regression tasks on structured data. Its core idea is to combine the prediction results of multiple decision trees into an additive model. This process iteratively minimizes the loss function to efficiently model complex, non-linear relationships.

Scope

This scenario uses features from the upgraded MaxFrame engine. These features are currently in invitational preview. The sample code in this topic is for reference only. To try these features, you can submit a ticket.

Benefits

In real-world business scenarios, cross-team collaboration is common. Algorithm teams often prefer to use the Python programming language for model training. In contrast, business analytics teams tend to use SQL for data analytics. MaxCompute allows the MaxFrame and SQL engines to use the same model object. This reduces integration costs for cross-language and cross-technology stack collaboration. It also allows users to flexibly apply MaxCompute's AI capabilities to business analytics.

Dataset overview

This topic uses the classic Boston housing dataset. This dataset contains information about housing conditions in the Boston area during the 1970s. The training dataset contains 500 samples, with 12 features and one target variable. The prediction dataset contains 6 samples with the same 12 features as the training dataset.

The following table describes the fields in the dataset.

No.	Field name	Type	Meaning	Description
1	CRIM	Float	Per capita crime rate by town	This field indicates the per capita crime rate for each town. Areas with higher crime rates generally have lower housing prices.
2	ZN	Float	Proportion of residential land zoned for lots >25,000 sq. ft.	This field indicates the proportion of residential land in a town that is zoned for lots larger than 25,000 sq. ft.
3	INDUS	Float	Proportion of non-retail business acres per town	This field represents the proportion of land in a town used for non-retail businesses, such as factories and warehouses.
4	CHAS	Integer	Bounds river (1=yes, 0=no)	This is a binary field that indicates whether the area is adjacent to the Charles River.
5	NOX	Float	Nitric oxides concentration (ppm)	This field indicates the concentration of nitric oxides in the air, measured in parts per million (ppm).
6	RM	Float	Average number of rooms per dwelling	This field represents the average number of rooms in a dwelling. Dwellings with more rooms are generally more expensive.
7	AGE	Float	Proportion of owner-occupied units built before 1940	This field indicates the proportion of owner-occupied homes built before 1940.
8	DIS	Float	Weighted distances to five Boston employment centers	This field indicates the weighted distance to five major employment centers in Boston. Areas closer to these centers generally have higher housing prices.
9	RAD	Integer	Index of accessibility to radial highways	This field indicates the accessibility index for highways. A higher index means better accessibility.
10	TAX	Float	Full-value property-tax rate per USD 10,000	This field indicates the tax amount per USD 10,000 of property value. Higher tax rates can negatively affect housing prices.
11	PTRATIO	Float	Pupil-teacher ratio by town	This field indicates the ratio of students to teachers in public schools in a town. A higher ratio may suggest that educational resources are relatively scarce.
12	LSTAT	Float	Percentage of lower status of the population	This field indicates the percentage of the population that is considered lower status. Areas with a higher percentage of lower-status population generally have lower housing prices.
13	MEDV	Float	Median value of owner-occupied homes in thousands of USD	Target variable: This field indicates the median value of owner-occupied homes in the area, in thousands of USD. This is the value that the model needs to predict.

Prerequisites

Activate MaxCompute, create a MaxCompute project, and ensure that schema syntax is enabled for the project.
Ensure that the latest versions of the MaxFrame, scikit-learn, and xgboost packages are installed in your local Python development environment. For more information, see Use MaxFrame in a local environment.

Prepare the data

Create the required tables and populate them with data:

-- Create a training data table.
CREATE TABLE IF NOT EXISTS demo_xgboost_train
(
 CRIM         FLOAT comment 'Per capita crime rate by town',
 ZN           FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
 INDUS        FLOAT comment 'Proportion of non-retail business acres per town',
 CHAS         INT   comment 'Bounds river (1=yes, 0=no)',
 NOX          FLOAT comment 'Nitric oxides concentration (ppm)',
 RM           FLOAT comment 'Average number of rooms per dwelling',
 AGE          FLOAT comment 'Proportion of owner-occupied units built before 1940',
 DIS          FLOAT comment 'Weighted distances to five Boston employment centers',
 RAD          FLOAT comment 'Index of accessibility to radial highways',
 TAX          FLOAT comment 'Full-value property-tax rate per USD 10,000',
 PTRATIO      FLOAT comment 'Pupil-teacher ratio by town',
 LSTAT        FLOAT comment 'Percentage of lower status of the population',
 MEDV         FLOAT comment 'Median value of owner-occupied homes in thousands of USD'
);

-- Create a prediction data table for SQL and MaxFrame.
CREATE TABLE IF NOT EXISTS demo_xgboost_predict
(
 CRIM         FLOAT comment 'Per capita crime rate by town',
 ZN           FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
 INDUS        FLOAT comment 'Proportion of non-retail business acres per town',
 CHAS         INT   comment 'Bounds river (1=yes, 0=no)',
 NOX          FLOAT comment 'Nitric oxides concentration (ppm)',
 RM           FLOAT comment 'Average number of rooms per dwelling',
 AGE          FLOAT comment 'Proportion of owner-occupied units built before 1940',
 DIS          FLOAT comment 'Weighted distances to five Boston employment centers',
 RAD          FLOAT comment 'Index of accessibility to radial highways',
 TAX          FLOAT comment 'Full-value property-tax rate per USD 10,000',
 PTRATIO      FLOAT comment 'Pupil-teacher ratio by town',
 LSTAT        FLOAT comment 'Percentage of lower status of the population'
);



-- Create a MaxFrame prediction data table.
CREATE TABLE IF NOT EXISTS demo_xgboost_predict_result
(
 CRIM         DOUBLE comment 'Per capita crime rate by town',
 ZN           DOUBLE comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
 INDUS        DOUBLE comment 'Proportion of non-retail business acres per town',
 CHAS         BIGINT comment 'Bounds river (1=yes, 0=no)',
 NOX          DOUBLE comment 'Nitric oxides concentration (ppm)',
 RM           DOUBLE comment 'Average number of rooms per dwelling',
 AGE          DOUBLE comment 'Proportion of owner-occupied units built before 1940',
 DIS          DOUBLE comment 'Weighted distances to five Boston employment centers',
 RAD          DOUBLE comment 'Index of accessibility to radial highways',
 TAX          DOUBLE comment 'Full-value property-tax rate per USD 10,000',
 PTRATIO      DOUBLE comment 'Pupil-teacher ratio by town',
 LSTAT        DOUBLE comment 'Percentage of lower status of the population',
 RESULT       DOUBLE comment 'Predicted value of owner-occupied homes'
);

Upload the data using the Tunnel API:

xgboost_train_data.csv

xgboost_predict.csv

TUNNEL UPLOAD xgboost_train_data.csv demo_xgboost_train;
TUNNEL UPLOAD xgboost_predict.csv demo_xgboost_predict;

Train and save the model

Train an XGBoost regression model using MaxFrame.

## Session initialization
from odps import ODPS
import maxframe
from maxframe import options

o = ODPS('LT******************',
         'Cz************************',
         'project_name',
         'endpoint',
        )

## Parameter configuration
from maxframe.config import options
options.sql.enable_mcqa = False
# Use the DPI engine "DPE". 
options.dag.settings = {
    "engine_order" : ["DPE", "MCSQL", "SPE"]
}
options.dpe.settings = {
    "odps.catalog_api_endpoint": "catalog_api_endpoint",
}
options.sql.settings = {
    "odps.session.image": "common",
}
options.service_role_arn =  "acs:ram::13933481********:role/aliyunodpsdefaultrole"
options.object_cache_url = "oss://oss-cn-beijing-internal.aliyuncs.com/models-*******/mfdemo"

sess = maxframe.new_session(o)  
print(sess.get_logview_address())

## Model training and generation
import numpy as np
from sklearn.datasets import make_classification
import maxframe.dataframe as md, maxframe.tensor as mt
from maxframe.learn.contrib.xgboost import XGBRegressor

# Define the table name and fields.
table_name = "demo_xgboost_train"
# Feature columns and the target column.
features = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'lstat']
label = 'medv'
# Read the training data.
df = md.read_odps_table(
    table_name,
    unknown_as_string=True  # Prevents errors caused by unstructured data.
)
# Feature and label processing.
X_train = df[features].fillna(-1)  # Fill missing values.
y_train = df[label]
# Initialize and train the XGBoost regression model.
model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.1,
    colsample_bytree=0.8,
    n_jobs=-1,
    tree_method="hist",
    enable_categorical=True,
    objective='reg:squarederror'
)
# Save the generated XGBoost regression model.
model.fit(X_train, y_train).to_odps_model(
    model_name="demo_model_xgboost_regressor",
    model_version="v01",
).execute()

View the trained model.

DESC model demo_model_xgboost_regressor;

+------------------------------------------------------------------------------------+
|                  Model Information                                                 |
+------------------------------------------------------------------------------------+
| Owner:                    ALIYUN$***********************                           |
| Project:                  pd_test_model                                            |
| Schema:                   default                                                  |
| Model Name:               demo_model_xgboost_regressor                                   |
| Model Type:               BOOSTED_TREE_REGRESSOR                                   |
| Source Type:              INTERNAL_TRAIN                                           |
| Default Version:          v01                                                      |
| CreateTime:               2025-09-12 13:15:16                                      |
| LastModifiedTime:         2025-09-12 13:15:16                                      |
| Model ID:                 389c90e355264079923b89**********                         |
+------------------------------------------------------------------------------------+
|                Version Information                                                 |
+------------------------------------------------------------------------------------+
| Owner:                    ALIYUN$***********************                           |
| Project:                  pd_test_model                                            |
| Schema:                   default                                                  |
| Model Name:               demo_model_xgboost_regressor                                    |
| Model Type:               BOOSTED_TREE_REGRESSOR                                   |
| Source Type:              INTERNAL_TRAIN                                           |
| Version Name:             v01                                                      |
| Version ID:               99acd6cb93f845c8a4a9977*********                         |
| Path:                                                                              |
| CreateTime:               2025-09-12 13:15:16                                      |
| LastModifiedTime:         2025-09-12 13:15:16                                      |
+------------------------------------------------------------------------------------+
| Input           | Type       | Comment                                             |
+------------------------------------------------------------------------------------+
| crim            | float      |                                                     |
| zn              | float      |                                                     |
| indus           | float      |                                                     |
| chas            | int        |                                                     |
| nox             | float      |                                                     |
| rm              | float      |                                                     |
| age             | float      |                                                     |
| dis             | float      |                                                     |
| rad             | float      |                                                     |
| tax             | float      |                                                     |
| ptratio         | float      |                                                     |
| lstat           | float      |                                                     |
+------------------------------------------------------------------------------------+

Predict data using MaxFrame

After you train the model, run the following program in MaxFrame to make predictions with the trained model:

# Define the prediction data table and the sink table.
predict_table = "demo_xgboost_predict_mf"
predict_result_table = "demo_xgboost_predict_result"

# Read the prediction data.
predict_data = md.read_odps_table(
    predict_table,
    unknown_as_string=True
)
X_predict = predict_data[features].fillna(-1)
y_predict = model.predict(X_predict)

# Build the prediction result DataFrame.
df_predict = predict_data[features].copy()  
df_predict['predicted_medv'] = y_predict.astype(np.float64)  # Add the predicted value.

# Write the prediction results to an ODPS table.
df_predict.to_odps_table(
    predict_result_table,
    overwrite=True,
    index=False,
    unknown_as_string=True
).execute()

View the sink table:

SELECT * FROM demo_xgboost_predict_result limit 10;

+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| crim       | zn         | indus      | chas       | nox        | rm         | age        | dis        | rad        | tax        | ptratio    | lstat      | result     | 
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| 0.22438    | 0.0        | 9.69       | 0          | 0.585      | 6.027      | 79.7       | 2.4982     | 6.0        | 391.0      | 19.2       | 14.33      | 20.03961181640625 | 
| 0.06263    | 0.0        | 11.93      | 0          | 0.573      | 6.593      | 69.1       | 2.4786     | 1.0        | 273.0      | 21.0       | 9.67       | 24.491479873657227 | 
| 0.04527    | 0.0        | 11.93      | 0          | 0.573      | 6.12       | 76.7       | 2.2875     | 1.0        | 273.0      | 21.0       | 9.08       | 24.683202743530273 | 
| 0.06076    | 0.0        | 11.93      | 0          | 0.573      | 6.976      | 91.0       | 2.1675     | 1.0        | 273.0      | 21.0       | 5.64       | 32.33962631225586 | 
| 0.10959    | 0.0        | 11.93      | 0          | 0.573      | 6.794      | 89.3       | 2.3889     | 1.0        | 273.0      | 21.0       | 6.48       | 28.45917510986328 | 
| 0.04741    | 0.0        | 11.93      | 0          | 0.573      | 6.03       | 80.8       | 2.505      | 1.0        | 273.0      | 21.0       | 7.88       | 24.43267822265625 | 
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Predict data using a SQL AI function

Use the created model and the ML_PREDICT SQL function to predict housing prices.

SET odps.sql.type.system.odps2=true;
SET odps.task.major.version=sqlml_master;
SET odps.sql.machine.learning.enable=true;

SELECT ML_PREDICT(demo_model_xgboost_regressor,v01,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat)
FROM demo_xgboost_predict;

-- The following result is returned:
+------------+
| _c0        | 
+------------+
| 20.039612  | 
| 24.49148   | 
| 24.683203  | 
| 32.339626  | 
| 28.459175  | 
| 24.432678  | 
+------------+

When you use the same model and prediction data, the inference results from the MaxFrame and SQL engines are consistent.