This document shows how to train an XGBoost (eXtreme Gradient Boosting) regression model using the MaxFrame Python API and run inference from two engines — Python and SQL — using the same saved model object.
This workflow uses the upgraded MaxFrame engine, which is currently in invitational preview. The sample code is for reference only. To try this feature, submit a ticket.
How it works
This document covers two complementary workflows:
| Workflow | Engine | Target audience |
|---|---|---|
| Train and predict with MaxFrame | Python (MaxFrame API) | Algorithm engineers who write Python |
| Predict with SQL | SQL (ML_PREDICT function) |
Analytics teams who write SQL |
Both workflows use the same MaxCompute model object. Results are identical regardless of which engine you use. This eliminates integration overhead in cross-team environments where algorithm engineers prefer Python and analytics teams prefer SQL.
Key concepts
XGBoost is an ensemble learning algorithm built on a gradient boosting framework. It iteratively builds decision trees, uses second-order derivatives to optimize the objective function, and combines their outputs into an additive model — efficiently capturing complex, non-linear relationships in structured data.
Dataset
This document uses the Boston Housing dataset, which describes housing conditions in the Boston area during the 1970s.
-
Training data: 500 samples, 12 feature columns, 1 target column (
MEDV) -
Prediction data: 6 samples, same 12 feature columns
| No. | Field | Type | Description |
|---|---|---|---|
| 1 | CRIM | Float | Per capita crime rate by town |
| 2 | ZN | Float | Proportion of residential land zoned for lots >25,000 sq. ft. |
| 3 | INDUS | Float | Proportion of non-retail business acres per town |
| 4 | CHAS | Integer | Bounds river (1=yes, 0=no) |
| 5 | NOX | Float | Nitric oxides concentration (ppm) |
| 6 | RM | Float | Average number of rooms per dwelling |
| 7 | AGE | Float | Proportion of owner-occupied units built before 1940 |
| 8 | DIS | Float | Weighted distances to five Boston employment centers |
| 9 | RAD | Integer | Index of accessibility to radial highways |
| 10 | TAX | Float | Full-value property-tax rate per USD 10,000 |
| 11 | PTRATIO | Float | Pupil-teacher ratio by town |
| 12 | LSTAT | Float | Percentage of lower status of the population |
| 13 | MEDV | Float | Median value of owner-occupied homes in thousands of USD (target variable) |
Prerequisites
Before you begin, ensure that you have:
-
An active MaxCompute project with schema syntax enabled. If you haven't created a project yet, see Create a MaxCompute project
-
The latest versions of MaxFrame, scikit-learn, and xgboost installed in your local Python environment. For setup instructions, see Use MaxFrame in a local environment
Prepare the data
-
Create the training, prediction input, and prediction result tables.
-- Training data table CREATE TABLE IF NOT EXISTS demo_xgboost_train ( CRIM FLOAT comment 'Per capita crime rate by town', ZN FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS FLOAT comment 'Proportion of non-retail business acres per town', CHAS INT comment 'Bounds river (1=yes, 0=no)', NOX FLOAT comment 'Nitric oxides concentration (ppm)', RM FLOAT comment 'Average number of rooms per dwelling', AGE FLOAT comment 'Proportion of owner-occupied units built before 1940', DIS FLOAT comment 'Weighted distances to five Boston employment centers', RAD FLOAT comment 'Index of accessibility to radial highways', TAX FLOAT comment 'Full-value property-tax rate per USD 10,000', PTRATIO FLOAT comment 'Pupil-teacher ratio by town', LSTAT FLOAT comment 'Percentage of lower status of the population', MEDV FLOAT comment 'Median value of owner-occupied homes in thousands of USD' ); -- Prediction input table (no MEDV column) CREATE TABLE IF NOT EXISTS demo_xgboost_predict ( CRIM FLOAT comment 'Per capita crime rate by town', ZN FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS FLOAT comment 'Proportion of non-retail business acres per town', CHAS INT comment 'Bounds river (1=yes, 0=no)', NOX FLOAT comment 'Nitric oxides concentration (ppm)', RM FLOAT comment 'Average number of rooms per dwelling', AGE FLOAT comment 'Proportion of owner-occupied units built before 1940', DIS FLOAT comment 'Weighted distances to five Boston employment centers', RAD FLOAT comment 'Index of accessibility to radial highways', TAX FLOAT comment 'Full-value property-tax rate per USD 10,000', PTRATIO FLOAT comment 'Pupil-teacher ratio by town', LSTAT FLOAT comment 'Percentage of lower status of the population' ); -- MaxFrame prediction result table (uses DOUBLE types and adds a RESULT column) CREATE TABLE IF NOT EXISTS demo_xgboost_predict_result ( CRIM DOUBLE comment 'Per capita crime rate by town', ZN DOUBLE comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS DOUBLE comment 'Proportion of non-retail business acres per town', CHAS BIGINT comment 'Bounds river (1=yes, 0=no)', NOX DOUBLE comment 'Nitric oxides concentration (ppm)', RM DOUBLE comment 'Average number of rooms per dwelling', AGE DOUBLE comment 'Proportion of owner-occupied units built before 1940', DIS DOUBLE comment 'Weighted distances to five Boston employment centers', RAD DOUBLE comment 'Index of accessibility to radial highways', TAX DOUBLE comment 'Full-value property-tax rate per USD 10,000', PTRATIO DOUBLE comment 'Pupil-teacher ratio by town', LSTAT DOUBLE comment 'Percentage of lower status of the population', RESULT DOUBLE comment 'Predicted value of owner-occupied homes' ); -
Download the sample CSV files and upload them to MaxCompute using the Tunnel API.
Download:
Upload:
TUNNEL UPLOAD xgboost_train_data.csv demo_xgboost_train; TUNNEL UPLOAD xgboost_predict.csv demo_xgboost_predict;
Train and save the model
Run the following Python script to initialize a MaxFrame session, train an XGBRegressor on the Boston Housing training data, and save the model as a versioned MaxCompute model object.
# --- Session initialization ---
from odps import ODPS
import maxframe
from maxframe.config import options
o = ODPS(
'<your-access-key-id>',
'<your-access-key-secret>',
'<your-project-name>',
'<your-endpoint>',
)
# --- Engine configuration ---
options.sql.enable_mcqa = False
# Use the DPE engine first, fall back to MCSQL, then SPE.
options.dag.settings = {
"engine_order": ["DPE", "MCSQL", "SPE"]
}
options.dpe.settings = {
"odps.catalog_api_endpoint": "<catalog-api-endpoint>",
}
options.sql.settings = {
"odps.session.image": "common",
}
options.service_role_arn = "acs:ram::13933481********:role/aliyunodpsdefaultrole"
options.object_cache_url = "oss://oss-cn-beijing-internal.aliyuncs.com/models-*******/mfdemo"
sess = maxframe.new_session(o)
print(sess.get_logview_address())
# --- Model training ---
import numpy as np
import maxframe.dataframe as md
from maxframe.learn.contrib.xgboost import XGBRegressor
table_name = "demo_xgboost_train"
features = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'lstat']
label = 'medv'
# Read training data from MaxCompute.
df = md.read_odps_table(table_name, unknown_as_string=True)
# Separate features and label; fill missing values with -1.
X_train = df[features].fillna(-1)
y_train = df[label]
# Train the XGBoost regression model.
model = XGBRegressor(
n_estimators=300,
learning_rate=0.1,
colsample_bytree=0.8,
n_jobs=-1,
tree_method="hist",
enable_categorical=True,
objective='reg:squarederror'
)
# Save the trained model as a versioned MaxCompute model object.
model.fit(X_train, y_train).to_odps_model(
model_name="demo_model_xgboost_regressor",
model_version="v01",
).execute()
Replace the placeholders with your actual values:
| Placeholder | Description | Example |
|---|---|---|
<your-access-key-id> |
Alibaba Cloud AccessKey ID | LTAI5tXxx |
<your-access-key-secret> |
Alibaba Cloud AccessKey secret | xXxXxXx |
<your-project-name> |
MaxCompute project name | my_project |
<your-endpoint> |
MaxCompute endpoint | http://service.cn-hangzhou.maxcompute.aliyun.com/api |
<catalog-api-endpoint> |
Catalog API endpoint for the DPE engine | Provided when you activate the feature |
Verify the saved model
Run DESC model to confirm the model was saved correctly.
DESC model demo_model_xgboost_regressor;
Expected output:
+------------------------------------------------------------------------------------+
| Model Information |
+------------------------------------------------------------------------------------+
| Owner: ALIYUN$*********************** |
| Project: pd_test_model |
| Schema: default |
| Model Name: demo_model_xgboost_regressor |
| Model Type: BOOSTED_TREE_REGRESSOR |
| Source Type: INTERNAL_TRAIN |
| Default Version: v01 |
| CreateTime: 2025-09-12 13:15:16 |
| LastModifiedTime: 2025-09-12 13:15:16 |
| Model ID: 389c90e355264079923b89********** |
+------------------------------------------------------------------------------------+
| Version Information |
+------------------------------------------------------------------------------------+
| Owner: ALIYUN$*********************** |
| Project: pd_test_model |
| Schema: default |
| Model Name: demo_model_xgboost_regressor |
| Model Type: BOOSTED_TREE_REGRESSOR |
| Source Type: INTERNAL_TRAIN |
| Version Name: v01 |
| Version ID: 99acd6cb93f845c8a4a9977********* |
| Path: |
| CreateTime: 2025-09-12 13:15:16 |
| LastModifiedTime: 2025-09-12 13:15:16 |
+------------------------------------------------------------------------------------+
| Input | Type | Comment |
+------------------------------------------------------------------------------------+
| crim | float | |
| zn | float | |
| indus | float | |
| chas | int | |
| nox | float | |
| rm | float | |
| age | float | |
| dis | float | |
| rad | float | |
| tax | float | |
| ptratio | float | |
| lstat | float | |
+------------------------------------------------------------------------------------+
Predict with MaxFrame
After training, use the same MaxFrame session to run inference and write results to a sink table.
# Define the prediction input and output tables.
predict_table = "demo_xgboost_predict_mf"
predict_result_table = "demo_xgboost_predict_result"
# Read prediction data.
predict_data = md.read_odps_table(predict_table, unknown_as_string=True)
X_predict = predict_data[features].fillna(-1)
y_predict = model.predict(X_predict)
# Build the result DataFrame with the predicted column.
df_predict = predict_data[features].copy()
df_predict['predicted_medv'] = y_predict.astype(np.float64)
# Write results to MaxCompute.
df_predict.to_odps_table(
predict_result_table,
overwrite=True,
index=False,
unknown_as_string=True
).execute()
Query the sink table to review the output:
SELECT * FROM demo_xgboost_predict_result LIMIT 10;
Expected output:
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | result |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| 0.22438 | 0.0 | 9.69 | 0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6.0 | 391.0 | 19.2 | 14.33 | 20.03961181640625 |
| 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 9.67 | 24.491479873657227 |
| 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.12 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 9.08 | 24.683202743530273 |
| 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 5.64 | 32.33962631225586 |
| 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 6.48 | 28.45917510986328 |
| 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.03 | 80.8 | 2.505 | 1.0 | 273.0 | 21.0 | 7.88 | 24.43267822265625 |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Predict with SQL
Use ML_PREDICT to run inference directly in SQL — no Python environment needed. The function takes the model name, version, and feature columns as arguments, in the same order as the training data.
SET odps.sql.type.system.odps2=true;
SET odps.task.major.version=sqlml_master;
SET odps.sql.machine.learning.enable=true;
SELECT ML_PREDICT(demo_model_xgboost_regressor,v01,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat)
FROM demo_xgboost_predict;
Pass feature columns in the same order as the training data. The function returns a single prediction column named _c0.
Expected output:
+------------+
| _c0 |
+------------+
| 20.039612 |
| 24.49148 |
| 24.683203 |
| 32.339626 |
| 28.459175 |
| 24.432678 |
+------------+
The ML_PREDICT results match the MaxFrame predictions for the same input data, confirming that both engines produce consistent inference from the same model object.