This topic describes how to train an XGBoost model and use it to make predictions with MaxCompute.
Key concepts
XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm that uses a gradient boosting framework. It iteratively builds decision tree models, uses second-order derivatives to optimize the objective function, and gradually corrects prediction errors. XGBoost is designed to improve the accuracy of classification and regression tasks on structured data. Its core idea is to combine the prediction results of multiple decision trees into an additive model. This process iteratively minimizes the loss function to efficiently model complex, non-linear relationships.
Scope
This scenario uses features from the upgraded MaxFrame engine. These features are currently in invitational preview. The sample code in this topic is for reference only. To try these features, you can submit a ticket.
Benefits
In real-world business scenarios, cross-team collaboration is common. Algorithm teams often prefer to use the Python programming language for model training. In contrast, business analytics teams tend to use SQL for data analytics. MaxCompute allows the MaxFrame and SQL engines to use the same model object. This reduces integration costs for cross-language and cross-technology stack collaboration. It also allows users to flexibly apply MaxCompute's AI capabilities to business analytics.
Dataset overview
This topic uses the classic Boston housing dataset. This dataset contains information about housing conditions in the Boston area during the 1970s. The training dataset contains 500 samples, with 12 features and one target variable. The prediction dataset contains 6 samples with the same 12 features as the training dataset.
The following table describes the fields in the dataset.
No. | Field name | Type | Meaning | Description |
1 | CRIM | Float | Per capita crime rate by town | This field indicates the per capita crime rate for each town. Areas with higher crime rates generally have lower housing prices. |
2 | ZN | Float | Proportion of residential land zoned for lots >25,000 sq. ft. | This field indicates the proportion of residential land in a town that is zoned for lots larger than 25,000 sq. ft. |
3 | INDUS | Float | Proportion of non-retail business acres per town | This field represents the proportion of land in a town used for non-retail businesses, such as factories and warehouses. |
4 | CHAS | Integer | Bounds river (1=yes, 0=no) | This is a binary field that indicates whether the area is adjacent to the Charles River. |
5 | NOX | Float | Nitric oxides concentration (ppm) | This field indicates the concentration of nitric oxides in the air, measured in parts per million (ppm). |
6 | RM | Float | Average number of rooms per dwelling | This field represents the average number of rooms in a dwelling. Dwellings with more rooms are generally more expensive. |
7 | AGE | Float | Proportion of owner-occupied units built before 1940 | This field indicates the proportion of owner-occupied homes built before 1940. |
8 | DIS | Float | Weighted distances to five Boston employment centers | This field indicates the weighted distance to five major employment centers in Boston. Areas closer to these centers generally have higher housing prices. |
9 | RAD | Integer | Index of accessibility to radial highways | This field indicates the accessibility index for highways. A higher index means better accessibility. |
10 | TAX | Float | Full-value property-tax rate per USD 10,000 | This field indicates the tax amount per USD 10,000 of property value. Higher tax rates can negatively affect housing prices. |
11 | PTRATIO | Float | Pupil-teacher ratio by town | This field indicates the ratio of students to teachers in public schools in a town. A higher ratio may suggest that educational resources are relatively scarce. |
12 | LSTAT | Float | Percentage of lower status of the population | This field indicates the percentage of the population that is considered lower status. Areas with a higher percentage of lower-status population generally have lower housing prices. |
13 | MEDV | Float | Median value of owner-occupied homes in thousands of USD | Target variable: This field indicates the median value of owner-occupied homes in the area, in thousands of USD. This is the value that the model needs to predict. |
Prerequisites
Activate MaxCompute, create a MaxCompute project, and ensure that schema syntax is enabled for the project.
Ensure that the latest versions of the MaxFrame, scikit-learn, and xgboost packages are installed in your local Python development environment. For more information, see Use MaxFrame in a local environment.
Prepare the data
Create the required tables and populate them with data:
-- Create a training data table. CREATE TABLE IF NOT EXISTS demo_xgboost_train ( CRIM FLOAT comment 'Per capita crime rate by town', ZN FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS FLOAT comment 'Proportion of non-retail business acres per town', CHAS INT comment 'Bounds river (1=yes, 0=no)', NOX FLOAT comment 'Nitric oxides concentration (ppm)', RM FLOAT comment 'Average number of rooms per dwelling', AGE FLOAT comment 'Proportion of owner-occupied units built before 1940', DIS FLOAT comment 'Weighted distances to five Boston employment centers', RAD FLOAT comment 'Index of accessibility to radial highways', TAX FLOAT comment 'Full-value property-tax rate per USD 10,000', PTRATIO FLOAT comment 'Pupil-teacher ratio by town', LSTAT FLOAT comment 'Percentage of lower status of the population', MEDV FLOAT comment 'Median value of owner-occupied homes in thousands of USD' ); -- Create a prediction data table for SQL and MaxFrame. CREATE TABLE IF NOT EXISTS demo_xgboost_predict ( CRIM FLOAT comment 'Per capita crime rate by town', ZN FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS FLOAT comment 'Proportion of non-retail business acres per town', CHAS INT comment 'Bounds river (1=yes, 0=no)', NOX FLOAT comment 'Nitric oxides concentration (ppm)', RM FLOAT comment 'Average number of rooms per dwelling', AGE FLOAT comment 'Proportion of owner-occupied units built before 1940', DIS FLOAT comment 'Weighted distances to five Boston employment centers', RAD FLOAT comment 'Index of accessibility to radial highways', TAX FLOAT comment 'Full-value property-tax rate per USD 10,000', PTRATIO FLOAT comment 'Pupil-teacher ratio by town', LSTAT FLOAT comment 'Percentage of lower status of the population' ); -- Create a MaxFrame prediction data table. CREATE TABLE IF NOT EXISTS demo_xgboost_predict_result ( CRIM DOUBLE comment 'Per capita crime rate by town', ZN DOUBLE comment 'Proportion of residential land zoned for lots >25,000 sq. ft.', INDUS DOUBLE comment 'Proportion of non-retail business acres per town', CHAS BIGINT comment 'Bounds river (1=yes, 0=no)', NOX DOUBLE comment 'Nitric oxides concentration (ppm)', RM DOUBLE comment 'Average number of rooms per dwelling', AGE DOUBLE comment 'Proportion of owner-occupied units built before 1940', DIS DOUBLE comment 'Weighted distances to five Boston employment centers', RAD DOUBLE comment 'Index of accessibility to radial highways', TAX DOUBLE comment 'Full-value property-tax rate per USD 10,000', PTRATIO DOUBLE comment 'Pupil-teacher ratio by town', LSTAT DOUBLE comment 'Percentage of lower status of the population', RESULT DOUBLE comment 'Predicted value of owner-occupied homes' );Upload the data using the Tunnel API:
TUNNEL UPLOAD xgboost_train_data.csv demo_xgboost_train; TUNNEL UPLOAD xgboost_predict.csv demo_xgboost_predict;
Train and save the model
Train an XGBoost regression model using MaxFrame.
## Session initialization from odps import ODPS import maxframe from maxframe import options o = ODPS('LT******************', 'Cz************************', 'project_name', 'endpoint', ) ## Parameter configuration from maxframe.config import options options.sql.enable_mcqa = False # Use the DPI engine "DPE". options.dag.settings = { "engine_order" : ["DPE", "MCSQL", "SPE"] } options.dpe.settings = { "odps.catalog_api_endpoint": "catalog_api_endpoint", } options.sql.settings = { "odps.session.image": "common", } options.service_role_arn = "acs:ram::13933481********:role/aliyunodpsdefaultrole" options.object_cache_url = "oss://oss-cn-beijing-internal.aliyuncs.com/models-*******/mfdemo" sess = maxframe.new_session(o) print(sess.get_logview_address()) ## Model training and generation import numpy as np from sklearn.datasets import make_classification import maxframe.dataframe as md, maxframe.tensor as mt from maxframe.learn.contrib.xgboost import XGBRegressor # Define the table name and fields. table_name = "demo_xgboost_train" # Feature columns and the target column. features = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'lstat'] label = 'medv' # Read the training data. df = md.read_odps_table( table_name, unknown_as_string=True # Prevents errors caused by unstructured data. ) # Feature and label processing. X_train = df[features].fillna(-1) # Fill missing values. y_train = df[label] # Initialize and train the XGBoost regression model. model = XGBRegressor( n_estimators=300, learning_rate=0.1, colsample_bytree=0.8, n_jobs=-1, tree_method="hist", enable_categorical=True, objective='reg:squarederror' ) # Save the generated XGBoost regression model. model.fit(X_train, y_train).to_odps_model( model_name="demo_model_xgboost_regressor", model_version="v01", ).execute()View the trained model.
DESC model demo_model_xgboost_regressor; +------------------------------------------------------------------------------------+ | Model Information | +------------------------------------------------------------------------------------+ | Owner: ALIYUN$*********************** | | Project: pd_test_model | | Schema: default | | Model Name: demo_model_xgboost_regressor | | Model Type: BOOSTED_TREE_REGRESSOR | | Source Type: INTERNAL_TRAIN | | Default Version: v01 | | CreateTime: 2025-09-12 13:15:16 | | LastModifiedTime: 2025-09-12 13:15:16 | | Model ID: 389c90e355264079923b89********** | +------------------------------------------------------------------------------------+ | Version Information | +------------------------------------------------------------------------------------+ | Owner: ALIYUN$*********************** | | Project: pd_test_model | | Schema: default | | Model Name: demo_model_xgboost_regressor | | Model Type: BOOSTED_TREE_REGRESSOR | | Source Type: INTERNAL_TRAIN | | Version Name: v01 | | Version ID: 99acd6cb93f845c8a4a9977********* | | Path: | | CreateTime: 2025-09-12 13:15:16 | | LastModifiedTime: 2025-09-12 13:15:16 | +------------------------------------------------------------------------------------+ | Input | Type | Comment | +------------------------------------------------------------------------------------+ | crim | float | | | zn | float | | | indus | float | | | chas | int | | | nox | float | | | rm | float | | | age | float | | | dis | float | | | rad | float | | | tax | float | | | ptratio | float | | | lstat | float | | +------------------------------------------------------------------------------------+
Predict data using MaxFrame
After you train the model, run the following program in MaxFrame to make predictions with the trained model:
# Define the prediction data table and the sink table. predict_table = "demo_xgboost_predict_mf" predict_result_table = "demo_xgboost_predict_result" # Read the prediction data. predict_data = md.read_odps_table( predict_table, unknown_as_string=True ) X_predict = predict_data[features].fillna(-1) y_predict = model.predict(X_predict) # Build the prediction result DataFrame. df_predict = predict_data[features].copy() df_predict['predicted_medv'] = y_predict.astype(np.float64) # Add the predicted value. # Write the prediction results to an ODPS table. df_predict.to_odps_table( predict_result_table, overwrite=True, index=False, unknown_as_string=True ).execute()View the sink table:
SELECT * FROM demo_xgboost_predict_result limit 10; +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | result | +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ | 0.22438 | 0.0 | 9.69 | 0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6.0 | 391.0 | 19.2 | 14.33 | 20.03961181640625 | | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 9.67 | 24.491479873657227 | | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.12 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 9.08 | 24.683202743530273 | | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 5.64 | 32.33962631225586 | | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 6.48 | 28.45917510986328 | | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.03 | 80.8 | 2.505 | 1.0 | 273.0 | 21.0 | 7.88 | 24.43267822265625 | +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Predict data using a SQL AI function
Use the created model and the ML_PREDICT SQL function to predict housing prices.
SET odps.sql.type.system.odps2=true;
SET odps.task.major.version=sqlml_master;
SET odps.sql.machine.learning.enable=true;
SELECT ML_PREDICT(demo_model_xgboost_regressor,v01,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat)
FROM demo_xgboost_predict;
-- The following result is returned:
+------------+
| _c0 |
+------------+
| 20.039612 |
| 24.49148 |
| 24.683203 |
| 32.339626 |
| 28.459175 |
| 24.432678 |
+------------+When you use the same model and prediction data, the inference results from the MaxFrame and SQL engines are consistent.