All Products
Search
Document Center

MaxCompute:Train and predict with an XGBoost model using MaxCompute

Last Updated:Oct 25, 2025

This topic describes how to train an XGBoost model and use it to make predictions with MaxCompute.

Key concepts

XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm that uses a gradient boosting framework. It iteratively builds decision tree models, uses second-order derivatives to optimize the objective function, and gradually corrects prediction errors. XGBoost is designed to improve the accuracy of classification and regression tasks on structured data. Its core idea is to combine the prediction results of multiple decision trees into an additive model. This process iteratively minimizes the loss function to efficiently model complex, non-linear relationships.

Scope

This scenario uses features from the upgraded MaxFrame engine. These features are currently in invitational preview. The sample code in this topic is for reference only. To try these features, you can submit a ticket.

Benefits

In real-world business scenarios, cross-team collaboration is common. Algorithm teams often prefer to use the Python programming language for model training. In contrast, business analytics teams tend to use SQL for data analytics. MaxCompute allows the MaxFrame and SQL engines to use the same model object. This reduces integration costs for cross-language and cross-technology stack collaboration. It also allows users to flexibly apply MaxCompute's AI capabilities to business analytics.

Dataset overview

This topic uses the classic Boston housing dataset. This dataset contains information about housing conditions in the Boston area during the 1970s. The training dataset contains 500 samples, with 12 features and one target variable. The prediction dataset contains 6 samples with the same 12 features as the training dataset.

The following table describes the fields in the dataset.

No.

Field name

Type

Meaning

Description

1

CRIM

Float

Per capita crime rate by town

This field indicates the per capita crime rate for each town. Areas with higher crime rates generally have lower housing prices.

2

ZN

Float

Proportion of residential land zoned for lots >25,000 sq. ft.

This field indicates the proportion of residential land in a town that is zoned for lots larger than 25,000 sq. ft.

3

INDUS

Float

Proportion of non-retail business acres per town

This field represents the proportion of land in a town used for non-retail businesses, such as factories and warehouses.

4

CHAS

Integer

Bounds river (1=yes, 0=no)

This is a binary field that indicates whether the area is adjacent to the Charles River.

5

NOX

Float

Nitric oxides concentration (ppm)

This field indicates the concentration of nitric oxides in the air, measured in parts per million (ppm).

6

RM

Float

Average number of rooms per dwelling

This field represents the average number of rooms in a dwelling. Dwellings with more rooms are generally more expensive.

7

AGE

Float

Proportion of owner-occupied units built before 1940

This field indicates the proportion of owner-occupied homes built before 1940.

8

DIS

Float

Weighted distances to five Boston employment centers

This field indicates the weighted distance to five major employment centers in Boston. Areas closer to these centers generally have higher housing prices.

9

RAD

Integer

Index of accessibility to radial highways

This field indicates the accessibility index for highways. A higher index means better accessibility.

10

TAX

Float

Full-value property-tax rate per USD 10,000

This field indicates the tax amount per USD 10,000 of property value. Higher tax rates can negatively affect housing prices.

11

PTRATIO

Float

Pupil-teacher ratio by town

This field indicates the ratio of students to teachers in public schools in a town. A higher ratio may suggest that educational resources are relatively scarce.

12

LSTAT

Float

Percentage of lower status of the population

This field indicates the percentage of the population that is considered lower status. Areas with a higher percentage of lower-status population generally have lower housing prices.

13

MEDV

Float

Median value of owner-occupied homes in thousands of USD

Target variable: This field indicates the median value of owner-occupied homes in the area, in thousands of USD. This is the value that the model needs to predict.

Prerequisites

Prepare the data

  1. Create the required tables and populate them with data:

    -- Create a training data table.
    CREATE TABLE IF NOT EXISTS demo_xgboost_train
    (
     CRIM         FLOAT comment 'Per capita crime rate by town',
     ZN           FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
     INDUS        FLOAT comment 'Proportion of non-retail business acres per town',
     CHAS         INT   comment 'Bounds river (1=yes, 0=no)',
     NOX          FLOAT comment 'Nitric oxides concentration (ppm)',
     RM           FLOAT comment 'Average number of rooms per dwelling',
     AGE          FLOAT comment 'Proportion of owner-occupied units built before 1940',
     DIS          FLOAT comment 'Weighted distances to five Boston employment centers',
     RAD          FLOAT comment 'Index of accessibility to radial highways',
     TAX          FLOAT comment 'Full-value property-tax rate per USD 10,000',
     PTRATIO      FLOAT comment 'Pupil-teacher ratio by town',
     LSTAT        FLOAT comment 'Percentage of lower status of the population',
     MEDV         FLOAT comment 'Median value of owner-occupied homes in thousands of USD'
    );
    
    -- Create a prediction data table for SQL and MaxFrame.
    CREATE TABLE IF NOT EXISTS demo_xgboost_predict
    (
     CRIM         FLOAT comment 'Per capita crime rate by town',
     ZN           FLOAT comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
     INDUS        FLOAT comment 'Proportion of non-retail business acres per town',
     CHAS         INT   comment 'Bounds river (1=yes, 0=no)',
     NOX          FLOAT comment 'Nitric oxides concentration (ppm)',
     RM           FLOAT comment 'Average number of rooms per dwelling',
     AGE          FLOAT comment 'Proportion of owner-occupied units built before 1940',
     DIS          FLOAT comment 'Weighted distances to five Boston employment centers',
     RAD          FLOAT comment 'Index of accessibility to radial highways',
     TAX          FLOAT comment 'Full-value property-tax rate per USD 10,000',
     PTRATIO      FLOAT comment 'Pupil-teacher ratio by town',
     LSTAT        FLOAT comment 'Percentage of lower status of the population'
    );
    
    
    
    -- Create a MaxFrame prediction data table.
    CREATE TABLE IF NOT EXISTS demo_xgboost_predict_result
    (
     CRIM         DOUBLE comment 'Per capita crime rate by town',
     ZN           DOUBLE comment 'Proportion of residential land zoned for lots >25,000 sq. ft.',
     INDUS        DOUBLE comment 'Proportion of non-retail business acres per town',
     CHAS         BIGINT comment 'Bounds river (1=yes, 0=no)',
     NOX          DOUBLE comment 'Nitric oxides concentration (ppm)',
     RM           DOUBLE comment 'Average number of rooms per dwelling',
     AGE          DOUBLE comment 'Proportion of owner-occupied units built before 1940',
     DIS          DOUBLE comment 'Weighted distances to five Boston employment centers',
     RAD          DOUBLE comment 'Index of accessibility to radial highways',
     TAX          DOUBLE comment 'Full-value property-tax rate per USD 10,000',
     PTRATIO      DOUBLE comment 'Pupil-teacher ratio by town',
     LSTAT        DOUBLE comment 'Percentage of lower status of the population',
     RESULT       DOUBLE comment 'Predicted value of owner-occupied homes'
    );
  2. Upload the data using the Tunnel API:

    xgboost_train_data.csv

    xgboost_predict.csv

    TUNNEL UPLOAD xgboost_train_data.csv demo_xgboost_train;
    TUNNEL UPLOAD xgboost_predict.csv demo_xgboost_predict;

Train and save the model

  1. Train an XGBoost regression model using MaxFrame.

    ## Session initialization
    from odps import ODPS
    import maxframe
    from maxframe import options
    
    o = ODPS('LT******************',
             'Cz************************',
             'project_name',
             'endpoint',
            )
    
    ## Parameter configuration
    from maxframe.config import options
    options.sql.enable_mcqa = False
    # Use the DPI engine "DPE". 
    options.dag.settings = {
        "engine_order" : ["DPE", "MCSQL", "SPE"]
    }
    options.dpe.settings = {
        "odps.catalog_api_endpoint": "catalog_api_endpoint",
    }
    options.sql.settings = {
        "odps.session.image": "common",
    }
    options.service_role_arn =  "acs:ram::13933481********:role/aliyunodpsdefaultrole"
    options.object_cache_url = "oss://oss-cn-beijing-internal.aliyuncs.com/models-*******/mfdemo"
    
    sess = maxframe.new_session(o)  
    print(sess.get_logview_address())
    
    ## Model training and generation
    import numpy as np
    from sklearn.datasets import make_classification
    import maxframe.dataframe as md, maxframe.tensor as mt
    from maxframe.learn.contrib.xgboost import XGBRegressor
    
    # Define the table name and fields.
    table_name = "demo_xgboost_train"
    # Feature columns and the target column.
    features = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'lstat']
    label = 'medv'
    # Read the training data.
    df = md.read_odps_table(
        table_name,
        unknown_as_string=True  # Prevents errors caused by unstructured data.
    )
    # Feature and label processing.
    X_train = df[features].fillna(-1)  # Fill missing values.
    y_train = df[label]
    # Initialize and train the XGBoost regression model.
    model = XGBRegressor(
        n_estimators=300,
        learning_rate=0.1,
        colsample_bytree=0.8,
        n_jobs=-1,
        tree_method="hist",
        enable_categorical=True,
        objective='reg:squarederror'
    )
    # Save the generated XGBoost regression model.
    model.fit(X_train, y_train).to_odps_model(
        model_name="demo_model_xgboost_regressor",
        model_version="v01",
    ).execute()
  2. View the trained model.

    DESC model demo_model_xgboost_regressor;
    
    +------------------------------------------------------------------------------------+
    |                  Model Information                                                 |
    +------------------------------------------------------------------------------------+
    | Owner:                    ALIYUN$***********************                           |
    | Project:                  pd_test_model                                            |
    | Schema:                   default                                                  |
    | Model Name:               demo_model_xgboost_regressor                                   |
    | Model Type:               BOOSTED_TREE_REGRESSOR                                   |
    | Source Type:              INTERNAL_TRAIN                                           |
    | Default Version:          v01                                                      |
    | CreateTime:               2025-09-12 13:15:16                                      |
    | LastModifiedTime:         2025-09-12 13:15:16                                      |
    | Model ID:                 389c90e355264079923b89**********                         |
    +------------------------------------------------------------------------------------+
    |                Version Information                                                 |
    +------------------------------------------------------------------------------------+
    | Owner:                    ALIYUN$***********************                           |
    | Project:                  pd_test_model                                            |
    | Schema:                   default                                                  |
    | Model Name:               demo_model_xgboost_regressor                                    |
    | Model Type:               BOOSTED_TREE_REGRESSOR                                   |
    | Source Type:              INTERNAL_TRAIN                                           |
    | Version Name:             v01                                                      |
    | Version ID:               99acd6cb93f845c8a4a9977*********                         |
    | Path:                                                                              |
    | CreateTime:               2025-09-12 13:15:16                                      |
    | LastModifiedTime:         2025-09-12 13:15:16                                      |
    +------------------------------------------------------------------------------------+
    | Input           | Type       | Comment                                             |
    +------------------------------------------------------------------------------------+
    | crim            | float      |                                                     |
    | zn              | float      |                                                     |
    | indus           | float      |                                                     |
    | chas            | int        |                                                     |
    | nox             | float      |                                                     |
    | rm              | float      |                                                     |
    | age             | float      |                                                     |
    | dis             | float      |                                                     |
    | rad             | float      |                                                     |
    | tax             | float      |                                                     |
    | ptratio         | float      |                                                     |
    | lstat           | float      |                                                     |
    +------------------------------------------------------------------------------------+

Predict data using MaxFrame

  1. After you train the model, run the following program in MaxFrame to make predictions with the trained model:

    # Define the prediction data table and the sink table.
    predict_table = "demo_xgboost_predict_mf"
    predict_result_table = "demo_xgboost_predict_result"
    
    # Read the prediction data.
    predict_data = md.read_odps_table(
        predict_table,
        unknown_as_string=True
    )
    X_predict = predict_data[features].fillna(-1)
    y_predict = model.predict(X_predict)
    
    # Build the prediction result DataFrame.
    df_predict = predict_data[features].copy()  
    df_predict['predicted_medv'] = y_predict.astype(np.float64)  # Add the predicted value.
    
    # Write the prediction results to an ODPS table.
    df_predict.to_odps_table(
        predict_result_table,
        overwrite=True,
        index=False,
        unknown_as_string=True
    ).execute()
  2. View the sink table:

    SELECT * FROM demo_xgboost_predict_result limit 10;
    
    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
    | crim       | zn         | indus      | chas       | nox        | rm         | age        | dis        | rad        | tax        | ptratio    | lstat      | result     | 
    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
    | 0.22438    | 0.0        | 9.69       | 0          | 0.585      | 6.027      | 79.7       | 2.4982     | 6.0        | 391.0      | 19.2       | 14.33      | 20.03961181640625 | 
    | 0.06263    | 0.0        | 11.93      | 0          | 0.573      | 6.593      | 69.1       | 2.4786     | 1.0        | 273.0      | 21.0       | 9.67       | 24.491479873657227 | 
    | 0.04527    | 0.0        | 11.93      | 0          | 0.573      | 6.12       | 76.7       | 2.2875     | 1.0        | 273.0      | 21.0       | 9.08       | 24.683202743530273 | 
    | 0.06076    | 0.0        | 11.93      | 0          | 0.573      | 6.976      | 91.0       | 2.1675     | 1.0        | 273.0      | 21.0       | 5.64       | 32.33962631225586 | 
    | 0.10959    | 0.0        | 11.93      | 0          | 0.573      | 6.794      | 89.3       | 2.3889     | 1.0        | 273.0      | 21.0       | 6.48       | 28.45917510986328 | 
    | 0.04741    | 0.0        | 11.93      | 0          | 0.573      | 6.03       | 80.8       | 2.505      | 1.0        | 273.0      | 21.0       | 7.88       | 24.43267822265625 | 
    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Predict data using a SQL AI function

Use the created model and the ML_PREDICT SQL function to predict housing prices.

SET odps.sql.type.system.odps2=true;
SET odps.task.major.version=sqlml_master;
SET odps.sql.machine.learning.enable=true;

SELECT ML_PREDICT(demo_model_xgboost_regressor,v01,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat)
FROM demo_xgboost_predict;

-- The following result is returned:
+------------+
| _c0        | 
+------------+
| 20.039612  | 
| 24.49148   | 
| 24.683203  | 
| 32.339626  | 
| 28.459175  | 
| 24.432678  | 
+------------+

When you use the same model and prediction data, the inference results from the MaxFrame and SQL engines are consistent.