Extreme Gradient Boosting (XGBoost) is an efficient gradient boosting tree algorithm and is widely used for classification and regression. The XGBoost gbtree model uses decision trees as base learners. This topic describes how to use a real estate dataset to train the XGBoost gbtree model and use the trained XGBoost gbtree model to predict the housing price.
Prepare data
For example, the Boston dataset is used as the training dataset. The following table describes the data structure of the dataset.
Column name | Description | Data type |
crim | The urban crime rate per capita. | FLOAT |
zn | The proportion of residential properties that exceed 25,000 square feet. | FLOAT |
indus | The proportion of non-retail commercial properties per town. | FLOAT |
chas | Specifies whether a river channel is nearby. | INT |
nox | The nitrogen oxide concentration (one in ten million). | FLOAT |
rm | The average number of rooms per residential building. | FLOAT |
age | The proportion of principal residences built before the year of 1940. | FLOAT |
dis | The average distance to five employment centers. | FLOAT |
rad | The distance to the expressway. | INT |
tax | The total property tax rate per USD 10,000. | INT |
ptratio | The proportion of teachers and students per town. | FLOAT |
b | The proportion of African-American urban residents. | FLOAT |
lstat | The proportion of residents that have low income. | FLOAT |
medv | The average value of principal residences. | FLOAT |
Train a model
Use the boston.train table to train a model and the boston.test table to predict data. During model training, RDS SQLFlow automatically classifies training datasets to generate a training dataset and a validation dataset.
In the RDS SQLFlow command-line interface of the RDS SQLFlow client, execute the following SQL statements to train a model:
SELECT * FROM boston.train TO TRAIN xgboost.gbtree WITH objective="reg:squarederror", train.num_boost_round = 30 COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat LABEL medv INTO sqlflow_models.my_xgb_regression_model;The
SELECTstatement extracts training data from the boston.train table. TheTO TRAINstatement specifies the model to be used, which is xgboost.gbtree. TheWITHstatement configures training parameters. TheCOLUMNstatement specifies feature columns. TheLABELstatement specifies the target column. Finally, theINTOstatement defines the storage location of the trained model. For more information, see Training syntax.
Use a model to predict data
After the xgboost.gbtree model is trained, use the model and execute the following SQL statements to predict housing prices:
SELECT * FROM boston.test TO PREDICT boston.predict.medv USING sqlflow_models.my_xgb_regression_model;The
SELECTstatement specifies the dataset required for prediction. TheTO PREDICTstatement specifies the table that stores the prediction results. TheUSINGstatement specifies the model to be used. For more information, see Prediction syntax.After the prediction is complete, execute the following SQL statement to view the prediction results:
SELECT * FROM boston.predict;