【DSW Gallery】 XGBoost: How to use XGBoost to solve regression problems
Use the XGBoost algorithm to predict house prices
This article is based on a data set containing various attributes of houses, using house prices as labels, and using XGBoost to predict house prices.
In the Sample Notebook of DSW, there is another Notebook that is also using the same data set as this article for the regression analysis of housing prices. The difference is that the Notebook uses the ordinary linear regression algorithm.
Interested students can take a look at the linear regression Notebook. Link
The final result is that XGBoost achieves higher accuracy (92% vs 86%)
Preparation
The software packages that this article relies on have been pre-installed in the DSW mirror. If your environment is not installed, you can use pip install xxx to complete the preparation.
Let's import the required python library first.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn
from scipy import stats
from scipy.stats import norm, skew
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
data loading
Use Pandas to read in the data and view the raw data. The train.csv file is that we have downloaded and prepared from the Internet in advance. This article does not involve test samples, and you can download the corresponding test.csv file from the Internet.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train. head(5)
ID
MSSubClass
MS Zoning
Lot Frontage
LotArea
Street
Alley
LotShape
Land Contour
Utilities
...
PoolArea
PoolQC
Fence
MiscFeature
MiscVal
MoSold
YrSold
SaleType
Sale Condition
SalePrice
0
1
60
RL
65
8450
Pave
NaN
Reg
Lvl
AllPub
...
0
NaN
NaN
NaN
0
2
2008
WD
Normal
208500
1
2
20
RL
80
9600
Pave
NaN
Reg
Lvl
AllPub
...
0
NaN
NaN
NaN
0
5
2007
WD
Normal
181500
2
3
60
RL
68
11250
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
9
2008
WD
Normal
223500
3
4
70
RL
60
9550
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
2
2006
WD
Abnormml
140000
4
5
60
RL
84
14260
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
12
2008
WD
Normal
250000
5 rows × 81 columns
test.head(5)
ID
MSSubClass
MS Zoning
Lot Frontage
LotArea
Street
Alley
LotShape
Land Contour
Utilities
...
Screen Porch
PoolArea
PoolQC
Fence
MiscFeature
MiscVal
MoSold
YrSold
SaleType
Sale Condition
0
1461
20
RH
80
11622
Pave
NaN
Reg
Lvl
AllPub
...
120
0
NaN
wxya
NaN
0
6
2010
WD
Normal
1
1462
20
RL
81
14267
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
NaN
Gar2
12500
6
2010
WD
Normal
2
1463
60
RL
74
13830
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
wxya
NaN
0
3
2010
WD
Normal
3
1464
60
RL
78
9978
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
NaN
NaN
0
6
2010
WD
Normal
4
1465
120
RL
43
5005
Pave
NaN
IR1
HLS
AllPub
...
144
0
NaN
NaN
NaN
0
1
2010
WD
Normal
5 rows × 80 columns
Data cleaning and preprocessing
Generally, the raw data we get has various problems, which is not conducive to analysis and training, so it has to go through a cleaning and preprocessing stage, such as deduplication, missing values, outliers, etc. processing.
We have seen earlier that the original data has 81 columns of features, totaling 1460 records. Among them, the ID column is meaningless to us for training, so remove it first
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the 'Id' colum since it's unnecessary for the prediction process.
train. drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print(" The train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)
The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)
feature engineering
The feature engineering has been explained and explained in detail in the Notebook where the link at the beginning of this article is located, so this article will not expand the description
# drop these two abnormal points
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Use np.log1p to smooth the label, making it closer to the standard normal distribution
train["SalePrice"] = np. log1p(train["SalePrice"])
ntrain = train. shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
# Get the feature with missing values, which will be dealt with separately below
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd. DataFrame({'Missing Ratio': all_data_na})
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns. heatmap(corrmat, vmax=0.9, square=True)
Missing value handling
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data. drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd. DataFrame({'Missing Ratio': all_data_na})
# type conversion
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
# encoding
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond', 'HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold')
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))
print('Shape all_data: {}'. format(all_data. shape))
# Create a new feature based on relevant industry knowledge
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
# How many bathrooms are there in total
all_data['TotalBath'] = all_data[['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']].sum(axis=1)
# The area of the porch
all_data['TotalPorchSF'] = all_data[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','WoodDeckSF']].sum(axis=1)
# Calculate the skewness of the feature
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print(" Skewness of numeric type feature: ")
skewness = pd. DataFrame({'Skew': skewed_feats})
# Smoothing for skewness greater than 0, 75
skewness = skewness[abs(skewness) > 0.75]
print("A total of {} features need to be processed".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness. index
lam = 0.15
for feat in skewed_features:
all_data[feat] = boxcox1p(all_data[feat], lam)
all_data = pd. get_dummies(all_data)
# Generate the final dataset
train = all_data[:ntrain]
test = all_data[ntrain:]
Shape all_data: (2917, 78)
Skewness of numeric type feature:
A total of 61 features need to be processed
• MiscFeature : Same as above
This article is based on a data set containing various attributes of houses, using house prices as labels, and using XGBoost to predict house prices.
In the Sample Notebook of DSW, there is another Notebook that is also using the same data set as this article for the regression analysis of housing prices. The difference is that the Notebook uses the ordinary linear regression algorithm.
Interested students can take a look at the linear regression Notebook. Link
The final result is that XGBoost achieves higher accuracy (92% vs 86%)
Preparation
The software packages that this article relies on have been pre-installed in the DSW mirror. If your environment is not installed, you can use pip install xxx to complete the preparation.
Let's import the required python library first.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn
from scipy import stats
from scipy.stats import norm, skew
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
data loading
Use Pandas to read in the data and view the raw data. The train.csv file is that we have downloaded and prepared from the Internet in advance. This article does not involve test samples, and you can download the corresponding test.csv file from the Internet.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train. head(5)
ID
MSSubClass
MS Zoning
Lot Frontage
LotArea
Street
Alley
LotShape
Land Contour
Utilities
...
PoolArea
PoolQC
Fence
MiscFeature
MiscVal
MoSold
YrSold
SaleType
Sale Condition
SalePrice
0
1
60
RL
65
8450
Pave
NaN
Reg
Lvl
AllPub
...
0
NaN
NaN
NaN
0
2
2008
WD
Normal
208500
1
2
20
RL
80
9600
Pave
NaN
Reg
Lvl
AllPub
...
0
NaN
NaN
NaN
0
5
2007
WD
Normal
181500
2
3
60
RL
68
11250
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
9
2008
WD
Normal
223500
3
4
70
RL
60
9550
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
2
2006
WD
Abnormml
140000
4
5
60
RL
84
14260
Pave
NaN
IR1
Lvl
AllPub
...
0
NaN
NaN
NaN
0
12
2008
WD
Normal
250000
5 rows × 81 columns
test.head(5)
ID
MSSubClass
MS Zoning
Lot Frontage
LotArea
Street
Alley
LotShape
Land Contour
Utilities
...
Screen Porch
PoolArea
PoolQC
Fence
MiscFeature
MiscVal
MoSold
YrSold
SaleType
Sale Condition
0
1461
20
RH
80
11622
Pave
NaN
Reg
Lvl
AllPub
...
120
0
NaN
wxya
NaN
0
6
2010
WD
Normal
1
1462
20
RL
81
14267
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
NaN
Gar2
12500
6
2010
WD
Normal
2
1463
60
RL
74
13830
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
wxya
NaN
0
3
2010
WD
Normal
3
1464
60
RL
78
9978
Pave
NaN
IR1
Lvl
AllPub
...
0
0
NaN
NaN
NaN
0
6
2010
WD
Normal
4
1465
120
RL
43
5005
Pave
NaN
IR1
HLS
AllPub
...
144
0
NaN
NaN
NaN
0
1
2010
WD
Normal
5 rows × 80 columns
Data cleaning and preprocessing
Generally, the raw data we get has various problems, which is not conducive to analysis and training, so it has to go through a cleaning and preprocessing stage, such as deduplication, missing values, outliers, etc. processing.
We have seen earlier that the original data has 81 columns of features, totaling 1460 records. Among them, the ID column is meaningless to us for training, so remove it first
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the 'Id' colum since it's unnecessary for the prediction process.
train. drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print(" The train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)
The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)
feature engineering
The feature engineering has been explained and explained in detail in the Notebook where the link at the beginning of this article is located, so this article will not expand the description
# drop these two abnormal points
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Use np.log1p to smooth the label, making it closer to the standard normal distribution
train["SalePrice"] = np. log1p(train["SalePrice"])
ntrain = train. shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
# Get the feature with missing values, which will be dealt with separately below
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd. DataFrame({'Missing Ratio': all_data_na})
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns. heatmap(corrmat, vmax=0.9, square=True)
Missing value handling
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data. drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd. DataFrame({'Missing Ratio': all_data_na})
# type conversion
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
# encoding
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond', 'HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold')
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))
print('Shape all_data: {}'. format(all_data. shape))
# Create a new feature based on relevant industry knowledge
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
# How many bathrooms are there in total
all_data['TotalBath'] = all_data[['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']].sum(axis=1)
# The area of the porch
all_data['TotalPorchSF'] = all_data[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','WoodDeckSF']].sum(axis=1)
# Calculate the skewness of the feature
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print(" Skewness of numeric type feature: ")
skewness = pd. DataFrame({'Skew': skewed_feats})
# Smoothing for skewness greater than 0, 75
skewness = skewness[abs(skewness) > 0.75]
print("A total of {} features need to be processed".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness. index
lam = 0.15
for feat in skewed_features:
all_data[feat] = boxcox1p(all_data[feat], lam)
all_data = pd. get_dummies(all_data)
# Generate the final dataset
train = all_data[:ntrain]
test = all_data[ntrain:]
Shape all_data: (2917, 78)
Skewness of numeric type feature:
A total of 61 features need to be processed
• MiscFeature : Same as above
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00