House Prices Prediction - XGBRegressor Model and Pipeline

Below is the first machine learning pipeline I implemented after completing the machine learning section of Kaggle learn. It uses XGBRegressor with Simple Imputer to impute the missing values and One-Hot encoding to convert the categorical data (like SaleCondition) into binary columns. There are many data analysis techniques and visualization methods that can be applied to the dataset to get the best results. Due to my limited knowledge of those methods for now, I will only be implementing the methods taught in the Kaggle learn course.
Steps:
  • Extract and modify the data from the .csv file to suit our model's requirement.
  • Impute and Encode the training and test set if needed.
  • Define the machine learning model.
  • Train or fit the model with the training data.
  • Make predictions on the testing data.
  • Evaluate the model, if any improvement possible then apply it and refit the model.
First, import the train and test data then convert it to dataframe using pandas. We will print the columns to take our first look at the names of the columns and potential features for our machine learning model.
In [1]:
import pandas as pd

# extract data into dataframe
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# print columns
train.columns
Out[1]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
The shape of the training set is its dimension. There are 1460 data points and 81 columns or features. We can think of this as 1460 rows and 81 columns in a table. If we check the shape of test data we will get the same number of rows (1460) but the columns would be only 80 as we have to predict the remaining column using our model.
The head prints the top five values of our dataset.
In [2]:
# prints shape which is (1460 * 81) 
train.shape
# prints first few values of the datset with their columns
train.head()
Describe method provides us with clear outlook of our data. It summarizes the fundamental statistical terms like mean, standard deviation, etc that are helpful in providing us some useful information and at the same time we can analyze if the given data makes sense in real life. For example, we can look at the mean of the YrSold column which is 2007.82, it takes little common knowledge to not use the model trained with this data in 2018 as it will make very low or high predictions.
In [3]:
# description
train.describe()
Next we will break down our dataframes further into X and y. As we all know in most mathematical equations X is the variable used to calculate the value of y. X and y are dependent meaning change in value of X changes the value of y. The same applies here. The SalePrice(target) column will be contained in y and all the other columns which are helpful in finding the target(SalePrice) will be stored in X.
We cannot have SalePrice column in X so drop is used to remove that column. From the previous step, I realized that the Id column is not very important information to consider while predicting the SalePrice therefore I dropped it too.
Now we have training input(X_train) , training target(y_train) and testing input (X_test). The testing target or y_test should be the model's output values.
In [4]:
# target 
y_train = train.SalePrice

# drops the Id and SalePrice columns
X_train = train.drop(['Id','SalePrice'], axis= 1)

# drops Id column from test
X_test = test.drop(['Id'], axis= 1)
Our dataset does not just include numerical or integer values but there are categorical columns as well like SaleCondition. Most machine learning models do not take in these values. It will produce errors unless we encode the categorical data into integer which might increase the number of columns. Here we are using One hot encoding to perform the conversion. It is available inside the pandas module.
In [5]:
# code to encode object data(text data) using one-hot encoding(commonly used)
one_hot_encoded_training_data = pd.get_dummies(X_train)
one_hot_encoded_testing_data = pd.get_dummies(X_test)

# align command make sure that the columns in both the datasets are in same order
final_train, final_test = one_hot_encoded_training_data.align(one_hot_encoded_testing_data,join='inner',axis=1)
# to check the increased number of columns
final_train.shape
Out[5]:
(1460, 270)
We covered the categorical values but still have some values or data points that are empty (like houses without Garages will not have GarageArea). The simplest solution to this problem is to just drop those columns but this might lead to loss of critical information that could have largely helped the model to improve its accuracy. Below I have used SimpleImputer that imputes the missing value with the mean of other available ones.
I have compared the cross-validation score of two different models to check which one performs better. The XGBRegressor is way more accurate than the RandomForestRegressor.
The for loop is used to estimate numer of max_leaf_nodes in RandomForestRegressor that gives the minimum mean absolute error (MAE).
I manually tweaked the parameters of XGBRegressor to get as low MAE as possible. I think some other parameters can be changed too to make it even more accurate.
Both the imputer and model are defined inside a pipeline which makes it easier and more flexible to apply to different datasets. It reduces bugs and improves readabiity.
In [6]:
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
import numpy as np

# RandomForestRegressor:
for n in range(10,200,10):
    # define pipeline
    pipeline = make_pipeline(SimpleImputer(), RandomForestRegressor(max_leaf_nodes=n,random_state=1))
    # cross validation score
    scores = cross_val_score(pipeline, final_train, y_train, scoring= 'neg_mean_absolute_error')
    print(n,scores)

# XGBRegressor:
# define pipeline
pipeline = make_pipeline(SimpleImputer(), XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                            nthread = -1, random_state=1))
# cross validation score
scores = cross_val_score(pipeline,final_train, y_train, scoring= 'neg_mean_absolute_error')
print('Mean Absolute Error %2f' %(-1 * scores.mean()))
#Validation function


# GradientBoostingRegressor:(just another model)
# define pipeline
my_pipeline = make_pipeline(SimpleImputer(), GradientBoostingRegressor())
# cross validation score
score = cross_val_score(my_pipeline,final_train, y_train, scoring= 'neg_mean_absolute_error')
print('Mean Absolute Error %2f' %(-1 * score.mean()))

Mean Absolute Error 15178.358732

Mean Absolute Error 16301.944887
Finally the step we all were waiting for a while now, the fitting and predicting.
To train or to fit is the heart of any machine learning exercise. Training the model means allowing the algorithms in the model to determine the patterns in the training data. It is the job of the model to figure out y_train(target) from X_train(input)
To predict means the model uses the insights and patterns it has gained from the training step to produce an output that is accurate (atleast for the model I don't want to hurt its feelings). The output or y_test might not be satisfactory at all. It is the job of the machine learning engineers to evaluate and improve it.
In [7]:
# fit and make predictions
pipeline.fit(final_train,y_train)
predictions= pipeline.predict(final_test)

print(predictions)
[128155.4  161458.44 186407.11 ... 172886.42 113008.18 208297.67]
The evaluation step cannot be implemented because our data does not have y_test. We did perform some evaluation using the training data which did help us make better predictions as previously seen.
In [8]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predictions})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)
Below is the partial dependecy plot which could give much useful information to further improve our model.
For this version of the kernel it is only able to produce partial dependecy plots for GradientBoostingRegressor.



In [9]:
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
def get_some_data():
    cols_to_use = ['YearBuilt', 'TotRmsAbvGrd', 'LotArea']
    data = pd.read_csv('../input/train.csv')
    y = data.SalePrice
    X = data[cols_to_use]
    my_imputer = SimpleImputer()
    imputed_X = my_imputer.fit_transform(X)
    return imputed_X, y


# get_some_data is defined in hidden cell above.
X, y = get_some_data()
# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,       
                                   features=[0,2], # column numbers of plots we want to show
                                   X=X,            # raw predictors data.
                                   feature_names=['YearBuilt', 'TotRmsAbvGrd', 'LotArea'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis




Comments