Linear regression example on housing data

Spread the love

In this below tutorial, we will explain about linear regression with housing data.

Step 1: we need to import the libraries and metrics

Step 2: we need to imprort the housing data

Step 3: Data pre-processing removing the uncessary variables like price, id, date

Step 4: Assisgn the price variable to Y

Step 5: Split the data into training and test set using training_test_split method.

Step 6. In this step, I am providing the data to linear Regression() algorithm. I fit and predict the values. I got 0.7044808067489784 score. I am not satisfied with this score.

Step 7. Now, I am moving to RandomforestRegressor, It will provide 500 trees with depth of 10. I feed that data to this algorithm. I got 0.9361980772317255 score. Pretty good.

Step 9.  I am some what satisfied with score. Trying to better the model. I finally tried with GradientBoostingRegressor  with 500 trees with depth of 10. I feed the data to this algorithm. Finally I Achieved, 0.9990719047561639.

Step 10. Plotting the graph Results.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('kc_house_data.csv')

X= data.drop(['price','id','date'], axis=1)
Y= data['price']

from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

from sklearn.linear_model import LinearRegression

reg = LinearRegression(), Y_train)

print("Linear Regression Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred))
print('Linear Regression r2 score: %.2f' % r2_score(Y_test, Y_pred))
print('Accuracy score',reg.score(X_train, Y_train))

from sklearn import ensemble
reg = ensemble.RandomForestRegressor(max_depth=10, random_state=0, n_estimators=500), Y_train)

print("RandomForestRegressor Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred))
print('RandomForestRegressor r2 score: %.2f' % r2_score(Y_test, Y_pred))
print('RandomForestRegressor Accuracy score',reg.score(X_train, Y_train))

from sklearn import ensemble
reg = ensemble.GradientBoostingRegressor(n_estimators = 500, max_depth = 10,
min_samples_split = 2,
learning_rate = 0.1, loss = 'ls'), Y_train)

print("GradientBoostingRegressor Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred))
print('GradientBoostingRegressor r2 score: %.2f' % r2_score(Y_test, Y_pred))
print('GradientBoostingRegressor Accuracy score',reg.score(X_train, Y_train))

plt.scatter(Y_test[:20], Y_pred[:20], color='black')
plt.plot(Y_test[:20], Y_pred[:20], color='blue', linewidth=3)


Linear Regression Mean squared error: 42863880415.46
Linear Regression r2 score: 0.69
Accuracy score 0.7044808067489784
RandomForestRegressor Mean squared error: 17501164817.95
RandomForestRegressor r2 score: 0.87
RandomForestRegressor Accuracy score 0.9361980772317255
GradientBoostingRegressor Mean squared error: 15377672448.99
GradientBoostingRegressor r2 score: 0.89
GradientBoostingRegressor Accuracy score 0.9990719047561639

We can’t sure which algorithm, will produce the best score for our data set, we have to do trail and error method.

I tried with different algorithms and finetune the parameters of algorithms to get the best results.

Best of luck.


Leave a Reply

Your email address will not be published. Required fields are marked *