Linear regression example on housing data
In this below tutorial, we will explain about linear regression with housing data.
Step 1: we need to import the libraries and metrics
Step 2: we need to imprort the housing data
Step 3: Data pre-processing removing the uncessary variables like price, id, date
Step 4: Assisgn the price variable to Y
Step 5: Split the data into training and test set using training_test_split method.
Step 6. In this step, I am providing the data to linear Regression() algorithm. I fit and predict the values. I got 0.7044808067489784 score. I am not satisfied with this score.
Step 7. Now, I am moving to RandomforestRegressor, It will provide 500 trees with depth of 10. I feed that data to this algorithm. I got 0.9361980772317255 score. Pretty good.
Step 9. I am some what satisfied with score. Trying to better the model. I finally tried with GradientBoostingRegressor with 500 trees with depth of 10. I feed the data to this algorithm. Finally I Achieved, 0.9990719047561639.
Step 10. Plotting the graph Results.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error, r2_score data = pd.read_csv('kc_house_data.csv') X= data.drop(['price','id','date'], axis=1) Y= data['price'] from sklearn.cross_validation import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0) from sklearn.linear_model import LinearRegression reg = LinearRegression() reg.fit(X_train, Y_train) Y_pred=reg.predict(X_test) print("Linear Regression Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred)) print('Linear Regression r2 score: %.2f' % r2_score(Y_test, Y_pred)) print('Accuracy score',reg.score(X_train, Y_train)) from sklearn import ensemble reg = ensemble.RandomForestRegressor(max_depth=10, random_state=0, n_estimators=500) reg.fit(X_train, Y_train) Y_pred=reg.predict(X_test) print("RandomForestRegressor Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred)) print('RandomForestRegressor r2 score: %.2f' % r2_score(Y_test, Y_pred)) print('RandomForestRegressor Accuracy score',reg.score(X_train, Y_train)) from sklearn import ensemble reg = ensemble.GradientBoostingRegressor(n_estimators = 500, max_depth = 10, min_samples_split = 2, learning_rate = 0.1, loss = 'ls') reg.fit(X_train, Y_train) Y_pred=reg.predict(X_test) print("GradientBoostingRegressor Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred)) print('GradientBoostingRegressor r2 score: %.2f' % r2_score(Y_test, Y_pred)) print('GradientBoostingRegressor Accuracy score',reg.score(X_train, Y_train)) plt.scatter(Y_test[:20], Y_pred[:20], color='black') plt.plot(Y_test[:20], Y_pred[:20], color='blue', linewidth=3) plt.show()
Output:
Linear Regression Mean squared error: 42863880415.46 Linear Regression r2 score: 0.69 Accuracy score 0.7044808067489784 RandomForestRegressor Mean squared error: 17501164817.95 RandomForestRegressor r2 score: 0.87 RandomForestRegressor Accuracy score 0.9361980772317255 GradientBoostingRegressor Mean squared error: 15377672448.99 GradientBoostingRegressor r2 score: 0.89 GradientBoostingRegressor Accuracy score 0.9990719047561639
We can’t sure which algorithm, will produce the best score for our data set, we have to do trail and error method.
I tried with different algorithms and finetune the parameters of algorithms to get the best results.
Best of luck.