Linear Regression Implementation By : Saurabh

We discussed the Linear Regression concept in the previous blog with a Layman example. So, let's discuss the Linear Regression model and its mechanism.

Linear Regression model makes use of two sets of variables:-

Independent variables (termed as x )
Dependent variable(termed as y)

Independent variables are the ones that are used to predict(calculate) the dependent variable.

If we take into consideration an example from the previous blog, then the independent variable will be 'Kilometers' and the dependent variable will be "Amount".

So in a nutshell:- You are predicting the amount based on the kilometers .i.e 'if I want travel x km then how much will it cost me (Y)?

Linear Regression can be further broken down into two types:-

Simple Linear Regression
Multiple Linear Regression

Simple Linear Regression is the one we just discussed above where we have one independent variable (km) and based on that we have to predict the dependent variable (Amount).

Multiple Linear Regression contains more than one independent variable and based on those variables we have to predict the dependent variable.

Example:- Algorithms will predict the blood pressure based on the patient's data

The one we are going to implement here is multiple linear Regression.

Description of the dataset:-

This data set contains the information listed on www.cardekho.com and it can be downloaded from here. The data can be used for a lot of purposes such as price prediction to exemplify the use of linear Regression in Machine Learning.

Here the variables in this data set are :- Car_Name,Year,Selling_Price,Present_Price ,Kms_Driven, Fuel_Type,Seller_Type,Transmission,Owner.

Goal:- To create a model which can predict the price of the car based on the given variables

FYI I'm using Jupyter Notebook and I'll be using python language for coding. So let's get started...

In [1]:-

#importing the necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

In [2]:-

## Importing Data into a Dataframe

df=pd.read_csv('cardata.csv',header=0)

df.head()

Out[1]:-

In[3]:-

#get to know your data

print(df.dtypes)

Out[2]:-

In[4]:-

# Check missing / null values

df.isnull().sum()

Out[3]:-

In[5]:-

#The Car_name variable is of no use to the Linear Regression model

## Removing Car Name

final_dataset=df[['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',

'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner']]

final_dataset.head()

In[6]:-

# Finding the age of the cars

final_dataset['Current Year']=2020

final_dataset['Ageing']=final_dataset['Current Year']-final_dataset['Year']

final_dataset.head()

Out[4]:-

In[7]:-

# Removing Current Year and Selling Year as now they are of no use

final_dataset.drop(['Year','Current Year'],axis=1,inplace=True)

final_dataset.head()

Out[5]:-

In[8]:-

#Storing all the categorical varibales together..

colname=[]

for x in final_dataset.columns:

if final_dataset[x].dtype=='object':

colname.append(x)

colname

In[9]:-

# For converting all the categorical variables into numerical,as our model only understands numerical values

from sklearn import preprocessing

le=preprocessing.LabelEncoder()

for x in colname:

final_dataset[x]=le.fit_transform(final_dataset[x])

final_dataset.head()

Out[6]:-

In[10]:-

# Splitting into Dependent and Independent features

X=final_dataset.iloc[:,1:]

Y=final_dataset.iloc[:,0]

In[11]:-

corr_final_dataset=X.corr(method="pearson")

print(corr_final_dataset)

sns.heatmap(corr_final_dataset,vmax=1.0,vmin=1.0,annot=True)

Out[7]:-

In[11]:-

#dropping Owner because of high multicollinearity

final_dataset.drop(['Owner'],axis=1,inplace=True)

final_dataset.head()

Out[8]:-

In[12]:-

#Checking the vif score of the varibales.

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

vif_final_dataset = pd.DataFrame()

vif_final_dataset["features"] = X.columns

vif_final_dataset["VIF Factor"] = [vif(X.values, i) for i in range(X.shape[1])]

vif_final_dataset.round(2)

Out[9]:-

In[13]:-

#Dropping Fueltype because of high Vif score

final_dataset.drop(['Fuel_Type'],axis=1,inplace=True)

final_dataset.head()

In[14]:-

from sklearn.model_selection import train_test_split

#Split the data into test and train

X_train,X_test, Y_train, Y_test=train_test_split(X,Y, test_size=0.2,random_state=10)

In[15]:-

from sklearn.linear_model import LinearRegression

#create a model object

lm = LinearRegression()

#tain the model object

lm.fit(X_train,Y_train)

#print intercpt and coeffficents

print (lm.intercept_)

print (lm.coef_)

Out[10]:-

In[16]:-

#predicting suing the model

Y_pred=lm.predict(X_test) #we only pass X_test in the predict function

print(Y_pred)

Out[11]:-

In[17]:-

#Framing up the Output

new_df=pd.DataFrame()

new_df=X_test

new_df["Actual Selling Price"]=Y_test

new_df["Predicted Selling Price"]=Y_pred

new_df

Out[12]:-

In[18]:-

#Checking the accuracy of the model created

from sklearn.metrics import r2_score,mean_squared_error

import numpy as np

r2=r2_score(Y_test,Y_pred)

print(r2)

rmse=np.sqrt(mean_squared_error(Y_test,Y_pred))

print(rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)

print(adjusted_r_squared)

Out[13]:-

At last, we have created a model(machine) that can predict the price of the car as we pass the data. We have an accuracy of 84% which is pretty good. Note that I haven't discussed how to handle the outliers in this code. I will discuss outlier imputation in another blog.

THANKS FOR READING...

Search This Blog

Saurabh Parab

Linear Regression Implementation By : Saurabh

Comments

Post a Comment

Popular posts from this blog

Machine Learning Introduction - By Saurabh

Machine Learning Algorithms Explained -By Saurabh