Linear Regression Implementation By : Saurabh
Linear Regression model makes use of two sets of variables:-
- Independent variables (termed as x )
- Dependent variable(termed as y)
Independent variables are the ones that are used to predict(calculate) the dependent variable.
If we take into consideration an example from the previous blog, then the independent variable will be 'Kilometers' and the dependent variable will be "Amount".
So in a nutshell:- You are predicting the amount based on the kilometers .i.e 'if I want travel x km then how much will it cost me (Y)?
Linear Regression can be further broken down into two types:-
- Simple Linear Regression
- Multiple Linear Regression
Simple Linear Regression is the one we just discussed above where we have one independent variable (km) and based on that we have to predict the dependent variable (Amount).
Multiple Linear Regression contains more than one independent variable and based on those variables we have to predict the dependent variable.
Example:- Algorithms will predict the blood pressure based on the patient's data
The one we are going to implement here is multiple linear Regression.
Description of the dataset:-
This data set contains the information listed on www.cardekho.com and it can be downloaded from here. The data can be used for a lot of purposes such as price prediction to exemplify the use of linear Regression in Machine Learning.
Here the variables in this data set are :- Car_Name,Year,Selling_Price,Present_Price ,Kms_Driven, Fuel_Type,Seller_Type,Transmission,Owner.
Goal:- To create a model which can predict the price of the car based on the given variables
FYI I'm using Jupyter Notebook and I'll be using python language for coding. So let's get started...
In [1]:-
#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:-
## Importing Data into a Dataframe
df=pd.read_csv('cardata.csv',header=0)
df.head()
Out[1]:-
In[3]:-
#get to know your data
print(df.dtypes)
Out[2]:-
In[4]:-
# Check missing / null values
df.isnull().sum()
Out[3]:-
In[5]:-
#The Car_name variable is of no use to the Linear Regression model
## Removing Car Name
final_dataset=df[['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner']]
final_dataset.head()
In[6]:-
# Finding the age of the cars
final_dataset['Current Year']=2020
final_dataset['Ageing']=final_dataset['Current Year']-final_dataset['Year']
final_dataset.head()
Out[4]:-
In[7]:-
# Removing Current Year and Selling Year as now they are of no use
final_dataset.drop(['Year','Current Year'],axis=1,inplace=True)
final_dataset.head()
Out[5]:-
In[8]:-
#Storing all the categorical varibales together..
colname=[]
for x in final_dataset.columns:
if final_dataset[x].dtype=='object':
colname.append(x)
colname
In[9]:-
# For converting all the categorical variables into numerical,as our model only understands numerical values
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
for x in colname:
final_dataset[x]=le.fit_transform(final_dataset[x])
final_dataset.head()
Out[6]:-
In[10]:-
# Splitting into Dependent and Independent features
X=final_dataset.iloc[:,1:]
Y=final_dataset.iloc[:,0]
In[11]:-
corr_final_dataset=X.corr(method="pearson")
print(corr_final_dataset)
sns.heatmap(corr_final_dataset,vmax=1.0,vmin=1.0,annot=True)
Out[7]:-
In[11]:-
#dropping Owner because of high multicollinearity
final_dataset.drop(['Owner'],axis=1,inplace=True)
final_dataset.head()
Out[8]:-
In[12]:-
#Checking the vif score of the varibales.
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
vif_final_dataset = pd.DataFrame()
vif_final_dataset["features"] = X.columns
vif_final_dataset["VIF Factor"] = [vif(X.values, i) for i in range(X.shape[1])]
vif_final_dataset.round(2)
Out[9]:-
In[13]:-
#Dropping Fueltype because of high Vif score
final_dataset.drop(['Fuel_Type'],axis=1,inplace=True)
final_dataset.head()
In[14]:-
from sklearn.model_selection import train_test_split
#Split the data into test and train
X_train,X_test, Y_train, Y_test=train_test_split(X,Y, test_size=0.2,random_state=10)
In[15]:-
from sklearn.linear_model import LinearRegression
#create a model object
lm = LinearRegression()
#tain the model object
lm.fit(X_train,Y_train)
#print intercpt and coeffficents
print (lm.intercept_)
print (lm.coef_)
Out[10]:-
In[16]:-
#predicting suing the model
Y_pred=lm.predict(X_test) #we only pass X_test in the predict function
print(Y_pred)
Out[11]:-
In[17]:-
#Framing up the Output
new_df=pd.DataFrame()
new_df=X_test
new_df["Actual Selling Price"]=Y_test
new_df["Predicted Selling Price"]=Y_pred
new_df
Out[12]:-
In[18]:-
#Checking the accuracy of the model created
from sklearn.metrics import r2_score,mean_squared_error
import numpy as np
r2=r2_score(Y_test,Y_pred)
print(r2)
rmse=np.sqrt(mean_squared_error(Y_test,Y_pred))
print(rmse)
adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print(adjusted_r_squared)
Out[13]:-
At last, we have created a model(machine) that can predict the price of the car as we pass the data. We have an accuracy of 84% which is pretty good. Note that I haven't discussed how to handle the outliers in this code. I will discuss outlier imputation in another blog.
THANKS FOR READING...
This one is truly great now we actually have enough knowledge about multiple regression. Great work
ReplyDeleteThanks I appreciate it.
Delete