What is Linear Regression?
Linear regression is a supervised learning algorithm. It is used for prediction of numerical output from a set of inputs. It is used for continuous data. It assumes that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value (y) as accurately as possible as a function of the feature or independent variable(x).
The equation of regression line for single independent variable is represented as:
Yi = b0 +b1Xi+ei
b0 = intercept of equation
b1 = slope of regression line
eI = residual error (distance between regression line and data point)
The equation of regression line for more than one independent variable is represented as:
Yi = b0 +b1X1+b2X2+b3X3…….+bnXn+ei
Assumptions on data set:
Before building the model, we must assume some basic assumptions on data set like
1. Linear Relationship:
Relationship between response and feature variables should be linear. Otherwise, it results in either overfitting or underfitting of the entire model. The model will be underfit, when the bias of the model is high, and the regression line does not pass through the points. The model will be overfit when it has high variance. Usually, this happens when we are using unwanted independent variables. So, use only significant variables.
Bias: How much on an average are the predicted values different from the actual value.
Variance: How different will the predictions of the model be at the same point, if different samples are taken from the same population.
This happens because, we are not training our model on full historical data. To overcome this problem, we will use K-fold cross-validation method (explained in example).
2. No Multicollinearity:
Multicollinearity occurs when the independent variables are not independent from each other. To measure multicollinearity, we use Variance Influence Factor (VIF). If the VIF is more than 5, then remove those variables while training the model. For every variable, VIF will be calculated by R-squared value of the regression line against all the variables.
Ex: In the below example, all the variables with dark blue color are highly corelated (VIF () > 5)
Outlier is an observation point that is distant from other points, remove outliers from the data, outliers’ effects on the performance of the model. This outlier is defined by plotting the data sets.
After preparing your data, build a regression model using function lm ()
Select the model which has a less AIC value and a high AUC. These AIC and AUC penalizes additional parameters which are not useful in our model, to overcome this penalization we will use stepwise regression.
It is a variable selection procedure for independent variables. In this, the selection of independent variables is done with the help of automatic process without involving human intervention.
There are 3 types of stepwise regression models:
- Forward stepwise regression
- Backward stepwise regression and
- Standard stepwise regression
Forward Stepwise Regression:
It is also called as step-up selection. It starts with adding from the most significant variables in the model and adds one at each iteration until it gets the best AIC value.
Backward stepwise regression:
It is called as step-down selection, first it will add all the variables to the model and starts removing from less significant variables until it is getting best AIC value.
Standard stepwise regression:
Combination of above steps gives standard stepwise regression.
Now, let’s go to practical demo. Here, I am using a house price data set which consists of parameters like area, baths, city, floor type etc. to predict the price of a house.
House Price Prediction with R:
Open RStudio and set working directory as your data set location
Load data set into RStudio
Read.csv () is used to load csv files and na. strings () is used to replace null values with NA and str () is used to check structure of data set
Here, the price variable is our dependent variable (which needs to be predicted) and all the remaining are independent variables, with less than 5 classes and can be changed to factor data type. It is mainly used for categorical data.
In our data set, 1 and 16 columns have more than 5 classes. So, I’m excluding those columns in data type conversion and, data structures after conversion are as follows:
In R, we have different packages that are used for statistics calculations. In CRAN repository, install the required package and import to R studio in the following way:
Now, split the data into 2 parts as training and testing as shown below:
There are many ways to split data. Here, I am using caret package. After splitting of data, build your model using training data set as shown below:
lm () function is used for building linear regression model. ‘Prices’ is my target column and ‘houseprice_train’ is my training data set. Summary(model) gives summary of our model like R- squared value, significance of variables etc.
Here, you can see that the last columns have stars and, the variable which has stars are the most significant variables for predicting data. The p-value should be less than 5. In order to keep significant columns, we use stepwise regression. It checks significance of variables and it gives the final formula as follows:
I already explained that there are there are 3 methods in stepwise regression. Here, I am using the backward method. Now, predict data on our testing data set.
My testing dataset already has a prices column. So, I removed that column while prediction. After prediction, check errors using RMSE (Root Mean squared error) as shown below:
This is my error rate, which is in fractions:
Get source code and data set from below link