I believe data scientists are similar to doctors who are trying to understand what our body is telling to identify a disease, In our case data is like the body of our business, and it is trying to communicate something about the business’s performance. To understand whether this new change is good or bad to our business, we need to perform various tests using statistical models and try to find the culprits causing this change. Finding these culprits will help us understand the shift in our customers’ preference and change our strategies accordingly to serve them better and improve our profits.
Out of all the statistical models, I like Regression the most because it is simple and straight forward. It gives us all the indicators helping us to come to a conclusion quickly compared to other models.
Regression models can mainly be categorized into two types
i. Linear Regression
ii. Logistic Regression
The main difference between these two is that Linear Regression is used to predict a continuous variable like revenue. Logistic Regression is used to predict a Binary variable like Purchased (represented by 1) or not purchased (represented by 0).
In this blog, I want to focus on Linear Regression, mainly.
So let’s start with a fundamental question
What is Linear Regression?
If you did some research google, you might have read something like Linear regression attempts to model a relationship between two variables by fitting a linear equation to observed data. One variable is considered an explanatory variable, and the other is considered a dependent variable. (Linear Regression, 1997).
That’s a lot of terminology, right!!
What if you want to explain this to a 10 year or 12-year-old kid. Obviously, they can’t understand all these terms right.
Let’s simplify this definition to layman terms. For example, say we want to predict the sales price of a house (dependent variable). If we don’t have any other information regarding the house like the number of rooms, location, and square feet, it covers (independent variables). Our best estimation in this case for house price will be the average sales price of a house in that state or country.
But let’s say now we came to know about the square feet area that the house covers, then we will try to adjust our expected price of the house according to it.
All that Regression does here is taking this mean line and give it an angle (ie., slope) such that the distance between each point and this mean line is minimized, this line is called the best fit line.
Now it sounds simple right !!
Basically, a linear regression algorithm works by fitting multiple lines on data points and returns the line with the least error. Whenever we do Regression along with the line, we get an equation to the line of best fit which will be something like
Here X1, X2, ….. , Xp are the independent variables
Y is the dependent variable
Beta 1, Beta 2,…….., Beta p values are coefficients values that indicates, for one unit increase in these independent variables how much will the dependent value (Y) increase or decrease.
Intercept value (Beta 0) is the value that we get if there are no independent variables. Sometimes it can be negative, and that is also fine.
Coming to the Error Term, independent variables can never predict the dependent variable’s value 100 per cent accurately in real life. The prediction that we make is based on the available data so the error term tells how certain we are about this formula. The lower the value of the error term, the more confident we are about our prediction.
What happens if you remove the intercept value?
Basically, we are fitting a best fit line and if we are removing our intercept, we are telling our best fit line to start at origin instead of what data says, which will lead to distortion in our predictions. So it is better to leave the intercept as it is regardless of its value.
What is the goal of doing Regression?
The Goal of Regression or any other statistical model is to draw a random sample from a population and use it to estimate the whole population’s properties.
In the regression equation coefficients are the estimates of the actual population. To get a reasonable prediction of the dependent variable, we need to have the following two things
i. To get an estimation that is right on target, we should make sure that there is no bias in our analysis.
ii. There will always be an error term in our analysis. To minimize the discrepancy between the estimated value and actual value, we should ensure that the error term is as small as possible.
What does Regression Analysis do?
Regression analysis will sort out all the independent variables according to their impact on the dependent variable. It basically answers the questions like which factors matter the most? Which can we ignore? How do these factors interact with each other? And most importantly, how certain are we about all these factors? (Gallo, 2015)
Our next question will be ok now we know what Regression but how to measure it?
The measure here will be how much variance (error) has been covered by our mean line when we added slope to it.
Doing Linear Regression using Python on Ames Housing Dataset, a popular learning dataset from Kaggle
(Note: We will be using Ordinary Least squares (OLS) method to find the best fit line. This is the most common application to find the best fit line and it create a straight line by minimizing the sum of squares of the errors generated from the differences in the observed value and the value anticipated from the model.)
The following code will generate the linear regression model in python
#importing regression model package
import statsmodels.formula.api as smf
# blueprinting a model type
lm_full = smf.ols(formula = “””Sale_Price ~ Lot_Area +
# telling Python to run the data through the blueprint
results_full = lm_full.fit()
# printing the results
This code can be found here and will give us output as show below
The main values that we need to concentrate on from this output are
- R-Squared: This is coefficient of determination which tell us how much variance of our dependent variable data did our model explained (i.e.,., from the above example we can say our model can explain 85% variance in the sales price). This answers the question of how accurate is our model.
- Adjusted R-Squared: This is similar to R-squared value with a twist: which is it penalizes the model whenever we add variables that don’t add any value to the model. In a zest, we can say R-square value can never go down even when we add useless variables to the model, but adjusted R-Square value goes down telling us that there is no use in adding these variables.
- AIC & BIC: R-Squared and Adjusted-R squared are the metrics to explain the amount of variance that the model can explain, In contrast AIC & BIC tell us how much of the information we are loosing by getting this model. (i.e., For example, when we are generalizing all the houses with three-bedrooms as same we might lose some specialities that each room has that can increase the value of a home which we are missing by keeping them all in the same box).
- Probability of F-statistics: If this value is greater than our level of significance which in most of the cases is 0.05 then our model is useless
- F-Statistics: This is like a giant P-value which tells us whether our model is better than nothing. To put in simple terms when we gave slope to our mean line using this model is the variance that we are coming is beating our normal mean line variance or not. If it is not, then it is better not to have this model.
- Degrees of Freedom Residuals: DF Residuals = Number of observations — Number of Coefficients including the intercept. This is very important because it is based on the fact that every house in our example is different and we want to give freedom to our data to be different. If we have a very low DF value like 30, it means that we are restricting our model by putting all our data into a tight box where we actual accounted for almost every single case in our dataset. When we give our model new data, and if it finds a case that our model never saw before it will predict very poorly. This is why it is good to have a higher DF residuals value.
- P-value of coefficients: If the p-value of a coefficient is greater than our level of significance, which is 0.05(ie., 5%), we will remove that variable from our models because it will not add any value to our model.
We have been talking a lot of level of significance, so what is it?
It basically tells us that whatever pattern the variable reflects has occurred by a random chance. Having a number as low as 0.05 (i.e., 5%) increases our confidence that the results are significant and not occurred by a random chance.
R-square & Adjusted R-square Vs AIC & BIC. When to use them?
- R Square value is the final value that we consider as accuracy of our model.
- Adjusted R-square can be used to compare a model of different sizes. When we compare OLS regression of 10 variables with OLS regression of 20 variables, this is the metrics to compare apple to apple between those models.
- AIC & BIC talks about the overall health of the model and is used when comparing models of different types, like when we are comparing OLS regression with Random Forest. This is like comparing Regression to the classification model to see how different response variables can be used to solve our problem and find which is better. In this case, the reason for using AIC & BIC is when we jump out of regression model assumptions R square & Adjusted R squared goes away.
“ We won’t get insights from any of these statistics that we explained above. They only tell us whether our model is good or bad. To gain insights, we need to take our coefficient values and put them into action to see how our business is getting affected by them. Then, we start getting ideas on how to use that information to improve our decisions “
Linear Regression is my favourite and preferred machine learning model because everything is straight forward. The Statistics that it generates talk to us directly. For example, R-square value & probability of F-statistics talks about the accuracy of the model telling us whether this model is better than nothing, p-values of each variable tells us how significant each variable is and coefficient values help us understand how much effect each of them has on the value that we are predicting. So coupling all these values from our model with the business knowledge we have will help us understand our data better, leading to better decision making. If we can solve our problem using a simple model like Regression which can give us straight forward results, it is better to use this than a complicated model like neural nets.
Gallo, A. (2015, Nov 04). A Refresher on Regression Analysis. Retrieved from Harvard Business Review: https://hbr.org/2015/11/a-refresher-on-regression-analysis
Linear Regression. (1997). Retrieved from stat.yale.edu: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm