Linear Regression from Scratch(Including Python Code)
Data Science

Linear Regression from Scratch(Including Python Code)

LINEAR REGRESSION 

In this article we are going to discuss about the linear regression algorithm in detail along with its practical implementation in laymen terms.

 • Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. 

• It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). 

• There are 2 types of linear regression model: 

• Simple Linear Regression 

• Multiple Linear Regression 

 

SIMPLE LINEAR REGRESSION 

• Simple linear regression uses traditional slope-intercept form, where a and b are the variables our algorithm will try to “learn” to produce the most accurate predictions. There we have only one input column. 

• Formula: 𝑦=a𝑥+b

 ‘𝑥’ represents our input data , ‘𝑦’ represents our prediction , ‘a’ represents the coefficient( weightage of input variable on the output variable) , ‘b’ represents constant/bias.

 Value of ‘a’ and ‘b’ can be find using two approaches : 

Closed form solution (Direct Formula) 

Non – closed solution (Gradient Descent) 

 

Closed form solution : 

Formula for a and b are

 a = Σ ((Xi - X ̅)(Yi - Y ̅)) / Σ (Xi - X ̅)^2 

b = Y ̅ - mX ̅ 

Where Y ̅,X ̅ represents the mean values of output and input variables. 

Where Xi , Yi represents the data points.

 Sklearn library uses this formula only. 

We will discuss the gradient descent approach in later part of this article. 

 

BEST FIT LINE

 

 

  • Dark black line in this figure represents best fit line ( line which is making the minimum error
  • In order to find the best fit line, we have to find the minimum value of cost function/loss function. 
  • The line for which the the error between the predicted values and the observed values is minimum is called the best fit line or the regression line. 
  • These errors are also called as residuals. 
  • The residuals can be visualized by the vertical lines from the observed data value to the regression line.

 MULTIPLE LINEAR REGRESSION 

A more complex, multi-variable linear equation might look like this, where 𝑤w represents the coefficients, or weights, our model will try to learn.

• F(𝑥,𝑦,𝑧) = (𝑤1 *𝑥)+(𝑤2 * 𝑦)+(𝑤3 * 𝑧)

 • The variables 𝑥, 𝑦, 𝑧 represent the attributes, or distinct pieces of information, we have about each observation. 

• For sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers. 

• 𝑆a𝑙𝑒𝑠 = (𝑤1 *𝑅a𝑑𝑖𝑜)+(𝑤2 *𝑇𝑉)+(𝑤3 *𝑁𝑒𝑤𝑠)

 

 COST FUNCTION: gives you the estimation of the error that your model is making.

 We always try to keep the value of the cost function as small as possible. 

NOTE: Cost function and Loss Function are synonymous and used interchangeably but they are different. A loss function is for a single training example/input. A cost function is average loss over the entire training dataset.

 In the formula, 

Theta --> Input parameter 

m --> Total number of data points 

h (xi) --> predicted value 

yi --> actual value 

WE TAKE THE SQUARE OF THE RESIDUALS AND NOT THE ABSOLUTE VALUE OF THE RESIDUALS BECAUSE :

 • We want to penalize the points which are farther from the regression line much more than the points which lie close to the line.( Penalizing means giving more weights to those points which are farther from the best fit line . ) 

• Our objective is to find the model parameters so that the cost function is minimum.

 GRADIENT DESCENT 

  • Gradient descent is a generic optimization algorithm used in many machine learning algorithms. (In a nutshell, it is method of finding the minima.) 
  • It iteratively tweaks the parameters of the model in order to minimize the cost function. 

STEPS IN GRADIENT DESCENT

  •  We first initialize the model parameters with some random values. This is also called as random initialization.
  •  Now we need to measure how the cost function changes with change in its parameters. Therefore we compute the partial derivatives of the cost function w.r.t to the parameters θ₀, θ₁… θ. 

PARTIAL DERIVATIVE OF COST FUNCTION W.R.T. ANY PARAMETER

 PARTIAL DERIVATIVES OF ALL PARAMETERS 

UPDATING THE PARAMETERS

 UPDATING ALL PARAMETERS

 LEARNING RATE (α) 

α is the learning rate.

 • If the value of α is too small, the cost function takes larger time to converge. When the learning rate is very slow, the gradient descent takes larger time to find the best fit line.

 • If α is too large, gradient descent may overshoot the minimum and may finally fail to converge. 

NOTE: learning rate=0.01 is usually taken . 

In order to reach the global minima , we have to perform this step n number of times. 

WHEN TO STOP THIS PROCESS: 

1. difference between (theta)^old and (theta)^new =< 0.0001 

2. By predefining iteration (epochs) before i.e. 1000 or 100 

BEST PARAMETERS TO EVALUATE PERFORMANCE OF LINEAR REGRESSION MODEL 

• We will be using Root mean squared error (RMSE) and Coefficient of Determination (R² score) to evaluate our model. 

• RMSE is the square root of the average of the sum of the squares of residuals. 

• R² score or the coefficient of determination explains how much the total variance of the dependent variable can be reduced by using the least square regression. 

• SSₜ is the total sum of errors if we take the mean of the observed values as the predicted value. 

• SSᵣ is the sum of the square of residuals. 

RMSE FORMULA

 

R² SCORE FORMULA

 Where SUM OF SQUARES Of Total

 y ̅ = mean of target variable

 SUM OF RESIDUALS

 ASSUMPTIONS

  •  There should be linear relationship between input and output variable. 
  • No multi collinearity between input variables. 
  • Normality: For any fixed value of X, Y is normally distributed. 
  • Homoscedasticity: The variance of residual is the same for any value of X.

 ADVANTAGES 

  •  Simple Implementation 
  •  Highly interpretable 
  •  Scientifically acceptable
  •   Widespread availability 
  •  Performance on linearly separable datasets 
  •  Over fitting can be reduced by regularization 

DISADVANTAGES

  •   Prone to underfitting 
  •  Sensitive to outliers 
  •  Linear regression also looks at a relationship between the mean of the dependent variables and the independent variables. Just as the mean is not a complete description of a single variable, linear regression is not a complete description of relationships among variables. 

WHERE CAN LINEAR REGRESSION BE USED ? 

• It is a very powerful technique and can be used to understand the factors that influence Profitability. 

• It can be used to forecast sales in the coming months by analysing the sales data for previous months. 

• It can also be used to gain various insights about customer behavior.

 Practical implementation of Simple Linear Regression

  • Ankit Yadav
  • Dec, 27 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.