Data Science

Understanding Logistic Regression

Introduction

A typical Machine Learning problem can be divided into two types: Regression or Classification, Regression problem deals with continuous values, where you form a linear relationship between dependent and independent variables for example prediction of stock price, time-wise sales price for an organization whereas Classification refers to a problem which deals with discrete values i.e for a given entry you have to find out to which class they belong having a greater probability than other classes example Spam email classification, whether a transaction is a fraud or not, etc

Note - Regression algorithms can also be used for classification and vice-versa, though it's not recommended.

What is Logistic Regression?

From the name you might be thinking it is a regression algorithm but logistic regression is popular and widely used for classification. So basically, what logistic regression does is, it tries to separate all the data points available by placing them within a distinct class and the physical line which separates all the classes is known as the decision boundary, this decision boundary gets generated during the training phase of the model, it picks up the best boundary which gives a minimum error and this boundary is then used for predicting whether a test point lines in class1 or class2 or class N.

Mathematical Intuition behind Logistic Regression

First, we require a hypothesis function, which maps independent variables with dependent variables.

Here x1, x2, …, xn are dependent variables, and θ1, θ2, …, θn are parameters that are learned during model training.

This function outputs a continuous(numeric) value but for classification, we need it to output y=0 or y=1 (i.e whether a test point belongs to a class or not). For this, we pass the output from the linear equation to the sigmoid function (also known as logistic function, hence named logistic regression) which is a non-linear activation function, that compresses the output between 0 and 1 then using a threshold value we can classify to which class a specific entry belongs.

As the value of X gets larger and larger, σ(x) gets closer to 1 and for large negative value of X, σ(x) gets closer to 0 . After that we can set threshold value according to our problem for example if σ(x) > = 0.5, y = 1 else if σ(x) < 0.5 then y = 0.

This is what sigmoid function looks like,

If you are interested to see the code for the above visualization please refer to my notebook here.

Cost Function

To find out how deviated our current prediction is from the true label, we need a way to calculate this. Cost Function helps us do this, it finds out the overall error of the model with the current value of model parameters which is further used by optimizers to reduce the error until the model converges or for a specific number of iterations.

In the case of Logistic Regression, we cannot use the one of Linear Regression as the hypothesis function here has non-linearity due to the sigmoid function, so we’ll need a different formula:

This is cost for a single prediction, for overall cost we need to average the sum of cost with respect to all the examples.

Example

Let’s see an example of Logistic Regression step-by-step

Generating a random (though it is not completely random cause they should be separable easily) dataset with 2 classes.

Here is how it looks after shuffling the Data Frame.

2. Visualizing them for more clear understanding

As you can see the points belonging to two different classes (colored orange and blue) are clearly visible.

3. Now we need to find & plot the decision boundary so that we can predict new datapoints.

In the above equation, b is the intercept and w1 & w2 are coefficients

Using this equation, we can plot the decision boundary as follows,

So, whenever a new data point arrives we can assign a class accordingly depending on which side it appears.

For more complex datasets, where you cannot form a linear decision boundary you can add polynomial terms of higher degree, that gives a decision boundary appropriate to that dataset or shapes such as a circle, ellipse, or it can be a random area.

One vs All Classification

Till now what we were discussing was Binary classification or also called one-vs-one where there are only two possibilities, yes or no, True or False, 0 or 1, etc. But when you have more than 2 classes how does Logistic Regression works? Let’s understood this with an example. Consider we’ve three classes named square, circle, and triangle. Our model cannot classify items of all these different classes at once. So, what we do is train k number of classifiers one for each class where k is the number of classes.

One classifier which tells whether a shape is square or not, another one which tells whether a shape is circle or not similarly, for all distinct classes. While predicting on a new test point we make predictions using all the k classifiers and assign that test point the class for which the classifier is giving the higher probability.

Aniket Jalasakare
Jul, 03 2022