How Does Calculus Work In Machine Learning?
Data Science

How Does Calculus Work In Machine Learning?

How Does Calculus Work In Machine Learning?


  1. Who can read this article? (and understand it)

  2. A very brief Intro to Calculus

  3. What is linear optimization and how it differs from non-linear optimization

  4. Gradient Descent

  5. Probability theory

  6. So where can I apply this knowledge

  7. References

1. Who can read this article? (and understand it)

I wrote this article such that anyone can completely grasp the gist of what calculus is actually doing in midst of all the Machine Learning hype but not really coming up on stage until you start seeking your own answer as to what Machine Learning is.

With that said, anyone who just heard about Machine Learning can benefit from this article and understand an intuitive explanation about Calculus and how it functions (no pun intended) hand-in-hand along the Machine Learning concepts.

I will not be going too much into mathematical concepts but I can’t just ignore it when it is sitting in the middle of the room. Therefore, I’ll try to be minimal.

Now with that out of the way, here’s a very summarized intuitive explanation about what Calculus is.

2. A very brief intro to Calculus

First of all, don’t worry, I won’t be going into too much Calculus.

Second of all, if you have already strengthened your concepts in calculus you skip this part. There’s nothing much in this section for you.

Calculus (aka Analysis) itself is a very formal mathematical concept which was invented by Isaac Newton, and Gottfried Leibniz in the 17th Century. 

It is a mathematical tool which lets us study the rates of change in any quantity..

The rates of change are basically slopes which can be calculated by dividing the Rise by the Run w.r.t two selected points of a function on a graph. Here is what my gibberish looks like visually:

The green line represents the function at every instance of “x”.

The two dots marked on the green line are just random points.

The point ( x, f(x) ) is the initial point we marked.

The point ( x+𝚫x, f(x+𝚫x) ) is a point marked on the function which is 𝚫x units farther away from the initial point.

The red line is found by dividing Rise over the Run, which is basically the slope of the selected part of the curve.

Calculus have two branches of study protruding from it. One is called Differential Calculus and other is called Integral Calculus.

The general formula we use in Differential Calculus (base of the shortcut method we usually employ) is as follows:

f(x+𝚫x) - f(x) / (x+𝚫x) - (x)

The above formula, if we look closely, is just the difference of what the figure said about Rise over Run. No rocket science there. And this is what differential calculus is all about. It breaks a function into smaller components through this formula (just like the yellow dotted box shows).

Of course we could make the (x+𝚫x) the base point and move ahead to another point along the x-axis to take another component(another yellow dotted box). The following diagram shows exactly this in pink rectangles.

Hence the notation

d / dx

where “d” just stands for delta in 𝚫x.

Integral Calculus on the other hand is collects all those differentiated pieces together into one “whole” and gives us a number of the the area of that “whole”. You can safely say that it is the Anti-Derivative.

But this is important to note that, while joining together those pink rectangles, it will not give us the precise area under the curve, but instead it will just give us an approximation. We can try to minimize this error by making rectangles with lesser width or in other words, differentiating with lesser values for 𝚫x. While this would surely minimize our error, it will not completely remove it from existence. But an upside to all this is just have a closer and closer approximation to the function.

To get a visual explanation of how Calculus works and get the essence of it, I know a guy who does it best. Try this youtube channel: 3blue1brown

This wraps up our Introduction to Calculus.

3. What is linear optimization and how it differs from non-linear optimization

Again, if you already know this, you can skip it.

Linear optimization is just talking about what value of variables in a function will make its output minimum or maximum given some constraints or restrictions. There can only be two types of optimization (min or max). Like for example we have a function:

f(x1, x2, … xn) = x1.a1 + x2.a2 + … +


x1,x2,...,xn = variables

a1,a2,...,an = constants

Since there is no index of 2 or more on any variables here, we can say that this is a linear function. Therefore, talking about what values of x1,x2,...,xn will minimize or maximize the value of f(x1, x2, … xn) given some constraints will be optimizing a linear function.

Non-linear optimization takes place the similar way but since the function will now contain higher order terms, the optimization can become messy. Therefore, we take the help of Calculus to optimize the variables w.r.t constraints.

You can find more about how to perform optimization using Calculus here: Optimization in Calculus - CalcWorkshop

4. Gradient Descent/Ascent

We will first talk about Gradient Descent. Gradient Ascent is just the opposite.

Gradient Descent is an algorithm which uses a plot of slopes(taken from the derivatives) at every point on “x”.

In a very good sandbox environment, the graphical representation of this plot would look like this:

We are considering that our objective is to find the minimum cost of this slope graph. And we can see that the red slope has the close to minimum cost. If we tried some more iterations we might get to the point where d/dx = 0 which is the minimum cost.

Now let’s try to make sense of this pointless looking gibberish.

This graph is plotted by differentiating a function and noting down its slope value i.e. the value of derivative at point x=p (“p” is just some random point on x-axis). This says that the x-axis on this plot is the same as x-axis on the function plot. But the y-axis this time contains values of slope, instead of f(x).

By finding the optimum value for this plot, we can optimize many of our Machine Learning Algorithms. Which will of course lead us to making better models. 

Ok, that was fun but where is Calculus here? Well, This algorithm utilizes a numerical method called Newton-Raphson Method, and that method utilizes slope. Newton-Raphson Method is said to be most efficient among its brethren, and it can make the overall process faster.

But remember, I said that this visualization is from a sand-box environment. That means that we may not get such a good looking slope plot, save for the function itself. It can be very messy and hard to track without the use of calculus. Not to mention that this is a 2D plot. We usually deal with plains and hyper surfaces when working with real life projects. There can also be several local minima so we could easily get lost and not find the global minima.

This Gradient Descent/Ascent can be thought of as a tool which enables us to optimize our models for better results. It is a fine-tuner behind good models and what actually makes the model learn. Cost here is just a concept that can be related to maximizing or minimizing in optimization.

Of course I have skipped a great deal of theory just to make this article beginner friendly.

5. Probability Theory

Now I don’t want to use any fancy terms here, but it will actually kill the gist of what I am going to talk about.

As far as Integrals are considered, it is utilized by “expectation maximization” and “variational Bayesian inference” to solve two of the most common problems in Machine Learning i.e. maximum likelihood estimation and Bayesian inference. They both fall under the banner of Probability Theory.

Recall I said that Integral is just collecting pieces of differentiated function and calculating the “whole”. Well, if you are aware of what a Probability Density Function and Probability Mass Function are, then you already know how summing up the small differentiated pieces of the function is going to be done by integrals.

One channel I like the most related to Statistics and Machine Learning concepts is StatQuest by Josh Starmer

6. So where can I apply this knowledge?

There’s a good news and slightly bad news.

The good news is that these concepts are abstracted for the developers. This means that If you want to just build some Machine Learning application, there are tons of libraries and tools for that and you only have to supply the pre-made model with inputs and train it. No calculus required here.

The slightly bad news is that if you are pursuing research in Machine Learning and Artificial Intelligence, almost all the paper concerning this topic have Integrals and Differentials all over the place. So you better master Calculus before start reading some research documentation. Also if you choose this route you might have to make your own models. It’s not really a bad news actually since most people pursuing research already know their stuff.

7. References

I mainly got this knowledge from Imperial College London’s course named as “Mathematics for Machine Learning” on Coursera. I built 70% of my concepts from this course. About the other 30%, I covered it by the following sources:

  • Muhammad Hammad Hassan
  • Mar, 25 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.