Data Science

How Does Calculus Work In Machine Learning?

f(x+𝚫x) - f(x) / (x+𝚫x) - (x)

The above formula, if we look closely, is just the difference of what the figure said about Rise over Run. No rocket science there. And this is what differential calculus is all about. It breaks a function into smaller components through this formula (just like the yellow dotted box shows).

Of course we could make the (x+𝚫x) the base point and move ahead to another point along the x-axis to take another component(another yellow dotted box). The following diagram shows exactly this in pink rectangles.

Hence the notation

d / dx

where “d” just stands for delta in 𝚫x.

Integral Calculus on the other hand is collects all those differentiated pieces together into one “whole” and gives us a number of the the area of that “whole”. You can safely say that it is the Anti-Derivative.

But this is important to note that, while joining together those pink rectangles, it will not give us the precise area under the curve, but instead it will just give us an approximation. We can try to minimize this error by making rectangles with lesser width or in other words, differentiating with lesser values for 𝚫x. While this would surely minimize our error, it will not completely remove it from existence. But an upside to all this is just have a closer and closer approximation to the function.

To get a visual explanation of how Calculus works and get the essence of it, I know a guy who does it best. Try this youtube channel: 3blue1brown

This wraps up our Introduction to Calculus.

3. What is linear optimization and how it differs from non-linear optimization

Again, if you already know this, you can skip it.

Linear optimization is just talking about what value of variables in a function will make its output minimum or maximum given some constraints or restrictions. There can only be two types of optimization (min or max). Like for example we have a function:

f(x1, x2, … xn) = x1.a1 + x2.a2 + … + xn.an

Where;

x1,x2,...,xn = variables

a1,a2,...,an = constants

Since there is no index of 2 or more on any variables here, we can say that this is a linear function. Therefore, talking about what values of x1,x2,...,xn will minimize or maximize the value of f(x1, x2, … xn) given some constraints will be optimizing a linear function.

Non-linear optimization takes place the similar way but since the function will now contain higher order terms, the optimization can become messy. Therefore, we take the help of Calculus to optimize the variables w.r.t constraints.

You can find more about how to perform optimization using Calculus here: Optimization in Calculus - CalcWorkshop

4. Gradient Descent/Ascent

We will first talk about Gradient Descent. Gradient Ascent is just the opposite.

Gradient Descent is an algorithm which uses a plot of slopes(taken from the derivatives) at every point on “x”.

In a very good sandbox environment, the graphical representation of this plot would look like this:

We are considering that our objective is to find the minimum cost of this slope graph. And we can see that the red slope has the close to minimum cost. If we tried some more iterations we might get to the point where d/dx = 0 which is the minimum cost.

Now let’s try to make sense of this pointless looking gibberish.

This graph is plotted by differentiating a function and noting down its slope value i.e. the value of derivative at point x=p (“p” is just some random point on x-axis). This says that the x-axis on this plot is the same as x-axis on the function plot. But the y-axis this time contains values of slope, instead of f(x).

By finding the optimum value for this plot, we can optimize many of our Machine Learning Algorithms. Which will of course lead us to making better models.

Ok, that was fun but where is Calculus here? Well, This algorithm utilizes a numerical method called Newton-Raphson Method, and that method utilizes slope. Newton-Raphson Method is said to be most efficient among its brethren, and it can make the overall process faster.

But remember, I said that this visualization is from a sand-box environment. That means that we may not get such a good looking slope plot, save for the function itself. It can be very messy and hard to track without the use of calculus. Not to mention that this is a 2D plot. We usually deal with plains and hyper surfaces when working with real life projects. There can also be several local minima so we could easily get lost and not find the global minima.

This Gradient Descent/Ascent can be thought of as a tool which enables us to optimize our models for better results. It is a fine-tuner behind good models and what actually makes the model learn. Cost here is just a concept that can be related to maximizing or minimizing in optimization.

Of course I have skipped a great deal of theory just to make this article beginner friendly.

5. Probability Theory

Now I don’t want to use any fancy terms here, but it will actually kill the gist of what I am going to talk about.

As far as Integrals are considered, it is utilized by “expectation maximization” and “variational Bayesian inference” to solve two of the most common problems in Machine Learning i.e. maximum likelihood estimation and Bayesian inference. They both fall under the banner of Probability Theory.

Recall I said that Integral is just collecting pieces of differentiated function and calculating the “whole”. Well, if you are aware of what a Probability Density Function and Probability Mass Function are, then you already know how summing up the small differentiated pieces of the function is going to be done by integrals.

One channel I like the most related to Statistics and Machine Learning concepts is StatQuest by Josh Starmer

6. So where can I apply this knowledge?

There’s a good news and slightly bad news.

The good news is that these concepts are abstracted for the developers. This means that If you want to just build some Machine Learning application, there are tons of libraries and tools for that and you only have to supply the pre-made model with inputs and train it. No calculus required here.

The slightly bad news is that if you are pursuing research in Machine Learning and Artificial Intelligence, almost all the paper concerning this topic have Integrals and Differentials all over the place. So you better master Calculus before start reading some research documentation. Also if you choose this route you might have to make your own models. It’s not really a bad news actually since most people pursuing research already know their stuff.

7. References

I mainly got this knowledge from Imperial College London’s course named as “Mathematics for Machine Learning” on Coursera. I built 70% of my concepts from this course. About the other 30%, I covered it by the following sources:

Mathematics For Machine Learning - Imperial College London

Muhammad Hammad Hassan
Mar, 25 2022