Data Science

Intuition behind Random Forest Algorithm and Why is it Popular

Random forest comes under Bagging an “Ensemble Technique”.

At first let's look into what is an Ensemble Technique ?

It is a technique of grouping multiple models and utilizing those models to train the Data and get the desired Output.

There are 2 types of Ensemble Technique: Bagging and Boosting. Let's solely focus on Bagging here.

What is Bagging?

It is also know as Bootstrap Aggregation. Let us understand why is it called so.

We have a cleaned Dataset with specific number of features and multiple models(from now on let's call them Base Learners - as the model takes the data and learns patterns from it) .

Step : 1

From the Dataset, relatively small amount(sample) of data is sent to the First Base Learner.

Step: 2

Again another sample of data (which may contain few of the previous data) is picked and sent to the the Second Base Learner.

Repeating Step 2 for “N” number of Base learners.

This process is called “Row Sampling with Replacement” (As we are sending samples of data row wise and few of the data gets repeated but most of the data will be unique to each other)

Now the Base learners get trained with the sample data that they have received.

After Training, New data is sent to the Trained Base Learners. Each Base Learner predicts the Output.

Majority of Baser learners having the Same Output is considered to be the Final Output.

“Row Sampling with Replacement” is called BOOTSTRAPPING and considering the “Majority Ouput as Final Output” is called AGGREGATION.

Let us understanding this better by diagram,

As per the above diagram,

B1, B2 and B3 Samples are picked up from the Training Data [Bootstrapping].
B1, B2 and B3 samples is sent to M1, M2, M3 Base Learners Respectively
Let's assume this is Model is used for classification either “Good” or “Bad"
M1 predicts “Good” ; M2 predicts “Bad” ; M3 predicts “Good”
“Good” predicted 2 times and “Bad” just once, Majority = Good [Aggregation]
Thus, final Output is considered as “Good”

Now, let us get to know the Idea about “Random Forest”

The Base Learners are Decision Trees.
Row Sampling and Feature or Column Sampling with replacement is also done here. ( Feature Sampling : Here sample of Columns are selected as Bootstrap Sample)
Each Decision Tree is fed with a sample of data and are trained
New data is sent to Decision trees for prediction

In Random Forest Classifier:

For example, The Decision tree predicts either “Good” or “Bad” - Majority of the Decision trees that predicts from the 2 is considered to be the Final Outcome

In Random Forest Regressor:

The Predicted values are Continuous values
The Mean or Median of the predicted values are considered as the Final Outcome.

Why Random Forest algorithm performs well?

Overfitting problem is avoided.

How?

An Individual Decision Tree has Low Bias and High Variance.

But when we combine Multiple Decision Trees, High Variance get converted to Low Variance as we are not dependent on a Single Decision Tree to predict the Outcome, we perform “Aggregation”

2. Decision Tree can handle both Categorical and Continuous Features

3. Feature Scaling is not necessary as Decision Trees are being used here

4. Robustness to Outliers

How?

We are using Decision Trees, it splits the data into groups, checks whether a case is above or below a selected threshold value on a specific feature

Input Outliers will not have extra influence.

Agasthya NU
Dec, 27 2022