Introduction to Classification Algorithms in Machine Learning
What is classification?
Classification:
Classification is a
technique to organize labeled data into distant classes or categories. It is a
supervised learning process. Classification is a process of categorizing a
given set of data into classes. It can be performed on both structured or
unstructured data. The process starts with predicting the class of given data
points. The classes are often referred to as target, label or categories.
Binary classification refers to predicting one of two classes and
multi-class classification involves predicting one of more than two classes.
One of the most common applications of classification is for filtering
emails into “spam” or “non-spam”, as used by today’s top email service
providers. In short, classification is the process of recognition, understanding
and grouping of objects and ideas into preset categories known as
sub-population.
Classification model includes logistic regression, decision tree, random
forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naïve
Bayes.
Examples of classification:
1. predicting the gender of a person b
his/her handwriting style
2. Given
a list of symptoms predicting whether a patient has disease or not.
3. Predicting
whether a monsoon will be normal to next year.
4. Predicting
the house price based on area.
Some classification models:
● Naïve
Bayes:
Naïve Bayes is a
classification algorithm based on Bayes’ theorem which gives an assumption that
the predictors in a dataset are independent. This means it is assumed that the
features are unrelated to each other. Even if the features depend on each
other, all of these properties independently contribute to the probability.
Naïve Bayes model is easy to make and particularly it is useful for
comparatively large data sets.
For example, if a
banana is given, then the classifier will see that the fruit is of yellow
color, oblong-shaped and long and tapered. All of these features will
contribute independently to the probability of it being a banana and are not
dependent on each other.
● Decision
trees:
Decision tree is an
algorithm that can be used to visually represent the decision making. It builds
the classification model in the form of a tree structure. It utilizes the
if-then rules which are equally exhaustive and mutually exclusive in
classification. A Decision Tree can be made by asking a yes/no question and
splitting the answer to lead to another decision. The question is at the node
and it places the resulting decisions below at the leaves. The tree depicted
below is used to decide whether to play tennis or not.
In the above
figure, depending on the weather conditions and the humidity and wind, the
decision about whether to play tennis or not is decided. In decision trees, all
the False statements lie on the left of the tree and the True statements branch
off to the right. The topmost node in the decision tree that corresponds
to the best predictor is called the root node. The best thing about a decision
tree is that it can handle both categorical and numerical data.
● K-Nearest
Neighbors:
K-Nearest Neighbor is
a classification and prediction algorithm that is used to divide data into
classes based on the distance between the data points. It is a lazy
learning algorithm that stores all instances corresponding to training data in
n-dimensional space.
The “K” is the number
of neighbors it checks. K-Nearest Neighbor assumes that data points which are
close to one another must be similar and hence, the data point to be classified
will be grouped with the closest cluster.
● Random
forest:
Random forest is an
ensemble learning method for classification, regression, etc. A random forest
is a meta-estimator that fits a number of trees on various subsamples of data
sets and then uses an average to improve the accuracy in the model’s predictive
nature. Random forest is more accurate than the decision trees due to the
reduction in the over-fitting.
● Evaluating
classifier:
After the completion
of any classifier, the most important part is to check its accuracy and
efficiency. There are a lot of ways to evaluate a classifier. Some are given
below.
1. Holdout method:
This is the most
common method to evaluate a classifier. In this method, the given data set is
divided into two parts as a test and train set for 20% and 80% respectively.
The train set is used
to train the data and the unseen test set is used to test its predictive power.
2. Cross-validation:
In most of the
machine learning models, the most common problem raised is the model
Over-fitting. K-fold cross-validation can be conducted to verify if the model
is overfitted at all.
In this method, the
data set is randomly partitioned into k mutually exclusive subsets,
each of which is of the same size. Out of these, one is kept for testing and
others are used to train the model. The same process takes place for all k
folds.
● Classification
report:
1. Accuracy:
Accuracy is a ratio
of correctly predicted observation to the total observations. The number of
correct predictions that the occurrence is positive is known as true positive
and the number of correct predictions that the occurrence is negative is known
as true negative.
2. F1 -score:
It is the weighted
average of precision and recall.
3. Precision and recall:
Precision is the
fraction of relevant instances among the retrieved instances. It is given by
dividing the number of correctly classified data points by the total number of
classified data points for that class label.
The recall is the
fraction of relevant instances that have been retrieved over the total number
of instances. They are basically used as the measure of relevance.
- Rasika Bhandarkavathe
- Mar, 28 2022