Data Science

Introduction to Classification Algorithms in Machine Learning

What is classification?

Classification:

Classification is a technique to organize labeled data into distant classes or categories. It is a supervised learning process. Classification is a process of categorizing a given set of data into classes. It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than two classes. One of the most common applications of classification is for filtering emails into “spam” or “non-spam”, as used by today’s top email service providers. In short, classification is the process of recognition, understanding and grouping of objects and ideas into preset categories known as sub-population.

Classification model includes logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naïve Bayes.

Examples of classification:

1. predicting the gender of a person b his/her handwriting style

2. Given a list of symptoms predicting whether a patient has disease or not.

3. Predicting whether a monsoon will be normal to next year.

4. Predicting the house price based on area.

Some classification models:

● Naïve Bayes:

Naïve Bayes is a classification algorithm based on Bayes’ theorem which gives an assumption that the predictors in a dataset are independent. This means it is assumed that the features are unrelated to each other. Even if the features depend on each other, all of these properties independently contribute to the probability. Naïve Bayes model is easy to make and particularly it is useful for comparatively large data sets.

For example, if a banana is given, then the classifier will see that the fruit is of yellow color, oblong-shaped and long and tapered. All of these features will contribute independently to the probability of it being a banana and are not dependent on each other.

● Decision trees:

Decision tree is an algorithm that can be used to visually represent the decision making. It builds the classification model in the form of a tree structure. It utilizes the if-then rules which are equally exhaustive and mutually exclusive in classification. A Decision Tree can be made by asking a yes/no question and splitting the answer to lead to another decision. The question is at the node and it places the resulting decisions below at the leaves. The tree depicted below is used to decide whether to play tennis or not.

In the above figure, depending on the weather conditions and the humidity and wind, the decision about whether to play tennis or not is decided. In decision trees, all the False statements lie on the left of the tree and the True statements branch off to the right. The topmost node in the decision tree that corresponds to the best predictor is called the root node. The best thing about a decision tree is that it can handle both categorical and numerical data.

● K-Nearest Neighbors:

K-Nearest Neighbor is a classification and prediction algorithm that is used to divide data into classes based on the distance between the data points. It is a lazy learning algorithm that stores all instances corresponding to training data in n-dimensional space.

The “K” is the number of neighbors it checks. K-Nearest Neighbor assumes that data points which are close to one another must be similar and hence, the data point to be classified will be grouped with the closest cluster.

● Random forest:

Random forest is an ensemble learning method for classification, regression, etc. A random forest is a meta-estimator that fits a number of trees on various subsamples of data sets and then uses an average to improve the accuracy in the model’s predictive nature. Random forest is more accurate than the decision trees due to the reduction in the over-fitting.

● Evaluating classifier:

After the completion of any classifier, the most important part is to check its accuracy and efficiency. There are a lot of ways to evaluate a classifier. Some are given below.

1. Holdout method:

This is the most common method to evaluate a classifier. In this method, the given data set is divided into two parts as a test and train set for 20% and 80% respectively.

The train set is used to train the data and the unseen test set is used to test its predictive power.

2. Cross-validation:

In most of the machine learning models, the most common problem raised is the model Over-fitting. K-fold cross-validation can be conducted to verify if the model is overfitted at all.

In this method, the data set is randomly partitioned into k mutually exclusive subsets, each of which is of the same size. Out of these, one is kept for testing and others are used to train the model. The same process takes place for all k folds.

● Classification report:

1. Accuracy:

Accuracy is a ratio of correctly predicted observation to the total observations. The number of correct predictions that the occurrence is positive is known as true positive and the number of correct predictions that the occurrence is negative is known as true negative.

2. F1 -score:

It is the weighted average of precision and recall.

3. Precision and recall:

Precision is the fraction of relevant instances among the retrieved instances. It is given by dividing the number of correctly classified data points by the total number of classified data points for that class label.

The recall is the fraction of relevant instances that have been retrieved over the total number of instances. They are basically used as the measure of relevance.

Rasika Bhandarkavathe
Mar, 28 2022