Data Science

Complete guide to Principal Component Analysis (PCA)

The Curse of Dimensionality : It is when there is a vast amount of features in a given dataset that may cause over fitting or computational costs that varies exponentially with the given features. One of the worst learning algorithm that is affected by the curse of dimensionality is K-Nearest-Neighbor.

Dimensionality Reduction : Is a very important concept in Machine Learning. It is a technique used by machine learning engineers to reduce the size of the input dataset , another way to think of it is to transform the input data from high dimensional space to a lower dimensional space. Of-course transforming the data from high dimensional space to a lower one will result in a loss of information, but the majority of data will be transformed and the lost data will be discarded as noise.

There are many methods of Dimensionality Reduction:

Principle Component Analysis
Linear Discriminant Analysis
Generalized Discriminant Analysis , etc. .

In this blog I will be explaining Principle Component Analysis "PCA" :

Principle Component Analysis is an unsupervised Machine learning technique that revolves around reducing the number of features of a given dataset . This is done by taking into consideration the Variance of the given dataset. It select the axis that preserve the maximum amount of variance , this selected axis is the axis that contain most of the information than the other projections.

Principle Component Analysis steps:

First of all , we subtract the mean of each feature in the dataset to centre the data around the origin. Secondly, we calculate the covariance matrix . After that we calculate the eigen decomposition of the covariance matrix . The output of the eigen decomposition is 2 matrices , the first one is an eigen values vector and the other one is the eigen vector matrix. From here we calculate a factor called 'Fr' that is equals to the amount of information that will not be lost after the projection of the data. For example if we want 'Fr' to be 0.9 then we sort the eigen values in descending order then we calculate the sum of all eigen values let the summation equal to 'S' , then sum the first 'd' eigen values let this summation equals to 'R' , then 'Fr' = R/S if this value is equal to the 'Fr' that we want then these eigen values are the main eigen values for this dataset. The next step is to find the corresponding Eigen vectors of the main eigen values previously calculated and put them into a matrix 'W'. Now the last step here is to project the dataset on the new axis. This is done by matrix multiplication of dataset by the 'W' matrix.

After projection:

After projecting the dataset 'X_projected' , this new projected points contain mostly all of the data and we reduced the dimensions by a lot. As I said before 'PCA' is an unsupervised technique so after all this steps we still did not made our model to compute whatever we want, but now because of 'PCA' the dataset became considerably easier to understand and reduced a lot of computational power . So now for the projected dataset we should use a machine learning algorithm such as:

K-NN
Linear Regression
Decision Trees
Random Forest

Code for PCA:

train_x , test_x = load_dataset()

import numpy as np

x_mean = np.mean(train_x)

centred_data= train_x-x_mean

cov_matrix= (1/train_x.shape[0]) np.matmul( np.transpose ( centred_data ) , centred_data ))

eigen_values , eigen_vectors = np.linalg.eigh (cov_matrix)

idx = eig_vals.argsort()[::-1]

eig_values = eig_vals[idx]

eig_vectors = eig_vecs[:, idx]

sum_of_eigen_values=eigen_values.sum()

fr=0
counter=0
for i in orderd_eigen_values:
    fr=i+fr
    counter=counter+1
    if(fr/sum_of_eigen_values>=0.8):
        break
reduced_eigen_vectors = eigen_vectors[:,:counter]

reduced_data= np.dot(training_data,reduced_eigen_vectors

Luckily for us , Sklearn library already has the pca implemented .So we can skip writing the algorithm and only implemet it using Sklearn

train_x , test_x = load_dataset()

from sklearn.decomposition import PCA

model = PCA(n_components = 2)

x2d = model.fit_transform(train_x)

Conclusion:

In this blog I have explained everything about pca , and how much it is useful for machine learning techniques.

Karim Akmal
Mar, 27 2022