Data Science

K means Clustering

K - Means Clustering (With Python implementation)

It is used for dividing data into groups of similar data points.
As it is unsupervised machine learning algorithm , it is not used for predictive modelling

CLUSTERING

Clustering falls into the category of un-supervised learning.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points
in the same group and dissimilar to the data points in other groups.

we also have K median Clustering which is quite similar to that of K means but their are few differences:

K-MEANS: 1. Mean of all data points is taken for deciding centroids.
2. Centroid may or may not belongs to the sample.

K-Medians: 1. median of all data points is taken for deciding centroids.
2. Centroid always belongs to the sample.

WHAT K-MEANS DOES FOR YOU ?

Step-wise Working of K-Means Clustering

STEP 1: Choose the number K clusters.

STEP 2: Select at random K points, the centroids(not necessarily from your dataset).

STEP 3: Assign each data point to the closest centroid – That forms K clusters.

STEP 4: Compute and place the new centroid of each cluster.

STEP 5: Reassign each data point to the new closet centroid.

If reassignment took place, go to step 4, otherwise go to FINAL MODEL.(Model Created)

STEP 1: CHOOSE THE NUMBER K OF CLUSTERS: K=2

STEP 2: SELECT AT RANDOM K POINTS, THE CENTROID

STEP 3: ASSIGN EACH DATA POINT TO THE CLOSEST CENTROID – THAT FORMS K CLUSTERS

STEP 4: COMPUTE AND PLACE THE NEW CENTROID OF EACH CLUSTER

STEP 5: REASSIGN EACH DATA POINT TO THE NEW CLOSEST CENTROID. IF AN REASSIGNMENT TOOK PLACE

FINISHED MODEL

PERFORMANCE OF THE K-MEANS DEPENDS ON:
1. NUMBER OF K(CLUSTERS)
2. INITIAL POSITION OF CENTROID.

RANDOM INITIALIZATION TRAP

Random initialization trap is a problem that occurs in the K-means algorithm. In random initialization trap when the centroids of the clusters to be generated are explicitly defined by the User then inconsistency may be created and this may sometimes lead to generating wrong clusters in the dataset. So random initialization trap may sometimes prevent us from developing the correct clusters.

To Avoid the Random initialization Trap:
Solution is ---> K-Means++
which says:
Centroids should be as far as possible from each other.
equally spaced.
Centroids should be far away from the domain of the points.

WITHIN CLUSTERS SUM OF SQUARES(WCSS)

WCSS tells us about the compactness of the clusters and how closely the clusters are packed to each other. (If there are 10 data points in a dataset and number of clusters =10 then WCSS =0) lower the value of the Wcss the better the model. Increasing the number of clusters decreases the value of the Wcss.

CHOOSING THE RIGHT NUMBER OF CLUSTERS

THE ELBOW METHOD

It is a graph of distortion score (WCSS) vs k
Distortion can be measured by WCSS as discussed in the previous slides.
Depending on the distortion appropriate number of clusters k is chosen.

ASSUMPTIONS

If any one of these 3 assumptions is violated, then k-means will fail:
k-means assume the variance of the distribution of each attribute (variable) is spherical;
All variables have the same variance;
The prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations
Clusters in K-means are defined by taking the mean of all the data points in the cluster.

ADVANTAGES

Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

DISADVANTAGES

The user has to specify 𝑘 (the number of clusters) in the beginning.
k-means can only handle numerical data.
Being dependent on initial values.
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.

APPLICATIONS

Marketing : It can be used to characterize & discover customer segments for marketing purposes.
Biology : It can be used for classification among different species of plants and animals.
Libraries : It is used in clustering different books on the basis of topics and information.
Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.