We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the kMeans algorithm; an unsupervised learning algorithm.
(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the euclidean distance as measurement.
The algorithm works as follows:
- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.
The “points” mentioned above are called means, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x the items have values in [0,3], we will initialize the means with values for x at [0,3]).
The above algorithm in pseudocode:
Initialize k means with random values For a given number of iterations: Iterate through items: Find the mean closest to the item Assign item to mean Update mean
We receive input as a text file (‘data.txt’). Each line represents an item, and it contains numerical values (one for each feature) split by commas. You can find a sample data set here.
We will read the data from the file, saving it into a list. Each element of the list is another list containing the item values for the features. We do this with the following function:
We want to initialize each mean’s values in the range of the feature values of the items. For that, we need to find the min and max for each feature. We accomplish that with the following function:
The variables minima, maxima are lists containing the min and max values of the items respectively. We initialize each mean’s feature values randomly between the corresponding minimum and maximum in those above two lists:
We will be using the euclidean distance as a metric of similarity for our data set (note: depending on your items, you can use another similarity metric).
To update a mean, we need to find the average value for its feature, for all the items in the mean/cluster. We can do this by adding all the values and then dividing by the number of items, or we can use a more elegant solution. We will calculate the new average without having to re-add all the values, by doing the following:
m = (m*(n-1)+x)/n
where m is the mean value for a feature, n is the number of items in the cluster and x is the feature value for the added item. We do the above for each feature to get the new mean.
Now we need to write a function to classify an item to a group/cluster. For the given item, we will find its similarity to each mean, and we will classify the item to the closest one.
To actually find the means, we will loop through all the items, classify them to their nearest cluster and update the cluster’s mean. We will repeat the process for some fixed number of iterations. If between two iterations no item changes classification, we stop the process as the algorithm has found the optimal solution.
The below function takes as input k (the number of desired clusters), the items and the number of maximum iterations, and returns the means and the clusters. The classification of an item is stored in the array belongsTo and the number of items in a cluster is stored in clusterSizes.
Finally we want to find the clusters, given the means. We will iterate through all the items and we will classify each item to its closest cluster.
The other popularly used similarity measures are:-
1. Cosine distance: It determines the cosine of the angle between the point vectors of the two points in the n dimensional space
2. Manhattan distance: It computes the sum of the absolute differences between the co-ordinates of the two data points.
3. Minkowski distance: It is also known as the generalised distance metric. It can be used for both ordinal and quantitative variables
You can find the entire code on my GitHub, along with a sample data set and a plotting function. Thanks for reading.
This article is contributed by Antonis Maronikolakis. If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready. To complete your preparation from learning a language to DS Algo and many more, please refer Complete Interview Preparation Course.
In case you wish to attend live classes with industry experts, please refer DSA Live Classes