Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz

The k-means algorithm requires as one of its inputs, k, the number of clusters you want it to find, but how do you decide how many clusters to used. Do you want two clusters or three clusters of five clusters or 10 clusters? Let's take a look. For a lot of clustering problems, the right value of K is truly ambiguous. If I were to show different people the same data set and ask, how many clusters do you see? There will definitely be people that will say, it looks like there are two distinct clusters and they will be right. There would also be others that will see actually four distinct clusters. They would also be right. Because clustering is unsupervised learning algorithm you're not given the quote right answers in the form of specific labels to try to replicate. There are lots of applications where the data itself does not give a clear indicator for how many clusters there are in it. I think it truly is ambiguous if this data has two or four, or maybe three clusters. If you take say, the red one here and the two blue ones here say. If you look at the academic literature on K-means, there are a few techniques to try to automatically choose the number of clusters to use for a certain application. I'll briefly mention one here that you may see others refer to, although I had to say, I personally do not use this method myself. But one way to try to choose the value of K is called the elbow method and what that does is you would run K-means with a variety of values of K and plot the cost function or the distortion function J as a function of the number of clusters. What you find is that when you have very few clusters, say one cluster, the distortion function or the cost function J will be high and as you increase the number of clusters, it will go down, maybe as follows. and if the curve looks like this, you say, well, it looks like the cost function is decreasing rapidly until we get to three clusters but the decrease is more slowly after that. Let's choose K equals 3 and this is called an elbow, by the way, because think of it as analogous to that's your hand and that's your elbow over here. Plotting the cost function as a function of K could help, it could help you gain some insight. I personally hardly ever use the the elbow method myself to choose the right number of clusters because I think for a lot of applications, the right number of clusters is truly ambiguous and you find that a lot of cost functions look like this with just decreases smoothly and it doesn't have a clear elbow by wish you could use to pick the value of K. By the way, one technique that does not work is to choose K so as to minimize the cost function J because doing so would cause you to almost always just choose the largest possible value of K because having more clusters will pretty much always reduce the cost function J. Choosing K to minimize the cost function J is not a good technique. How do you choose the value of K and practice? Often you're running K-means in order to get clusters to use for some later or some downstream purpose. That is, you're going to take the clusters and do something with those clusters. What I usually do and what I recommend you do is to evaluate K-means based on how well it performs for that later downstream purpose. Let me illustrate to the example of t-shirt sizing. One thing you could do is run K-means on this data set to find the clusters, in which case you may find clusters like that and this would be how you size your small, medium, and large t-shirts, but how many t-shirt sizes should there be? Well, it's ambiguous. If you were to also run K-means with five clusters, you might get clusters that look like this. This will let shoe size t-shirts according to extra small, small, medium, large, and extra large. Both of these are completely valid and completely fine groupings of the data into clusters, but whether you want to use three clusters or five clusters can now be decided based on what makes sense for your t-shirt business. Does a trade-off between how well the t-shirts will fit, depending on whether you have three sizes or five sizes, but there will be extra costs as well associated with manufacturing and shipping five types of t-shirts instead of three different types of t-shirts.

Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithcoursera