Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
How does PCA work? If you have a dataset with two features, x_1 and x_2. Initially, your data is plotted or represented using axes x_1 and x_2. But you want to replace these two features with just one feature. How can you choose a new axis, let's call it the z-axis, that is somehow a good feature for capturing, of representing the data? Let's take a look at how PCA does this. Here's the data sets with five training examples. Remember, this is an unsupervised learning algorithm so we just have x_1 and x_2, there is no label y. An example here like this may have coordinates x_1 equals 10 and x_2 equals 8. If we don't want to use the x_1, x_2 axes, how can we pick some different axis with which to capture what's in the data or with which to represent the data? One note on preprocessing, before applying the next few steps of PCA the features should be first normalized to have zero mean and I've already done that here. If the features x_1 and x_2 take on very different scales, for example, if you remember our housing example, if x_1 was the size of a house in square feet, and x_2 was the number of bedrooms then x_1 could be 1,000 or a couple of thousand, whereas x_2 is a small number. If the features take on very different scales, then you will first perform feature scaling before applying the next few steps of PCA. Assuming the features have been normalized to have zero mean, so subtract the mean from each feature and then maybe apply feature scaling as well so the ranges are not too far apart. What does PCA do next? To examine what PCA does, let me remove the x_1 and x_2 axes so that we're just left with the five training examples. This dot here represents the origin. The position of zero on this plot still. What we have to do now with PCA is, pick one axis instead of the two axes that we had previously with which to capture what's important about these five examples. If we were to choose this axis to be our new z-axis, it's actually the same as the x_1 axis just for this example. Then what we're saying is that for this example, we're going to just capture this value, this coordinate on the z-axis. For the second example, we're going to capture this value, and then this will capture this value, and so on for all five examples. Another way of saying this is that we're going to take each of these examples and project it down to a point on the z-axis. The word project refers to that you're taking this example and bringing it to the z-axis using this line segment that's at a 90-degree angle to the z-axis. This little box here is used to denote that this line segment is at 90 degrees to the z-axis. The term project just means you're taking a point and finding this corresponding point on the z-axis using this line segment that's at 90 degrees. Picking this direction as a z-axis is not a bad choice, but there's some even better choices. This choice isn't too bad because when you project your examples onto the z-axis, you still capture quite a lot of the spread of the data. These five points here, they're pretty spread apart so you're still capturing a lot of the variation or a lot of the variance in the original dataset. By that I mean these five points are quite spread apart and so the variance or variation among these five points, the projections of the data onto the z-axis is decently large. What that means is we're still capturing quite a lot of the information in the original five examples. Let's look at some other possible choices for the axis z. Here's another choice. This is actually not a great choice. But if I were to choose this as my z axis, then if I take those same five examples and project them down to the z-axis, I end up with these five points. You notice that compare it to the previous choice, these five points are quite squished together. The amount they are differing from each other, or their variance or the variation is much less. What this means is with this choice of z, you're capturing much less of the information in the original dataset because you've partially squish all five examples together. Let's look at one last choice, which is if I choose this to be the z-axis. This is actually a better choice than the previous two that we saw, because if we take the data's projections onto the z-axis, we find that these dots over here, they're actually quite far apart. We're capturing a lot of the variation, a lot of the information in the original dataset, even though we're now using just one coordinate or one number to represent or to capture each of the training examples instead of two numbers or two coordinates, X_1 and X_2. In the PCA algorithm, this axis is called the principal component.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks