Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
In this video, we'll take a look at how you can use the scikit-learn library to implement PCA. These are the main steps. First, if your features take on very different ranges of values, you can perform pre-processing to scale the features to take on comparable ranges of values. If you were looking at the features of different countries, those features take on very different ranges of values. GDP could be in trillions of dollars, whereas other features are less than 100. Feature scaling in applications like that would be important to help PCA find a good choice of axes for you. The next step then is to run the PCA algorithm to "fit" the data to obtain two or three new axes, Z_1, Z_2, and maybe Z_3. Here I'm assuming you want two or three axes if you want to visualize the data in 2D or 3D. If you have an application where you want more than two or three axes, the PCA implementation can also give you more than two or three axes, it's just that it'd then be harder to visualize. In scikit-learn, you will use the fit function, or the fit method in order to do this. The fit function in PCA automatically carries out mean normalization, it subtracts out the mean of each feature. So you don't need to separately perform mean normalization. After running the fit function, you would get the new axes, Z_1, Z_2, maybe Z_3, and in PCA, we also call these the principal components, where Z_1 is the first principal component, Z_2 the second principal component, and Z_3 the third principal component. After that, I would recommend taking a look at how much each of these new axes, or each of these new principal components explains the variance in your data. I'll show a concrete example of what this means on the next slide, but this lets you get a sense of whether or not projecting the data onto these axes help you to retain most of the variability, or most of the information in the original dataset. This is done using the explained variance ratio function. Finally, you can transform, meaning just project the data onto the new axes, onto the new principal components, which you will do with the transform method. Then for each training example, you would just have two or three numbers, you can then plot those two or three numbers to visualize your data. In detail, this is what PCA in code looks like. Here's the dataset X with six examples. X equals NumPy array, the six examples over here. To run PCA to reduce this data from two numbers, X_1, X_2 to just one number Z, you would run PCA and ask it to fit one principal component. N components here is equal to one, and fit PCA to X. Pca_1 here is my notation for PCA with a single principle component, with a single axis. It turns out, if you were to print out pca_1.explained_variance_ratio, this is 0.992. This tells you that in this example when you choose one axis, this captures 99.2 percent of the variability or of the information in the original dataset. Finally, if you want to take each of these training samples and project it to a single number, you would then call this, and this will output this array with six numbers corresponding to your six training examples. For example, the first training example 1,1, projected to the Z-axis gives you this number, 1.383, so on. So if you were to visualize this dataset using just one dimension, this will be the number I use to represent the first example. The second example is projected to be this number and so on. I hope you take a look at the optional lab where you see that these six examples have been projected down onto this axis, onto this line which is now Y. All six examples now lie on this line that looks like this. The first training example, which was 1,1, has been mapped to this example, which has a distance of 1.38 from the origin, so that's why this is 1.38. Just one more quick example. This data is two-dimensional data, and we reduced it to one dimensions. What if you were to compute two principal components? Starts with two-dimensions, and then also end up with two-dimensions. This isn't that useful for visualization but it might help us understand better how PCA and how they code for PCA works. Here's the same code except that I've changed n components to two.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks