Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
I'd like to share with you some practical tips for developing an anomaly detection system. One of the key ideas will be that if you can have a way to evaluate a system, even as it's being developed, you'll be able to make decisions and change the system and improve it much more quickly. Let's take a look at what that means. When you are developing a learning algorithm, say choosing different features or trying different values of the parameters like epsilon, making decisions about whether or not to change a feature in a certain way or to increase or decrease epsilon or other parameters, making those decisions is much easier if you have a way of evaluating the learning algorithm. This is sometimes called real number evaluation, meaning that if you can quickly change the algorithm in some way, such as change a feature or change a parameter and have a way of computing a number that tells you if the algorithm got better or worse, then it makes it much easier to decide whether or not to stick with that change to the algorithm. This is how it's often done in anomaly detection. Which is, even though we've mainly been talking about unlabeled data, I'm going to change that assumption a bit and assume that we have some labeled data, including just a small number usually of previously observed anomalies. Maybe after making airplane engines for a few years, you've just seen a few airplane engines that were anomalous, and for examples that you know are anomalous, I'm going to associate a label y equals 1 to indicate this anomalous, and for examples that we think are normal, I'm going to associate the label y equals 0. The training set that the anomaly detection algorithm will learn from is still this unlabeled training set of x1 through xm, and I'm going to think of all of these examples as ones that we'll just assume are normal and not anomalous, so y is equal to 0. In practice, you have a few anomalous examples where to slip into this training set, your algorithm will still usually do okay. To evaluate your algorithm, come up with a way for you to have a real number evaluation, it turns out to be very useful if you have a small number of anomalous examples so that you can create a cross validation set, which I'm going to denote x_cv^1, y_cv^1 through x_cv^mcv, y_cv^mcv. This is similar notation as you had seen in the second course of this specialization. As similarly, have a test set of some number of examples where both the cross validation and test sets hopefully includes a few anomalous examples. In other words, the cross validation and test sets will have a few examples of y equals 1, but also a lot of examples where y is equal to 0. Again, in practice, anomaly detection algorithm will work okay if there are some examples that are actually anomalous, but there were accidentally labeled with y equals 0. Let's illustrate this with the aircraft engine example. Let's say you have been manufacturing aircraft engines for years and so you've collected data from 10,000 good or normal engines, but over the years you had also collected data from 20 flawed or anomalous engines. Usually the number of anomalous engines, that is y equals 1, will be much smaller. It will not be a typical to apply this type of algorithm with anywhere from, say, 2-50 known anomalies. We're going to take this dataset and break it up into a training set, a cross validation set, and the test set. Here's one example. I'm going to put 6,000 good engines into the training set. Again, if there are couple of anomalous engines that got slipped into this set is actually okay, I wouldn't worry too much about that. Then let's put 2,000 good engines and 10 of the known anomalies into the cross-validation set, and a separate 2,000 good and 10 anomalous engines into the test set. What you can do then is train the algorithm on the training set, fit the Gaussian distributions to these 6,000 examples and then on the cross-validation set, you can see how many of the anomalous engines it correctly flags. For example, you could use the cross validation set to tune the parameter epsilon and set it higher or lower depending on whether the algorithm seems to be reliably detecting these 10 anomalies without taking too many of these 2,000 good engines and flagging them as anomalies.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithcoursera