Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
The learning algorithm that we developed, even while you're still learning how to approximate Q(s,a), you need to take some actions in the lunar lander. How do you pick those actions while you're still learning? The most common way to do so is to use something called an Epsilon-greedy policy. Let's take a look at how that works. Here's the algorithm that you saw earlier. One of the steps in the algorithm is to take actions in the lunar lander. When the learning algorithm is still running, we don't really know what's the best action to take in every state. If we did, we'd already be done learning. But even while we're still learning and don't have a very good estimate of Q(s,a) yet, how do we take actions in this step of the learning algorithm? Let's look at some options. When you're in some state s, we might not want to take actions totally at random because that will often be a bad action. One natural option would be to pick whenever in state s, pick an action a that maximizes Q(s,a). We may say, even if Q(s,a) is not a great estimate of the Q function, let's just do our best and use our current guess of Q(s,a) and pick the action a that maximizes it. It turns out this may work okay, but isn't the best option. Instead, here's what is commonly done. Here's option 2, which is most of the time, let's say with probability of 0.95, pick the action that maximizes Q(s,a). Most of the time we try to pick a good action using our current guess of Q(s,a). But the small fraction of the time, let's say, five percent of the time, we'll pick an action a randomly. Why do we want to occasionally pick an action randomly? Well, here's why. Suppose there's some strange reason that Q(s,a) was initialized randomly so that the learning algorithm thinks that firing the main thruster is never a good idea. Maybe the neural network parameters were initialized so that Q(s, main) is always very low. If that's the case, then the neural network, because it's trying to pick the action a that maximizes Q(s,a), it will never ever try firing the main thruster. Because it never ever tries firing the main thruster, it will never learn that firing the main thruster is actually sometimes a good idea. Because of the random initialization, if the neural network somehow initially gets stuck in this mind that some things are bad idea, just by chance, then option 1, it means that it will never try out those actions and discover that maybe is actually a good idea to take that action, like fire the main thrusters sometimes. Under option 2 on every step, we have some small probability of trying out different actions so that the neural network can learn to overcome its own possible preconceptions about what might be a bad idea that turns out not to be the case. This idea of picking actions randomly is sometimes called an exploration step. Because we're going to try out something that may not be the best idea, but we're going to just try out some action in some circumstances, explore and learn more about an action in the circumstance where we may not have had as much experience before. Taking an action that maximizes Q(s,a), sometimes this is called a greedy action because we're trying to actually maximize our return by picking this. Or in the reinforcement learning literature, sometimes you'll also hear this as an exploitation step. I know that exploitation is not a good thing, nobody should ever explore anyone else. But historically, this was the term used in reinforcement learning to say, let's exploit everything we've learned to do the best we can. In the reinforcement learning literature, sometimes you hear people talk about the exploration versus exploitation trade-off, which refers to how often do you take actions randomly or take actions that may not be the best in order to learn more, versus trying to maximize your return by say, taking the action that maximizes Q (s,a). This approach, that is option 2, has a name, is called an Epsilon-greedy policy, where here Epsilon is 0.05 is the probability of picking an action randomly. This is the most common way to make your reinforcement learning algorithm explore a little bit, even whilst occasionally or maybe most of the time taking greedy actions.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks