Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
Let me summarize where we are. If you can compute the state action value function Q of S, A, then it gives you a way to pick a good action from every scene. Just pick the action A, that gives you the largest value of Q of S,A. The question is, how do you compute these values Q of S,A? In reinforcement learning, there's a key equation called the Bellman equation that will help us to compute the state action value function. Let's take a look at what is this equation. As a reminder, this is the definition of Q of S, A. It's return if we start in state S, take the action a once and they behave optimally after that. In order to describe the Bellman equation, I'm going to use the following notation. I'm going to use S to denote the current state. Next, I'm going to use R of S to denote the rewards of the current state. For our little MDP example, we will have that r of one State1 is 100. The reward of State 2 is 0, and so on. The reward of State 6 is 40. I'm going to use the alphabet A to denote the current action, the action that you take in the state S. After you take the action a, you get to some new state. For example, if you're in State 4 and you take the action left, then you get to State 3. I'm going to use S prime to denote the state you get to after taking that action a from the current state S. I'm also going to use A prime to denote the action that you might take in state S prime, the new steady you got to. The notation convention, by the way, is that S,A correspond to the current state and action. When we add the prime, that's the next state, then the next action. The Bellman equation is the following. It says that Q of S,A, that is the return under this set of assumptions that's equal to r of S, says reward you get for being in that state plus the discount factor Gamma times max over all possible actions, a prime of q of S prime, the new state you just got to, and then a prime. There's a lot going on in this equation. Let's first take a look at some examples. We'll come back to see why this equation might make sense. Let's look at an example. Let's look at Q of State 2 and action. Apply Bellman Equation to this to see what value it gives us. If the current state is state two and that the action is to go right, then the next day you get to, after going write S prime would be the State 3. The Bellman equation says Q of 2, right is R of S. This R State 2, which is just the rewards zero plus the discount factor Gamma, which we set to 0.5 in this example, times max of the Q values in state S prime in State 3. This is going to be the max of 25 and 6.25, since this is max over a prime of q of S prime comma a prime. This is taking the larger of 25 or 6.25 because those are the two choices for State 3. This turns out to be equal to zero plus 0.5 times 25, which is equal to 12.5, which fortunately is Q of two and then the action right. Let's look at just one more example. Let me take the State 4 and see what is Q of State 4 if you decide to go left. In this case, the current state is four current action is to go left. The next state, if you can start from four going left. You end up also at State 3. Let us prime this three again, the Bellman Equation, we'll say this is equal to R of S. Our State four, which is zero plus 0.5 the discount factor Gamma of max over a prime of q of S prime. That is the State 3 again, comma a prime. Once again, the Q values for State 3 are 25 and 6.25 and the larger of these is 25. This works out to be our 40 plus 0.5 times 25, which is again equal to 12.5. That's why q of four with the action left is also equal to 12.5, just one note, if you're in a terminal state, then Bellman Equation simplifies to q of SA equals to r of S because there's no state S prime and so that second term would go away. Which is why Q of S,A in the terminal states is just 100, 100 or 40 40. If you wish feel free to pause the video and apply the Bellman Equation to any other state action in this MDP and check for yourself if this math works out. Just to recap, this is how we had define Q of S,A. We saw earlier that the best possible return from any state S is max over a Q of S,A. In fact, just to rename SNA, it turns out that the best possible return from a state S prime, is max over S prime of a prime. I didn't really do anything other than rename S, S prime and a to a prime. But this will make some of the intuitions a little bit easier later. But for any state S prime, like State 3, the best possible return from, say, State 3 is the max over all possible actions of Q of S prime E prime. Here again is the Bellman equation.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks