Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
When we start to develop reinforcement learning hours later this week, you see that there's a key quantity that reinforcement learning arrows will try to compute and that's called the state action value function. Let's take a look at what this function is. The state action value function is a function typically denoted by the letter uppercase Q. And it's a function of a state you might be in as well as the action you might choose to take in that state and QFSA. Will give a number that equals the return. If you start in that state. S and take the action A just once and after taking action A once you then behave optimally after that. So after that you take whatever actions will result in the highest possible return. Now you might be thinking there's something a little bit strange about this definition because how do we know what is the optimal behavior? And if we knew what the auto behavior, if we already knew what's the best action to take in every state, why do we still need to compute Q of SA. Because we already have the auto policy. So I do want to acknowledge that there's something a little bit strange about this definition. There's almost something a little bit circular about this definition, but rest assured When we look at specific reinforcement learning outcomes later will resolve this slightly circular definition and will come up with a way to compute the Q function even before we've come up with the optimal policy. But you see that in a later video. So don't worry about this for now. Let's look at an example we saw previously that this is a pretty good policy Go left from stage 2, 3 and four and go right from State five. It turns out that this is actually the optimal policy for the mars rover application When the discount factor gamma is 0.5, so Q of S. A will be equal to the total return If you start from say that take the action A and then behave optimally after that. Meaning take actions according to this policy. Shown over here, let's figure out what Q of s,a. Is for a few different states. Let's look at say Q of state too. And what if we take the action to go right well if you're in state two and you go right then you end up at state three And then after that you behave optimally you're going to go left from ST three and then go left from state to and then eventually you get the reward of 100. In this case, the rewards you get would be zero from state to zero when you get to stay three zero when you get back to state two and then 100 when you finally get to the terminal state one and so the return will be zero plus 0.5 times that plus 0.5 squared times ac plus 0.5 cubed times 100. And this turns out to be 12.5 And so Q of ST two of going right as equal to 12.5. Note that this passes, no judgment on whether going right is a good idea or not. It's actually not that good an idea from state to to go right, but it just faithfully reports out the return if you take action A and then behave optimally afterwards. Here's another example. If you're in state to and you were to go left, then the sequence of rewards you get will be zero when you're in state two followed by 100. And so the return is zero plus 0.5 times 100 that's equal to 50 in order to write down The values of QSA. In this diagram, I'm going to write 12.5 here on the right to denote that this is Q of state to going to the right. And then when I write a little 50 here on the left to denote that this is Q of ST two and going to the left just to take one more example What if we're in ST four and we decide to go left. Well if you're in stage four you go left, you get rewards zero and then you take action left here. So zero gain, take action left here, zero and then 100. So Q of four Left results in rewards zero because the first action is left and then because we followed the optimal policy afterwards You can reward 00 100. And so the return is zero plus 00.5 times that. Plus 4.5 squared times that plus 0.5 Q times that. Which is therefore equal to 12.5. So Q4 left is 12.5. I'm going to write this year as 12.5. And it turns out if you were to carry out this exercise for all of the other states and all of the other actions, you end up with this being the Q of s,a. For different states and different actions And then finally at the Terminal State. Well it doesn't matter what you do, you just get that terminal reward 100 or 40. So just write down those terminal awards over here.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks