Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
You saw in the last video, what are the states of reinforcement learning application, as well as how depending on the actions you take you go through different states, and also get to enjoy different rewards. But how do you know if a particular set of rewards is better or worse than a different set of rewards? The return in reinforcement learning, which we'll define in this video, allows us to capture that. As we go through this, one analogy that you might find helpful is if you imagine you have a five-dollar bill at your feet, you can reach down and pick up, or half an hour across town, you can walk half an hour and pick up a 10-dollar bill. Which one would you rather go after? Ten dollars is much better than five dollars, but if you need to walk for half an hour to go and get that 10-dollar bill, then maybe it'd be more convenient to just pick up the five-dollar bill instead. The concept of a return captures that rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to. Let's take a look at exactly how that works. Here's a Mars Rover example. If starting from state 4 you go to the left, we saw that the rewards you get would be zero on the first step from state 4, zero from state 3, zero from state 2, and then 100 at state 1, the terminal state. The return is defined as the sum of these rewards but weighted by one additional factor, which is called the discount factor. The discount factor is a number a little bit less than 1. Let me pick 0.9 as the discount factor. I'm going to weight the reward in the first step is just zero, the reward in the second step is a discount factor, 0.9 times that reward, and then plus the discount factor^2 times that reward, and then plus the discount factor^3 times that reward. If you calculate this out, this turns out to be 0.729 times 100, which is 72.9. The more general formula for the return is that if your robot goes through some sequence of states and gets reward R_1 on the first step, and R_2 on the second step, and R_3 on the third step, and so on, then the return is R_1 plus the discount factor Gamma, this Greek alphabet Gamma which I've set to 0.9 in this example, the Gamma times R_2 plus Gamma^2 times R_3 plus Gamma^3 times R_4, and so on, until you get to the terminal state. What the discount factor Gamma does is it has the effect of making the reinforcement learning algorithm a little bit impatient. Because the return gives full credit to the first reward is 100 percent is 1 times R_1, but then it gives a little bit less credit to the reward you get at the second step is multiplied by 0.9, and then even less credit to the reward you get at the next time step R_3, and so on, and so getting rewards sooner results in a higher value for the total return. In many reinforcement learning algorithms, a common choice for the discount factor will be a number pretty close to 1, like 0.9, or 0.99, or even 0.999. But for illustrative purposes in the running example I'm going to use, I'm actually going to use a discount factor of 0.5. This very heavily down weights or very heavily we say discounts rewards in the future, because with every additional parsing timestamp, you get only half as much credit as rewards that you would have gotten one step earlier. If Gamma were equal to 0.5, the return under the example above would have been 0 plus 0.5 times 0, replacing this equation on top, plus 0.5^2 0 plus 0.5^3 times 100. That's lost reward because state 1 to terminal state, and this turns out to be a return of 12.5. In financial applications, the discount factor also has a very natural interpretation as the interest rate or the time value of money. If you can have a dollar today, that may be worth a little bit more than if you could only get a dollar in the future. Because even a dollar today you can put in the bank, earn some interest, and end up with a little bit more money a year from now. For financial applications, often, that discount factor represents how much less is a dollar in the future where I've compared to a dollar today. Let's look at some concrete examples of returns. The return you get depends on the rewards, and the rewards depends on the actions you take, and so the return depends on the actions you take.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks