Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
We've developed a reinforcement learning formalism using the six state Mars rover example. Let's do a quick review of the key concepts and also see how this set of concepts can be used for other applications as well. Some of the concepts we've discussed are states of a reinforcement learning problem, the set of actions, the rewards, a discount factor, then how rewards and the discount factor altogether use to compute the return, and then finally, a policy whose job it is to help you pick actions so as to maximize the return. For the Mars rover example, we had six states that we numbered 1-6 and the actions were to go left or to go right. The rewards were 100 for the leftmost state, 40 for the rightmost state, and zero in between and I was using a discount factor of 0.5. The return was given by this formula and we could have different policies Pi depict actions depending on what state you're in. This same formalism or states, actions, rewards, and so on can be used for many other applications as well. Take the problem or find an autonomous helicopter. To set a state would be the set of possible positions and orientations and speeds and so on of the helicopter. The possible actions would be the set of possible ways to move the controls stick of a helicopter, and the rewards may be a plus one if it's flying well, and a negative 1,000 if it doesn't fall really bad or crashes. Reward function that tells you how well the helicopter is flying. The discount factor, a number slightly less than one maybe say, 0.99 and then based on the rewards and the discount factor, you compute the return using the same formula. The job of a reinforcement learning algorithm would be to find some policy Pi of s so that given as input, the position of the helicopter s, it tells you what action to take. That is, tells you how to move the control sticks. Here's one more example. Here's a game-playing one. Say you want to use reinforcement learning to learn to play chess. The state of this problem would be the position of all the pieces on the board. By the way, if you play chess and know the rules well, I know that's little bit more information than just the position of the pieces is important for chess, but I'll simplify it a little bit for this video. The actions are the possible legal moves in the game, and then a common choice of reward would be if you give your system a reward of plus one if it wins a game, minus one if it loses the game, and a reward of zero if it ties a game. For chess, usually a discount factor very close to one will be used, so maybe 0.99 or even 0.995 or 0.999 and the return uses the same formula as the other applications. Once again, the goal is given a board position to pick a good action using a policy Pi. This formalism of a reinforcement learning application actually has a name. It's called a Markov decision process, and I know that sounds like a big technical complicated term. But if you ever hear this term Markov decision process or MDP for short, that's just the formalism that we've been talking about in the last few videos. The term Markov in the MDP or Markov decision process refers to that the future only depends on the current state and not on anything that might have occurred prior to getting to the current state. In other words, in a Markov decision process, the future depends only on where you are now, not on how you got here. One other way to think of the Markov decision process formalism is that we have a robot or some other agent that we wish to control and what we get to do is choose actions a and based on those actions, something will happen in the world or in the environment, such as our position in the world changes or we get to sample a piece of rock and execute the science mission. The way we choose the action a is with a policy Pi and based on what happens in the world, we then get to see or we observe back what state we're in, as well as what rewards are that we get. You sometimes see different authors use a diagram like this to represent the Markov decision process or the MDP formalism but this is just another way of illustrating the set of concepts that you learn about in the last few videos. You now know how a reinforcement learning problem works. In the next video we'll start to develop an algorithm for picking good actions. The first step toward that will be to define and then eventually learn to compute the state action value function. This turns out to be one of the key quantities for when we want to develop a learning algorithm. Let's go onto the next video to see what is this, state action value function.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks