Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz

The lunar lander lets you land a simulated vehicle on the moon. It's like a fun little video game that's been used by a lot of reinforcement learning researchers. Let's take a look at what it is. In this application you're in command of a lunar lander that is rapidly approaching the surface of the moon. And your job is the fire thrusters at the appropriate times to land it safely on the landing pad. To give you a sense of what it looks like. This is the lunar lander landing successfully and it's firing thrusters downward and to the left and right to position itself to land between these two yellow flags. Or if the reinforcement landing algorithm policy does not do well then this is what it might look like where the lander unfortunately has trashed on the surface of the moon. In this application you have four possible actions on every time step. You could either do nothing, in which case the forces of inertia and gravity pull you towards the surface of the moon. Or you can fire a left thruster when you see a little red dot come out on the left, that's firing the left. They'll tend to push the lunar lander to the right or you can fire the main engine that's thrusting down the bottom here. Or you can fire the right thruster and that's firing the right thruster which will push you to the left and your job is to keep on picking actions over time. So it's the lunar lander safely between these two flags here on the landing pad. In order to give the actions a shorter name I'm sometimes going to call the actions nothing meaning do nothing or left meaning farther left thruster or main meaning fire the main engine downward or right. So I'm going to call the actions nothing left. Maine and write for short later in this video. How about the states face of this? Mtp the states are its position X and Y. So how far to the left or right and how high up is it as well as velocity x.y how fast is it moving in the horizontal and vertical directions and then also is angle. So how far is the lunar lander tilted to the left or tilted to the right? Is angular velocity fated not. And then finally, because a small difference in positioning makes a big difference in whether or not it's landed. We're going to have two other variables in the state vector which we call l and r. Which corresponds to whether the left leg is grounded, meaning whether or not the left leg is sitting on the ground as well as r which corresponds to whether or not the right leg is sitting on the ground. So whereas xy x.theta theta.our numbers l and r will be binary valued and can take on only values zero or one depending on whether the left and right legs are touching the ground. Finally his reward function for the lunar lander. If it manages to get to the landing pad, didn't receive the reward between 100 and 140 depending on how well it's flown when gotten to the center of the landing pad. We also give it an additional reward for moving toward or away from the pad so it moves closer to the pad it receives a positive reward and it moves away and drifts away. It receives a negative reward. If it crashes it gets a large -100 reward, it achieves a soft landing, that is a landing. There's another crash, it gets a +100 reward for each leg, the left leg or the right link that it gets grounded. It receives a +10 reward and finally to encourage it not to waste too much fuel and fire thrusters aren't necessarily. Every time it fires the main engine we give it a -0.3 rewards and every time it fires the left or the right side thrusters we give it a -0.03 reward. Notice that this is a moderately complex reward function. The designers of the lunar lander application actually put some thought into exactly what behavior you want and codified it in the reward function. To incentivize more of the behaviors you want and fear of the behaviors like crashing that you don't want. You find when you're building your own reinforcement learning application usually takes some thought to specify exactly what you want or don't want and to codify that in the reward function. But specify the reward function should still turn out to be much easier to specify the exact right action to take from every single state. Which is much harder for this and many other reinforcement learning applications. So the lunar lander problem is as follows. Our goal is to learn a policy pi. That when given a state S as written here picks an action a equals pi of S. So as to maximize the return the sum of discounted rewards. And usually for the lunar lander would use a fairly large value for gamma ra. In fact for the would use the value of camera that's equal to 0.985 so pretty close to one.

Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks