Complete Playlist of Unsupervised Machine Learning https://www.youtube.com/playlist?list=PLfQLfkzgFi7azUjaXuU0jTqg03kD-ZbUz
Let's see how we can use reinforcement learning to control the Lunar Lander or for other reinforcement learning problems. The key idea is that we're going to train a neural network to compute or to approximate the state action value function Q of s, a and that in turn will let us pick good actions. Let's see how it works. The heart of the learning algorithm is we're going to train a neural network that inputs the current state and the current action and computes or approximates Q of s, a. In particular, for the Lunar Lander, we're going to take the state s and any action a and put them together. Concretely, the state was that list of eight numbers that we saw previously, so you have xy, x dot, y dot, Theta, Theta dot, and then LR for where the legs are grounded, so that's a list of eight numbers that describe the state. Then finally, we have four possible actions: nothing, left, main, a main engine, and right. We can encode any of those four actions using a one-hot feature vector. If action were the first action, we may encode it using 1, 0, 0, 0 or if it was the second action to find the left cluster, we may encode it as 0, 1, 0, 0. This list of 12 numbers, eight numbers for the state and then four numbers, a one-hot encoding of the action is the inputs we'll have to the neural network, and I'm going to call this X. We'll then take these 12 numbers and feed them to a neural network with say, 64 units in the first hidden layer, 64 units in the second hidden layer, and then a single output in the output layer. The job of the neural network is the output Q of s, a. The state action-value function for the Lunar Lander given the input s and a. Because we'll be using neural network training algorithms in a little bit, I'm also going to refer to this value Q of s, a as the target value Y that were training the neural network to approximate. Notice that I did say reinforcement learning is different from supervised learning, but what we're going to do is not input a state and have it output an action. What we're going to do is input a state action pair and have it try to output Q of s, a, and using a neural network inside the reinforcement learning algorithm this way will turn out to work pretty well. We'll see the details in a little bit so don't worry about it if it doesn't make sense yet. But if you can train a neural network with appropriate choices of parameters in the hidden layers and in the upper layer to give you a good estimates of Q of s, a, then whenever you're Lunar Lander is in some state s, you can then use the neural network to compute Q of s, a. For all four actions, you can compute Q of s, nothing, Q of s, left, Q of s, main, Q of s, right, and then finally, whichever of these has the highest value, you pick the corresponding action a. For example, if out of these four values, Q of s, main is largest, then you would decide to go and fire the main engine of the Lunar Lander. The question becomes, how do you train a neural network to output Q of s, a? It turns out the approach will be to use Bellman's equations to create a training set with lots of examples, x and y, and then we'll use supervised learning exactly as you learned in the second course when we talked about neural networks. To learn using supervised learning, a mapping from x to y, that is a mapping from the state action pair to this target value Q of s, a. But how do you get the training set with values for x and y that you can then train a neural network on? Let's take a look. Here's the Bellman equation, Q of s, a equals R of s plus Gamma, max of a prime, Q of s prime, a prime. The right-hand side is what you want Q of s, a to be equal to, so I'm going to call this value on the right-hand side y and the input to the neural network is a state and an action so I'm going to call that x. The job of the neural network is to input x, that is input the state action pair, and try to accurately predict what will be the value on the right. The algorithm you just saw is sometimes called the DQN algorithm which stands for Deep Q-Network because you're using deep learning and neural network to train a model to learn the Q functions. Hence DQN or DQ using a neural network. If you use the algorithm as I described it, it will work, okay, on the lunar lander. Maybe it'll take a long time to converge, maybe it won't land perfectly, but it'll work. But it turns out that with a couple of refinements to the algorithm, it can work much better. In the next few videos, let's take a look at some refinements to the algorithm that you just saw.
Subscribe to our channel for more computer science related tutorials| https://www.youtube.com/@learnwithgeeks