вторник, 8 август 2017 г.

Reinforcement learning: Demon Attack


Overview

In this post we discuss the inner workings of a simple RL agent which learns how to play the classic Atari game Demon Attack.
The algorithm is a basic implementation of Policy gradient, which is a type of a reinforcement learning algorithm. For more details: Karpathy and Juliani provide very good introductions to the Policy gradient. The full source code of the algorithm as a jupyter notebook can be found on Git. The agent operates within a OpenAI Gym environment and its policy is implemented in Tensorflow. Hence the requirements:

  • ipython-jupyter
  • numpy
  • Tensorflow
  • OpenAI Gym
  • matplotlib (optional)

Idea of the algorithm

On a top level, the algorithm can be represented by the following block diagram.
It has two main stages:
  1. Dimensionality reduction
  2. Training

Dimensionality reduction

This first stage is essentially a preprocessing stage. The goal is to detect the actual dimension of the input data. Every state of the game is represented by an array with 128 integer values. Our agent uses either 2 (or optionally 3) consecutive frames while making a decision. The reason for using more than one frame is to provide also dynamic information- i.e. the difference between two consecutive frames contains information about all moving objects- bullets, enemies, etc. In total 2 consecutive observations of the environment represent \( n_d = 256 \) dimensional states. We can use that collection of states as a single input to our neural network. However, we can use Principal Component Analysis in order to estimate the effective dimensionality of our input data. An observation of the environment contains some opaque to us representation of the state of the game. This state encodes information about the position in space of the agent, the enemies, current lives, bullets, etc. Intuitively we can suspect that not all possible states are observed- i.e. there are some restrictions with respect to the relative positions of the agents on the screen, for example. 

In order to "measure" the dimensionality of an observation, we need to obtain a collection of many observations. The easiest way to do so is to let an uninitialized or a random walk agent to play several games while we collect the history of the encountered states. In our runs we have used typically \( n_g=200 \) games with maximum number of allowed steps \( 3000 \). These leads to a maximum of \( n_s=600000 \) collected states. We represent them as the matrix \( S \) of size \( 600000 \times 256 \). Each row of the matrix is a separate state and each column represents the evolution in time of the state entry enumerated by the number of the column. In what follows we will be more interested in the rows of the matrix.

After the data is collected, we perform Singular Value Decomposition (SVD) on the matrix \( S \). More detailed explanation of SVD can be found at www.deeplearningbook.org. It is essentially a decomposition of the matrix into three matrices with known properties:


\( S = U . s . V \)

  • \( U \) is a unitary matrix which is constructed by the eigenvectors of the matrix \( SS^{T} \). Its size is \( n_s \) by \( n_s \).
  • \( s \) is zero everywhere except the elements on the main diagonal. They have positive values in descending order- the so called singular values SV. We denote them with \( \lambda \).
  • \( V \) is a unitary matrix which is constructed by the eigenvectors of the matrix \( S^{T}S \). Its size is \( n_d \) by \( n_d \).

We are interested in the the relative magnitude of the singular values and their corresponding vectors from \( V \). The bigger the SV which corresponds to a vector from \( V \), the bigger is the contribution of the given mode. It turns out that the magnitudes of the different SVs differ by several orders of magnitude...


As we can see the difference between biggest and the smallest SV is \( 18 \) orders of magnitude.  This is to be expected as intuitively we know that the not possible states are observed. So this allows us to approximate our observations with a very high degree of accuracy by using only vectors from \( V \) which correspond to the biggest SV's. In our experiments we choose to use the first \( 160 \) vectors. This allows us to reduce the dimension of the input data from \( 256 \) to \( 160 \) while still working with accuracy of about \( 10^{-6} \).


Training

After we have reduced the dimensionality of the input data, we can proceed with implementing the agent and the training algorithm.
The agent is implemented as a simple neural network with one hidden layer with \( 150 \) neurons. The degrees of freedom of our NN correspond to the parameters of the following matrices and vectors: \( W^{1}_{160, 150} \) , \( b^{1}_{150} \), \( W^2_{150, 6} \) , and \( b^2_{6} \). The matrices \( W^1 \) and \( W^2 \) correspond to the connections between the neurons of the input layer and the hidden layer and between the hidden layer and the output layer. The vectors \( b^1 \) and \( b^2 \) are the corresponding biases. The activation function of the hidden layer is ReLU. The output layer uses softmax activations- this way the output can be interpreted as a probability distribution over the action space. We start with an agent with a randomly initialized policy neural network. We need to make sure that our agent, even though randomly initialized, assigns nonzero probability for any of the \( 6 \) possible actions in most situations- i.e. an agent which always moves left and never shoots is doomed to never make any decision which generates points. The perfect initial condition would be if our algorithm assigns equal probabilities to any action in any situation. We estimate the probability distribution over the different actions of an agent before training from a history of a batch of 200 games:
The histogram shows a good agreement with a uniform distribution over the action space. Therefore our agent will explore all possible actions in many given conditions.

We start the training with \( 200 \) games played by the agent while storing in memory every single observation together with the action performed by the agent. For every game we also store the total points made during the game- for us this plays the role of the reward function. After that we compute the average reward generated by our agent- every game with higher score then the average is marked as "good" and every game with lower score is marked as "bad". For each game we define a least square loss function \( f(x, y, W^1, b^1, W^2, b^2) \) using the game history as a test set. Then we compute the gradients of the loss function with respect to the parameters of the neural network.
\( \frac{\partial f}{\partial W^1} \)     \( \frac{\partial f}{\partial b^1} \)
\( \frac{\partial f}{\partial W^2} \)     \( \frac{\partial f}{\partial b^2} \)
Then we update the weights of our neural network according to:
\( W^1 \to W^1 + c \frac{\partial f}{\partial W^1} \)
\( b^1 \to b^1 + c \frac{\partial f}{\partial b^1} \)
\( W^2 \to W^2 + c \frac{\partial f}{\partial W^2} \)
\( b^2 \to b^2 + c \frac{\partial f}{\partial b^2} \)
The parameter \( c \) is \( -1 \) for a good game and \( +1 \) for weak game. This way the probabilities for the decisions made during a good game increase and vice-versa. After we process all the games in the current batch, we plot the progress. We can make as many generations as possible and observe the results:
As we can see from the figure, statistically our trained agent starts to perform significantly better after a couple of thousands of games. While it still might happen that the trained agent performs worse occasionally, on average it is better as shown by the "moving averages" plot. Here we also demonstrate a game played by the initial generation agent and a game played by a later generation:  

Further improvements

There are many ways we can improve the algorithm while preserving its basic idea. Here we state the ones that come to mind:
  • Changing the topology of the neural network. We can either increase the number of the neurons in the hidden layer or add more layers to the network altogether. While this looks promising, it will inevitably increase the number of played games needed to train our agent.
  • Introducing reward discount. We can introduce parameter \( \gamma < 1 \) which exponentially decreases the rewards taken in time. This is a very common technique and it might significantly increase the performance of the training algorithm.
  • Tuning the hyper-parameters of the model- size of the training batch, learning rate, maximum number of steps per game, etc. Finding of the right parameters is indeed very important as in many cases a good algorithm won't work until a sensible choice of the hyper parameters has been made.

Conclusions

We descried a very simple implementation of Policy gradient algorithm. The algorithm is far from perfect- most likely the convergence can be significantly improved by some of the techniques we already mentioned. The implementation is a generic RL setup and does not rely heavily on the particular game- one can adjust the same algorithm to a different game by simply changing the dimensions of the input and output layers of the neural network.

Няма коментари:

Публикуване на коментар