четвъртък, 4 октомври 2018 г.

Solving Atari's Lunar Lander with TensorFlow and Evolution


Overview
In this post we discuss the implementation of a very simple evolution machine learning algorithm which aims to solve the Atari's Lunar Lander game using OpenAI.
The algorithm is a very basic implementation of an Evolution strategy. The idea is inspired by Harmaru's A Visual Guide to Evolution Strategies and the article Donal Byrne's PyTorch Landing A Rocket With Simple Reinforcement Learning. Here we provide a very simple Tensorflow implementation. The complete source code can be found on Github. The requirements of this project are:
  • ipython-jupyter
  • numpy
  • Tensorflow
  • OpenAI Gym
  • matplotlib (optional)
Idea of the algorithm
The algorithm can be seen as a very "extreme" version of an evolution algorithm- only the best one survives. Here it is how it works:
  1. Generate \( N_a \) agents.
  2. Every agent plays \( N_g \) games. Calculate the average score per agent.
  3. Only the agent with the highest score survives. Produce the next generation by mutating the best performing agent. In our case we use Gaussian noise centered around \( 0 \) and with dispersion \( \sigma \).
  4. Repeat steps 2 and 3 until desired score is reached.
Implementation
The Agent class
The agents are implemented as a Python classes. Each agent is an instance of the Agent class. The class contains all the necessary infrastructure of an agent- it defines the agent's neural network together with the main operations- initialization, inheritance, mutation, and some helpers.
First off we define the structure of the neural network in the constructor __init__() of the class. It consists of an input layer, two hidden dense layer with ReLu activation functions, and a softmax output layer. The softmax layer always outputs non-negative values which sum to \( 1 \). We will interpret thise outputs as probabilities for taking a certain action from the action space of the game. Note that all Tensorflow variables of our agent are defined within the variable scope, which in turn is the name of the agent- we can later use that naming in order to address the training variables of an agent.

Next, we need some helpers: zero() method nullifies all the entries within the neural network of an agent. train_vars() method returns the Tensorflow trainable variables which represent an agent's neural network.

Notable methods are: from_population()- it replaces the data associated to an agent with a combination of the data of a collection of predecessor agents (previous generation). Each agent is assigned a weight. In this particular version of the algorithm we always supply only one predecessor- the best scoring agent with weight one.
from_agent()- initializes the agent with the data of another agent, with the option to specify a noise function. The noise function is our mutation mechanism. Different noise functions can be specified. In this particular version, we use a Gaussian distribution centered around zero with sigma specified as a hyper parameter.
compute()- takes an observation vector and returns action probabilities as calculated by the agent's internal neural network.
Play
The play() function accepts an OpenAI environment, an Agent object, number of games, max steps per game and a boolean render. The function returns the accumulated scores from all games. This routine uses Agent.compute() in order to choose agent's actions during a sequence of games.
The learning algorithm
Now as we have defined the behaviour of our agents, it is time to implement the learning routine. The learning routine follows closely the basic idea of the algorithm as described in the previous section. We use the play() routine to generate an accumulative score for each agent within a generation. After that we use the Agent.from_population() and Agent.from_agent() in order to inherit the best scoring agent as a seed for the next generation. We also store the average and the maximum score for the generation. The history of these provides a good benchmark of the learning process. The rest of the code is time keeping and some "interactive" control- i.e. we use a file called `1.ctl` in order to indicate when the learning is to be interrupted.
Learning curve
As we have all the infrastructure, neural networks and algorithms in place, the last ingredient are the hyper parameters of the training. For that particular run the following values were chosen:
  • num_agents = 10 (corresponds to \( N_a \) )
  • num_games_per_agent = 50 (corresponds to \( N_g \) )
  • max_steps = 300
  • hsizes = [24, 24] sizes of the two hidden layer within the neural network of an agent
  • sigma = 0.03 the dispersion of the Gaussian noise we use for mutations
With the above hyper-parameters the model was run for approx 300 generations. The progress of the learning is depicted on the figure.
The plot clearly shows that the agents `evolve` towards better performance. We demonstrate an agent in action in the animation below. It is visible that the agent is able to control the speed and the direction of the lander and to point it towards the landing zone.
Space for improvements
As mentioned before, this is a vanilla implementation of an evolution algorithm, so there are many places where the improvement. Here we list a few:
  • \( N_g \) is too big. We might increase the speed of the training if we decrease the number every agent plays. The parameter \( N_g > 1 \) only because we need to have statistical certainty about the relative performance of agents within the same generation
  • We throw away \( 90% \) of the information. We eliminate \( 9 \) out of \( 10 \) agents per generation. The second and third best agents also contain useful information. Even the worst performing agent contains information on what kind of behavior to be avoided. Harmaru describes some modifications of the algorithm which allow for generating the next generation as a mixture of multiple parents.
  • \( \sigma \) is kept constant during the whole training process. Maybe we can use adaptive \( \sigma \) during the training curve- keeping high values at the beginning of the training and decreasing it towards the saturation of our learning curve. This would allow for faster training at the beginning and more precise adjustments towards the end.
  • The algorithm might perform comparable with smaller hidden layers. That would make the training faster, but the only way to know for sure, is to experiment.
Final remarks
In comparison to other RL learning like policy gradient and deep Q-learning, this evolution strategy algorithm is easier to implement and test. This makes it a good candidate for starters. Another important advantage of the algorithm is that it is very easy to parallelize- each agent can run on its own process or a machine.

Няма коментари:

Публикуване на коментар