Stories by Jake Grigsby on Medium

Multivariate Time Series Forecasting with Transformers

Jake Grigsby — Thu, 28 Oct 2021 18:56:53 GMT

Thoughts and Theory

Spatiotemporal Learning Without a Graph

Temporal Self-Attention (left) and Spatiotemporal Self-Attention (right). Splitting each timestep into separate time series variables lets us learn attention patterns between each variable across time. [All images in this post by the authors]

Many real-world applications of Machine Learning involve making predictions about the outcomes of a group of related variables based on historical context. We might want to forecast the traffic conditions on connected roads, the weather at nearby locations, or the demand for similar products. By modeling multiple time series together, we hope that changes in one variable may reveal key information about the behavior of related variables. Multivariate Time Series Forecasting (TSF) datasets have two axes of difficulty: we need to learn temporal relationships to understand how values change over time and spatial relationships to know how variables impact one another.

Popular statistical approaches to TSF can struggle to interpret long context sequences and scale to complex variable relationships. Deep Learning models overcome these challenges by making use of large datasets to predict rare events far into the future. Many methods focus on learning temporal patterns across long timespans and are based on Recurrent or Convolutional layers. In highly spatial domains, Graph Neural Networks (GNNs) can analyze the relationships amongst variables as a graph of connected nodes. That graph is often pre-defined, e.g., a map of roads and intersections in traffic forecasting.

In this post, we hope to explain our recent work on a hybrid model that learns a graph across both space and time purely from data. We convert multivariate TSF into a super-long sequence prediction problem that is solvable with recent improvements to the Transformer architecture. The approach leads to competitive results in domains ranging from temperature prediction to traffic and energy forecasting.

This is an informal summary of our research paper, “Long-Range Transformers for Dynamic Spatiotemporal Forecasting,” Grigsby, Wang, and Qi, 2021. The paper is available on arXiv, and all the code necessary to replicate the experiments and apply the model to new problems can be found on GitHub.

Transformers and Time Series Forecasting

Transformers are a state-of-the-art solution to Natural Language Processing (NLP) tasks. They are based on the Multihead-Self-Attention (MSA) mechanism, in which each token along the input sequence is compared to every other token in order to gather information and learn dynamic contextual information. The Transformer learns an information-passing graph between its inputs. Because they do not analyze their input sequentially, Transformers largely solve the vanishing gradient problem that hinders Recurrent Neural Networks (RNNs) in long-term prediction. For this reason, Transformers have been applied to datasets with long historical information, including TSF.

Multivariate TSF datasets are usually organized by time: the values of all N variables are represented as a single vector. However, this only allows Transformers to learn relationships between the entire stack of variables across time. In complex Multivariate TSF problems, each variable has meaningful relationships to its history as well as different events in the history of other variables. A standard application of Transformers to TSF data can’t learn this because it treats the values of every variable at a given timestep as a single token in its graph; each variable cannot have its own opinion on the context it should prioritize. This is unlike the NLP tasks where Transformers are so popular, where every token represents a unified idea (a single word).

We address this by creating a new prediction problem in which each token represents the value of a single variable per timestep. Transformers are then free to attend to the values of any variable at any time in order to make more accurate predictions. The diagram at the top of this post shows the difference between these two types of attention.

Spacetimeformer

We use an input format in which N variables at T timesteps are flattened into a sequence of (N x T) tokens. The value of each variable is projected to a high-dimensional space with a feed-forward layer. We then add information about the timestep and variable corresponding to each token. The time and variable embeddings are initialized randomly and trained with the rest of the model to improve our representation of temporal and spatial relationships. The values at future timesteps we want to predict are set to zero, and we tell the model which ones are missing with a binary “given” embedding. The different components are summed and laid out such that Transformer MSA constructs a spatiotemporal graph across time and variable space. The embedding pipeline is visualized in the figure below.

Spatiotemporal Sequence Inputs: (1) The multivariate input format with time information included. Decoder inputs have missing (“?”) values set to zero where predictions will be made. (2) The time sequence is passed through a Time2Vec layer to generate a frequency embedding that represents periodic input patterns. (3) A binary embedding indicates whether this value is given as context or needs to be predicted. (4) The integer index of each time series is mapped to a “spatial” representation with a lookup-table embedding. (5) The Time2Vec embedding and variable values of each time series are projected with a feed-forward layer. (6) Value&Time, Variable, and Given embeddings are summed and laid out such that MSA attends to relationships across both time and variable space at the cost of a longer input sequence.

Standard Transformers compare each token to every other token to find relevant information in the sequence. This means that the model’s runtime and memory use grows quadratically with the total length of its input. Our method greatly exaggerates this problem by making the sequence N times longer than the timeseries itself. The rest of our approach deals with the engineering challenge of making it possible to train this model without the highest-end GPU/TPUs.

Efficient Transformers are an active area of research in applications with long input sequences. These “Long-Range Transformers” look to fit the gradient computation of longer sequences in GPU memory. They often do this by adding heuristics to make the attention graph sparse, but those assumptions don’t always hold up outside of NLP. We use the Performer attention mechanism, which linearly approximates MSA with a kernel of random features. Performer is efficient enough to fit sequences of thousands of tokens, and lets us train our model in a few hours on one node with 10GB GPUs.

The context sequence of historical data and the target timestamps we’d like to predict are converted to long spatiotemporal sequences. A Performer-based encoder-decoder architecture processes the sequence and predicts the value of each variable at future timesteps as separate tokens. We can then re-stack the predictions to their original format and train to minimize prediction-error metrics like mean squared error. The model can also create a range of forecasts by outputting both the mean and standard deviation of a normal distribution — in which case we train to maximize the probability of the ground-truth sequence. The full model architecture is shown below.

Spacetimeformer Architecture: Target and Context Embeddings are passed through a sequence of effiicient attention layers. “Global Attention” modules looks at every timestep of every variable while “Local Attention” sees the timesteps of each variable separately. We found this to be helpful in datasets with large N.

Applications

We compare the model against more standard TSF and GNN methods. Linear AR is a basic linear model trained to make auto-regressive predictions, meaning it outputs one token at a time and recycles its output as an input for the next prediction. LSTM is a standard RNN-based encoder-decoder model without attention. LSTNet is an auto-regressive model based on Conv1D layers and RNNs with skip connections to remember long-term context. DCRNN is a graph-based model that can be used when a pre-defined variable graph is available. Like our method, MTGNN is a TSF/GNN hybrid that learns its graph structure from data but does not use Transformers for temporal forecasting. Finally, we include a version of our own model that does not separate the tokens into a spatiotemporal graph; the values of each variable remain stacked together as usual. This “Temporal” ablation is a stand-in for Informer, but it uses all of the rest of our engineering tricks and training process to isolate the benefits of spatiotemporal attention.

First, we’ll look at a weather forecasting task. We used the ASOS Network to put together a large dataset of temperature readings from airports in Texas and New York. The geographic separation between the two groups makes spatial relationships more important, and those relationships have to be learned from experience because we do not provide any location information. We predict 40, 80, and 160 hours into the future and compare Mean Squared Error (MSE), Mean Absolute Error (MAE) and Root Relative Squared Error (RRSE). This experiment focuses on TSF models because a graph is not available.

Spacetimeformer outperforms the baselines, and its advantage over the Temporal attention version appears to increase with the length of our predictions. Our goal is to learn a spatiotemporal attention graph, and we can verify that this is what Spacetimeformer accomplishes by visualizing its attention network. Attention matrices visualize MSA by revealing the attention given by each token to the full sequence; each row is one token, and the columns of that row show the token’s attention to the other tokens in the sequence, including itself. The figure below shows the weather station variables and attention matrices of Spacetimeformer and the Temporal-only variant, where darker blue coloring corresponds to more attention. The standard Temporal mechanism learns a sliding wave-like pattern where each token focuses mostly on itself (along the diagonal) and on the very end of the sequence. Spacetimeformer flattens that sequence into separate variables, with each variable having its own sub-sequence of tokens (indicated by a green arrow and the variable shape). This results in a ‘block-structured’ attention matrix where all the tokens of a variable tend to prioritize the timesteps of a subset of the other variables. We can interpret these patterns to understand the spatial relationships the model is learning. In this case, the model can correctly cluster the Texas and New York stations together — and if you zoom in, you can see the same wave-like temporal patterns within each subsequence.

Next, we look at three benchmark datasets in traffic and energy forecasting. AL Solar measures the solar output of 137 sites in Alabama, while the Metr-LA and Pems-Bay datasets measure vehicle speeds at 200+ road sensors around Los Angeles and San Francisco, respectively. We generate 4-hour solar forecasts and 1-hour traffic forecasts. The results are shown in the tables below.

Spacetimeformer learns an accurate prediction model in all cases. The traffic results are interesting because the complex road network makes these problems a common benchmark in Graph Neural Network research, where the map can be turned into a graph and provided to the model in advance. Our model has comparable predictive power while implicitly learning a roadmap from data.

If you’d like to apply this method to new problems, the source code for the model and training process is released on GitHub at QData/spacetimeformer. A more detailed explanation with additional background and related work can be found in our paper.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Written by Jake Grigsby, Zhe Wang, and Yanjun Qi. This research was done by the QData Lab at the University of Virginia.

Multivariate Time Series Forecasting with Transformers was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Arti Canon: Neural Text Generation

Jake Grigsby — Mon, 22 Oct 2018 13:08:16 GMT

“Why should I be angry with the childish pain of conditioned existence?”

— A computer

The process of using neural networks to generate text based on comprehensive datasets has been possible for a number of years. One of the most popular applications of machine learning, this process (neural text generation) involves building a statistical model of a given piece of text and using that model to output similar writings of its own. Existing applications of neural text generation have been able to produce entertaining adaptations of Latex and C Code.

Examples like those highlight these models’ ability to master highly structured, syntax-intensive domains. That got us thinking about the opposite end of the spectrum — about genres with simple structures but complicated ideas and deep insights. Could a neural network be trained on both classical and contemporary Buddhist writings in order to produce its own canonical work?

Thus, our server began its own mathematical progression down the path to awakening.

Technical Overview

The core of this text-generation method is a time series model; a function that predicts the next token in a sequence given the context sequence (the last several steps).

In our case, these tokens are individual characters. So the function is predicting the probability of the next character given the sequence of characters that came before it.

This prediction function can be approximated with deep learning, using a training technique called teacher forcing that lets the neural network learn the statistical structure of the language. In teacher forcing, a text corpus is broken down into (x, y) pairs where x is a slice of characters of length t and y is the t+1 character. The network outputs probabilities for each character and is looking to minimize the cross entropy loss:

The cross entropy loss measures the similarity between a network’s predictions and the true labels. It comes up often in prediction problems, like NTG and image classification.

There are a couple ways to do this (including some complicated GAN and RL-based approaches), but we went with a recurrent encoder-decoder architecture. The encoder reads each time step in x as input, and keeps track of past inputs in the hidden/recurrent state of the RNN cell (we are using LSTMs; how LSTMs work). After all t steps in the context sequence, we end up with a sequence of vectors meant to distill the content of the whole sentence into numbers. The Arti Canon actually runs through x forwards and backwards, concatenating the resulting vectors in hopes that including the information both before and after each character will help it capture more of its context (a bidirectional layer).

The decoder takes the vector representation of the context sequence and outputs predictions for the next character. In this model, we’re using two LSTM layers, followed by a dense layer with a softmax activation. The problem is, the decoder needs to figure out which timesteps are actually relevant to its prediction; the characters towards the end are important if we want to preserve spelling and grammar, but the beginning might hold important information about the direction the sentence is headed. Here’s a good (made up)¹ example:

An example attention vector. Orange represents more attention. This vector is broadcasted and multiplied by the encoder’s output state, giving the decoder a representation of each step that is proportional to that step’s importance.

The decoder would clearly need to be focused on the last letter ‘g’, which lets it figure out it’s reached the end of the word ‘clapping’; it’s expecting a whitespace or some kind of punctuation. But there are still several options — how does it decide? The key lies all the way back at the start of the sequence, with ‘What’, which would indicate a question. This ability to adjust focus between disjoint sequence fragments is called attention. It’s used all over the place in sequence modeling, most famously in translation, where it helps align languages with different conventions for sentence structure and word order.

There are a couple more tricks we use to increase prediction accuracy:

Dropout is a common technique that handicaps the network during training in hopes that the added difficulty can help boost results during prediction. We use it on the LSTM layers.
Stripping the source text of rare characters. There can be a lot of obscure characters in a text corpus, especially when mining from pdfs. The network has a hard enough job as it is — we don’t make it learn the difference between obscure pdf artifacts. The same principle applies to capital letters. If the dataset is small, removing capitals is a great way to increase generalization. The network has no concept of the relation between lowercase and uppercase letters; ‘G’ is just a different index in its one-hot-encoded input vector from ‘g’. Leaving them in splits training examples that would otherwise be teaching the network the same concept².
Lots of data. Most of the development time on this project was spent in the loop of training the model, not being satisfied with the results, and then adding more data before training again. An interesting way to make sense of the impact that larger and larger datasets have on overall performance is to compare versions of the Arti Canon generated earlier in the project. We set out to directly imitate the Dhammapada, so the first model trained on nothing but the 400 or so lines that make up that text. Soon we expanded to 7,000 lines, which was enough to give the outputs some variety, but the syntax was choppy and hard to read. The final version reads more than 20,000 lines of text, and that extra data let the model learn sentence structures that flow well, even if they don’t always make sense. You can read full pdf outputs from earlier models in the output section of the repository.

All we really have now is a way of predicting the next character based on previous ones. How is that useful? It forms the core of a text generation loop, where we can grow a seed to infinite length using repeated calls to the neural network.

https://medium.com/media/649355e7361338f846d2f7b0b07d5801/href

So for some seed s, we can use the trained model to calculate next-character predictions in a vector of probabilities.

Example output from the softmax layer, with corresponding encoding indices.

But how do we pick which character to add to the seed we will use for the next calculation? The naive way, greedy selection, is just to take the highest value, in this case ‘c’ at 25%. That may seem ok (after all, that’s exactly what we trained the language model to try and do), but in practice this method leads to super conservative, boring outputs that will favor whatever was most prominent in the training set.

The next thing you could try is a random sample from the prediction distribution; the most likely character would be picked most of the time, but not all. This can be extended with a temperature setting that can skew the distribution towards otherwise low-probability choices, overcoming some of the training set bias and increasing complexity and creativity; it’s a balance — too adventurous and you’re asking for spelling mistakes.

The better solution is beam search. By considering the top several options, we can score based on the best sentences and phrases, rather than the best individual character choices. This saves us from mistakes where the highest recommendation writes the sentence into a corner of nonsense and repetition; by keeping some backup options in mind, we give alternate solutions the chance to make a comeback and lead to better sentences. The beam width parameter determines how many other options we are considering (avoiding the space and time complexity issues that come with the branching factor of this tree).

Example beam search. The top 2 options are retained at every prediction layer, giving the generation loop an alternative route when one path ends up stuck on low-probability characters.

For a more comprehensive description of beam search, see this video lesson. Our program adds a few extra features, like a final score evaluation that normalizes for length and checks for spelling, as well as a repetition reduction technique. Using a verse-by-verse beam search with a high beam width leads to a pretty remarkable improvement in results. Unfortunately it also leads to an equally large increase in runtime.

Despite beam search’s best efforts, there are still some challenges with long generation loops like this one. Because teacher forcing uses ground truth sequences and labels for every step, the model never has to recover from mistakes; it predicts the next character, compares it to the actual text, and moves on to the next example. But when it has to start predicting on its own output, mistakes can pile up. When they do, the sequence strays from the training distribution, and we start to expose the model’s poor knowledge of general language; it knows how to predict the space it was trained in, but hasn’t seen spelling and grammar mistakes, or any of the content outside the classic Buddhist texts we fed it. You can see that in action by seeding the generator with something random (seed in bold):

“What is the runtime complexity of a doubly linked lista, it o tieiiu or me, when she do you think, and he who sits, ill live within the annihilation of suffering, and when they return to make what they do not require and depend upon the goodness of the destiny of fault.

It turns out Arti Canon has no interest in doing our data structures homework. The random topic throws it off for a few characters before snapping right back to pseudo-Buddhism.

There is also an issue with verse and chapter repetition. The Arti Canon’s text generation loop makes decisions based only on its current input, not the history of inputs that came before it. It has some go-to phrases, which is fine, but occasionally those phrases get long enough fill up the entire input window. Then, because the input state is the same as something it’s written before, it makes the same predictions it’s made before. While a temperature selection strategy introduces enough randomness to eventually push the generator in a new direction, beam search’s deterministic calculation of the best possible verse will spit out the same thing it wrote the last time that common phrase came up. And because each verse is seeded based on the previous one, the rest of the output will be a duplicate. We contain the problem by reseeding the first verse of each chapter, so the longest it can repeat itself is about a page. We also experimented with introducing some random noise to beam search’s character selections, but found that to be ineffective, as those selections were just getting tossed out by the hypothesis scoring function. A more effective way to handle this would probably be to add noise directly to the model’s confidence output vectors, but we decided against that, as generating shorter chapters (and therefore reseeding more often) was effective enough and didn’t require additional hyperparameters.

Most of the code base deals with the tedious process of automatically assembling the output in a fancy-looking pdf.

The pipeline from start to finish. A neural network generates a raw text file. Then the rest of the program handles automatic conversion from that text file to a pdf.

The Arti Canon

Unsurprisingly, the generator spends most of its time pondering life’s great questions:

“What is so special about me?”

“My body is an awakening mind.”

“My mind does not exist.”

“I shall not be bothered by anything.”

“He who understands satisfies.”

Many of its passages are structured around the words and actions of the Buddha:

“The buddha says: I shall not perceive the truth that thy blessed one accept things as they are.”

“When the blessed one accepted his own refuge, the buddha explained the rainy season.”

“Look upon the air, for the sake of compassion. The disciples of gotama are always well awake, and their thoughts day and night are always set on the road.”

The Dhammpada makes frequent use of metaphors to shorten and simplify its ideas. The Arti Canon doesn’t have time for variety, so it’s settled on the ultimate metaphor for human existence, and repeats it over and over again just to drive the point home:

“I should remain like a piece of wood.”

“The king becomes free from ailments! Let us live happily then, free from good for their sight on beings should remain like a piece of wood.”

“A noble youth should remain like a piece of wood.”

It also makes pretty good use of Buddhist vocabulary words, like ‘Gotama’, ‘Brahmana’, ‘Tathagata’, and ‘the Four Noble Truths’.

“Sacrifice the lobster.”

Our website has the full output from the latest model, while the github repo has a second book as well as a few legacy versions for comparison.

What Does it All Mean?

The truth, of course, is that the Arti Canon has no understanding of its thoughts. It’s a mathematical model that learned the statistical structure of a couple books we slapped in a text file. It can’t be insightful because it doesn’t know what insight is; it contradicts itself left and right because it doesn’t know what an opinion is. And yet… every once in a while it writes something that makes sense — or at least as much sense as the proverbs it learned to read. There’s something there that speaks to the fine line between genius and randomness; great ideas are easy to imitate but hard to come up with.

You would think Buddhism would be a layup for text generation, if only because a lot of the actual Buddhist texts come across as sort of nonsensical, but in hindsight it wasn’t the best choice. Texts like the Dhammapada build the essence of a religion from short words and simple phrasing — they take basic building blocks and design a masterpiece. But as we’ve discussed, the neural network has no concept of meaning, so it masters those same basic building blocks but resorts to the writing equivalent of scattering them around on the floor. You could cover that up a lot better by picking a language space with bigger words and a sentence structure that more closely resembles the way we usually read and write.

Sequence Generation in the Real World

As much as we love pseudo-intellectual silicon Buddhism, neural text generation has some applications that are actually important. Image captioning uses it to write descriptions for input images. Conversational models are starting to use it to come up with genuinely new responses for every question. Text generation is also just a subset of the broader sequence generation problem — many of the underlying ideas carry over to seemingly unrelated domains: if you replace strings and characters with images you have video generation. Then swap out the images for MIDI keyboard notes and you have music generation.

Example of image captioning. These are taken from neuraltalk2 by Andrej Karpathy.

There is still a lot of research being done and a lot of problems to solve. The Arti Canon is a pretty carefully selected application that dodges some of those bigger challenges. For example, recurrent neural networks have difficulty making decisions over long time horizons; they can’t keep track of what happened 90,000 steps ago. There’s a reason you don’t see much ‘Neural Story Generation’ — this technique has trouble including long-term story arcs and keeping track of key details it made up a few pages back. By copying the Dhammapada’s short verse format, Arti Canon only has to remember the paragraph it’s writing right now, and it still doesn’t do a very convincing job.

https://medium.com/media/83e8ea65f24f682b3ba546d6f53b8605/href

On that note, you can read the entire Arti Canon at: www.articanon.com.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

By Jake Grigsby and Hugh Jones

The Arti Canon was developed by Jake Grigsby and Zabih Yousuf (GitHub)

The website was developed by Hugh Jones (ArtiCanon.com)

Cavalier Machine Learning at the University of Virginia

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —-

Notes:

The github repo contains a script that outputs very similar attention diagrams using the actual model. In reality the attention tends to exclusively focus on the end of the input sequence, but there is some variance in how far back that window extends.
Capitalization is restored using regular expressions.

The Arti Canon: Neural Text Generation was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lunar Landings from Demonstrations

Jake Grigsby — Tue, 28 Aug 2018 17:46:45 GMT

Deep reinforcement learning algorithms have achieved remarkable results on a number of problems once thought to be unsolvable without the aid of human intuition and creativity. RL agents can learn to master tasks like chess and retro video games without any prior instruction — often surpassing the performance of even the greatest human experts. But these methods are sample inefficient and rely on learning from hundreds or even thousands of complete failures before any progress is made. That’s a luxury we can afford when the task is simple or can be simulated, like an Atari screen or chess board, but is at least partially responsible for RL’s relatively short list of real-world applications. For example, it would be incredibly dangerous, expensive, and time inefficient to let a self-driving algorithm learn by smashing a real car into a real wall for the 1000 iterations it might take for it to figure out what the brakes do, or to learn to land a rocket by crashing the first 500 of them.

What would be nice is if we could fast-forward through that early phase of catastrophic failure and let the agent start real-world learning with a solid understanding from its very first attempt. One approach would be to let the agent burn through its failure stage in a simulated version of the task, where its actions wouldn’t have consequences. That’s fine, but most applications don’t have a perfect simulator and transferring learning from a simplified game-like environment to the real world comes with its own host of problems (stemming from the fact that neural networks don’t generalize as well as we like to give them credit for and often can’t make sense of the subtle differences between the two domains). Many tasks, however, do have examples of near-perfect controllers — an expert human who can provide demonstration data to guide the agent to success early and often.

That’s the solution given in Deep Q-Learning from Demonstrations. Their algorithm is an imitation-learning variant of a Deep Q-Network (DQN) called DQfD.

DQfD works by adding an imitation loss on top of the standard TD-error. During a pretraining phase, the agent samples transitions from the expert’s demonstrations and is incentivized to replicate that expert’s actions in any given state.

The large-margin classification loss used to push the agent’s actions towards the expert. l(ae, a) is a zero matrix with positive entries where the agent chooses an action that the expert did not recommend.

Once the pretraining has been completed, the agent starts interacting with the environment by itself, generating its own transitions and exploring the larger state space. As a result, the loss is a combination of supervised, one-step and n-step TD error. That is the real advantage of this technique relative to the other imitation learning methods out there; most imitators just hope to become a watered-down version of the demonstrator, but DQfD accelerates learning up to the expert’s level and then uses the success of a standard DQN to try and push out beyond that benchmark.

The agent stores its own experiences alongside the expert’s in the replay buffer, using Prioritized Experience Replay to weigh the demos higher than its own data. At each training step, it samples proportional to those priorities, making sure to set the imitation loss to 0 for self-generated transitions. When the replay buffer fills, it overwrites only its own data, leaving the demonstrations intact.

Example: Safer Initial Control in Lunar Lander

We can use a toy problem to think about the potential impact this research direction might have.

LunarLander-v2

The LunarLander Gym environment tasks a spacecraft equipped with main, left, and right side engines with landing in a target area surrounded by randomly generated terrain. The state space is continuous and a vector in ℝ⁸, with a discrete action space of {do nothing, left engine, right engine, main engine}. The reward is a function of distance to the target, velocity, and fuel spent, with a positive (or negative) constant upon successful or unsuccessful landings.

Kickstarting Performance

The idea here is to use the expert controller that is available to train new ML agents, and in most real world applications, that would mean using human experts. But it can just as easily be another artificial agent, and in this case is a Prioritized Double Dueling (PDD) DQN given plenty of time to train:

Prioritized Double Dueling performance over the first 15k episodes. Note that the reward is modified from the raw environment using a log scale that overcomes some of the shortcomings of DQN’s reward clipping approach. (See the DQfD paper for details).

While its end performance might be good enough, PDD DQN spends its first few thousand attempts smashing the lander to pieces. This is a product of the exploration/exploitation conflict at the core of trial-and-error optimization problems. This experiment uses an e-greedy approach, which assigns the greedy (current best) action a probability of:

Where |A(s)| is the size of the action space

Less-optimal actions are selected according to:

This guarantees exploration… and total failure. That’s what makes demonstration data so valuable: DQfD can quickly learn the expert’s policy without having to stumble into it on its own. Instead of an annealed epsilon value starting at 1.0, we can use a constant value of just .01, which lets it explore just enough to surpass the demonstrator’s performance.

DQfD is landing successfully from its first interaction with the ‘real’ environment — bypassing all the obstacles that come with thousands of repeated failures. The pretraining phase is also extremely sample efficient: that initial gap is the result of just 15k pretraining steps on ~50k demonstration samples. Even the large convolutional network used on Atari needs only 750k steps.

DQfD in action

One of the challenges that comes with this approach is the tendency to overfit on the demos during pretraining. That’s when the network starts to sacrifice its ability to generalize across the entire input space in favor of high accuracy on the specific examples in the training set. This is an issue in just about every supervised learning application, from speech recognition to language modeling and image classification. The obvious answer is just to make the training set bigger — and in the world of big data, most of the time that’s about as easy as it sounds. But we’re unlikely to have that same endless supply of data for imitation learning, especially when we need human demonstrators; the data just doesn’t come as cheap. DQfD has to find another answer, which is a carefully tuned amount of l2 regularization, penalizing large weight changes in an effort to keep the network from chasing after every local minimum in the demo set’s loss.

l2 regularization over the first 15K episodes

Here, the regularization term is adjusted while other hyperparameters are fixed. High coefficients result in underfitting, hurting initial performance and slowing/destabilizing overall learning progress, while properly-tuned values let the network mold tightly to the demo data without collapsing when given its own chance to fly the lander.

Sample Efficiency and Hard Exploration Problems

Pretraining length makes almost no difference on a problem of this difficulty; DQfD learns the expert’s actions in as few as 15 thousand steps.

Lunar Lander is an example of using DQfD to solve a relatively easy exploration problem almost instantaneously, but it can also be used to make progress on problems that leave traditional methods hopeless by giving the network something to minimize when real reward signals are few and far between. This is actually the main focus of the paper, with tests run on games like Pitfall! and Hero. In fact, a more advanced algorithm called Ape-X DQfD was the first RL agent to complete a level of the famously hard exploration game Montezuma’s Revenge.

https://medium.com/media/ecae064ce9406cc5b90007c99ea6b06b/href

(Interesting read about how much that matters)

Another impact of this TD/imitation hybrid technique is its ability to accelerate training cycles. The pretraining phase can shave millions of timesteps and several GPU hours off of each training run, which would add up over hyperparameter grid searches or sweeps through large problem sets (like all 60 Atari games in the ALE). For example, the paper found that DQfD saves roughly 83 million frames over DQN on the Atari games they tested it on.

DQfD saving millions of timesteps in Asteroids

A large enough replay buffer would push the imitation loss near 0, basically reducing DQfD to a super-sample-efficient DQN. DeepMind even showed that the algorithm can get a fast start with human demonstrations and still be flexible enough to shift gears towards the non-human strategies it would find if it started from scratch — meaning this approach may work even when the demonstrator’s policy is known to be less than perfect.

Agents typically beat Road Runner using a very non-human gameplay strategy. DQfD uses human demonstrations to increase learning speed, but is still able to find that ‘exploit’. (Demo data average was ~20k.)

You could also just let the replay buffer overwrite the demonstrations, in which case you’d have a more complicated version of the replay buffer spiking technique.

You can access the source code for this project on Github.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Written by Jake Grigsby with Hugh Jones

Cavalier Machine Learning, University of Virginia

www.cavml.com

Lunar Landings from Demonstrations was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Advanced DQNs: Playing Pac-man with Deep Reinforcement Learning

Jake Grigsby — Fri, 29 Jun 2018 19:30:01 GMT

art by Yojama

In 2013, DeepMind published the first version of its Deep Q-Network (DQN), a computer program capable of human-level performance on a number of classic Atari 2600 games. Just like a human, the algorithm played based on its vision of the screen. Starting from scratch, it discovered gameplay strategies that let it meet (and in many cases, exceed) human benchmarks. In the years since, researchers have made a number of improvements that super-charge performance and solve games faster than ever before. We’ve been working to implement these advancements in Keras — the open source, highly accessible machine learning framework — and in this post, we’ll walk through the details of how they work and how they can be used to master Ms. Pac-man.

A Brief Introduction to Reinforcement Learning

DQN, and similar algorithms like AlphaGo and TRPO, fall under the category of reinforcement learning (RL), a subset of machine learning. In reinforcement learning, an agent exists within an environment and looks to maximize some kind of reward. It takes an action, which changes the environment and feeds it the reward associated with that change. Then it takes a look at its new state and settles on its next action, repeating the process endlessly or until the environment terminates. This decision-making loop is more formally known as a Markov decision process (MDP).

It’s easy to see how Atari games like Ms. Pac-man fit into the MDP framework. The player looks at the game screen and chooses from the different buttons and joystick positions in an effort to increase their score. But that simplicity can hide the daunting task that anyone who’s ever played a video game has probably taken for granted, which is the process of making instant decisions based on one of the nearly infinite pixel combinations that can show up on screen at any given time, and one which you are very unlikely to have ever encountered before. In case that wasn’t hard enough, video games are technically partially observable MDPs, in the sense that you are forced to make choices based on an indirect representation of the game (the screen) rather than the code/memory itself, hiding some of the information you need to make a fully-informed decision. Lucky for us, video games like Pac-man do provide two useful simplifications: frame rates and controllers. The finite number of buttons and joystick positions let us map that observation space to a discrete and manageable action space, a mental function that gets simplified further when you realize it doesn’t need to be done continuously, because you can only press one button for every frame (most RL agents go further than that and only make new decisions every 3 or 4 frames). The challenge is then difficult but definable: how do we use our experience playing a game to make decisions that increase the score, and generalize that decision-making process to new positions we’ve never been in before?

The Basics of Deep Q-Networks

DQNs leverage the incredible success of neural networks to solve that challenge. The architecture can be split into two parts. First, a series of convolutional layers learns to detect increasingly abstract features of the game’s input screen (we’ll get into the details of how this works and what it learns to detect later). Then, a dense classifier maps the set of those features present in the current observation to an output layer with one node for every button combination on the controller.

Neural network schematic from DeepMind’s 2015 paper

These output values are the agent’s best prediction of the reward it expects if it were to take this action. This ‘state to predicted reward’ approach is known as value-based learning, and is centered around the action-value function Q(s,a), representing the total value of taking action a in state s, which is the sum of the future rewards r, adjusted by a discount factor gamma, indicating how far into the future the agent is expected to plan. The optimal strategy is then written:

Which is just to say that at each step we are taking the action that leads to the maximum reward score. Pretty straightforward once we can calculate those Q values, but of course our Pac-man agent can’t really see the future, and can’t assign a unique Q value to each of the possible states. That’s where the deep learning comes in. By mapping pixel images to Q values, our neural network acts as a Q function approximator, so that while it can’t see the future, with enough training it can learn to predict it.

That training is accomplished using gradient descent on the above loss function, which is a variation of temporal difference. This can be thought of as the difference between the ‘true’ or target Q values and our current estimation of them, where the target value is the immediate reward plus the Q value of the action we will take in the next state. Of course, that value is also calculated by our network, but the overall expression is inherently more accurate thanks to it having access to at least the first reward term. Even so, this is definitely the math equivalent of trying to hit a moving target, as the true values are calculated by the same network we are training. This is the big discrepancy between supervised learning — another subset of machine learning that includes tasks like image classification and sentiment analysis — and reinforcement learning. Supervised learning uses datasets that are labeled, meaning that the target values are manually set by humans and assumed to be accurate and unchanging. Reinforcement learning creates its own, ever-shifting dataset, both because the network generates its own target values and because the action choices of the network directly impact which states it will reach in its environment and therefore what it will have to learn about. To help manage that, we actually take two extra stabilization measures. First, we duplicate the neural network, using the second copy, or target network to generate the target values, and the original, or online network to generate the estimations. Only the online network is trained, but we copy its weights over to the target network every so often to refresh it with more optimized parameters. Second, we use something called an experience buffer. The buffer is a dataset of our agent’s past experiences, where an experience is defined as (s, a, r, t, s’) where s, a and r maintain their previous definitions, t is a boolean that lets the agent know if this was the terminal state of the episode, and s’ represents the state that followed s when the agent took action a. You’ll notice that an experience entry contains all of the variables needed to compute the loss function. So, instead of learning as the agent plays Pac-man, it’s actually just moving Pac-man around the screen based on what it has already learned, but adding all of those experiences to the buffer. Then, we can take experiences from storage and replay them to the agent so that it can learn from them and take better actions in the future. This is conceptually similar to how humans replay memories in order to learn from them, and the process is fittingly named experience replay. Most importantly, it ensures that the agent is learning from its entire history (or at least as much of it as we can practically store before we start overwriting the oldest data), rather than it’s most recent trajectory.

It’s worth noting that these architecture decisions are what classify DQN as an off-policy, model-free algorithm. Model-free because it learns to predict the value associated with a position, but doesn’t attempt to build a model of the inner workings of its environment in order to make predictions about the next state it will see, and off-policy because its training examples are generated by a past version of itself (and therefore a past version of its decision-making policy).

There’s one more important detail to get out of the way, and that’s the exploration issue. There’s a never-ending debate within Markov agents over whether its best to take a risk and explore new ideas or to exploit what is already known. Imagine what would happen if the algorithm always pressed the button that it thought would lead to the highest reward. It would likely do that same action every single time, never trying anything else and therefore never learning and improving any further. For example, it might learn on its very first attempt that going right from the start zone can get it a couple points, and then just continue to go right no matter how many ghosts or walls stand in its way; it never experimented with going left and therefore has no way to accurately estimate the reward it would generate by doing so. The original DQN solves this problem in an arbitrary but surprisingly effective way. We initialize a variable, epsilon, to 1.0. For every step, we generate a random number between 0 and 1. If this number is less than epsilon, we take an action completely at random, regardless of what the agent has to say about that action’s Q value. Because epsilon is 1, we do this 100% of the time. But as training progresses, we anneal epsilon down to around 0.1, meaning we take the best action 90% of the time and explore a new, random direction the other 10%. Interestingly, we never let epsilon hit 0 in practice (even during testing/evaluation, where we typically use .05); this makes sure that Ms. Pac-man can never get stuck in a corner or stop moving indefinitely. This method is known as epsilon greedy.

Now that we’re getting into the experiments/code, it’s probably a good time to mention that all of these results (including weight files for all of the trained models), and the scripts that generated them can be found on this project’s GitHub repo. The inner workings of the DQN algorithm are implemented in the open source library keras-rl. However, it may be some time before the more advanced techniques we’ve been working on and will be describing below have been approved for a merge into the master version. In the meantime, you can access them on our fork.

Something to know about DQNs (and RL algorithms in general) is that they are sample inefficient and expensive to train. The standard approach in the DQN literature is to run 200 million frame training sessions. That’s about 930 hours of human-speed gameplay or roughly 20 days of training on a single P4000 GPU (at least for the final DQN variant we’ll be getting to). We don’t have the resources to pull something like that off for every version of the algorithm. Instead, we chose to run these comparison experiments for 10 million frames (enough to highlight the discrepancies), and to train our final, best-performing agent for a more extended 40 million step run.

With that out of the way, it’s time to see how the algorithm we’ve described performs on the task at hand.

Vanilla DQN Takes on Ms. Pac-man

An effective way to monitor and analyze an agent’s success during training is to chart its cumulative reward at the end of each episode.

The reward function is proportional to (but not equal to, and actually much less than) the in-game score, so it doesn’t mean all that much on its own. However, the agent is clearly able to learn something about the game in order to consistently improve over time. Also notice the diminishing return on investment, just like any learning curve — human or machine.

Let’s check out some gameplay and see how it does.

https://medium.com/media/8c98e21ec381ec7a0bd016ed15e232f6/href

Amazingly, the agent has learned to navigate around the maze. It even does a pretty good job of clearing large areas when there are still dense pockets to be collected. However, it has a very hard time finding its way back to the lonely few dots that slip away, and it seems to have total meltdowns whenever it finds itself stuck in the middle of two pockets and has to decide which one to go after first (like at 0:29). It also does not seem to actively hunt down the ghosts for points when they become vulnerable, and only triggers that behavior by accident, on its way to gathering up all the dots in each corner of the maze.

Double Q-Learning

In 2015, van Hasselt et al. applied double q-learning to DQN, a modification that improved performance on some games. Notice that - in the loss function from above - we use the target network both to calculate the q values for each action in the next state and to select which of those actions we will want to take (the highest one). It turns out this can lead to some overestimation problems, particularly when the discrepancy between the target network and online network cause them to recommend totally different actions given the same state (as opposed to the same action with slightly different q values). To solve this, we take away the target network’s responsibility to determine the best action; it simply generates the Q values, and we let the online model decide which to use. More formally, Double DQN generates target values according to:

Dueling Architecture

An important detail of RL theory is that the Q function can actually be split up into the sum of two independent terms: the value function V(s), which represents the value of being in the current state, and an advantage function A(s,a), which represents the relative importance of each action, and helps ensure that the agent takes the best action possible, even if that choice may not have any immediate impact on the game score.

In 2016, Wang et al. published the dueling network architecture, which splits the neural network into two streams that share a convolutional base, one for estimating each function.

Vanilla DQN architecture (top) vs. Dueling DQN (bottom). Dueling estimates the action and value functions separately, before combining them to arrive at the overall Q function.

The green arrows in the architecture schematic above stand in for the method we use to combine them. Unfortunately, the naive solution (adding A(s,a) to V(s)), isn’t effective because the network isn’t incentivized to optimize V and A independently, and therefore we can’t guarantee that those are what it’s learning at all. Instead, we use an alternative approach:

Where alpha and beta represent the parameters of the advantage and value streams, respectively. Now the network can use the knowledge of its chosen action to push the advantage function to 0, so that Q(s,a) is approximately equal to V(s).

Prioritized Experience Replay

When humans look back to learn from our mistakes, we optimize the process to spend time where it’s most needed. When you fail a midterm and decide you need to do better on the final, you go in and look specifically at the questions you got wrong; you don’t skim through the pages or just check the even multiples of 3. Up until now, our agent has been sampling from its replay buffer uniformly at random. In the context of Pac-man, that means our agent is being presented with frames where it was coasting down a straightaway with no enemies in sight much more often than the one time it entered a four-way intersection with ghosts on three sides, chose wrong, and paid the price. But what if it could learn from the experiences where it is the most wrong?

Schaul et al.’s 2016 paper proposes a solution, known as prioritized experience replay (PER). In PER, we use an additional data structure that keeps a record of the priority of each transition. Experiences are then sampled proportional to their priorities:

Where alpha is a hyperparameter that indicates how aggressive we want this sampling bias to be. The priorities themselves are just the agent’s temporal difference error the last time it trained on that experience. In this way, our agent is more likely to learn from its most inaccurate predictions, smoothing out its trouble spots and generally being much more sample efficient.

There is also the detail of importance sampling weights, which are meant to compensate for the fact that we are now intentionally feeding the network data that will cause abnormally large gradient updates.

These importance weights are controlled by a hyperparameter beta, which controls how aggressively we want to compensate for them. This beta value is typically annealed through the duration of the run, rising from an initialization of ~ .6 and ending at 1.0 (full compensation). Even so, PER-equipped agents tend to use lower learning rates (typically around .25x) to safeguard against catastrophic collapse.

Prioritized Double Dueling vs Pac-man

While each of these three improvements can make significant improvements on their own, the great thing about them is they can exist in the same algorithm, somewhat longwindedly referred to as a Prioritized Double Dueling DQN (PDD). Here is the learning curve, plotted against our previous version:

While their performance is equally poor during the heavy exploration phase of the first few thousand episodes, PDD begins to break away at the 2k episode mark. And while that gap may seem tough to distinguish, the gameplay video tells a better story:

https://medium.com/media/a18cafca056e29569f0325a1acd0a558/href

The new additions have allowed our agent to seek out isolated dashes with clear intent — no more random wandering around the maze. It also seems to have made the association between picking up the energy pills in the corners and earning a higher reward, most likely from accidentally running into ghosts once that behavior makes them vulnerable. Interestingly, it hasn’t fully flushed out that strategy, as it typically makes a beeline to each corner of the maze without even waiting for its invincibility to expire and without targeting the fleeing ghosts, whereas an experienced human player might save those powers for dire situations or at the very least space them out to maximize their effect.

Noisy Networks for Exploration

Earlier, we talked about the debate between exploration and exploitation, and the somewhat arbitrary solution called epsilon Q greedy. The problem with this approach is its reliance on a scheduled hyperparameter. How do we know how much exploration we’ll need to solve a game like Ms. Pac-man? The only real way to find out is to test a whole bunch of different values, a time consuming and expensive process. It would be much better if the agent could learn to control its exploration the same way it learns to control the rest of its decision-making — through gradient descent. In 2017, Fortunato et al. released the solution that made that possible: Noisy Networks.

Because we are calculating the Q-values as normal, but randomizing (adding noise) to our action selection, the epsilon Q greedy approach falls under the umbrella of action space noise exploration methods. In a Noisy Network, we instead inject noise into the weights of the neural network itself (the parameter space). As a result, the agent will consistently recommend actions that it didn’t ‘intend’ to, ensuring exploration, and we can make action decisions based solely on the highest Q value output.

The difference between action and parameter space noise, from a very similar technique released by Open AI around the same time.

Specifically, Noisy Networks replace the Dense classifiers in the model with Noisy Dense layers, defined by the operation:

Where all of the parameters are learnable except for the epsilon terms, which are generated using factorized Gaussian noise before each training and inference step. The agent can then learn to manipulate the other variables, particularly the two sigmas, to tune out the noise — but it can do so at the pace its environment demands, without any arbitrary interference from us. To illustrate this, we can chart the average sigma weight over time. Here is the result from the 10 million step training run we’ll get more into in a little bit:

Because the noise injection is element-wise multiplied by this term, the agent can learn to ignore the noise by setting sigma to 0. It’s interesting to note that this method tends to naturally bottom-out above 0 just like we designed our epsilon schedule to — ensuring at least some level of exploration for the duration of training.

N-step TD

One last improvement we’ll make is the introduction of a new term in our loss function, which helps the agent more closely approximate the theoretical cumulative discounted reward. Basically, this:

becomes this:

in an effort to emulate this:

Where n is a hyperparameter typically set to either 3 or 5. The idea here is to improve target accuracy by bootstrapping from sequences of transitions — a method that lies somewhere in the middle of the Temporal Difference -> Monte Carlo spectrum. You can read a lot more about this in Chapter 7 of this textbook.

The Final Result

Once again, the best part of these changes is that we can stack them into the same, super-charged algorithm, presumably called a NoisyNet N-step Prioritized Double Dueling Deep Q-Network, which really rolls off the tongue. When DeepMind added one more thing to this late last year, they started calling it “Rainbow”, either because they were throwing in the towel on the whole name-stacking thing or because a bunch of PhD’s sat around a conference room looking at the colors on this chart and had an unfortunate lapse in creativity:

from Rainbow: Combining Improvements in Deep Reinforcement Learning

For the record, that additional feature is distributional RL, in which the agent learns to predict reward distributions for each action. We hope to return to this in the future.

Here is the learning curve for our final DQN variant, plotted against the previous iterations:

These last few extensions go a long way towards improving both overall performance and sample efficiency — with a wire-to-wire victory over both previous versions. One final reminder: 10 million steps is not enough to say anything definitive about overall performance when it comes to DQNs. However, all of the papers we’ve been referencing/implementing show similar results extrapolated over much larger time frames, and our shorter experiments show that the hierarchy holds true even on more manageable training runs.

We left NoisyNstepPDD running for a total of 40 million steps. Now let’s see how it plays.

https://medium.com/media/f0efb6fb6f8f9db10480038c125a03a4/href

We’re now consistently clearing the first level, and pulling off tight maneuvers between ghosts while navigating towards left-over dashes with a purpose and making good use of the shortcut exits on each side of the screen. It’s most common cause of death is running into invisible ghosts, a ‘glitch’ that is caused by the Atari’s strict sprite-render limits.

What Does DQN ‘See’?

Atari-playing DQNs make sense of the screen with the help of a convolutional neural network (CNN) — a series of convolutional layers that learn to detect relevant game information from the mess of pixel values. Unfortunately, we don’t give it the chance to learn directly from the game screen — compute resources demand some shortcuts. Instead, we pre-process the raw input by downsizing from the original size to a more manageable 84 x 84 and convert to grayscale.

What we would see (left) vs. what our agent sees (right)

Of course, this does mean we are throwing away color information that could be helpful. For example, each color ghost in Pac-man has it’s own AI, and expert players can use those differences to forecast where they will go, but it’s unlikely our agent can tell them apart at all. This is done to chop off two of the channels an RGB image would have, significantly reducing the load on our CNN. Also note that the resize operation even manages to delete the location of the dots to the left and right of the ghosts’ start box; luckily, these are typically picked up by chance as Ms. Pac-man moves to either edge of the maze. These preprocessing steps are a pretty standard practice across the Atari RL domain, so we decided to continue them here, even if they aren’t optimized for this specific game.

We fill the now-empty third dimension with a stack of the 4 most recent frames; this lets the agent get an understanding of the speed and direction of each sprite.

The stack of frames that makes up one complete input.

In the example above, the agent would know that it is rounding a corner counter-clockwise and that a fruit has recently appeared and is moving away from it. A stack of frames also adds some protection against the sprite-flicker effect, asssuming each ghost is rendered at least once every 4 frames (which doesn’t always seem to be the case).

All of our implementations use DeepMind’s CNN architecture from 2015, consisting of three layers. The first learns to recognize 32 low-level features. Conceptually, we might expect one feature channel to track the remaining dots, one to track the ghosts, another to track the player, etc. By accessing the layer’s output and laying its activations over the most recent frame in the input stack, we can get a sense of what they might actually be learning. Probably the most surprising thing we found was our agent’s ghost and player-tracking system; rather than using one channel to keep a constant watch on the ghosts, and another to get Ms. Pac-man’s position in the maze, our agent has learned a radar-by-committee approach, in which a channel might track one ghost at a time, and even switch ghosts depending on the frame.

Ghost-busting by committee: the agent uses an ensemble of feature channels to determine the position of the game’s sprites. Bright white spots indicate areas of high activation. The maze background is added for context.

In this example, channels 13 and 16 seem to track individual ghosts, while channel 22 tracks the player and channel 1 sees multiple sprites at once. The system is far from perfect, as there doesn’t seem to be a feature map that picks up the ghost near the center start zone. That ghost will likely be identified in a future frame, and a good portion of the agent’s success on any given episode seems to be connected to how long these ‘blind spots’ last.

There are other channels that pick up very specific features, even if it’s hard for humans to conceptualize what those features might be:

The agent has learned a number of well-defined features that are difficult to conceptualize and vary by frame. In this case, 29 and 0 might both be looking at the player; 28 and 11 are anyone’s guess.

Even less helpful are the channels of total noise, of which there are still a disappointing number after 40m steps. This would likely clear up as training continued, with noisy channels narrowing to look more like 11 and 28 from above, while 11 and 28 might develop into more useful representations themselves.

High-noise activations

The great thing about CNNs is that their features get more and more abstract as you get deeper into the network. We can use gradient-ascent on the outputs of the third and final convolutional layer to generate class activation maps, indicating which parts of the image most contributed to the agent’s decision. This technique has some similarities to tracking the eye movement/attention of a human playing a video game or watching a movie. The results reveal a decision-making process that a human player might recognize.

The class activation map for the agent’s decision.

Here, the agent, located at (20, 30), is coming up on an intersection and needs to decide where to go next. The heatmap indicates that the agent’s attention is focused on the left side of the screen, and is clustered pretty tightly around the path of dots that remain. It’s also noticing the ghosts at (25,40) and in the upper-center as well as the smaller dot cluster on the far right side. It ends up deciding to go right and up, which would be the equivalent of pushing the Atari joystick diagonally between those two directions. Interestingly, our agent seems to prefer these in-between inputs to the four standard directions, probably because the walls of the maze typically make one of those directions invalid, so it ends up being more efficient to try both at once. It could also be that this allows Ms. Pac-man to round tight corners without frame-perfect inputs (the game forces her right until the first frame there isn’t a wall overhead, then usually turns her up), a useful tactic for an agent that only gets to make a decision once every 4 frames.

In this example, from the end of the first level, we get some empirical proof of what we suspected when we first added PER — that the agent is capable of seeking out lone dots. It knows it’s safe from the three ghosts to the left and is only focused on the last remaining dot.

While their decision-making strategies may be similar on the surface, it’s important to remember that DQN doesn’t process information like a human. It learns to recognize certain groups of pixels as being ‘good’ or ‘not as good’. It can’t avoid ghosts, because it doesn’t even know what a ghost is; it learns to avoid clusters of numbers in a matrix that are similar to clusters of numbers that gave it a negative reward in the past. Where a human can lean on past experience and outside concepts (‘ghosts are bad, I should run’), DQN has nothing. Recent research has shown just how much of an advantage this can be, and how it’s probably a big contributor to the slow learning progress of RL algorithms. Borrowing a diagram from that paper, imagine how difficult it would be to play a game if it looked like this:

What’s going on? What am I controlling? What’s the goal? The only way to find out is to try something and see what happens. Welcome to the world your computer sees.

Where Does it Go From Here?

As great as DQN is, it has room to improve and many directions to branch out in. For one, it learns extremely slowly. There has been a lot of work on jumpstarting RL algorithms, using techniques from imitation and few-shot learning. Intuition says there’s also room for optimization in the memory mechanism; we’re already sampling based on importance, but we could be using similar metrics to decide which experiences to save and which ones they should overwrite. The convolutional architecture that is widely used and that we held constant in this post has been mostly stagnant since it’s release in 2015; would recent advancements in the field of computer vision boost our agent’s ability to comprehend its environment? In addition, DQN has to learn each new game from scratch; what if it could take its knowledge from different-but-similar tasks, like Wizard of Wor, another maze-based Atari game, and transfer some of it over to Pac-man? That would take us from replicating human performance to replicating the human learning process — using a lifetime of continuous learning to tackle brand-new challenges.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Written by Jake Grigsby with Zabih Yousuf

Cavalier Machine Learning, University of Virginia

www.cavml.com

June 2018

Advanced DQNs: Playing Pac-man with Deep Reinforcement Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.