Stories by Thomas Wolf on Medium

From TensorFlow to PyTorch

Thomas Wolf — Fri, 09 Aug 2019 13:05:31 GMT

Friends and users of our open-source tools are often surprised how fast 🚀 we reimplement the latest SOTA pre-trained TensorFlow models to make them accessible for everyone in our libraries like PyTorch-Transformers 👾 or PyTorch-pretrained-BigGAN 🦋

In this post, you’ll learn the main recipe to convert a pretrained TensorFlow model in a pretrained PyTorch model, in just a few hours.

We’ll take the example of a simple architecture like OpenAI GPT-2 🦄

Doing such a conversion assumes a good familiarity with both TensorFlow and PyTorch but it’s also one of the best ways to get to know better both frameworks!

Looking at the scope structure 🔎

The first step is to retrieve the TensorFlow code and a pretrained checkpoint. Let’s get them from OpenAI GPT-2 official repository:

https://medium.com/media/a115aa6c9dcff38ed5e37376922a0bed/href

TensorFlow checkpoints are usually composed of three files named XXX.ckpt.data-YYY , XXX.ckpt.index and XXX.ckpt.meta :

A trained NLP model should also be provided with a vocabulary to associate the tokens to the embeddings indices (here encoder.json and vocab.bpe). We won’t talk in too many details about vocabulary and tokenizer here since you can usually directly reuse their original python code with minor modifications.

First, we can have a look at the hyper-parameters file: hparams.json. It contains a few hyper-parameters like the number of layers/heads and so on:

We can reuse this JSON file in a configuration class for our model.

Now, let’s have a look at the structure of the model. Starting from now, you’ll need to have TensorFlow installed on your computer (can be the CPU version). Once TensorFlow is set up, open a python interpreter to load the checkpoint to inspect the saved variables:

https://medium.com/media/ad038dfd38f57856a4408bac4c2346ce/href

The result is a (long) list of all the variables stored in the checkpoint with their name and shapes:

Variables are stored as Numpy arrays that you can load with tf.train.load_variable(name).

Now, what we are particularly interested in here are the path-like names of the variables like model/h0/ln_1/b which reflects the organization of TensorFlow variables in scopes.

Here is our first secret:

To build our PyTorch model as fast as possible, we will reuse exactly the same organization: for each sub-scope in the TensorFlow model, we’ll create a sub-class under the same name in PyTorch.

This will let us load weights easily by jointly iterating on scopes & classes.

As you can see, GPT-2 has three modules at the root of the model (at the end of the list): model/wte, model/wpe and model/ln_f, and the rest of the model is composed of a series of identical modules hXX, each comprising a self-attention sub-module attn , a feed-forward module mlp and two layer-normalization modules ln_1 and ln_2 .

Now that we know how the model is organized, let’s build our PyTorch model with a hierarchy that reproduces this organization of scopes.

Building the PyTorch model skeleton 👩‍🎨

It’s time to have a look at the TensorFlow code it-self. We’ll start with the code for the main model and reproduce the general organization in our PyTorch main model class:

https://medium.com/media/6cdae8fad8b4cb5fb921355a05c65231/href

As you can see, we’ve given our main sub-modules names (wte, wpe, h, ln_f) that are identical to the first-level scopes of the variables we saw in the TensorFlow checkpoint.

We can also write the code for our forward pass by converting the code for the main model from TensorFlow operations to PyTorch operations:

https://medium.com/media/89844e3a9294b6a2289ae9b683f71056/href

Now we dive deeper in the hierarchy, continuing to build our PyTorch model by adapting the rest of the TensorFlow code. Here is another example comparing the TensorFlow code for a “Block” module:

https://medium.com/media/763e6e6e7901d9c0cd69a2e031dd1f5c/href

To the PyTorch equivalent nn.Module class:

https://medium.com/media/951393d1e27c986e99bd392b00ef4d81/href

Here again, the name of the class attributes containing the sub-modules (ln_1, ln_2, attn, mlp) are identical to the associated TensorFlow scope names that we saw in the checkpoint list above. Doing that ensures that the PT hierarchical attributes structure will be identical to the TF scope structure.

Beware of the details — section I 🕵️

The computation flow

When you convert TensorFlow code to PyTorch code, you have to be attentive to reproduce the exact computation workflow of the TensorFlow model in PyTorch. For instance, you should take care of reimplementing all the operations, even the ones not associated to a Variable (i.e. not visible in the checkpoint), add the dropout modules at same places than the original ones and carefully check how to convert each TensorFlow method in an equivalent PyTorch operation.

It’s a good opportunity to dive in the internals of both frameworks to see how each operation is made under the hood. One example: TensorFlow & PyTorch layer normalizations are slightly different from each other (go check them out!) so I usually reimplement layer normalization from scratch in PyTorch.

The initialization and defaults

It’s also important to check default parameters of each module like epsilons and make sure you are using the same ones in PyTorch than the TensorFlow. Be especially careful about defaults values that may not be visible.

Loading the weights 🏋️

Once the code conversion step is finished and you can run a forward pass on dummy input without any errors with your newly defined PyTorch model, it’s time to load the TensorFlow weights in the newly created model 🐣

Having the same models' organization make the loading very easy:

We just jointly iterate on both the path-like names of TensorFlow variables & our PyTorch model attributes.

A commented loading function for GPT-2 looks like this:

https://medium.com/media/ffdcab638e39bbe583c5ec1ba5193856/href

Let’s talk about a few things to keep in mind at this stage 👇

Beware of the details — section II🕵️

Transposing tensors from TensorFlow to PyTorch

Some TensorFlow operations operate on weights that are transposed with regards to their PyTorch counter-part (or vice-versa 😉). In this case, your weights loading method should take care of transposing the weights when loading them.

The main cases where this happens in practice are Keras modules like tf.layer.dense whose kernel is the transposed of PyTorch’s nn.Linear weights.

This transposition issue can be especially tricky to detect for square matrices which bring us to our last section 👇

The final step —️ comparing the models 👭

Comparing hidden-states 🎼

Now that your model runs and all the weights are initialized with their TensorFlow counterpart it is time for the most important operation:

a careful comparison of both models!

The way I usually do it is by starting from one script running the TensorFlow model provided by the authors of the original implementation and:

modify the TensorFlow model to output the hidden-states at regular locations along the depth of the model,
modify our PyTorch model to output the hidden-states at the same regular locations along the depth of the model,
load the PyTorch model in parallel with the TensorFlow model and run them on the same inputs,
compare their behaviors during a forward pass to detect where an error may have been made.

You should take care of deactivating the DropOut modules and all nondeterministic modules to ensure maximal compatibility.

If your script is a fine-tuning script and your model contains weights which are newly initialized, you should take care of fully initializing the PyTorch model from the newly initialized TensorFlow model for good comparison. Here is an example of this process during the reimplementation of XLNet in pytorch-transformers where the new TensorFlow model is saved and loaded in PyTorch.

I usually compare the max absolute difference between the hidden-states after each layer of the models on a few real-life inputs:

https://medium.com/media/3d9b2a6d1f493f1d68eeefa085d185da/href

Comparing on a down-stream task 🚣

If your model is a pretrained model which can be fine-tuned on a down-stream task, you can further confirm the accuracy of the conversion by reproducing some results on a downstream task.

This task can be quite long as you will need to reproduce the pre-processing, optimization and post-processing of the original author’s work.

In our experience, a discrepancy at this stage, in pretty much every case, doesn’t come from a difference inside the models but from a discrepancy in the way the inputs are prepared, in the optimization parameters (one of the most often over-looked ones being the batch size) or in the post-processing and evaluation metrics.

That’s all folks👭

We’ve seen the main steps you can take to quickly and accurately reimplement a pretrained TensorFlow model in PyTorch.

This method has a few limits:

the model may end up having a deeper hierarchy than necessary. In this case, you can rewrite the model to reduce the number of classes and use a mapping between the TensorFlow variables and the PyTorch attributes 🗺
the model is sometimes implemented with operations that are fast in TensorFlow or TPU (e.g. multiplication with one-hot matrices) but may be suboptimal in PyTorch. Here again, some rewriting and conversion afterward can help speed up the resulting model in some cases 🏎
You need access to the TensorFlow code for the conversion. It’s possible to convert a TensorFlow model without access to the code, e.g. a model only available on TensorFlow Hub but it’s a far more difficult process. In PyTorch-pretrained-BigGAN we did that by inspecting the raw computation graph and guessing the high-level operations involved 🙃

👾 For detailed code examples of this process, you can have a look at the various models implemented in PyTorch-Transformers.

… and if you feel like adding one of your own, we will probably be more than happy to welcome a Pull Request on the repository! Just ping us before to be sure we are not already working on it 😉

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🌓 From TensorFlow to PyTorch was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to build a State-of-the-Art Conversational AI with Transfer-Learning

Thomas Wolf — Thu, 09 May 2019 12:18:36 GMT

By Mahir Uysal

🦄 How to build a State-of-the-Art Conversational AI with Transfer Learning

A few years ago, creating a chatbot -as limited as they were back then- could take months 🗓, from designing the rules to actually writing thousands of answers to cover some of the conversation topics.

With the recent progress in deep-learning for NLP, we can now get rid of this petty work and build much more powerful conversational AI 🌟 in just a matter of hours 🍃 as you will see in this tutorial.

We’ve set up a demo running the pretrained model we’ll build together in this tutorial at convai.huggingface.co. Be sure to check it out! 🎮

Online demo of the pretrained model we’ll build in this tutorial at convai.huggingface.co. The “suggestions” (bottom) are also powered by the model putting itself in the shoes of the user.

Here is what we will learn and play with today:

How you can use Transfer Learning to build a State-of-the-Art dialog agent based on OpenAI GPT and GPT-2 Transformer language models,
How you can reproduce the model we used in the NeurIPS 2018 dialog competition ConvAI2 which won the automatic metrics track,
How we distilled 3k+ lines of competition code in less than 250 lines of commented training code (with distributed & FP16 options!), and
How you can train this model for less than $20 on a cloud instance, or just use our open-sourced pre-trained model.

Together with this post, we released a clean and commented code base with a pretrained model! Check the Github repo here ✈️

The story of this post began a few months ago in Montreal 🇨🇦 where Hugging Face finished 1st 🏆 in the automatic track of the Conversational Intelligence Challenge 2 (ConvAI2), a dialog competition at NeurIPS 2018.

Our secret sauce was a large-scale pre-trained language model, OpenAI GPT, combined with a Transfer Learning fine-tuning technique.

With the fast pace of the competition, we ended up with over 3k lines of code exploring many training and architectural variants.

Clearly, publishing such raw code would not have been fair.

In the meantime, we had started to build and open-source a repository of transfer learning models called pytorch-pretrained-BERT which ended up being downloaded more than 150 000 times and offered implementations of large-scale language models like OpenAI GPT and it’s successor GPT-2 🦄

A few weeks ago, I decided to re-factor our competition code in a clean and commented code-base built on top of pytorch-pretrained-BERT and to write a detailed blog post explaining our approach and code.

So here we are, let’s dive in 🚀

An AI with a personality 🤠

We’ll build a conversational AI with a persona.

Our dialog agent will have a knowledge base to store a few sentences describing who it is (persona) and a dialog history. When a new utterance will be received from a user, the agent will combine the content of this knowledge base with the newly received utterance to generate a reply.

Here is the general scheme:

When we train a deep-learning based dialog agents, in an end-to-end fashion, we are facing a major issue:

Dialog datasets are small and it’s hard to learn enough about language and common-sense from them to be able to generate fluent and relevant responses.

Some approaches try to solve this by filtering the output of the model to improve the quality using smart beam search. Here we’ll take another path that gathered tremendous interest over the last months: Transfer Learning.

The idea behind this approach is quite simple:

start by pretraining a language model on a very large corpus of text to be able to generate long stretches of contiguous coherent text,
fine-tune this language model to adapt it to our end-task: dialog.

Pretraining a language model is an expensive operation so it’s usually better to start from a model that has already been pretrained and open-sourced.

What would be a good pretrained model for our purpose?

The bigger the better, but we also need a model that can generate text. The most commonly used pretrained NLP model, BERT, is pretrained on full sentences only and is not able to complete unfinished sentences. Two other models, open-sourced by OpenAI, are more interesting for our use-case: GPT & GPT-2.

Let’s have a quick look at them 🔎

🦄 OpenAI GPT and GPT-2 models

In 2018 and 2019, Alec Radford, Jeffrey Wu and their co-workers at OpenAI open-sourced two language models trained on a very large amount of data: GPT and GPT-2 (where GPT stands for Generative Pretrained Transformer).

A decoder/causal Transformer attends to the left context to generate next words

GPT and GPT-2 are two very similar Transformer-based language models. These models are called decoder or causal models which means that they use the left context to predict the next word (see left figure).

Many papers and blog posts describe Transformers models and how they use attention mechanisms to process sequential inputs so I won’t spend time presenting them in details. A few pointers if you are not familiar with these models: Emma Strubell’s EMNLP slides are my personal favorite and Jay Alammar’s “Illustrated Transformer” is a very detailed introduction.

For our purpose, a language model will just be a model that takes as input a sequence of tokens and generates a probability distribution over the vocabulary for the next token following the input sequence. Language models are usually trained in a parallel fashion, as illustrated on the above figure, by predicting the token following each token in a long input sequence.

Pretraining these models on a large corpus is a costly operation, so we’ll start from a model and tokenizer pretrained by OpenAI. The tokenizer will take care of splitting an input string in tokens (words/sub-words) and convert these tokens in the correct numerical indices of the model vocabulary.

In pytorch-pretrained-BERT OpenAI GPT’s model and its tokenizer can be easily created and loaded from the pretrained checkpoint like this:

https://medium.com/media/bfdfa85e951b2d80d66817db24fda451/href

You probably noticed we’ve loaded a model called OpenAI GPT Double Heads Model which sounds a bit more complex than the language model we’ve just talked about and you’re right!

This is because we need to adapt our model to dialog. Let’s see how this goes!

👻 Adapting a language model to a dialog task

Our language model is trained with a single input: a sequence of words.

But as we saw earlier, in a dialog setting, our model will have to use several types of contexts to generate an output sequence:

one or several persona sentences,
the history of the dialog with at least the last utterance from the user,
the tokens of the output sequence that have already been generated since we generate the output sequence word by word.

How can we build an input for our model from these various contexts?

A simple answer is just to concatenate the context segments in a single sequence, putting the reply at the end. We can then generate a completion of the reply token by token by continuing the sequence:

Input sequence: a concatenation of persona (blue), history (pink) and reply (green) with delimiters (light pink). Here we generate the word “you” to complete the reply.

There are two issues with this simple setup:

Our transformer is color-blind! The delimiter tokens only give it a weak idea of which segment each word belongs to. For example, the word “NYC” is indicated in blue (persona) in our illustration but our model will have a hard time extracting this information from the delimiters alone: we should add more information about the segments.
Our transformer is position-blind! Attention is a symmetrical dot-product so we should add position information for each token.

An easy way to add this information is to build three parallel input sequences for word, position, and segments, and fuse them in a single sequence, summing three types of embeddings: word, position, and segments embeddings:

Summing three types of inputs embeddings indicating words (grey), position (gradient) and segments (blue/pink/green)

How do we implement this?

First, we’ll add special tokens to our vocabulary for delimiters and segment indicators. These tokens were not part of our model’s pretraining so we will need to create and train new embeddings for them.

Adding special tokens and new embeddings to the vocabulary/model is quite simple with pytorch-pretrained-BERT classes. Let’s add five special tokens to our tokenizer’s vocabulary and model’s embeddings:

https://medium.com/media/54fb710ccd3651e9605d41899a4be0bb/href

These special-tokens methods respectively add our five special tokens to the vocabulary of the tokenizer and create five additional embeddings in the model.

Now we have all we need to build our input sequence from the persona, history, and beginning of reply contexts. Here is a simple example:

https://medium.com/media/eea69917fbab7fcc7323a3c717f4fa09/href

👑 Multi-tasks losses

We have now initialized our pretrained model and built our training inputs, all that remains is to choose a loss to optimize during the fine-tuning.

We will use a multi-task loss combining language modeling with a next-sentence prediction objective.

The next-sentence prediction objective is a part of BERT pretraining. It consists in randomly sampling distractors from the dataset and training the model to distinguish whether an input sequence ends with a gold reply or a distractor. It trains the model to look at the global segments meaning besides the local context.

Now you see why we loaded a “Double-Head” model. One head will compute language modeling predictions while the other head will predict next-sentence classification labels. Let’s have a look at how losses are computed:

Multi-task training objective — the model is provided with two heads for language modeling prediction (orange) and next-sentence classification (blue)

The total loss will be the weighted sum of the language modeling loss and the next-sentence prediction loss which are computed as follow:

Language modeling: we project the hidden-state on the word embedding matrix to get logits and apply a cross-entropy loss on the portion of the target corresponding to the gold reply (green labels on the above figure).
Next-sentence prediction: we pass the hidden-state of the last token (the end-of-sequence token) through a linear layer to get a score and apply a cross-entropy loss to classify correctly a gold answer among distractors.

Let’s see how we can code this:

https://medium.com/media/2357e65682d831f57c490082b4404095/href

We now have all the inputs required by our model and we can run a forward pass of the model to get the two losses and the total loss (as a weighted sum):

https://medium.com/media/8bcb388af3f8f92e5e6895d9c2f9c63e/href

We are ready to start the training 🎉

🦊 Training on a dialog dataset

The ConvAI2 competition used an interesting dataset released by Facebook last year: PERSONA-CHAT.

It’s a rather large dataset of dialog (10k dialogs) which was created by crowdsourcing personality sentences and asking paired crowd workers to chit-chat while playing the part of a given character (an example is given on the left figure).

This dataset is available in raw tokenized text format in the nice Facebook’s ParlAI library. To bootstrap you, we also uploaded a JSON formatted version that you can download and tokenize using GPT’s tokenizer like this:

https://medium.com/media/ca0661d64d6c726c6c00740968822bc7/href

The JSON version of PERSONA-CHAT gives quick access to all the relevant inputs for training our model as a nested dictionary of lists:

Organization of the JSON version of PERSONA-CHAT

Using the awesome PyTorch ignite framework and the new API for Automatic Mixed Precision (FP16/32) provided by NVIDIA’s apex, we were able to distill our +3k lines of competition code in less than 250 lines of training code with distributed and FP16 options!

We’ve covered the essential parts of the code in the above gists so I’ll just let you read the commented code to see how it all fits together.

The training (train.py) code is here ➱ 🎮

Training this model on an AWS instance with 8 V100 GPU takes less than an hour (currently less than $25 on the biggest p3.16xlarge AWS instance) and gives results close to the SOTA obtained during the ConvAI2 competition with Hits@1 over 79, perplexity of 20.5 and F1 of 16.5.

A few differences explain the slightly lower scores vs our competition model, they are detailed in the readme of the code repo here and mostly consists in tweaking the position embeddings and using a different decoder.

👻 Talking with the Model — the Decoder

The amazing thing about dialog models is that you can talk with them 🤗

To interact with our model, we need to add one thing: a decoder that will build full sequences from the next token predictions of our model.

Now there have been very interesting developments in decoders over the last few months and I wanted to present them quickly here to get you up-to-date.

The two most common decoders for language generation used to be greedy-decoding and beam-search.

Generating a sentence word by word (source)

Greedy-decoding is the simplest way to generate a sentence: at each time step, we select the most likely next token according to the model until we reach end-of-sequence tokens. One risk with greedy decoding is that a highly probable token may be hiding after a low-probability token and be missed.

Beam-search try to mitigate this issue by maintaining a beam of several possible sequences that we construct word-by-word. At the end of the process, we select the best sentence among the beams. Over the last few years, beam-search has been the standard decoding algorithm for almost all language generation tasks including dialog (see the recent [1]).

However several developments happened in 2018/early-2019. First, there was growing evidence that beam-search was strongly sensitive to the length of the outputs and best results could be obtained when the output length was predicted before decoding ([2, 3] at EMNLP 2018). While this makes sense for low-entropy tasks like translation where the output sequence length can be roughly predicted from the input, it seems arbitrary for high-entropy tasks like dialog and story generation where outputs of widely different lengths are usually equally valid.

In parallel, at least two influential papers ([4, 5]) on high-entropy generation tasks were published in which greedy/beam-search decoding was replaced by sampling from the next token distribution at each time step. These papers used a variant of sampling called top-k sampling in which the decoder sample only from the top-k most-probable tokens (k is a hyper-parameter).

The last stone in this recent trend of work is the study recently published by Ari Holtzman et al. [6] which showed that the distributions of words in texts generated using beam-search and greedy decoding is very different from the distributions of words in human-generated texts. Clearly, beam-search and greedy decoding fail to reproduce some distributional aspects of human texts as it has also been noted in [7, 8] in the context of dialog systems:

Left: Probability assigned to tokens generated by humans and beam search using GPT-2 (Note the strong variance in human text not reproduced by beam-search). Right: N-gram distributions in human and machine-generated texts (Note the complete separation between greedy/beam-search and sampling decoding methods).

Currently, the two most promising candidates to succeed beam-search/greedy decoding are top-k and nucleus (or top-p) sampling. The general principle of these two methods is to sample from the next-token distribution after having filtered this distribution to keep only the top k tokens (top-k) or the top tokens with a cumulative probability just above a threshold (nucleus/top-p).

Here is how we can decode using top-k and/or nucleus/top-p sampling:

https://medium.com/media/e4ced14baefad690026d643514aa2a6a/href

We are now ready to talk with our model 🚀

The interactive script is here (interact.py) and if you don’t want to run the script you can also just play with our live demo which is here 🎮

Here is an example of dialog:

Example using the interactive scripts with default settings — Bot personality: I read twenty books a year. I’m a stunt double as my second job. I only eat kosher. I was raised in a single parent household.

👻 Conclusion

We’ve come to the end of this post describing how you can build a simple state-of-the-art conversational AI using transfer learning and a large-scale language model like OpenAI GPT.

As we learned at Hugging Face, getting your conversational AI up and running quickly is the best recipe for success so we hope it will help some of you do just that!

Be sure to check out the associated demo and code:

the live demo is here and
the open-sourced code and pretrained models are here.

As always, if you liked this post, give us a few 👏 to let us know and share the news around you!

References:

[1] ^ Importance of a Search Strategy in Neural Dialogue Modelling by Ilya Kulikov, Alexander H. Miller, Kyunghyun Cho, Jason Weston (http://arxiv.org/abs/1811.00907)

[2] ^ Correcting Length Bias in Neural Machine Translation by Kenton Murray, David Chiang (http://arxiv.org/abs/1808.10006)

[3] ^ Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation by Yilin Yang, Liang Huang, Mingbo Ma (https://arxiv.org/abs/1808.09582)

[4] ^ Hierarchical Neural Story Generation by Angela Fan, Mike Lewis, Yann Dauphin (https://arxiv.org/abs/1805.04833)

[5] ^ Language Models are Unsupervised Multitask Learners by Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever (https://openai.com/blog/better-language-models/)

[6] ^ The Curious Case of Neural Text Degeneration by Ari Holtzman, Jan Buys, Maxwell Forbes, Yejin Choi (https://arxiv.org/abs/1904.09751)

[7] ^ Retrieve and Refine: Improved Sequence Generation Models For Dialogue by Jason Weston, Emily Dinan, Alexander H. Miller (https://arxiv.org/abs/1808.04776)

[8] ^ The Second Conversational Intelligence Challenge (ConvAI2) by Emily Dinan et al. (https://arxiv.org/abs/1902.00098)

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🦄 How to build a State-of-the-Art Conversational AI with Transfer-Learning was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

Thomas Wolf — Mon, 15 Oct 2018 09:23:42 GMT

By David Marcu

I’ve spent most of 2018 training neural networks that tackle the limits of my GPUs. Whether it was a 150 millions parameters language model like OpenAI’s huge Generative Pre-trained Transformer (or the recent and similar BERT model) or a meta-learning neural net fed with 30 million element inputs like the one of our ICLR ‘18 paper, I could barely fit more than a few training samples on a GPU.

But most of the time stochastic gradient descent algorithms require larger batches than just a handful of examples to get decent results.

How can you train your model on large batches when your GPU can’t hold more than a few samples?

There are several tools, tips and tricks you can use to do that and I thought it would be nice to gather all the things I use and learned in a post.

In this post I will mainly talk about the PyTorch framework. Some of these tools are not in PyTorch yet (as of 1.0) so I include some custom code as well.

In particular, we’ll talk about:

How you can train a model on a single or multi GPU server with batches larger than the GPUs memory or when even a single training sample won’t fit (!),
How you can make the most efficient use of a multi-GPU machine, and
The simplest way to train a model using several machines in a distributed setup.

Let’s start by the simplest trick: gradient accumulation.

⌛️Large batches on one or several GPU(s)

So, you’ve build a nice model that might be the new SOTA on this neat task but every time you try to stack more than a few samples in a batch you get a CUDA RuntimeError: out of memory.

Adam confirms your predicament! 😱Oh no!

But you’re pretty sure that doubling the batch size will improve the results.

How can you do that?

There is an easy solution to this problem: accumulating gradients. Here is a quick reminder on how stochastic gradient descent works from my earlier post on meta-learning:

The 5-steps of a gradient descent optimization algorithm

The PyTorch code equivalent of these 5 steps can also be written in 5 lines:

https://medium.com/media/b938012b325208cb9638949e19be71d5/href

During the loss.backward() operation, gradients are computed for each parameter (in green on our animation) and stored in a tensor associated to each parameter: parameter.grad (the middle graph on our animation).

Accumulating gradients just means that, before calling optimizer.step() to perform a step of gradient descent, we will sum the gradients of several backward operations in the parameter.grad tensors. This is straightforward to do in PyTorch as the gradient tensors are not reset unless we call model.zero_grad() or optimizer.zero_grad(). We’ll also need to divide by the number of accumulation steps if our loss is averaged over the training samples.

Here is a simple gist for training a model using gradient accumulation. In this example we can train with a batch size that is accumulation_steps-larger than the maximum size that fits on our GPU(s):

https://medium.com/media/9625526c69f9c993da9d9a16a77f048c/href

Grzegorz Chlebus made a nice post describing how to do gradient accumulation in TensorFlow, check it out here.

😱 Pushing that to the extreme

Can you train a model for which not even a single sample can fit on a GPU?

Well if your architecture doesn’t have too-much skip connections, yes, it’s possible! The solution is to trade compute for memory using gradient-checkpointing.

Basically, the idea is to back-propagate the gradients in small chunks along the model, trading the memory needed to store a full back propagation graph with the additional compute of a partial forward pass associated to each chunk. This is a rather slow method as we add additional compute to reduce the memory requirements but it can be interesting in some settings, e.g. to train RNN models over very long sequences (see for example my previous introduction to meta-learning).

I won’t go into more details here and will just refer you to the relevant links:

TensorFlow: https://github.com/openai/gradient-checkpointing
PyTorch doc: https://pytorch.org/docs/stable/checkpoint.html

A “Memory-poor” strategy that needs O(1) memory (but requires O(n²) computation steps) — From Yaroslav Bulatov’s nice post: https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9

🕰 Making the best of a multi-GPU machine

Now let’s talk more specifically about training model on multi-GPUs.

The go-to strategy to train a PyTorch model on a multi-GPU server is to use torch.nn.DataParallel. It’s a container which parallelizes the application of a module by splitting the input across the specified devices, chunking along the batch dimension.

DataParallel is very easy to use, we just add one line to encapsulate the model:

https://medium.com/media/e91264ec8e4e49e45578c53cd79708d3/href

However one issue can arise with DataParallel: unbalanced GPU usage.

Under some settings GPU-1 will be used a lot more than the other GPUs.

Where does this come from? I made an illustration to better explain what DataParallel does under the hood:

Forward and Backward passes with torch.nn.DataParallel

During step 4 of the Forward pass (top-right), the results of all the parallel computations are gathered on GPU-1. This is fine for a lot of classification problems but it can become problematic when you train a language model on large batch for example.

Let’s quickly compute the size of the output for a language model:

Number of elements in the output of a language model

If we assume a 40k vocabulary, 250 tokens in our sequences, 32 samples per batch and 4 bytes to store each element in the memory, the output of our model takes about 1,2 GB. We need to double that to store the associated gradient tensors, our model output thus requires 2,4 GB of memory!

That’s a significant portion of a typical 10 GB GPU memory and means that GPU-1 will be over-used with regards to the other GPUs, limiting the effect of the parallelization.

We cannot easily reduce the number of elements in this output without tweaking the model and/or optimization scheme. But we can make sure the memory load is more evenly distributed among the GPUs.

⚖️ Balanced load on a multi-GPU machine

There are two main solution to the imbalanced GPU usage issue:

computing the loss in the forward pass of your model,
computing the loss in a parallel fashion.

The first option is the easiest but sometimes you can’t use it or it’s not practical for various reasons (e.g. your forward pass becomes too complicated and slow because of Python’s GIL) so let’s talk a bit about the second solution. Along the road we’ll learn interesting things about how PyTorch multi-GPU modules work.

In that case, the solution is to keep each partial output on its GPU instead of gathering all of them to GPU-1. We well need to distribute our loss criterion computation as well to be able to compute and back propagate our loss.

Thankfully for us, Hang Zhang (张航) has open-sourced a nice PyTorch package called PyTorch-Encoding which comprises these custom parallelization functions.

I’ve extracted and slightly adapted this module and you can download here a gist (parallel.py) to include and call from your code. It mainly comprises two modules: DataParallelModel and DataParallelCriterion which are made to be used as follows:

https://medium.com/media/823ef640ddaf893fee7673b794c41e36/href

The difference between DataParallelModel and torch.nn.DataParallel is just that the output of the forward pass (predictions) is not gathered on GPU-1 and is thus a tuple of n_gpu tensors, each tensor being located on a respective GPU.

The DataParallelCriterion container encapsulate the loss function and takes as input the tuple of n_gpu tensors and the target labels tensor. It computes the loss function in parallel on each GPU, splitting the target label tensor the same way the model input was chunked by DataParallel.

I made an illustration of DataParallelModel/DataParallelCriterion internals:

Using DataParallelModel and DataParallelCriterion

Here is how to handle two particular cases you may encounter:

Your model outputs several tensors: you likely want to disentangle them: output_1, output_2 = zip(*predictions)
Sometimes you don’t want to use a parallel loss function: gather all the tensors on the cpu: gathered_predictions = parallel.gather(predictions)

⏰ Distributed training: training on several machines

Now how can we harness the power of several servers to train on even larger batches?

The simplest option is to use PyTorch DistributedDataParallel which is meant to be almost a drop-in replacement for DataParallel discussed above.

But be careful: while the code looks similar, training your model in a distributed setting will change your workflow because you will actually have to start an independent python training script on each node (these scripts are all identical). As we will see, once started, these training scripts will be synchronized together by PyTorch distributed backend.

In practice, this means that each training script will have:

its own optimizer and performs a complete optimization step with each iteration, no parameter broadcast (step 2 in DataParallel) is needed,
an independent Python interpreter: this will also avoid the GIL-freeze that can come from driving several parallel execution threads in a single Python interpreter.

Models that make heavy use of Python loops/call in their forward passes can be slowed down by the python interpreter’s GIL when several parallel forward calls are driven by a single interpreter. In these settings, DistributedDataParallel can advantageously replace DataParallel even on a single-machine setup.

Now let’s just dive straight in the code and usage.

DistributedDataParallel is build on top of torch.distributed package which provide low-level primitives for synchronizing distributed operations and can make use of several backends (tcp, gloo, mpi, nccl) with different capabilities.

In this post I will select one simple way to use it out-of-the-box but you should read the doc and this nice tutorial by Séb Arnold to dive deeper in this module.

We will consider a simple but general setup with two 4-GPU servers (nodes):

The main server (server 1) has an accessible IP and an open port for communication.

🏃 Adapting our Python training script for distributed training

First we need to adapt our script so that it can be run separately on each machine (node). We are actually going to go fully distributed and run a separate process for each GPU of each node, so 8 process in total.

Our training script is a bit longer as we need to initialize the distributed backend for synchronization, encapsulate the model and prepare the data to train each process on a separate subset of the data (each process is independent so we have to care of having each of them handle a different slice of the dataset ourselves). Here is the updated code:

https://medium.com/media/eb364572c28b87e9543a0f1db52e1d35/href

✨ Launching multiple instances of our Python training script

We are almost done now. We just have to start an instance of our training script on each server.

To run our script, we’ll use the torch.distributed.launch utility of PyTorch. It will take care of setting the environment variables and call each script with the right local_rank argument.

The first machine will be our master, it need to be accessible from all the other machine and thus have an accessible IP address (192.168.1.1 in our example) and an open port (1234 in our case). On this first machine, we run our training script using torch.distributed.launch:

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)

On the second machine we similarly start our script:

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)

These two commands are identical excepted for the --node_rank argument which is set to 0 on the first machine and 1 on the second (and would be 2 on an additional server etc…)

The process of running a bunch of almost identical commands on a cluster of machine might looks a bit tedious. So now is probably a good time to learn about the magic of… GNU parallel:

https://medium.com/media/d51fb6fada94437833c863e671529d56/href

One exciting improvement of the coming PyTorch v1.0 is the release of a new c10d backend for the distributed module. I will update this short introduction when v1.0 is released with more details on the new backend 🔥

This conclude our quick post on a few tips, tricks and tools to train your model on larger batches in a variety of settings.

I hope you enjoyed this more technical post!

Clap 👏 a couple of times if you liked it and want us to post more of these!

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

⛵ Learning Meaning in Natural Language Processing - The Semantics Mega-Thread

Thomas Wolf — Tue, 07 Aug 2018 15:31:15 GMT

Krabi, Thailand by Mo Baghdadi

⛵ Learning Meaning in Natural Language Processing — The Semantics Mega-Thread

In which Twitter talked about meaning, semantics, language models, learning Thai and Java, entailment, co-reference — all in one fascinating thread.

Last week a tweet by Jacob Andreas triggered a huge discussion on Twitter that many people have called the meaning/semantics mega-thread.

Twitter is a great medium for having such a discussion, replying to any comment allows to revive the debate from the most promising point when it’s stuck in a dead-end.

Unfortunately Twitter also makes the discussion very hard to read afterwards so I made three entry points to explore this fascinating mega-thread:

a summary of the discussion that you will find below,
an interactive view to explore the trees of tweets, and
a commented map to get an overview of the main points discussed:

Full size image here

These map and summary are obviously selective/biased by my personal interests and familiarity with the topics discussed, so if you notice anything wrong or mis-represented please (i) go back to the actual tweets and (ii) send me a message 📨

🏎 A crash course on Lexical Meaning & Semantics

If you already know what we mean by “Meaning/Semantics” in NLP, you can skip this part and go straight to the debate 🔥

For the CS/ML folks out there here are a few words of introduction.

First, it’s important to state that Meaning in Natural Language is a multi-facetted concept with semantic, pragmatic, cognitive and social aspects. The discussion that happened on Twitter was mainly about lexical semantics and compositionality so I will focus on this sub-field for brevity. You will find additional links to broaden this view at the end of this section.

Meaning is the information that a sender intends to convey, or does convey, to a receiver.

Now, we know that strings are already a representation of meaning, so why should we go any further than just raw text?

Well there are several reasons we may want to distinguish meaning from raw text.

One reason is that the field of NLP/NLU aims at building systems that understand what you say to them, trigger actions based on that and convey back meaningful information. Let’s take a simple example:

Given some knowledge of math, we want our NLU system to produce an appropriate answer.

It’s difficult to (i) link raw text to a knowledge base of mathematical facts in our system and (ii) combine pieces of knowledge together to infer an answer. One solution is to define an intermediate meaning representation (sometimes called a Logical Form) that is more easy to manipulate.

For example in our case:

Meaning representation: 𝑚𝑎𝑥(𝚙𝚛𝚒𝚖𝚎𝚜 ∩(−∞; 𝟣𝟢))

We can then execute this expression with respect to a model of the world, like our database of knowledge, to get an answer. This way, we have also factored out the understanding of language (called semantic parsing) from the world knowledge (the problem of grounding meaning of in the real word).

Advantageously, our representation of the meaning of a sentence can thus:

provide a way to link language to external knowledge base, observations, and actions;
support computational inference, so that concepts can be combined to derive additional knowledge as human do during a conversation.

Two other nice requirements for this representation:

unambiguous: one meaning per statement (unlike natural language);
expressive enough to cover the full range of things that people talk about.

Natural language as raw text doesn’t fulfill most of these criteria!

Arelated line of research is Formal Semantics which seek to understand linguistic meaning by constructing models of the principles that speakers use to convey meaning.

The tools of formal semantics are similar to NLU/NLP tools but the aim is to understand how people construct meaning more than any specific application.

Now there is a lot more to meaning than just logic forms and grounding. A few examples: “But I didn’t mean it literally!!” (speaker meaning ≠ literal meaning), “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.” (yes, this is a real sentence with a meaning but you need to find the right sense for each buffalo!) and so on…

A few pointers: Our simple example came from this nice article by Percy Liang. As a quick overview of the field, I would recommend chapters 12 and 13 of J. Eisenstein’s book “Natural Language Processing”. They will take you through the main ideas, tools up to recent research on Meaning in NLP. Emily M. Bender’s ACL 2018 tutorial is a nice way to see how Meaning can be a multi-headed monster 🐍 to say the least!

Now back to our mega-thread!

🔥 Triggering a debate on Meaning

As often, the discussion was sparked by a mention of sentence embeddings

Jacob Andreas on Twitter

@_shrdlu_ @emilymbender Are sentence embeddings "meaning representations"? Most of the work I've seen is more about syntactic phenomena

Emily M. Bender on Twitter

@_shrdlu_ @jacobandreas What's the task the encodings are learned from? If it's language modeling, that's *not* a representation of meaning.

This argument was the main trigger for the mega-discussion that followed. Further in the thread, Emily M. Bender reformulates her argument as:

If all the learner gets is text the learner cannot learn meaning.

Can a model trained only on raw text learn meaning?

This question was explored along two axes:

What aspect of Meaning can a model learn?

Can it learn meaning or just learn similarity in meaning (i.e. learn that some expressions are similar without knowing what they mean. It is still very useful for transfer learning)?
Can it learn grounded meaning (learn the meaning of each expression as a state of the world state) or learn lexical meaning (e.g. learn how the meaning of sub-expressions compose together, as in our logical forms)?

How can the model Learn?

If the model cannot learn meaning from raw text alone, what would be the minimal amount of additional supervision needed? Should we add supervision from Logical Forms, Textual Entailment…?
Could we encode some inductive bias in the model so that it can learn aspects of meaning from raw text?

The Thai and Java Experiments

Emily M. Bender proposed several interesting experiments which were discussed at length:

The Thai Room experiment: “Imagine [you] were given the sum total of all Thai literature in a huge library. (All in Thai, no translations.) Assuming you don’t already know Thai, you won’t learn it from that.”
A real life example of trying to learn from raw text only.
The Java Code experiment: “Give your NN all well-formed java code that’s ever been written, but only the surface form of the code. Then ask it to evaluate (i.e. execute) part of it.”
Can we learn execution semantics from raw text only?

Investigating Programming Language Semantic

The Java Code proposal triggered an interesting discussion on the difference between trying to learn meaning from Programming Language (PL) code and from Natural Language (NL) text.

It actually seems more difficult to learn from PL than NL:

Matt Gardner on Twitter

@yoavgo @emilymbender @gneubig @gchrupala @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ Third try :) I want to learn code execution. I have java code (with no tests). There is no information about execution in the code itself, because I never actually see the output of a function. I see functions called, but this only ever gives me return types.

Matt Gardner on Twitter

@yoavgo @emilymbender @gneubig @gchrupala @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ In order to learn execution, I need to either see execution (for some kind of supervised learning), or some indirect result of an execution (for some kind of weak / unsupervised learning). I get neither in just code, unless I have tests.

Learning meaning from Java code is like having a text composed only of orders/commands and without any descriptions. But description are very important feedbacks for learning as they allow one to compare its internal world state with the real world state.

Language Models

The discussion circled around Language Models. A language model is a model which can predict the next word in a sentence given a context (past words, external knowledge). Recently, these models have given interesting results in commonsense reasoning. Language models were examined in two settings:

Independently of any training dataset: having a human-level language model involves that the model has a human-like notion of meaning and world state. As Yolav Goldberg mentioned “it’s just (one of the many) trivial examples where perfect LM entails solving not only all of language but also all of AI.”
More interesting is the case of a language model trained from raw text only: here the question is how much the model can learn in terms of semantics without being given access to explicit meaning information!

Textual Entailment

One way to give some information about meaning without departing too much from raw text is to train a model on Textual Entailment, the task of predicting whether a sentence imply another.

In a series tweets, Sam Bowman explained his view that entailment could be used to learn “the meat of compositional” semantics almost from raw text:

Sam Bowman on Twitter

@jacobandreas @emilymbender @_shrdlu_ @tallinzen I'd argue that it's possible to build a satisfying approach to semantics with only entailments and no grounding. This is a category fight, so I won't push to hard, but I'm not the only one-have a look at work on natural logic for some very formal attempts.

And also made the suggestion that a learner might be able to learn entailment in a simpler way than curent setups, using a setup that could be close to LM:

Sam Bowman on Twitter

@emilymbender @tallinzen @IntuitMachine @jacobandreas @_shrdlu_ I'm mostly on the same page, but I think it's plausible for a learner to learn to do something like RTE (which does get at something more meaning-like than bare forms) while learning almost entirely _from_ raw form.

Sam Bowman on Twitter

@emilymbender @tallinzen @IntuitMachine @jacobandreas @_shrdlu_ Something like language modeling, with some minimal additional guidance/data to teach it what the notion of entailment is (but not how any particular language works). I don't think it's something we can do now or soon, but I don't see a strong argument that it's impossible.

An Inductive Bias Language Model

Another way to learn meaning from a dataset as close as possible to raw text is to put a strong inductive bias in the model as discussed by Matt Gardner:

Matt Gardner on Twitter

@emilymbender @yoavgo @microth @gneubig @gchrupala @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ Step 1: design a model that explicitly tries to capture world state ("meaning"). Step 2: train that model using a LM signal. Yes, this is a weak signal, I know. But not zero. https://t.co/DgXKpSZtLN

One example is the Entity Language Model which augments a classical LSTM by explicitly modeling an arbitrary number of entities, updating their representations along the hidden state, and using the mention representations to contextually generate the next word in the LM task.

To read more about that, check Yejin Choi’s talk at ACL 2018 and Percy Liang’s talk at AAAI 2018.

The Big Open Question

In the end, I feel like the main original question stayed open: can a model learn some aspects of lexical meaning from raw text alone?

Here is a discussion representative of the positions of the participants:

Emily M. Bender on Twitter

@gchrupala @yoavgo @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ First, yes, some constituent structure can be induced, but I'm not convinced it all can. But even if it could be: How's the learner going to get active/passive equivalence? Dative alternation? Long distance dependencies? W/o these, you don't have predicate-argument structure.

Matt Gardner on Twitter

@emilymbender @gchrupala @yoavgo @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ By seeing "She thought it was yummy" after both the active and the passive versions. A stretch to actually learn this from a language modeling signal, yes, but not a priori impossible. Future utterances can give context that allows some induction of "meaning".

Emily M. Bender on Twitter

@yoavgo @nlpmattg @gchrupala @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ Because I don't see how "She thought it was yummy" is going to help with mapping predicate argument structure for "Kim at the cake" and "The cake was eaten by Kim" without it.

(((ل()(ل() 'yoav)))) on Twitter

@emilymbender @nlpmattg @gchrupala @tallinzen @sleepinyourhat @jacobandreas @_shrdlu_ by seeing these two sentences in the exact same contexts, you could assume they share the meaning, and you also observe the symbols for cake and for Kim in different positions. a case-inflection system will make this harder.

We are done with our quick summary of the Meaning Megathread.

For more details you should check out the commented map and navigate the tweet trees!

As always, don’t hesitate give this post a few claps 👏 if you enjoyed it!

A word on Searle’s Chinese Room

Searle’s room argument came back often in the discussion but the situation was a bit different.

Searle’s argument was made in the Strong versus Weak AI debate: does the computer has a mind or consciousness. Here the question is less philosophical: can we extract a representation of meaning from form alone.

Still, as Jeremy Howard detailed a bit later, the Chinese room experiment of Searle goes far beyond the question of Strong/weak AI to the question of understanding/qualia, so please go check this thread.

⛵ Learning Meaning in Natural Language Processing - The Semantics Mega-Thread was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

100 Times Faster Natural Language Processing in Python

Thomas Wolf — Tue, 12 Jun 2018 06:46:20 GMT

SpaceX Falcon Heavy Launch — Credit SpaceX

How to take advantage of spaCy & a bit of Cython for blazing fast NLP

I also published a Jupyter notebook with the examples I describe in this post.

When we published our Python coreference resolution package✨ last year, we got an amazing feedback from the community and people started to use it for many applications 📚, some very different from our original dialog use-case 👥.

And we discovered that, while the speed was totally fine for dialog messages, it could be really slow 🐌 on larger news articles.

I decided to investigate this in details and the result is NeuralCoref v3.0 which is about 100 times faster 🚀 than the previous version (several thousands words per seconds) while retaining the same accuracy, and the easiness of use and eco-system of a Python library.

In this post I wanted to share a few lessons learned on this project, and in particular:

How you can design a high-speed module in Python,
How you can take advantage of spaCy’s internal data structures to efficiently design super fast NLP functions.

So I am a bit cheating here because we will be talking about Python, but also about some Cython magic — but, you know what? Cython is a superset of Python, so don’t let that scares you away!

Your current Python program is already a Cython program.

There are several cases where you may need such speed-ups, e.g.:

you are developing a production module for NLP using Python,
you are computing analytics on a large NLP dataset using Python,
you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow, or you have a heavy processing logic in your DeepLearning batch loader that slows down your training.

One last thing before we start: I also published a Jupyter Notebook with the working examples I talk about in this post. Try it out!

First step to rocket speed: Profiling

The first thing to know is that most of your code is probably just fine in pure Python but there can be a few bottlenecks functions that will get you orders of magnitude faster if you give them some love.

You should thus start by profiling your Python code and find where the slow parts are located. One option is to use cProfile like that:

https://medium.com/media/25e8fee7b331fbfa23cdef2852f89d00/href

You’ll likely find that the slow parts are a few loops, and some Numpy arrays manipulations if you use neural networks (but I won’t spend time talking about NumPy here as there is already a lot of information written about that).

So, how can we speed up these loops?

Fast Loops in Python with a bit of Cython

Let’s work this out on a simple example. Say we have a large set of rectangles that we store as a list of Python objects, e.g. instances of a Rectangle class. The main job of our module is to iterate over this list in order to count how many rectangles have an area larger than a specific threshold.

Our Python module is quite simple and looks like this:

https://medium.com/media/0b3be6080ecf6f7857d22fe639e8c206/href

The check_rectangles function is our bottleneck! It loops over a large number of Python objects and this can be rather slow as the Python interpreter does a lot of work under the hood at each iteration (looking for the area method in the class, packing and unpacking arguments, calling the Python API….).

Here comes Cython to help us speed up our loop.

The Cython language is a superset of Python that contains two kind of objects:

Python objects are the objects we manipulate in regular Python like numbers, strings, lists, class instances…
Cython C objects are C or C++ objects like double, int, float, struct, vectors that can be compiled by Cython in super fast low-level code.

A fast loop is simply a loop in a Cython program within which we only access Cython C objects.

A straightforward approach to designing such a loop is to define C structures that will contain all the things we need during our computation: in our case, the lengths and widths of our rectangles.

We can then store our list of rectangles in a C array of such structures that we will pass to our check_rectangle function. This function now has to accept a C array as input and thus will be defined as a Cython function by using the cdef keyword instead of def (note that cdef is also used to define Cython C objects).

Here is how the fast Cython version of our Python module looks like:

https://medium.com/media/85795766c250d8f075fa85a96da49e47/href

Here we used a raw array of C pointers but you can also choose other options, in particular C++ structures like vectors, pairs, queues and the like. In this snippet, I also used the convenient Pool() memory management object of cymem to avoid having to free the allocated C array manually. When Pool is garbage collected by Python, it automatically frees the memory we allocated using it.

A good reference on the practical usage of Cython in NLP is the Cython Conventions page of spaCy’s API.

👩‍🎨 Let’s Try that Code!

There are many ways you can test, compile and distribute Cython code! Cython can even be used directly in a Jupyter Notebook like Python.

First install Cython with pip install cython

First Tests in Jupyter

Load the Cython extension in a Jupyter notebook with %load_ext Cython.

Now you can write Cython code like Python code by using the magic command %%cython.

If you have a compilation error when you execute a Cython cell, be sure to check Jupyter terminal output to see the full message.

Most of the time you’ll be missing a-+ tag after %%cython to compile to C++ (for example if you use spaCy Cython API) or an import numpy if the compiler complains about NumPy.

As I mentioned in the beginning, check the Jupyter Notebook accompanying this post, it has all the examples we discuss running in Jupyter.

Writing, Using and Distributing Cython Code

Cython code is written in .pyx files. These files are compiled to C or C++ files by the Cython compiler and then to byte-code level with the system’s C compiler. The byte-code level files can then be used by the Python interpreter.

You can load a .pyx file directly in Python by using pyximport:

>>> import pyximport; pyximport.install()
>>> import my_cython_module

You can also build your Cython code as a Python package and import/distribute it as a regular Python package as detailed here. This can take some time to get working, in particular on all platforms. If you need a working example, spaCy’s install script is a rather comprehensive one.

Before we move to some NLP, let’s quickly talk about the def, cdef and cpdef keywords, because they are the main things you need to grab to start using Cython.

You can use three types of functions in a Cython program:

Python functions, which are defined with the usual keyword def. They take as input and output Python objects. Internally they can use both Python and C/C++ objects and can call both Cython and Python functions.
Cython functions defined with the cdef keyword. They can take as input, use internally and output both Python and C/C++ objects. These functions are not accessible from the Python-space (i.e. the Python interpreter and other pure Python modules that would import your Cython module) but they can be imported by other Cython modules.
Cython functions defined with the cpdef keyword are like the cdef Cython functions but they are also provided with a Python wrapper so they can be called from the Python-space (with Python objects as inputs and outputs) as well as from other Cython modules (with C/C++ or Python objects as inputs).

The cdef keyword has another use which is to type Cython C/C++ objects in the code. Unless you type your objects with this keyword, they will be considered as Python objects (and thus slow to access).

💫 Using Cython with spaCy to speed up NLP

This is all nice and fast but… we are still not doing NLP here! No string manipulations, no unicode encodings, none of the subtleties we are lucky to have in Natural Language Processing 🙃.

And the official Cython documentation even advises against the use of C level strings:

Generally speaking: unless you know what you are doing, avoid using C strings where possible and use Python string objects instead.

So how can we design fast loops in Cython when we work with strings?

💫 spaCy got us covered.

The way spaCy tackle this problem is quite smart.

Convert all strings to 64-bit hashes

All the unicode strings in spaCy (the text of a token, its lower case text, its lemma form, POS tag label, parse tree dependency label, Named-Entity tags…) are stored in a single data structure called the StringStore where they are indexed by 64-bit hashes, i.e. C level uint64_t.

The StringStore object implements a look up between Python unicode strings and 64-bit hashes.

It is accessible from everywhere in spaCy and every object (see on the left), for example as nlp.vocab.strings, doc.vocab.strings or span.doc.vocab.string.

When a module needs to perform fast processing on some tokens, it simply uses the C level 64-bit hashes instead of the strings. A call to the StringStore look up table will then give back the Python unicode strings associated to the hashes.

But spaCy does more than that and also gives us access to fully populated C level structures of the document and vocabulary, which we can use in Cython loops instead of having to build our own structures.

SpaCy’s internal data structures

The main data structure associated to a spaCy document is the Doc object which owns the sequence of tokens (“words”) of the processed string and all their annotations in a C level object called doc.c which is an array of TokenC structures.

The TokenC structure contains all the informations we need about each tokens. This information is stored as 64-bit hashes that can be re-associated to unicode strings as we’ve just seen.

To see exactly what’s in these nice C structures, just have a look at the freshly created Cython API doc of spaCy 💫.

Let’s see that in action on a simple example of NLP processing.

🚀Fast NLP Processing with spaCy and Cython

Let’s say we have a dataset of text documents we need to analyse.

https://medium.com/media/4a81bd7254df8769bc901df5bfa79877/href

On the left I wrote a script that builds a list of 10 documents parsed by spaCy, each with ~170k words. We could also have 170k documents with 10 words in each (like a dialog dataset) but that’s slower to create so let’s stick with 10 docs.

We want to perform some NLP task on this dataset. For example, we would like to count the number of times the word “run” is used as a noun in the dataset (i.e. tagged tagged with a “NN” Part-Of-Speech tag by spaCy).

A Python loop to do that is short and straightforward:

https://medium.com/media/d3babdc283b837b18701aa8d8ebd5678/href

But it’s also quite slow! On my laptop this code takes about 1.4 second to get the answer. If we had a million documents it would take more than a day to give us the answer.

We could use multiprocessing but it’s often not such a great solution in Python because you have to deal with the GIL 😕 Also, note that Cython can also use multi-threading! And that may actually even be the best part of Cython because the GIL is released so we are at full speed 🏎 Cython basically directly call OpenMP under the hood. I won’t have time to talk about parallelism here so check this link for more details.

Now let’s try to speed up our Python code with spaCy and a bit of Cython.

First, we have to think about the data structure. We will need a C level array for the dataset, with pointers to each document’s TokenC array. We’ll also need to convert the test strings we use (“run” and “NN”) to 64-bit hashes.

When all the data required for our processing is in C level objects, we can then iterate at full C speed over the dataset.

Here is how this example can be written in Cython with spaCy:

https://medium.com/media/4b7e6240e3f4fbed885fadbdb80ae010/href

The code is a bit longer because we have to declare and populate the C structures in main_nlp_fast before calling our Cython function [*].

But it is also a lot faster! In my Jupyter notebook, this Cython code takes about 20 milliseconds to run which is about 80 times faster than our pure Python loop.

The absolute speed is also impressive for a module written in a Jupyter Notebook cell and which can interface natively with other Python modules and functions: scanning ~1,7 million words in 20 ms means we are processing a whopping 80 millions words per seconds.

This concludes our quick introduction on using Cython for NLP. I hope you enjoyed it.

There are a lot of other things to says on Cython but it would get us too far from this simple introduction. The best place to start from now is probably the Cython tutorials for a general overview and spaCy’s Cython page for NLP.

Don’t hesitate to give us a few claps 👏 if you want more content like that!

. ^ If you use low level structures several times in your code, a more elegant option than populating C structures each time, is to design our Python code around the low level structures with a Cython extension type wrapping the C level structures. This is how most of spaCy is structured and it is a very elegant way to combine fast speed, low memory use and the easiness of interfacing with external Python libraries and functions.

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🚀 100 Times Faster Natural Language Processing in Python was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Current Best of Universal Word Embeddings and Sentence Embeddings

Thomas Wolf — Mon, 14 May 2018 15:13:53 GMT

John Christian Fjellestad –Distant road

A Chinese version of this article can be found here, thanks to Jakukyo.

Word and sentence embeddings have become an essential part of any Deep-Learning-based natural language processing systems.

They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual data.

A huge trend is the quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset.

It’s a form of transfer learning. Transfer learning has been recently shown to drastically increase the performance of NLP models on important tasks such as text classification. Go check the very nice work of Jeremy Howard and Sebastian Ruder (ULMFiT) to see it in action.

While unsupervised representation learning of sentences had been the norm for quite some time, the last few months have seen a shift toward supervised and multi-task learning schemes with a number of very interesting proposals in late 2017/early 2018.

Recent trend in Universal Word/Sentence Embeddings. In this post, we describe the models indicated in black. Reference papers for all indicated models are listed at the end of the post.

This post is thus a brief primer on the current state-of-the-art in Universal Word and Sentence Embeddings, detailing a few

strong/fast baselines: FastText, Bag-of-Words
state-of-the-art models: ELMo, Skip-Thoughts, Quick-Thoughts, InferSent, MILA/MSR’s General Purpose Sentence Representations & Google’s Universal Sentence Encoder.

If you want some background on what happened before 2017 😀, I recommend the nice post on word embeddings that Sebastian wrote last year and his intro posts.

Let’s start with word embeddings.

Recent Developments in Word Embeddings

A wealth of possible ways to embed words have been proposed over the last five years. The most commonly used models are word2vec and GloVe which are both unsupervised approaches based on the distributional hypothesis (words that occur in the same contexts tend to have similar meanings).

While several works augment these unsupervised approaches by incorporating the supervision of semantic or syntactic knowledge, purely unsupervised approaches have seen interesting developments in 2017–2018, the most notable being FastText (an extension of word2vec) and ELMo (state-of-the-art contextual word vectors).

FastText was developed by the team of Tomas Mikolov who proposed the word2vec framework in 2013, triggering the explosion of research on universal word embeddings.

The main improvement of FastText over the original word2vec vectors is the inclusion of character n-grams, which allows computing word representations for words that did not appear in the training data (“out-of-vocabulary” words).

FastText vectors are super-fast to train and are available in 157 languages trained on Wikipedia and Crawl. They are a great baseline.

The Deep Contextualized Word Representations (ELMo) have recently improved the state of the art in word embeddings by a noticeable amount. They were developed by the Allen institute for AI and will be presented at NAACL 2018 in early June.

Elmo knows quite a lot about words context

In ELMo, each word is assigned a representation which is a function of the entire corpus sentences to which they belong. The embeddings are computed from the internal states of a two-layers bidirectional Language Model (LM), hence the name “ELMo”: Embeddings from Language Models.

Specificities of ELMo:

ELMo’s inputs are characters rather than words. They can thus take advantage of sub-word units to compute meaningful representations even for out-of-vocabulary words (like FastText).
ELMo are concatenations of the activations on several layers of the biLMs. Different layers of a language model encode different kind of information on a word (e.g. Part-Of-Speech tagging is well predicted by the lower level layers of a biLSTM while word-sense disambiguation is better encoded in higher-levels). Concatenating all layers allows to freely combine a variety of word representations for better performances on downstream tasks.

Now, let’s turn to universal sentence embeddings.

The Rise of Universal Sentence Embeddings

There are currently many competing schemes for learning sentence embeddings. While simple baselines like averaging word embeddings consistently give strong results, a few novel unsupervised and supervised approaches, as well as multi-task learning schemes, have emerged in late 2017-early 2018 and lead to interesting improvements.

Let’s go quickly through the four types of approaches currently studied: from simple word vector averaging baselines to unsupervised/supervised approaches and multi-task learning schemes (as illustrated above).

There is a general consensus in the field that the simple approach of directly averaging a sentence’s word vectors (so-called Bag-of-Word approach) gives a strong baseline for many downstream tasks.

A good algorithm for computing such a baseline is detailed in the work of Arora et al. published last year at ICLR, A Simple but Tough-to-Beat Baseline for Sentence Embeddings: use a popular word embeddings of your choice, encode a sentence in a linear weighted combination the word vectors and perform a common component removal (remove the projection of the vectors on their first principal component). This general method has deeper and powerful theoretical motivations that rely on a generative model which uses a random walk on a discourse vector to generate text (we won’t discuss the theoretical details here).

A very recent implementation of a strong Bag-of-Word baseline (even stronger than Arora’s one) is the Concatenated p-mean Embeddings from the University of Darmstadt that you will find here (thanks Yaser for pointing that work out).

A plot of HuggingFace’s dialogs Bag-of-Words. Bag-of-Words approaches loose words ordering but keep a surprising amount of semantic and syntactic content. Interesting insights in Conneau et al. ACL 2018.

Going beyond simple averaging, the first major proposals were using unsupervised training objectives, starting with the Skip-thoughts vectors proposed by Jamie Kiros and co-workers in 2015.

Unsupervised schemes learn sentence embeddings as a byproduct of learning to predict a coherent succession of sentences or a coherent succession of clauses inside a sentence. These approaches can (in theory) make use of any text dataset as long as it contains sentences/clauses juxtaposed in a coherent way.

Skip-thoughts vectors is the archetypical example of learning unsupervised sentence embeddings. It can be though as the equivalent for sentences of the skip-gram model developed for word embeddings: rather than predicting the words surrounding a word, we try to predict the surroundings sentences of a given sentence. The model consists in an RNN-based encoder-decoder which is trained to reconstruct the surrounding sentences from the current sentence.

One interesting insight in the Skip-Thought paper was a vocabulary expansion scheme: Kiros et al. handled words not seen during training by learning a linear transformation between their RNN word embedding space and a larger word embedding such as word2vec.

Quick-thoughts vectors are a recent development of the Skip-thoughts vectors, presented this year at ICLR. In this work, the task of predicting the next sentence given the previous one is reformulated as a classification task: the decoder is replaced by a classifier which has to choose the next sentence among a set of candidates. It can be interpreted as a discriminative approximation to the generation problem.

One strength of this model is its speed of training (an order of magnitude compared to Skip-thoughts model) making it a competitive solution to exploit massive dataset.

Quick-thoughts classification task. The classifier has to chose the following sentence from a set of sentence embeddings. Source: “An efficient framework for learning sentence representations” by Logeswaran et al.

For a long time, supervised learning of sentence embeddings was thought to give lower-quality embeddings than unsupervised approaches but this assumption has recently been overturned, in part following the publication of the InferSent results.

Unlike the unsupervised approaches detailed before, supervised learning requires a labelled dataset annotated for some task like Natural Language Inference (e.g. with pairs of entailed sentences) or Machine Translation (with pairs of translated sentences) which poses the question of the specific task to choose and the related question of the size of the dataset required for good quality embeddings. We talk more about these questions in the next and last section on Multi-task learning but before that, let’s see what’s behind the InferSent breakthrough that was published in 2017.

InferSent is an interesting approach by the simplicity of its architecture. It uses the Stanford Natural Language Inference (SNLI) Corpus (a set of of 570k pairs of sentences labelled with 3 categories: neutral, contradiction and entailment) to train a classifier on top of a sentence encoder. Both sentences are encoded using the same encoder while the classifier is trained on a pair representation constructed from the two sentence embeddings. Conneau et al. adopt a bi-directional LSTM completed with a max-pooling operator as sentence encoder.

A supervised sentence embeddings model (InferSent) to learn from a NLI dataset. Source: “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data” by A. Conneau et al.

The success of InferSent lead poses the following question in addition to the usual quest for selecting the best neural net model:

Which supervised training task would learn sentence embeddings that better generalize on downstream tasks?

Multi-task learning can be seen as a generalization of Skip-Thoughts, InferSent, and the related unsupervised/supervised learning schemes, that answer this question by trying to combine several training objectives in one training scheme.

Several recent proposals for multi-task learning were published in early 2018. Let’s quickly go through MILA/MSR’s General Purpose Sentence Representation and Google’s Universal Sentence Encoder.

In the paper describing MILA & Microsoft Montreal’s work and presented at ICLR 2018 (Learning General Purpose Distributed Sentence Representation via Large Scale Multi-Task Learning), Subramanian et al observe that to be able to generalize over a wide range of diverse tasks, it is necessary to encode multiple aspects of the same sentence.

The authors thus leverage a one-to-many multi-tasking learning framework to learn a universal sentence embedding by switching between several tasks. The 6 tasks chosen (Skip-thoughts prediction of the next/previous sentence, neural machine translation, constituency parsing and natural language inference) share the same sentence embedding obtained by a bi-directional GRU. Experiments suggest that syntactic properties are better learned when adding a multi-language neural machine translation task, length and word order are learned with a parsing task and training a natural language inference encodes syntax information.

Google’s Universal Sentence Encoder, published in early 2018, follows the same approach. Their encoder uses a transformer-network that is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. A pre-trained version has been made available for TensorFlow.

This concludes our short summary on the current state of Universal Words and Sentence Embeddings.

The domain has seen a lot of interesting developments in the last few months together with great progresses in the ways we assess and probe the performance of these embeddings and their inherent bias/fairness (a real issue when you talk about Universal Embeddings). We didn’t have time to talk about these latest topics but you can find a few links in the references.

I hope you enjoyed this brief!

Clap 👏 a couple of times if you liked it and want us to post more of these!

Some references

Very recently, C. Perone and co-workers published a nice and extensive comparison between ELMo, InferSent, Google Universal Sentence Encoder, p-mean, Skip-thought, etc. Here is a link to the paper: https://arxiv.org/abs/1806.06259
A nice ressource on traditional word embeddings like word2vec, GloVe and their supervised learning augmentations is the github repository of Hironsan. More recent developments are FastText and ELMo.
Sentence embeddings papers: Skip-Thoughts, Quick-Thoughts, DiscSent, InferSent, MILA/MSR’s General Purpose Sentence Representations, Google’s Universal Sentence Encoder & Google Input-Ouput Sentence learning on dialog.
If you’re interested in the way we evaluate sentence embeddings, you should definitely check the recent work of Facebook on SentEval and its probing tasks as well as the recently published GLUE benchmark by NYU, UW and DeepMind researchers.

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

📚The Current Best of Universal Word Embeddings and Sentence Embeddings was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

From zero to research — An introduction to Meta-learning

Thomas Wolf — Tue, 03 Apr 2018 15:29:12 GMT

🐣 From zero to research — An introduction to Meta-learning

Meta-learning is an exciting trend of research in the machine-learning community which tackles the problem of learning to learn.

The traditional paradigm in machine learning research is to get a huge dataset on a specific task, and train a model from scratch using this dataset. Obviously that’s very far from how humans leverage past experiences to learn very quickly a new task from only a handset of examples.

That’s because humans learn to learn [1].

Over the last months, I have been playing and experimenting quite a lot with meta-learning models for Natural Language Processing and will be presenting some of this work at ICLR, next month in Vancouver 🇨🇦 — come say hi! 👋

In this post, I will start by explaining what’s meta-learning in a very visual and intuitive way. Then, we will code a meta-learning model in PyTorch and I will share some of the lessons learned on this project.

What’s learning in the first place?

Let’s have a quick look at what happens when we train a simple neural net to classify images of dogs and cats. Let’s say we have a single training image of a cat together with a label indicating that this image represents a cat [2]. I made a quick animation of a training step to save us a few thousand sentences.

Single step of the training process of a neural network. The neural net is trained to classify an image as representing a dog or a cat

The backward pass (“backprop”) is a key step when we train a neural net. Since the computations performed by the neural network and the loss are differentiable functions [3], we can compute the gradient that should be applied to each parameter of the neural net to reduce the difference between the label currently predicted by the neural net and the real/target label (this difference is measured by the loss function). After the backpropagation comes the optimizer which computes updated parameters for the model. This is where training a neural net becomes more of an art than a science as there are so many possible optimizers and optimization settings (hyper-parameters).

Let’s represent our single training step in a more compact way

The training image is now a 🐈 and the label indicating that the picture represents a cat is a 🔺. Large △s are our neural net with ■ parameters and gradients. The loss function is the L-box and the optimizer the O-box.

The learning process then simply consists in repeatedly applying the optimization step until we converge to good parameters for our neural net.

3 steps of a neural net training process where the neural net (large △s) is trained to classify dogs/cats images.

Let’s turn to meta-learning

The idea of meta-learning is to learn the learning process.

There are several ways to implement meta-learning [4] but the two I want to describe here are about learning a learning process that resemble the one we’ve just seen.

In our training process, there are two things in particular we can learn:

the initial parameters of the neural net (blue ■) and
the parameters of the optimizer (pink ★).

I will describe a combination of the two cases but each case is also very interesting on its own and can lead to simplifications, speedups and sound theoretical results [5].

So now, we have two modules to train:

What I will call the model (M) which is our previous neural net. It can now be seen as a low-level network. It is sometimes called an optimizee or a learner. The weights of the model are the ■ on the drawings.
The optimizer (O) or meta-learner is a higher-level model which is updating the weights of the lower-level network (the model). The weights of the optimizer are the ★ on the drawings.

How do we learn these meta-parameters?

Well it turns out we can back-propagate a meta-loss gradient along the training process itself, back to the initial weights of the model and/or to the parameters of the optimizer [6].

We now have two, nested, training processes: the meta-training process of the optimizer/meta-learner in which the (meta-)forward pass includes several training steps of the model (with forward, backward and optimization steps as we saw previously).

Let’s take a look at the meta-training step:

A meta-training step (training the optimizer O) comprising with 3 steps of training the model M)

Here, a single step of meta-training process is represented horizontally. It includes two steps of training process of the model (vertically in the meta-forward and meta-backward boxes). The training process of the model is exactly the same training process that we’ve just seen.

As we can see, the input of the meta-forward pass is a list of examples/labels (or a list of batches) that are used successively during the model training pass.

The input of a meta-training step is a list of examples (🐈, 🐕) with associated labels (🔺,🔻)

Now what meta-loss can we use to train the meta-learner? In the case of the model training we could simply compare the model prediction to the target label to get an error signal.

In the case of the meta-learner, we would like a meta-loss that is indicative of how well the meta-learner is performing its task: training the model.

One possibility is then to compute the loss of the model on some training data, the lower the loss, the better the training was. We can compute a meta-loss at the end or even just combine the losses of the model that we already compute during the training (e.g. by summing them).

We also need a meta-optimizer to update the weights of the optimizer. Here it starts to get very meta as we could use another meta-learner to optimize the meta-learner and so on, but in the end we will need a hand-defined optimizer like SGD or ADAM (it can’t be turtles all the way down).

There are a few important remarks regarding the implementation that we can as well discuss now:

Second-order derivatives: back propagating the meta-loss through the model’s gradients involves computing derivatives of derivative, i.e. second derivatives (when the green ▲ passes through the green ■ on the meta-backward pass of our last animation). We can compute that in modern frameworks like Tensorflow or PyTorch but in practice we often drop the second derivatives and only back propagate though the model weights (the yellow ■ of the meta-backward pass) to reduce the complexity.
Coordinate sharing: a recent deep-learning model can have a very large number of parameters (easily around 30–200 millions in NLP). With current GPU memory, it is not possible to have such a number of parameters as separate inputs to the optimizer. What we often do is called coordinate-sharing [7], it means we design the optimizer for a single parameter of the model and duplicate it for all parameters (i.e. share it’s weights along the input dimension associated to the model parameters). This way the number of parameters of the meta-learner is not a function of the number of parameters of the model. When the meta-learner is a network with a memory like an RNN, we can still allow to have a separate hidden state for each model parameters to keep separate memories of the evolution of each model parameter.

Meta-learning in PyTorch 🔥

Let’s try some code to see how this looks in practice.

So we have a model with a set of weights that we want to train and use for two tasks:

during the meta-forward pass: we use our model to compute gradients (from the loss) that are feed as inputs to the optimizer to update the model parameters, and
during the meta-backward pass: we use our model as a path for back propagating the gradients of the optimizer’s parameters (computed from the meta-loss).

The easiest way to do that in PyTorch is to have two duplicate modules that represent the model, one for each task. Let’s call forward model the module responsible for storing the model gradients used during the meta-forward pass and backward model the module responsible for keeping parameters as a continuous path for back propagating the optimizer gradients during the meta-backward pass.

The two modules will share their Tensors to avoid duplicating memory (tensors are the real meat in memory) but will keep separate Variables to cleanly separate the gradients of the model and the gradients used for the meta-learner.

A simple meta-learner class in PyTorch

Sharing Tensors in PyTorch is rather straight-forward: we just need to update the pointers in the Variable class to point to the same Tensors. One difficulty comes when our model is already a memory optimized model like an AWD-LSTM or AWD-QRNN model with shared Tensors (input and output embeddings). Then we need to be careful to keep the right pointers when we update the model parameters of the two modules.

One way to do that is to set a simple helper that will handle the task of looping through the parameters, send back all needed information to update the Parameters pointers (and not only the Tensors) and keep shared parameters synced. Here is such a function:

https://medium.com/media/2c142644ce3d6360ae6a8cbb377037e6/href

Using this function, we can plug any model and loop over the model parameters in our meta-learner in a clean way [8].

Now let’s draft a simple meta-learner class. Our optimizer is a module that will take as inputs during the forward pass, the forward model (with gradients) and the backward model, will loop over their parameters to update the backward model parameter in a way that allows meta-gradients to back propagate (by updating Parameters pointers and not only Tensors).

https://medium.com/media/32e7b3f283337dc35ca1777512dee961/href

We can now train this optimizer as we saw in the first part. Here is a simple gist that illustrate the meta-training process that we have been describing:

https://medium.com/media/c605cc20e0b5d0aacfd6b990601bb400/href

Avoid memory blow-up — Hidden State Memorization

Sometimes we want to learn an optimizer that can operate on very large models with several tens of millions of parameters and at the same time we would like to unroll the meta-training over a large number of steps to get good quality gradients [9] like we did in our work.

In practice, it means we want to include a long training process during the meta-forward pass, with many time-steps, and we’ll have to keep in memory the parameters (yellow ■) and gradients (green ■) data for each step that are used for the meta-backward pass.

How can we do that without blowing up our GPU’s memory?

One way is to trade some memory for computation by using gradient checkpointing, also called hidden state memorization [10]. In our case gradient checkpointing consists in slicing the meta-forward and meta-backward passes in segments that we compute successively.

A good introduction to gradient checkpointing is given in the nice blog post of Yaroslav Bulatov of OpenAI. If you are interested in this, you should go and check it:

Fitting larger networks into memory.

This post is already quite long so I won’t include a full gist of gradient checkpointing code. I’ll rather forward you to the nice PyTorch implementation of TSHadley and the current active work to include gradient checkpointing natively in PyTorch.

Other approaches in Meta-learning 🐣

There are two other trends of research in meta-learning that I hadn’t time to cover but which are also very promising. I’ll just give you a few pointers so you can go check that for your-self now that you know the general idea:

Recurrent networks: We have built upon the standard training process of neural nets. An alternative is to consider the succession of task as a sequential series of input and build a recurrent model that can ingest and build a representation of this sequence for a new task. In this case we typically have a single training process with a recurrent network with memory or attention. This approach also gives good results, in particular when the embeddings are adequately designed for the task. A good example is the recent SNAIL paper.
Reinforcement learning: The computation made by the optimizer during the meta-forward pass is very similar to the computation of a recurrent network: repeatedly apply the same parameters on a sequence of inputs (the succession of weights and gradients of the model during the learning). In practice this means we meet a usual issue with recurrent nets: the models have trouble returning to a safe path when they make errors as they are not trained to recover from training errors and the models have difficulties generalizing to longer sequences than the ones used during the meta-training. To tackle these issues, one can turn to reinforcement learning approaches where the model learn an action policy associated to a current state of training.

Meta-learning in Natural Language Processing 🗣

There is an interesting parallel between meta-learning and neural net models used in Natural Language Processing (NLP) like recurrent neural networks (RNN) that we have just started mentioning in the previous paragraph:

A meta-learner optimizing a neural net model behaves similarly to a recurrent neural network.

Like an RNN, the meta-learner ingests a series of parameters and gradients of the model during training, as an input sequence, and compute a sequential output (the series of updated model parameters) from this input sequence.

We develop this analogy in our paper and study how a meta-learner can be used to implement a medium-term memory in a neural net language model: the meta-learner learns to encode a medium-term memory in the weights of a standard RNN like a LSTM (in addition to the way short-term memories are conventionally encoded in the hidden state of the LSTM).

Our meta-learning language model has a hierarchy of memories with 3 levels, from bottom to top: a standard LSTM, a meta-learner updating the weights of the LSTM to store medium term memories and a long-term static memory.

We discovered that the meta-learning language model could be trained to encode memory of recent inputs, like the beginning of a Wikipedia article, that will be useful to predict the end of an article.

The curves indicate how good the model is at predicting the words of a Wikipedia article given the beginning (A, …, H are successive Wikipedia articles), colored words indicate the same for single words, blue is better, red is worse. As the model reads through an article, it learns from the beginning and become better at predicting the end (for more details see our paper).

Well I guess now you are ready to have a look at our paper for more details on this story.

This concludes my introduction to Meta-Learning. Congratulation for reaching the end of this long post!

I hope you liked it!

Don’t forget to give us a few claps 👏 if you want more content like that!

^ As such, meta-learning can be seen as a generalization of “transfer learning” and is related to the techniques for fine-tuning model on a task as well as techniques for hyper-parameters optimization. There was an interesting workshop on meta-learning at NIPS 2017 last December.
^ Of course in a real training we would be using a mini-batch of examples.
^ More precisely: “most of” these operations are differentiable.
^ Good blog posts introducing the relevant literature are the BAIR posts: Learning to learn by Chelsea Finn and Learning to Optimize with Reinforcement Learning by Ke Li.
^ Good examples of learning the model initial parameters are Model-Agnostic Meta-Learning of UC Berkeley and its recent developments as well as the Reptile algorithm of OpenAI. A good example of learning the optimizer’s parameters is the Learning to learn by gradient descent by gradient descent paper of DeepMind. A paper combining the two is the work Optimization as a Model for Few-Shot Learning by Sachin Ravi and Hugo Larochelle. An nice and very recent overview can be found in Learning Unsupervised Learning Rules.
^ Similarly to the way we back propagate through time in an unrolled recurrent network.
^ Initially described in DeepMind’s Learning to learn by gradient descent by gradient descent paper.
^ We are using coordinate-sharing in our meta-learner as mentioned earlier. In practice, it means we simply iterate over the model parameters and apply our optimizer broadcasted on each parameters (no need to flatten and gather parameters like in L-BFGS for instance).
^ There is a surprising under-statement of how important back-propagating over very long sequence can be to get good results. The recent paper An Analysis of Neural Language Modeling at Multiple Scales from Salesforce research is a good pointer in that direction.
^ Gradient checkpointing is described for example in Memory-Efficient Backpropagation Through Time and the nice blog post of Yaroslav Bulatov.

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🐣 From zero to research — An introduction to Meta-learning was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

✨How to train a neural coreference model— Neuralcoref 2

Thomas Wolf — Fri, 23 Mar 2018 11:40:04 GMT

Links: Online demo Github repo: https://github.com/huggingface/neuralcoref and our previous Medium post.

The last months have been quite intense at HuggingFace 🤗 with crazy usage growth 🚀 and everybody hard at work to keep up with it 🏇, but we finally managed to free some time and update our open-source library ✨Neuralcoref while publishing the training code at the same time.

Since we launched v1 last summer, more than ten million 💯 coreferences have been resolved on Hugging Face. Also, we are stoked that our library is now used in production by a few other companies and some really smart researchers, and our work was featured in the latest session of Stanford’s NLP course! 💪

The training code has been updated to work with the latest releases of both PyTorch (v0.3) and spaCy v2.0 while the pre-trained model only depends on Numpy and spaCy v2.0.

This release’s major milestone: You will now be able to train ✨ Neuralcoref on your own dataset — e.g., another language than English! — provided you have an annotated dataset.

We have added a special section to the readme about training on another language, as well as detailed instructions on how to get, process and train the model on the English OntoNotes 5.0 dataset.

As before, ✨Neuralcoref is designed to strike a good balance between accuracy and speed/simplicity, using a rule-based mention detection module, a constrained number of features and a simple feed-forward neural network that can be implemented easily in Numpy.

In the rest of this blog post, I will describe how the coreference resolution system works and how to train it. Coreference resolution is a rather complicated NLP task 🐉 so bare with me, you won’t regret it!

Let’s have a quick look at a (public) dataset 📚

A good quality public dataset you can use to train the model on English is the CoNLL 2012 dataset. It is one of the largest freely available dataset with coreference annotations, having about 1.5M+ tokens spanning many fields like newswire, broadcast and telephone conversations as well as web data (blogs, newsgroups …).

In the repo we explain how to download and prepare this dataset if you want to use it. Once you are done with that, a typical CoNLL file will look like this:

Extract of CoNLL 2012 dataset file “cctv_0005.v4_gold_conll”

This extract contains 2 sentences: “Yes, I noticed that many friends, around me received it” and “It seems that almost everyone received this SMS”

The sentences are tokenized and annotated with the tokens in column 4 and a large number of annotations: POS tags (col 5), parse tree (col 6), verbs lemma (col 7), speaker (col 10) and, what we are especially interested in, co-reference labels in the last column (labels 12, 119 on lines 5, 12, 14, 23 & 24). In this extract, “I” is annotated as co-referring with “me” (they have the same entity label 12) and “it” as co-referring with “this SMS” (label 119).

You can also notice that only the mentions that have at least one co-reference are labelled in the dataset (i.e. at least a pair of mentions referring to the same entity). Single mentions of an entity, with no other mention referring to the same entity, are not labelled.

This is somewhat annoying as it means we cannot fully evaluate (and easily train) the mention identification module of our coreference system (with precision, recall and F1 metrics). However, we can still have a look at the recall of co-referring mentions as we mention in the github repo.

The coreference module workflow 🌊

The first step of our process is to extract potential mentions. ✨Neuralcoref uses a rule-based mention-extraction function for this operation and get, in our two-sentences example:

Depending on the selectivity of the rule-based mention extractor and the parse-tree for our sentence, it may also capture a few bigger mentions like “many friends, around me received it” or “almost everyone received this SMS”. Let’s keep only the short mentions here for the sake of simplicity.

Each mention can co-refer with a various number of previous mentions. We can gather all the mentions in a mention-pairs table to highlight the co-referring pairs (in the table ∅ means that a mention doesn’t corefer with any previous mention).

A table of coreferring mention-pairs for our two-sentences example (positive labels in red)

Note that, their may be more than one co-referring antecedent for a given mention (i.e. several red box on a single line on our table), forming clusters of co-referring mentions (co-referrence resolution is a clustering task).

We can already see some of the issues that will arise when training a model on such data, namely that (i) each mention will have a varying number of potential antecedents, complicating the batching (the size of our mention-pairs vector will span all the range from 0 to N the total number of mentions in a document), and (ii) the table of mention-pairs will typically scale as cN where c in the average number of mentions in each document of the dataset, and this can become quite large.

In practice, our rule-based mention-extractor identifies about 1 million potential mentions on the CoNLL 2012 training set resulting in about 130 millions mention-pairs to train our model on.

Once we have identified potential mentions and labels for them (the red box in our table), we can extract a set of features for each mention and each pair of mentions. Let’s see the features we extract:

Extracted features for mentions and pairs of mentions. Span vectors are pre-computed average of word vectors.

It may seem like we need a lot of features but one of the advantage of ✨Neuralcoref is actually its reduced number of features — some coreference resolution systems uses up to +120 features! Another advantage is that most of these features don’t depend on the parser or additional databases (like word gender/number) and they are easy/fast to compute.

Practically, the features are a bunch of real-valued vectors (e.g. the span vectors which are average over word vectors and won’t be trained), integers (e.g. word indices in a dictionary, categorial indices), and boolean (e.g. “nested?” indicates whether a pair mention is contained in the other).

The mix of real values, integer and boolean features can give rise to large numpy arrays if we simply gather them in a single array (integer and boolean will be converted to floats). So we store them in separate arrays and build the features arrays on-time while we feed the neural net (see the DataLoader code in dataset.py).

We are done with the pre-processing steps. These steps are implemented in conllparser.py and document.py in the code ✨Neuralcoref.

Now, let’s use these features to train our model!

A quick look at the neural net model 🔮

As always, the neural net model is a pleasure to write in pyTorch so I copy it here in full (I just removed weights initialization/loading functions).

https://medium.com/media/74cdda70ddc8cc1b83b5590a280f9b44/href

The model comprises a common embedding layer, self.embed, that transforms words indices in word vectors and feed two parallel feed-forward networks:

self.single takes as inputs word vectors, spans and additional features (see above) of a mention and compute the score that it has no other co-referring mention (score of ∅ as label),
self.pairs takes as inputs word vectors, spans and features of a mention and an antecedent, together with pair features, and compute the score that the pair of mentions are co-referring.

So, how do we train this beauty?

Training the coreference neural net 🌀

First, a word about mini-batching. We talked about the problem of having a varying number of pairs for each mention. One way to use mini-batching in such conditions is to pack mini-batches as follows:

Sort the mentions in the training set by their number of potential antecedents (the length of each line in our pair table),
Define a maximum number of pairs P for a mini-batch, and
Slice the sorted training set in mini-batches of size ≅P, padding the mention-pairs in each mini-batch to the maximum number of pairs in the mini-batch (the longest line, i.e. the last one in our sorted dataset).

In ✨Neuralcoref, this is done by the dataset.py module which load and construct a Dataset and a Dataloader with such padded mini-batches.

Example of Neuralcoref evaluation metric during training

Once our mini-batches are ready, we can start training.

The training goes through three successive training phases: All pairs, Top pairs and Ranking.

We set up a very simple scheduling scheme to keep the training fast: each time our metric on the dev set stops increasing, we move on to the next stage.

The first two phases uses probabilistic loss (cross-entropy) while the last phase use a slack-rescaled ranking loss. More precisely, our losses for each training phase look like this:

𝓣(m) is the set of true antecedents of a mention m, 𝓕(m) the false antecedents and 𝓐(m) all antecedents (including ∅).

The All pairs loss is a standard cross-entropy loss on the full set of mention-pairs. The Top pairs loss is also a cross-entropy but is restricted to the (currently) top-scoring true and false antecedents of a mention. Finally, the Ranking loss is a max-margin loss with a slack rescaled cost Δ.

To get more information, you should check the very nice work published in 2016 by Kevin Clark and Christopher Manning (see “Deep Reinforcement Learning for Mention-Ranking Coreference Models” by Kevin Clark and Christopher D. Manning, EMNLP 2016, Improving Coreference Resolution by Learning Entity-Level Distributed Representations by Kevin Clark and Christopher D. Manning, ACL 2016, and the references therein), which our model is an adaptation of.

The full details and more are given in these publications which you should definitely read if you are interested in this model.

This training is implemented in learn.py in the code ✨Neuralcoref.

So I hope this gives you some good intuitions on how this rather uncommon beast works.

Most important, we setup a really nice and fast demo, so don’t hesitate to try the coreference system for yourself!

And don’t hesitate to fork the code and use it in your projects. Hope you liked it and let us know how you use it 🚀

✨How to train a neural coreference model— Neuralcoref 2 was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding emotions — from Keras to pyTorch

Thomas Wolf — Wed, 04 Oct 2017 15:06:28 GMT

Introducing torchMoji, a PyTorch implementation of DeepMoji

Detecting emotions, sentiments & sarcasm is a critical element of our natural language understanding pipeline at HuggingFace 🤗. Recently, we have switched to an integrated system based on a NLP model from the MIT Media Lab.

Update: We’ve open sourced it! Repo on GitHub

The model was initially designed in TensorFlow/Theano/Keras, and we ported it to pyTorch. Compared to Keras, pyTorch gives us more freedom to develop and test custom neural network modules and uses an easy to read numpy-style code. In this post, I will detail several interesting points that arose during the reimplementation:

how to make a custom pyTorch LSTM with custom activation functions,
how the PackedSequence object works and is built,
how to convert an attention layer from Keras to pyTorch,
how to load your data in pyTorch: DataSets and smart Batching,
how to reproduce Keras weights initialization in pyTorch.

First, let’s look at the torchMoji/DeepMoji model. It is a fairly standard and robust NLP neural net with two bi-LSTM layers followed by an attention layer and a classifier:

torchMoji/DeepMoji model

How to build a custom pyTorch LSTM module

A very nice feature of DeepMoji is that Bjarke Felbo and co-workers were able to train the model on a massive dataset of 1.6 billion tweets. The pre-trained model thus carries a very rich representation of the emotions and sentiments in the training set and we would like to use the pre-trained weights.

However, the model was trained with Theano/Keras’ default activation for the recurrent kernel of the LSTM: a hard sigmoid, while pyTorch is tightly modeled around NVIDIA’s cuDNN library for efficient GPU acceleration which natively supports only LSTM with standard sigmoid recurrent activations:

Keras default LSTM VS pyTorch default LSTM

I thus wrote a custom LSTM layer with hard sigmoid recurrent activation functions:

https://medium.com/media/5a92520a29c0877015c2022048b830aa/href

This LSTM cell has to be integrated in a full module that can make use of all the pyTorch facilities (variable number of layers and directions, inputs as PackedSequences). This integration is quite long so I’ll refer you directly to the relevant file of the repo.

Writing a custom LSTM cell also means that we lose some of the easy and fast GPU capabilities of cuDNN. As we mainly want to use the pre-trained model in production on a CPU and maybe fine-tune a small classifier on top of it, this is not a problem for us, but it means that the model should be further adapted to make use of the GPU more efficiently if you would like to re-train it from scratch.

Attention layer: side-by-side Keras & pyTorch

The attention layer of our model is an interesting module where we can do a direct one-to-one comparison between the Keras and the pyTorch code:

https://medium.com/media/d221231a918f17069b6b8e4cc359d8e8/href https://medium.com/media/f2bc9526a04d2f61861439fe2c33b23e/href

As you can see, the general algorithm is roughly identical but most of the lines in the pyTorch implementation are comments while a Keras implementation requires you to write several additional functions and reshaping calls.

When it comes to writing and debugging custom modules and layers, pyTorch is a faster option while Keras is clearly the fastest track when you need to quickly train and test a model built from standard layers.

How the PackedSequence object works

Keras has a nice masking feature to deal with variable lengths sequences. How do we do that in pyTorch? We use PackedSequences! PackedSequence is not very detailed in the pyTorch doc so I will spend some time describing them in greater details.

A typical NLP batch with five sequences and a total of 18 tokens

Let’s say we have a batch of sequences with variable lengths (as it is often the case in NLP application). To parallelize the computation of such a batch on the GPU we would like:

to process the sequences in parallel as much as possible given that the LSTM hidden state need to depend from the previous time step of each sequence, and
to stop the computation of each sequence at the right time step (the end of each sequence).

This can be done by using the PackedSequence pyTorch class as follow. We first sort the sequences by decreasing lengths and gather them in a (padded) tensor. Then we call the pack_padded_sequence function on the padded Tensor and the list of sequences lengths:

https://medium.com/media/c1d647caa378911e9f222180f625661c/href

The PackedSequence object comprises:

a `data` object: a torch.Variable of shape (total # of tokens, dims of each token), in our simple case with five sequences of token (represented by integers): (18, 1)
a `batch_sizes` object: a list of the number of token per time-step, in our case: [5, 4, 3, 3, 2, 1]

How the pack_padded_sequence function constructs this object is simple:

How to construct a PackedSequence object (with batch_first=True)

One nice properties of the PackedSequence object is that we can perform many operations directly on the PackedSequence data variable without having to unpack the sequence (which is a slow sequential operation). In particular, we can perform any operation which is local in the tokens (i.e. insensitive to the tokens order/context). Of course, we can also apply any pyTorch Modules that accept PackedSequence inputs.

In our NLP model, we can, for example, concatenate the outputs of the two LSTM modules without unpacking the PackedSequence object and apply a LSTM on this object. We could also perform some operations of our attention layer without unpacking (like vector product, exponentiation).

Another thing to note is to be careful about the ordering of the label as you have now sorted the input sentence by length, you should sort the labels as well, using the permutation indices you got when you sorted the input:

labels = labels[perm_index]

Smart data loading in pyTorch: DataSets & Batches

In Keras, data loading and batching are often hidden in the fit_generator function. Again, this is nice when you want to quickly test a model but it also means we don’t fully control what is happening in this –rather critical– part of the model.

In pyTorch, we will combine three nice classes to do this task:

a DataSet to hold, pre-process and index the dataset,
a BatchSampler to control how the samples are gathered in batches, and
a DataLoader that will take care of feeding these batches to our model.

Our DataSet class is very simple:

https://medium.com/media/3679f8e013ac480840b5d79cbd0675ea/href

Our BatchSampler is more interesting.

We have several small NLP datasets that we would like to use to fine-tune our model on emotion, sentiment and sarcasm detection. These datasets have varying lengths and sometimes unbalanced classes so we would like to design a batch sampler that could:

gather batches in epochs of pre-defined number of samples so our training process can be independent of the batches lengths, and
be able to sample in a balanced way from the unbalanced datasets.

In pyTorch, a BatchSampler is a class on which you can iterate to yield batches, each batch for the BatchSampler comprises a list with the indices of the samples to pick in the DataSet.

We can thus define a BatchSampler that will be initialized using a dataset class label vector to construct a list of batches fulfilling our needs:

https://medium.com/media/b877c3bf87a2795b556f7f57bcc0f8ad/href

From Keras to pyTorch: don’t forget the initialization

One last thing you have to be careful when porting Keras/Tensorflow/Theano code in pyTorch is the initialization of the weights.

Another powerful feature of Keras in term of speed of development is that the layers come with default initialization that makes a lot of sense.

On the contrary, pyTorch does not initialize the weights but let you free to do as you please. To get consistent results when fine tuning the weights we thus copy the default Keras initialization of the weights as follows:

https://medium.com/media/488c761dfad6f22a736ced56d8581660/href

Conclusion

Keras and pyTorch have differing philosophies and goals that we can feel when we compare the two frameworks directly on a single model.

In my opinion and experience :

Keras is great for quickly testing various ways to combine standard neural network blocks on a given task,
pyTorch is great to quickly develop and test a custom neural network module with a great freedom and an easy to read numpy-style code.

I took care to add a lot of comments in my pyTorch code and the original Keras implementation of DeepMoji is also well commented so don’t hesitate to walk through them, use them, and modify them.

Also, clap if you want us to share more of these! 🤗🤗🤗

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

Understanding emotions — from Keras to pyTorch was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

State-of-the-art neural coreference resolution for chatbots

Thomas Wolf — Fri, 07 Jul 2017 14:48:44 GMT

TL;DR, Links: Online demo at https://huggingface.co/coref, Github repo for Neuralcoref: https://github.com/huggingface/neuralcoref

At Hugging Face 🤗 we work on the most amazing and challenging subset of natural language: millennial language. Full of uncertainties 🤔, implicit references 👾, emojis 😭, jokes 😂 and constantly creating novel expressions…

To navigate these stormy waters 🚤, we have developed a number of specific NLP tools based on the latest research in the field. One of these tools is a coreference system we use to keep track of short term references already at the front end of our AI 🤗 brains.

We couldn’t find a tool we could easily integrate in our conversational agent so we decided to develop it internally and open-source it.

You can also try the coreference system for yourself on the demo we setup!

But, what is coreference?

Let’s have a look at Bob 👦 who is talking with his AI friend Alice 🤖

There are several implicit references in the last message from Bob 👦

“she” refers to the same entity as “My sister”: Bob’s sister.
“he” refers to the same entity as “a friend called John”: Bob’s sister’s friend.

The process of linking together mentions that relates to real world entities is called coreference resolution [1].

Hugging face coreference system in operation. Try it for yourself!

Humans naturally associate these references together — but for an AI 🤖 brain, it is much more difficult! And when we say it is hard, we mean it,

Coreference resolution is the basis of the Winograd Schema Challenge, a test of machine intelligence … build to defeat the AIs who’ve beaten the Turing Test! 🔥

So how do we solve this problem without creating a strong-AI?

Coreference is a rather old NLP research topic [1]. It* has seen a revival of interest in the past two years as several research groups [2] applied cutting-edge deep-learning and reinforcement-learning techniques to it. It was published earlier this year that coreference resolution may be instrumental in improving the performances of NLP neural architectures like RNN and LSTM (see “Linguistic Knowledge as Memory for Recurrent Neural Networks” by B. Dhingra, Z. Yang, W. W. Cohen, and R. Salakhutdinov).

A typical coreference resolution algorithm goes like this:

We extract a series of mentions –words that are potentially referring to real world entities–

In our previous dialogue, we can identify six mentions

For each mentions and each pair of mentions, we compute a set of features (more on that soon).
Then, we find the most likely antecedent for each mention (if there is one) based on this set of features. This last step is called pairwise ranking [3].

Traditionally the set of features was hand-engineered from linguistic features and it could be huge. Some high quality systems use 120+ features [4]!

Here comes the nice thing about modern NLP techniques like word vectors and neural nets. They allow us to automatically learn a lot of these hand-engineered features and reduce the set of hand-designed features by an almost an order of magnitude while keeping a good or even better accuracy.

Let’s see that in action on a simple example.

Her is a feminine pronoun and should have a higher probability to refer to my sister 👧 than to my brother 👦.

We could encode that by hand, listing the gender and other properties of every word in our vocabulary — but we can also assign a vector to every word the vocabulary and let our model learn vectors that are adapted for our task on a training dataset.

So we trained our word embeddings on a large coreference annotated dataset without supplying any information regarding word gender. And here is what we got.

Left: Initial word embeddings (PCA of pre-trained word2vec) — Right: trained word embeddings (PCA)

On the left, the original word2vec words vectors don’t specifically care about gender association [5]. On the right side, after training, our word vectors shows feminine and masculine nouns nicely separated along the principal components of the vectors even though we didn’t supply any information regarding word gender (gender has become a direction of main variance of our trained word vectors).

Obviously the quality of our trained embeddings depends a lot on the training dataset. Our embeddings are trained on the OntoNotes 5.0 dataset (the largest coreference annotated corpus).

But clearly this dataset has its flaws. In particular, as often in NLP datasets, it is build mainly from news and web articles hence with a more formal language than the usual chatbot user.

We can see on our PCA projection that pairs of words like dad/mum or brother/sister seem less clearly separated than a pair of more formal words like woman/man (along the first components of the PCA).

And, in fact, you can construct sentences for which our coreference system will work nicely on some pairs of mention but fail on another pair of mentions which are less formal. We give some hacks to circumvent that in a minute but the cleanest way is of course to gather a labeled data set that is more representative of your production data (like the over 10M messages already exchanged by our users with their Huggingface AIs).

So is that all? If we can learn every linguistic features relevant to our words, then how can coreference resolution be more difficult than the Turing Test?

Well, in the Winograd Schema Challenge, your AI has to solve questions like:

The trophy would not fit in the brown suitcase because it was too big. What was too big? the trophy or the suitcase?

Terry Winograd (Flickr/Lisa Padilla)

Our carefully gender-tuned-word embeddings will not be enough to help us solve these questions because we actually have to take the context of our mentions into account and maybe even external common knowledge!

In our model, we add some simple context information by averaging word vectors surrounding each mention but there are many way you can add some context information [2]. The good thing is, because we’re building an entertaining AI, we don’t need 100% accuracy for it to work for users. And a high accuracy can be very hard to reach: the best team competing for the Winograd Schema Challenge last year reached only 58% of success!

Our model then goes very roughly as follow: we take word embeddings for several words inside and around each mention, average them if needed, add some simple integer features (length of the mention, speaker information, location of the mentions…) to obtain a features representation for each mention and it surroundings. Then we plug these representations into two neural nets. A first neural net gives us a score for each pair of a mention and a possible antecedent while a second neural net gives us a score for a mention having no antecedent (sometimes a mention is the first reference to an entity in a text). We can then simply compare all these scores together and take the highest score to determine whether a mention has an antecedent and which one it should be.

Rough sketch of the coreference scoring model.

The neural model is trained on a non-probabilistic slack-rescaled max-margin objective. It means that the system computes a score for each pair of mentions and a score for each individual mention but these scores are not probabilities, just scores in arbitrary unit (the highest the better).

This scoring system is an adaptation of the very nice work published last year by Kevin Clark and Christopher Manning (see “Deep Reinforcement Learning for Mention-Ranking Coreference Models” by Kevin Clark and Christopher D. Manning, EMNLP 2016, Improving Coreference Resolution by Learning Entity-Level Distributed Representations by Kevin Clark and Christopher D. Manning, ACL 2016, and the references therein). The full details and even more are given in theses publications which you should definitely read if you are interested in this model.

The scoring model of Clark and Manning is open-source and a full implementation (with mention detection and features computation) has been integrated in Stanford’s CoreNLP.

We initially used this implementation in our system. However, we found that several important features were missing for our needs:

Easy integration in our current NLP processing pipeline. CoreNLP is an extensive tool but it is also a large monolithic java bloc that is hard to integrate in a high throughput distributed system like ours.
Capability to evolve with our users language and take into account user-specific informations. New word vectors have to be dynamically learned to use the fact that “Kendall Jenner” is a woman and a model even though she is not mentioned in our training dataset.
Taking care of the speakers in a conversation. The quality of the coreference resolution depend for a large part of the speaker information associated to each mention.

Here is how we solved that:

Our current pipeline is based on a set of deep-learning python tools and the high speed parsing is done by spaCy. We are big fans of spaCy ultra-fast parser and of the work of Matthew and Ines at Explosion.ai. The coreference system with mentions detection, features extraction and neural net computation is thus implemented on top of spaCy and Numpy (in the future we could easily switch to Thinc when Thinc’s API is stabilized).
We make use of recent work on word embeddings to compute embeddings for unknown words on the fly from definitions or information that you can provide (it’s very simple in fact: you can compute a word embedding for “Kendall Jenner” simply by averaging the vectors for “woman” and “model” for example).
We input and take care of the various speakers in the conversation when computing the features and resolving the coreferences.

You can try the coreference system for yourself!

You can also fork the code and use it in your projects. Hope you liked it and let us know how you use it 🚀

*. ^ Coreference pun!

^ Good introductions of the subject can be found in the great NLP class of Chris Manning and in “Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition” by Daniel Jurafsky & James H. Martin, 2nd edition 2007 (Chapter 21 in particular). Linguistic is a fascinatingly huge domain so you may also have heard of anaphors and cataphors. Coreferences, anaphors and cataphors describes different relationships that sometimes overlap but not always. In particular coreferences are mentions that all relates to the same object of the outside world.
^ See in particular: “Learning Global Features for Coreference Resolution” by Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber, NAACL 2016, “Deep Reinforcement Learning for Mention-Ranking Coreference Models” by Kevin Clark and Christopher D. Manning, EMNLP 2016, and “Latent Structures for Coreference Resolution” by Sebastian Martschat and Michael Strube, ACL 2015.
^ There are other ways you can link entities (entity-mention models, , for e.g. by constructing cluster of mentions and considering global features (coreference is a clustering operation) but we won’t talk about them here. You should read the references above of the great NLP class of Chris Manning, or refer to “Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition” by Daniel Jurafsky & James H. Martin, for example, if you want to know more about that.
^ See Modeling the lifespan of discourse entities with application to coreference resolution by Marneffe et al. (2015) and Search space pruning: A simple solution for better coreference resolvers by Nafise Sadat Moosavi and Michael Strube (2016).
^ well they do somehow but they also have to encode many other semantic/syntaxique features of the words — many of these being more important than gender— to predict the surrounding words (the objective in word2vec training)

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

State-of-the-art neural coreference resolution for chatbots was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.