Stories by Amy Pajak on Medium

ALiBi: Attention with Linear Biases

Amy Pajak — Mon, 03 Jul 2023 07:02:13 GMT

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

This paper was published at ICLR 2022 by Ofir Press, Noah A. Smith, and Mike Lewis from University of Washington, Facebook AI Research and Allen Institute for AI.

Summary

At a high level this paper discusses how the authors replace positional encodings of transformers by using a new and very simple system which enables transformers to extrapolate to much longer sequences at inference time than what they have been trained on. ALiBi allows training on smaller sequences where performance will not suffer/degrade even if the inference sequence length is much longer than the training sequence length. This holds from 2x longer to 8x longer to even more.

The extent of ALiBis performance, when compared to techniques using positional embeddings, depends on the size of the dataset a model is trained on — where models trained on smaller datasets using ALiBi show comparatively greater extrapolation abilities.

It’s a simple technique to use. The code and implementation steps are available in github.

TLDR; This technique is useful to implement on transformer based models to infer on longer sequences than what have been trained on i.e., extrapolate.

Background

A quick bit of background about this technique — its already seen some great adoption in industry. Here’s a list of popular models that use ALiBi:

MPT-7B

From mosaicMLs MPT-7B blog

MPT-30B
BLOOM
BloombergGPT

MPT is part of MosaicMLs foundation series. MPT-30B was released in June 2023 and a few days later MosaicML announced a US$1.3 billion acquisition by Databricks - so ALiBi is definitely seeing some success.

Introduction

Problem this paper works on: Positional encodings.

The transformer architecture was released in the 2017 paper ‘Attention Is All You Need’ by Vaswani et al. In this the authors dealt with the question of positional encodings — they had to because technically the transformer isn’t a sequence model per-se, it processes all elements in the input simultaneously which means it treats the input more like a set than a sequence. So it’s actually a set model that has been adapted to handle sequence data.

The ALiBi paper deals exclusively with the text generation task, it may also be useful in other areas, but here the goal is to predict the next token from a sequence of tokens. So with 5 tokens you want to predict the 6th, then the 7th and so on…

Next word prediction. Source

Since a transformer essentially transforms a sequence of inputs into an equally sized sequence of outputs in every layer — the transformer itself doesn’t really know where a particular item is relative to its position in the sequence.

Recognising this, the original paper came up with sinusoidal position embeddings to represent multiple dimensions of positional encoding using a vector.

Sinusoidal Embedding

The sinusoidal embedding uses sine and cosine waves of different frequencies across multiple dimensions (y-axis) which we index all the tokens against (x-axis) to return a unique vector which we can assign to represent an inputs position. An advantage of this, hypothesized in the original paper, was that the transformer can use these vectors to learn and reason about the relative distances between tokens.

We can choose a dimension and reason that if two tokens have a close value then they are somewhat close together in the input sequence. And if we look at the rest of the dimensions in their vectors and they’re very similar-valued then they’re likely right next to each other. This allows us to inject some information about tokens absolute and relative positions to each other in the sequence.

Visualizing the Sinusoidal Positional Matrix. Source

(The sine and cosine functions were chosen in tandem because they have linear properties the model can easily learn to attend to).

Transformer Architecture

Positional encodings are usually added to the input embeddings (which then become the Query, Key, and Value vectors) before they are fed into the self-attention mechanism. In this way, the position of each word or token in the sequence can affect the attention scores and the subsequent processing by the transformer.

Self Attention. Source

The Q, K, and V vectors are crucial components of the self-attention mechanism. Here’s how they work:

Query: A representation of a word that we’re trying to understand in a certain context. The Transformer generates a Query vector for each word in the input. The purpose of the Query vector is to score how well each word (Key) in the context correlates with the word that the Query vector represents.
Key: A representation of a word in a given context that we use to score against the Query. The Transformer generates a Key vector for each word in the input. The dot product of a Query and a Key gives us a score, indicating the compatibility or relevance of that particular Key to the Query.
Value: A representation of the actual content of a word. The Transformer generates a Value vector for each word in the input. These Value vectors are used in the final weighted sum of the attention output, where the weights are determined by the scores calculated using Query and Key vectors.

In practice, the Transformer takes an input sequence, creates Q, K, and V vectors for each word in the sequence through linear transformations (which are learnable parameters), and then uses these vectors to compute a new representation for each word that takes the entire context into account. The self-attention mechanism takes into account all words in the sequence, assigning different weights based on their relevance, as determined by the scores calculated from the Query and Key vectors. These individual vectors for each token are stacked together to form matrices when considering the entire input sequence.

Results on WikiText-103

The authors first demonstrate the results of ALiBi on the WikiText language modeling dataset. This dataset is a collection of over 100 million tokens from 28,588 Wikipedia articles verified as ‘Good’ or ‘Featured’ collected by Salesforce Research.

Source: Paper

Here they compare different embeddings across four 247M parameter models:

Sinusoidal
Rotary embeddings — used in GPT-j
T5 Bias — used in T5 collection
ALiBi

In figure 1 (right), the models are trained on sequences of 1024 tokens in their training distribution, however, when they perform inference (x-axis) on larger inputs than what they were trained on the perplexity (y-axis) explodes. This is especially the case for sinusoidal and rotary embeddings, but perplexity remains very low for ALiBi. Even for extrapolating on input sizes of up to 16,000!

A lower perplexity means the model’s predictions are closer to the expected outcomes, i.e., it’s less “surprised” by the data it sees, indicating better predictive performance. For example, if a model predicts the next word of a sentence accurately, this means the model has a good understanding of the language structure and context, resulting in a lower perplexity score.

T5 Bias performs a bit better than the first two techniques but it’s a learned embedding so it requires more memory and takes longer to compute and train. Learned embeddings are part of the model itself and are learned during the training process. They start from random initialization (or some pretraining) and are then updated via backpropagation and gradient descent alongside all the other parameters of the model. The authors explain T5 drops off around 12,000 tokens due to running out of memory on their 32GB GPU. Regardless of this, it also shows significant degradation over larger input sequences.

ALiBi is a fixed valued (or precomputed) approach, as with Sinusoidal and Rotary embeddings. These embeddings are “fixed” in the sense that they do not change during the training of the model in which they are used. Due to this they can deal with much longer sequences and maintain speed with not having to learn embeddings. No wasting memory or compute:

Source: Paper

(If you’re wondering why ALiBi’s inference speed in the chart above is slightly higher — it was later explained that there’s no reason for this. There was just variance when running inference/training their hardware and, as explained in the paper, the speed of sinusoidal and ALiBi is basically identical, those variations are all within variance.)

How does ALiBi work?

Introducing ALiBi:

ALiBi. Source: Paper

We’re working with autoregressive language modeling which involves ‘causal attention’ / next word prediction. This is why the upper-triangular area is masked out; we don’t want to extend any attention to the future — only the past words. (We could fill in the rest of the matrix for full self-attention). This means a current query node can only pay attention to all the keys in the nodes that come before it.

So query 2 would only be multiplied by key 1 and key 2 and not key 3 etc because it can’t peek into the future:

q2 can only be multiplied by keys of current or previous tokens to determine an attention score

If it were just this calculation then there is no notable difference between q_2*k_1 and q_2*k_2 — it only depends on what the value of the key is to impact the information, not the position at all. So what the authors do is pretty simple… They add the distance between the two positions, multiplied by a number ‘m’.

M is a constant predefined number e.g., 0.4. It’s needed as the q*k — position numbers can get big too quickly, so m normalizes them as it is a float between 0–1.

We can see that the further into the past a given key is — the higher a value is subtracted from the attention value (numbers in the left matrix are attention values). If numbers in the left matrix are high then it means that key is really relevant to the query.

Whatever value you compute, however important it is, the further in the past it is then the more we’re going to subtract from it and we do this in a linear fashion. For example:

Source

So it degrades linearly (hence the name; Attention with Linear Biases) and can go to a negative value for dropoff. We then apply softmax to the query-key dot products which will give us a distribution.

This technique doesn’t require additional parameters but does result in a slight slow down due to injecting the positional encodings into every attention head within each layer. This is only applied to the query and key computation — not the value.

Source: Paper

Additional information about m:
M is different for each head to control the slope in the attention head and make each head slightly different from each other so the model doesn’t just rely on noise and build an ensemble. This can make the heads more effective by making them slightly different in how they work so the model can choose which one to utilize most. Please refer to the paper for more detail on this.

Experimental Results

The authors expand their experiments to compare ALiBi models trained and evaluated on varying input subsequence lengths to the sinusoidal baseline.

Source: Paper

Here they show that ALiBi is efficient and enables training models with short input subsequences that outperform strong baselines even when the ALiBi models extrapolate to more than six times the number of tokens that they were trained on.

In figure 4 above the square dots represent models trained using classic sinusoidal embeddings. These are always tested on as long a sequence as they were trained on because, as we saw in figure 1, if we make the sequence longer they will just fail explosively.

The dotted purple line represents a model trained using the ALiBi technique with input lengths of 512. As we can see during validation, it already performs better in perplexity than all baseline sinusoidal models.

We can see in general that on longer texts the perplexity decreases, so as we train ALiBi on longer sequences, such as 1024 and 2048, we’ll experience a gain in performance. However, it’s not too bad if we stick to training on short sequences like 512 and then extrapolate to longer ones. The ALiBi models are able to surpass the sinusoidal baseline when not extrapolating while also outperforming it when extrapolating to longer sequences.

Results on the CC100 + RoBERTa Corpus

The final set of experiments investigate whether ALiBis performance transfers to a larger model trained on a larger dataset than the ones previously used.

Since WikiText-103 is considered quite small (100 million tokens) compared to more recently available datasets, a model with a strong inductive bias will easily achieve great results on this - but that advantage almost disappears when you train on a much much larger dataset with a much greater compute budget.

The datasets used in this experiment are the 161 gigabyte RoBERTa training corpus (Toronto Book Corpus, English Wikipedia, CC-News, OpenWeb-Text and Stories ), and the 300 gigabyte English part of the CC-100 corpus, totalling 461GB. The validation set contains 649K tokens.

The models used for this dataset consist of 25 transformer layers with 16 heads and have 1.3B parameters.
The only differences between the models, other than positional encoding, is the length L of the input sequences used during training. The authors decided to limit the ALiBi models to only be trained on half the L of the sinusoidal model. This was done to demonstrate the savings in time and compute without sacrificing any extrapolation power.

Source: Paper

We can see figure 5 (left) compares the validation perplexity for L_valid = 1024 throughout the training process for an ALiBi model trained with L = 512 compared to a sinusoidal model trained with L = 1024. They show very similar results and as the ALiBi model is trained on shorter sequences, it is 7% faster and uses 1.6 GB less memory.

The Figure 5 (right), results become even more impressive, showing that the ALiBi model trained on L = 1024 outperforms by 0.09 perplexity the sinusoidal model trained on L = 2048 (when evaluating with L_valid = 2048) even though the ALiBi model uses 3.1 GB less memory. The ALiBi model maintains a lead in perplexity over the sinusoidal model during the entire training process.

The authors were able to show that the ALiBi L = 1024 model reaches a given perplexity value, on average, 11% faster than the sinusoidal model does. Stacking more layers would further improve performance (with negligible, if any, runtime cost).

Source: Paper

Figure 6 shows that the ALiBi models trained on L = 512 and L = 1024 achieve the best results when extrapolating to about double the tokens that they were trained on. After that their performance starts to degrade.
The sinusoidal model cannot extrapolate at all in this setting, with its performance degrading for both the L = 512 and 1024 models as soon as one token more than L is added during evaluation.

In this experiment the authors were able to show that their method achieves strong results in this more challenging setting, obtaining similar performance to the sinusoidal baseline while using significantly less memory, since they train on shorter subsequences.

Conclusion

Overall, the gaps of improvement in extrapolating on smaller datasets are huge. On larger datasets ALiBi is still a bit better, but more realistic.

Eliminating the positional embeddings and calculating ALiBi across the heads can save a lot of computation and time in training. If we train a model on smaller sequences using ALiBi then at inference time it has the potential to understand much larger texts such as books or long reports etc. This helps to explain why we have seen the technique be adopted by models such as MPT and BLOOM.

Resources

Paper: https://arxiv.org/pdf/2108.12409.pdf

Github: https://github.com/ofirpress/attention_with_linear_biases

Video Talk by Author: https://www.youtube.com/watch?v=Pp61ShI9VGc

Slides I created to give a presentation on this paper

What’s next? (If you found this interesting): Extending Context Window of Large Language Models via Positional Interpolation 27 Jun, 2023. Meta.

t-SNE: t-distributed stochastic neighbor embedding

Amy Pajak — Tue, 27 Jun 2023 17:37:59 GMT

An overview of t-SNE as a dimensionality reduction technique

Summary

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction tool used to help visualize high dimensional data.

It’s not typically used as the primary method for training a model. Instead, it’s often the first step for exploratory data analysis, visualization, and clustering. t-SNE provides an intuitive way to understand high-dimensional data by reducing its complexity, which can guide the selection and application of subsequent techniques for more detailed and focused analysis.

The 2D/3D mapping created by t-SNE allows us (as humans) to view if there are strong relationships and from there decide ourselves the best machine learning algorithm to apply to the data; such as clustering, classification, or deep learning i.e. to LEARN those relationships within the mapping and identify outliers. This can help to improve the performance of our ML algorithm and reduce overfitting.

Introduction — High Dimensional Data

t-SNE was introduced due to having lots of high dimensional data that practitioners want to visualize. For example:

Financial data: stock prices, trading volumes, and economic indicators, can be represented as high-dimensional data sets.
Medical imaging: technologies, such as MRI and CT scans, generate high-dimensional data with intensity values representing different features.
Genomics: DNA sequences of organisms are represented as high-dimensional data sets with each gene or nucleotide base representing a dimension. The human genome has approximately 3 billion base pairs, which correspond to 3 billion features in the DNA.
Image and video data where each pixel represents a dimension.
Social media platforms with user profiles, posts, likes, comments, and other interactions.
Text data, such as news articles, tweets, and customer reviews, can be represented with each word or token as a dimension.
Robotics: generates data with sensory inputs from cameras, microphones, and other sensors used in their control systems.

High dimensional data is everywhere. We need to visualize and explore the complex data in a more intuitive and understandable way. t-SNE serves as a powerful tool to achieve this by effectively transforming the high-dimensional data into a low-dimensional representation without losing significant information.

MNIST Digits

A great way to explain t-SNE is to show how it works on the MNIST Digits dataset.

A Survey of Handwritten Character Recognition with MNIST and EMNIST (2019)

This is a publicly available labeled dataset of 60,000 28x28 grayscale training images of handwritten digits from 0–9, along with a test set of 10,000 images.

We will use this dataset to demonstrate different dimensionality reduction techniques.

Introducing Principal Component Analysis (PCA)

One of the first, and most popular, dimensionality reduction techniques is Principal Component Analysis (PCA) which was published in 1901 by Pearson, Karl et al. It is a linear technique which finds a linear projection, or a new representation, of the original high-dimensional data points onto a lower-dimensional subspace in a way to maximize the variance of the data i.e. preserve as much information as possible. These projected axes/directions are referred to as the principal components of the data.

View the notebook version here

If we visualize PCA on MNIST Digits the results will be similar to what we see above which is a visualization of around 5,000 images. They’ve been laid out in 2-dimensions where each point corresponds to a digit in the dataset and its color labels which digit the point is representing. What we see here is that PCA captures some of the structure of the data, for example the red points on the right form a cluster of 0’s, and the oranges on the left form a cluster of 1’s. This happens to be the first principal component! — so the main variation between digits is between the 0’s and 1’s. That makes sense in pixel values where 0’s and 1’s have very few overlapping pixels.

The second principle component is on the top of the visualization where we see 4’s 7’s and 9’s clustered which are slightly more similar in terms of pixel values, and on the bottom we’ve got 3’s, 5’s and 8’s clustered which are also more similar e.g. a 3 will have many overlapping pixels with an 8. So that’s our second source of maximum variation between the data.

We can see 4, 7, 9 and 3, 5, 8 have similar overlapping pixel structure

This is great but a problem arises when the data is unlabelled — as we can see on the right. The color/labels can tell us some information about the relationships in the data, without them we see no clear clusters but rather just many points in a 2D space. So we run into a problem with unlabelled data causing us to be unable to interpret these results.

Can we do better?

Linear vs Non-Linear Data

PCA is good, but it’s a linear algorithm.

It cannot represent complex relationships between features
It is concerned with preserving large distances in the map (to capture the maximum amount of variance). But are these distances reliable?

Linear techniques focus on keeping the low-dimensional representations of dissimilar data points far apart (e.g. 0’s and 1’s we’ve just seen). But is that what we want in a visual representation? And how reliable is it?

Dimensionality reduction with t-SNE (2018)

If we look at the swiss-roll nonlinear manifold above (a), we can see that a Euclidean (straight-line) distance between two points in the red and blue clusters would suggest that the data points are quite close.

If we consider the entire structure to be represented in a 2D plane (i.e. rolling it out into a flat 2D shape), the red points would actually be on the opposite end to the blue points — one would have to traverse the entire length of the roll to get from one point to the other. PCA would attempt to capture the variance along the longest axis of the roll, essentially flattening it out. This would fail to preserve the spiral structure inherent to the Swiss roll data, where points that are close together in the spiral (and thus should be close together in a good 2D representation) end up being placed far apart.

So we can see PCA doesn’t work very well for visualization of nonlinear data because it preserves these large distances and we not only need to consider the straight-line distance, but also the surrounding structure of each data point.

Introducing t-SNE

Stochastic Neighbor Embedding was first developed and published in 2002 by Hinton et al, which was then modified in 2008 to what we’re looking at today — t-SNE (t-Distributed Stochastic Neighbor Embedding).

For anyone curious another variation was published in 2014 named Barnes-Hut t-SNE which improves on the efficiency of the algorithm via a tree-based implementation.

How t-SNE works

Dimensionality reduction with t-SNE (2018)

In a high dimensional space (left) we measure similarities between points. We do this in a way so that we only look at local similarities, i.e. nearby points.

The red dot in the high dimensional space is xi. We first center a gaussian over this point (shown as the purple circle) and then measure the density of all the other points under this gaussian (e.g. xj). We then renormalise all pairs of points that involve the point xi (the denominator/bottom part of the fraction). This gives us the conditional probability pj|i, which basically measures the similarity between pairs of points ij. We can think of this as a probability distribution over pairs of points, where the probability of picking a particular pair of points is proportional to their similarity (distance).

t-SNE (t-Distributed Stochastic Neighbor Embedding) Algorithm

This can be visualized as such. If two points are close together in the original high dimensional space, we’re going to have a large value for pj|i. If to points are dissimilar (far apart) in the high dimensional space, we’re going to get a small pj|i.

Perplexity

Looking at the same equation, perplexity tells us the density of points relative to a particular point. If 4 points of similar characteristics are clustered together, they will have higher perplexity than those not clustered together. Now, points with less density around them have flatter normal curves compared to curves with more density. In the figure below, the purple points are sparse.

t-SNE (t-Distributed Stochastic Neighbor Embedding) Algorithm

We compute the conditional distribution between points as this allows us to set a different bandwidth (sigma i) for each point, such that the conditional distribution has a fixed perplexity. This is basically scaling the bandwidth of the gaussian in such a way that a fixed number of points fall in the range of this Gaussian. We do this because different parts of the space may have different densities, and this trick allows us to adapt to those different densities.

Mapping the lower dimensional space

Next we’re going to look at the low dimensional space which will be our final map.

[2]

We start by laying the points out randomly on this map. Each high dimensional object will be represented by a point here.

We then repeat the same process previously - center a kernel over the point yi and measure the density of all the other points yj under that distribution. We then renormalise by dividing over all pairs of points. This gives us a probability qij, which measures the similarity of two points in the low dimensional map.

Now, we want these probabilities in qij to reflect the similarities pij which we computed in the high dimensional space just before, as closely as possible. If the qij’s are identical to the pijs then apparently the structure of the map is very similar to the structure of the data in the original high dimensional space.

We will measure the difference between these pij values in the high dimensional space and the qij values in the low dimensional map by using Kullback–Leibler divergence.

Stochastic Neighbor Embedding

KL divergence is the standard measure of the distance between probability distributions / their similarity. Its shown below in the cost function as the sum over all pairs of points of pj|i times log pj|i over qj|i.

Similarity of data points in High Dimension

Similarity of data points in Low Dimension

Cost function

[1]

Our goal now is to lay out the points in the low dimensional space such that the KL divergence is minimized i.e. as similar as possible to the high dimension values. In order to do that we’re basically going to do gradient descent in this KL divergence, which is essentially moving the points around in such a way that the KL divergence becomes small

Mapping the lower dimensional space

[2]

Here we can see the resultant mapping which has been rearranged to be as similar to the higher dimension as possible.

KL divergence is useful as it measures the similarity between two probability distributions. It is also symmetric (distance xi -> xj is same as distance from xj -> xi).

t-SNE Algorithm

Let’s take a final look at the overall algorithm.

[1]

Combining what we’ve seen:

1. Calculate the pairwise affinities (conditional probabilities) in the high-dimensional space, using a Gaussian distribution. The perplexity parameter defines the effective number of neighbors each point has and helps to balance the focus on local and global aspects of the data.

2. Symmetrize the probabilities. This means that each point considers the other point as its neighbor, and it is achieved by taking the average of the conditional probabilities for each pair of points. Then normalize by dividing by the total number of points.

3. Randomly initialize the position of each data point in the low-dimensional space, usually by drawing from a normal distribution with mean 0 and small variance.

4. Start a loop that will iterate for a fixed number of times T. Each iteration updates the position of the points:

4.1. Calculate the pairwise affinities (similarities) in the low-dimensional space using a t-Student distribution.

4.2. Calculate the gradient of the cost function with respect to the position of the points. The cost function is usually the Kullback-Leibler divergence between the high-dimensional and low-dimensional distributions.

4.3. Update the position of the points in the low-dimensional space. This update step is composed of three parts: the old position Y(t-1), a term proportional to the gradient that helps minimize the cost function, and a momentum term that helps accelerate convergence and avoid local minima.

5. End of the t-SNE algorithm. The final positions of the points in the low-dimensional space should now provide a useful visualization of the high-dimensional data.

Visual output of t-SNE applied to 1000 MNIST Digits images

If you want to see this algorithm implemented in detail as code please take a look at the original authors github [3] or this great step-by-step article.

Alternatively…

“To deal with hyperplanes in a 14-dimensional space, visualize a 3D space and say ‘fourteen’ to yourself very loudly. Everyone does it.” — Geoffrey Hinton, A geometrical view of perceptrons, 2018

Advantages

Visualization: t-SNE can help visualize high-dimensional data that has non-linear relationships as well as outliers
Good for clustering: t-SNE is often used for clustering and can help identify groups of similar data points within the data.

Limitations

Computational Complexity: t-SNE involves complex calculations as it calculates the pairwise conditional probability for each point. Due to this, it takes more time as the number of data points increases. Barnes-Hut t-SNE was later developed to improve on this.
Non-Deterministic: Due to the randomness in the algorithm, even though code and data points are the same in each iteration, we may get different results across runs.

Demo

In the following notebook I use Python to implement PCA and t-SNE on the MNIST Digits dataset via the sklearn library:

https://colab.research.google.com/drive/1znYpKviaBQ7h0HgfACcVxnP26Ud1cwKO?usp=sharing

Conclusion and Next Steps

To conclude, t-SNE visualization is just the first step in the data analysis process. The insights gained from the visualization need to be followed up with further analysis to gain a deeper understanding of the data or captured by an appropriate ML algorithm to build predictive models, or statistical analysis methods to test specific hypotheses about the data.

Other popular dimensionality reduction techniques include:

Non-negative matrix factorization (NMF)
Kernel PCA
Linear discriminant analysis (LDA)
Autoencoders
Uniform manifold approximation and projection (UMAP)

You can read more about them here.

Resources

[1] Original paper “Visualizing Data using t-SNE”, 2008, by L. Maaten and G. Hinton
[2] YouTube video Visualizing Data Using t-SNE — Google Tech Talks, L. Maaten, 2013
t-SNE implementations, examples and FAQ — L. Maaten GitHub
Slides I created to give a presentation on this topic