Attention Mechanism in Transformers

Bilal Kabas — Sun, 02 Apr 2023 18:15:35 GMT

Today’s advanced language models are based on transformer architecture [1]. The transformer architecture, which is otherwise no different from MLP (multilayer perceptron), relies on the attention mechanism. Attention provides the network with the ability to extract long-range relationships within given word sequences which is crucial in natural language understanding. Prior to the main discussion, I will briefly explain word embeddings. To keep things simple, I will introduce “the basic self-attention” which I call it in that way as it is the simplest form of self-attention. Then, I will elaborate on scaled dot product attention and multi-head attention.

Attention is a general term and can be in different forms, e.g. additive attention [2]. Throughout this text, attention refers to the “dot product attention”. It should be also noted that self-attention is a form of attention focusing on the relationships “within” the sequence.

Word Embedding

Word embedding is a representation of words (or tokens) as vectors in high-dimensional space. This representation can be learned from a large corpus for example using word2vec [3] methods so that resulting embedding vectors encode semantic relationships within the corpus. Specifically, embedding vectors of two semantically related words are expected to be similar in terms of direction assuming that all embedding vectors are of unit length. Figure 1 illustrates word vectors in three-dimensional space. Such learned representation of words can also reflect indirect relationships. For example, the vector differences (woman-man) and (queen-king) are expected to be similar which are shown in red in Figure 1.

Figure 1. Illustration of word vectors in three-dimensional embedding space.

Basic Self-attention

I use the basic self-attention term to refer to the simplest form of self-attention. Equation (1) is the mathematical expression of basic self-attention where x represents a sequence of words as illustrated in Figure 2, dₛ is the sequence length which is the number of words (or tokens) in the input sequence, and dₘ is the dimensionality of the word embedding.

Equation 1

Figure 2. An example input sequence to transformers. For simplicity, the example sequence is at the word level, not the token level.

The expression softmax(xxᵀ)x can be understood in three steps:

(i) xxᵀ: similarity scores between each word.
(ii) softmax(xxᵀ): convert similarity scores into weights. At this step, similarity scores are converted into values between 0 and 1.
(iii) softmax(xxᵀ)x: weighted sum of features from input itself (self-attention).

Let’s explore each step in detail.

Equation (2) shows the matrix multiplication to compute similarity scores between words. xⁱxʲᵀ gives the similarity score between the ith and jth words in the sequence. Thus, for dₛ words, there are dₛ × dₛ scores. Note that xxᵀ is a symmetric matrix.

Equation 2

The cosine similarity given in Equation (3) equals the dot product for vectors of unit length. Therefore, xⁱxʲᵀ is essentially the cosine similarity between two word vectors.

Equation 3

Cosine similarity is a measure of how similar two vectors are. When two vectors are similar, the angle (θ) between them is small meaning that cos(θ) is large which is bounded by 1. Therefore, dot product is proportional to vector similarity as given in Equation (3).

In the second step, we apply softmax to similarity scores as given in Equation (4) where P is a dₛ × dₛ matrix. Pⁱ is the similarity of the ith word to each word in the sequence such that the sum of all elements of Pⁱ makes up 1. In that sense, similarity scores are converted into weights using softmax.

Equation 4

The last step is basically a weighted sum. Using similarity scores, each word is reconstructed considering its relationship with all words in the sequence. Specifically, the kth feature (or element) of the ith word in the sequence is determined by the dot product between Pⁱ and xₖ where xₖ is the kth column of input x. This dot product is basically the expectation over feature dimensions using similarity scores as weights. In this way, the new representation of the ith word encodes its similarity to each word in the sequence. Figure 3 summarizes the weighted sum step.

Figure 3. Weighted sum step in basic self-attention.

For example, considering the input sequence given in Figure 2, the first feature of the resulting representation of the word attention is determined by the weighted sum of the first features of all words in the sequence using similarity scores.

Figure 4 demonstrates a numerical example of the three steps of basic self-attention which summarizes the concept. The embedding space is assumed to be three-dimensional. In the example, the word vectors do not represent learned embeddings, they are instead random.

Figure 4. A numerical example of the three steps of basic self-attention.

Dot Product Attention

The mathematical expression of dot product attention is given in Equation (5) where Q, K, and V are called query, key, and value, respectively, inspired by databases.

Equation 5

This is the generalized form of self-attention as when Q = K = V, then the operation becomes self-attention and focuses on the relationships within the sequence. However, in the generalized form, attention can be performed on two different sequences. For example, assume that Q is from sequence 1, and K and V are from sequence 2. This implies that sequence 2 is going to be reconstructed using its relationships with sequence 1 which is what happens in the encoder-decoder attention layers of the transformer.

Scaled Dot Product Attention

Scaled dot product attention given in Equation (6) is a slightly modified version of dot product attention such that similarity scores (QKᵀ) are scaled by √dₖ. The reason is that assuming that Q and K are of shape dₛ × dₖ and they are independent random variables with mean 0 and variance 1, their dot product has mean 0 and variance dₖ. Note that dₛ is the sequence length and dₖ is the feature dimensionality. With this modification, changes in dₖ do not create a problem.

Equation 6

Multi-Head Attention

Multi-head attention is an extended form of scaled dot product attention with learnable parameters in a multi-headed manner. Figure 5 shows attention with prior linear transformations. Wᵢ(Q), Wᵢ(K), and Wᵢ(V) are learnable parameters and are realized as linear layers.

Figure 5. Attention with prior linear transformations.

Intuitively, the resulting matrices after linear transformations are more useful (or richer) representations of input sequences such that the similarity scores and weighted sum step can be improved via learning. Figure 6 illustrates the overall multi-head attention. The input sequences, Q, K, and V, are of shape dₛ × dₘ. After linear transformations, the resulting matrices have dₖ dimension in feature space (generally dᵥ = dₖ). After performing attention over transformed matrices, the resulting matrix is of shape dₛ × dₖ and there are h of them such that i = {1,…,h}. They are concatenated such that dₛ × (dₖ × h) becomes dₛ × dₘ. Finally, another linear transformation is performed on the resulting matrix.

Figure 6. Multi-head attention.

It should be noted that multi-head attention is implemented in the form of self-attention in the encoder and decoder parts of the transformer architecture as queries (Q), keys (K), and values (V) are the same. In the encoder-decoder attention layers, keys and values come from the encoder part, and queries come from the decoder part as explained in the dot product attention part by an example.

The multi-head form of attention provides the transformer with the ability to focus on associations within sequences (or between sequences) considering different aspects, i.e. semantic and syntactic.

Conclusion

In this text, the attention mechanism, i.e. multi-head attention, of the transformer architecture was elucidated. The method relies on the cosine similarity between word vectors. Multi-head attention can extract long-range relationships within or between sequences. Not only in texts but also in images [4], attention can be used.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.

[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in arXiv preprint arXiv:1409.0473, Sep. 2014.

[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[4] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in arXiv preprint arXiv:2010.11929, 2020.

Stories by Bilal Kabas on Medium