<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Bilal Kabas on Medium]]></title>
        <description><![CDATA[Stories by Bilal Kabas on Medium]]></description>
        <link>https://medium.com/@bilalkabas?source=rss-cbf4f5f5c8de------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*iCyOGkxImW4ojhIg-48wdg.png</url>
            <title>Stories by Bilal Kabas on Medium</title>
            <link>https://medium.com/@bilalkabas?source=rss-cbf4f5f5c8de------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 05 Apr 2026 22:02:45 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@bilalkabas/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Attention Mechanism in Transformers]]></title>
            <link>https://medium.com/@bilalkabas/self-attention-in-transformers-9d631f2a076d?source=rss-cbf4f5f5c8de------2</link>
            <guid isPermaLink="false">https://medium.com/p/9d631f2a076d</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[attention]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[transformers]]></category>
            <category><![CDATA[nlp]]></category>
            <dc:creator><![CDATA[Bilal Kabas]]></dc:creator>
            <pubDate>Sun, 02 Apr 2023 18:15:35 GMT</pubDate>
            <atom:updated>2023-04-02T18:15:35.906Z</atom:updated>
            <content:encoded><![CDATA[<p>Today’s advanced language models are based on transformer architecture [1]. The transformer architecture, which is otherwise no different from MLP (multilayer perceptron), relies on the attention mechanism. Attention provides the network with the ability to extract long-range relationships within given word sequences which is crucial in natural language understanding. Prior to the main discussion, I will briefly explain word embeddings. To keep things simple, I will introduce “<em>the basic self-attention</em>” which I call it in that way as it is the simplest form of self-attention. Then, I will elaborate on <em>scaled dot product attention </em>and <em>multi-head attention</em>.</p><blockquote>Attention is a general term and can be in different forms, e.g. additive attention [2]. Throughout this text, attention refers to the “dot product attention”. It should be also noted that self-attention is a form of attention focusing on the relationships “within” the sequence.</blockquote><h3>Word Embedding</h3><p>Word embedding is a representation of words (or tokens) as vectors in high-dimensional space. This representation can be learned from a large corpus for example using word2vec [3] methods so that resulting embedding vectors encode semantic relationships within the corpus. Specifically, embedding vectors of two semantically related words are expected to be similar in terms of direction assuming that all embedding vectors are of unit length. Figure 1 illustrates word vectors in three-dimensional space. Such learned representation of words can also reflect indirect relationships. For example, the vector differences (woman-man) and (queen-king) are expected to be similar which are shown in red in Figure 1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/425/1*S_nI8AAmLlU3llrZLDmfSg@2x.jpeg" /><figcaption>Figure 1. Illustration of word vectors in three-dimensional embedding space.</figcaption></figure><h3>Basic Self-attention</h3><p>I use the <em>basic self-attention </em>term to refer to the simplest form of self-attention. Equation (1) is the mathematical expression of basic self-attention where <em>x </em>represents a sequence of words as illustrated in Figure 2, dₛ is the sequence length which is the number of words (or tokens) in the input sequence, and dₘ is the dimensionality of the word embedding.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/415/1*SpBrcDUHny0oJk7uzfA-6A@2x.png" /><figcaption>Equation 1</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/375/1*aHzOCmEhsNwbBv0KpmPasA@2x.png" /><figcaption>Figure 2. An example input sequence to transformers. For simplicity, the example sequence is at the word level, not the token level.</figcaption></figure><p>The expression <strong><em>softmax(xxᵀ)x</em></strong> can be understood in three steps:</p><p>(i) <strong><em>xxᵀ</em></strong>: similarity scores between each word.<br>(ii) <strong><em>softmax(xxᵀ)</em></strong>: convert similarity scores into weights. At this step, similarity scores are converted into values between 0 and 1.<br>(iii) <strong><em>softmax(xxᵀ)x</em></strong>: weighted sum of features from input itself (self-attention).</p><p>Let’s explore each step in detail.</p><p>Equation (2) shows the matrix multiplication to compute similarity scores between words. <strong>xⁱxʲᵀ</strong> gives the similarity score between the <em>i</em>th and <em>j</em>th words in the sequence. Thus, for dₛ words, there are dₛ × dₛ scores. Note that <strong><em>xxᵀ </em></strong>is a symmetric matrix.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/816/1*QJUszEsKgNkjnUXUT2FSPQ@2x.png" /><figcaption>Equation 2</figcaption></figure><p>The cosine similarity given in Equation (3) equals the dot product for vectors of unit length. Therefore, <strong>xⁱxʲᵀ </strong>is essentially the cosine similarity between two word vectors.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/177/1*E6nFoYEwPuCRyvre7jhfrA@2x.png" /><figcaption>Equation 3</figcaption></figure><blockquote>Cosine similarity is a measure of how similar two vectors are. When two vectors are similar, the angle (θ) between them is small meaning that cos(θ) is large which is bounded by 1. Therefore, dot product is proportional to vector similarity as given in Equation (3).</blockquote><p>In the second step, we apply softmax to similarity scores as given in Equation (4) where <em>P</em> is a dₛ × dₛ matrix. <em>Pⁱ</em> is the similarity of the <em>i</em>th word to each word in the sequence such that the sum of all elements of <em>Pⁱ</em> makes up 1. In that sense, similarity scores are converted into weights using <em>softmax</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/261/1*BynGaoNz97A9OJ8Y7kMB4A@2x.png" /><figcaption>Equation 4</figcaption></figure><p>The last step is basically a <em>weighted sum</em>. <strong>Using similarity scores, each word is reconstructed considering its relationship with all words in the sequence.</strong> Specifically, the <em>k</em>th feature (or element) of the <em>i</em>th word in the sequence is determined by the dot product between <em>Pⁱ</em> and <em>xₖ</em> where <em>xₖ</em> is the <em>k</em>th column of input <em>x</em>. This dot product is basically the expectation over feature dimensions using similarity scores as weights. In this way, <strong>the new representation of the <em>i</em>th word encodes its similarity to each word in the sequence.</strong> Figure 3 summarizes the weighted sum step.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/786/1*_PrRji6tkSUzCfakyoHebg@2x.png" /><figcaption>Figure 3. Weighted sum step in basic self-attention.</figcaption></figure><p>For example, considering the input sequence given in Figure 2, the first feature of the resulting representation of the word <em>attention</em> is determined by the weighted sum of the first features of all words in the sequence using similarity scores.</p><p>Figure 4 demonstrates a numerical example of the three steps of basic self-attention which summarizes the concept. The embedding space is assumed to be three-dimensional. In the example, the word vectors do not represent learned embeddings, they are instead random.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/759/1*P-s4SilSHyI5E7iOhZnVUA@2x.png" /><figcaption>Figure 4. A numerical example of the three steps of basic self-attention.</figcaption></figure><h3>Dot Product Attention</h3><p>The mathematical expression of dot product attention is given in Equation (5) where <em>Q</em>, <em>K</em>, and <em>V</em> are called <em>query</em>, <em>key</em>, and <em>value</em>, respectively, inspired by databases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/206/1*Fat4uLEfGhfKBzFKzKQBuw@2x.png" /><figcaption>Equation 5</figcaption></figure><p>This is the generalized form of self-attention as when <em>Q = K = V</em>, then the operation becomes self-attention and focuses on the relationships <em>within</em> the sequence. However, in the generalized form, attention can be performed on two different sequences. For example, assume that <em>Q</em> is from <em>sequence 1</em>, and <em>K</em> and <em>V</em> are from <em>sequence 2</em>. This implies that <em>sequence 2 </em>is going to be reconstructed using its relationships with <em>sequence 1 </em>which is what happens in the encoder-decoder attention layers of the transformer.</p><h3>Scaled Dot Product Attention</h3><p>Scaled dot product attention given in Equation (6) is a slightly modified version of dot product attention such that similarity scores (<em>QKᵀ</em>) are scaled by <em>√dₖ</em>. The reason is that assuming that <em>Q</em> and <em>K </em>are of shape dₛ × dₖ and they are independent random variables with mean 0 and variance 1, their dot product has mean 0 and variance <em>dₖ</em>. Note that <em>dₛ</em> is the sequence length and <em>dₖ</em> is the feature dimensionality. With this modification, changes in <em>dₖ </em>do not create a problem.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/379/1*YVLpHFTtzvTWpCE3Syj4MQ@2x.png" /><figcaption>Equation 6</figcaption></figure><h3>Multi-Head Attention</h3><p>Multi-head attention is an extended form of scaled dot product attention with learnable parameters in a multi-headed manner. Figure 5 shows attention with prior linear transformations. Wᵢ(Q), Wᵢ(K), and Wᵢ(V) are learnable parameters and are realized as linear layers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/475/1*mepyYBL8nUyV43GqKbkyyg@2x.png" /><figcaption>Figure 5. Attention with prior linear transformations.</figcaption></figure><p>Intuitively, the resulting matrices after linear transformations are more useful (or richer) representations of input sequences such that the <em>similarity scores </em>and <em>weighted sum</em> step can be improved via learning. Figure 6 illustrates the overall multi-head attention. The input sequences, <em>Q</em>, <em>K</em>, and <em>V</em>, are of shape dₛ × dₘ. After linear transformations, the resulting matrices have <em>dₖ</em> dimension in feature space (generally dᵥ = dₖ). After performing attention over transformed matrices, the resulting matrix is of shape dₛ × dₖ and there are <em>h</em> of them such that i = {1,…,h}. They are concatenated such that dₛ × (dₖ × h) becomes dₛ × dₘ. Finally, another linear transformation is performed on the resulting matrix.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/525/1*5cFVJuBbuvzciwH4t5jZOw@2x.png" /><figcaption>Figure 6. Multi-head attention.</figcaption></figure><p>It should be noted that multi-head attention is implemented in the form of self-attention in the encoder and decoder parts of the transformer architecture as <em>queries</em> (<em>Q), keys</em> (<em>K), </em>and <em>values</em> (<em>V) </em>are the same. In the encoder-decoder attention layers, <em>keys </em>and <em>values</em> come from the encoder part, and <em>queries </em>come from the decoder part as explained in the dot product attention part<em> </em>by an example.</p><p>The <em>multi-head </em>form<em> </em>of attention provides the transformer with the ability to focus on associations within sequences (or between sequences) <strong>considering different aspects</strong>, i.e. semantic and syntactic.</p><h3>Conclusion</h3><p>In this text, the attention mechanism, i.e. <em>multi-head attention</em>, of the transformer architecture was elucidated. The method relies on the cosine similarity between word vectors. Multi-head attention can extract long-range relationships within or between sequences. Not only in texts but also in images [4], attention can be used.</p><h3>References</h3><p>[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.</p><p>[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in arXiv preprint arXiv:1409.0473, Sep. 2014.</p><p>[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.</p><p>[4] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in arXiv preprint arXiv:2010.11929, 2020.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9d631f2a076d" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>