Stories by Ray Flanagan on Medium

Word Vectors & Word2Vec

Ray Flanagan — Thu, 21 Dec 2023 16:13:07 GMT

This article focuses on Word Vectors and the Word2Vec algorithm.

Word vectors are numerical representations (vectors) of words.

What’s the point? Can’t we just use raw words?

Before word-vectors, this was the approach to NLP problems. However, computers had a hard time understanding the relationships between words and phrases. Because word vectors are in a computer’s native language (numbers), language models can find a sense of meaning and relationship between words.

Let’s look at an example:

Here, the words in the graph are represented by a vector (like the one shown on the left).

Now, the amazing thing about word embeddings is that they have relationships with one another. Similar words are close together in vector space. For example, come and go are related words and have similar vectors.

Since word embeddings are vectors, we can also perform mathematical operations on them, which can yield some pretty interesting results. One classic example is the following:

Here, we have three vectors: one for king, one for man, and one for woman. And we are performing the operation: king — man + woman, which yields a vector similar to queen.

This demonstrates the capability of word vectors as they can retain a sense of semantic meaning and relationship.

How can we create word vectors? Word2Vec

Let’s outline how Word2Vec creates word vectors.

Each word in the corpus is given a random vector
For each word in the corpus, calculate the probability of the current word (o) given the context words (c). This is done by calculating the similarity between the word vector representing o and the word vectors representing c.
The loss is then calculated based on these probabilities. The objective is to minimize the loss by maximizing the probability in Step 2.

Loss function for Word2Vec

4. Then, use gradient descent to update the word vectors.

Transformer Architecture in NLP

Ray Flanagan — Sun, 11 Dec 2022 18:34:45 GMT

The Transformer, based solely on attention, is an alternative to an RNN model architecture. Many recent popular NLP models use the Transformer Architecture (GPT and Bert for example).

This article focuses on the Transformer Architecture as outlined in the paper Attention is All You Need (Vaswani et al 2017).

Model Architecture

Transformer Architecture (Vaswani et al 2017)

The Transformer in the diagram represents an encoder-decoder model.

Encoder

The base architecture for the Encoder is what is boxed to the left. It can be thought of as “encoding” or understanding information from the input.

The two main parts of the stack are

Multi-head attention layer
Feed-forward network

Decoder Architecture

The base architecture for the Decoder is what is boxed to the right and is used to generate output (in an encoder-decoder architecture the decoder generates inputs from the encoder).

The stack is very similar to the encoder. The only difference is that there is an extra layer at the beginning: Masked Multi-head Attention.

Attention

With previous RNN model architecture, it was hard to pay “attention” to future and previous words in the input sentence. Transformers use self-attention to fix this problem. For each word, self-attention calculates a weight that each word before and after the current word in the sequence.

Multi-head Attention (Vaswani et al 2017)

Each of the Multi-head attention layers is made up of several Attention Layers (Scaled Dot-Product Attention). Let’s look into these layers and see how they work.

Scaled Dot-Product Attention

Although there are many ways to calculate attention, the method that is proposed in the paper is called Scaled Dot-Product attention.

Here, the similarity measure is the dot product between Q and K. The Dot-product is then scaled by the square root of the size of K. The Dot-product is scaled here to normalize values, improving the speed of the calculations.

The softmax of the dot product returns a probability distribution which is then multiplied by V to get the attention weights.

Masked Attention

As mentioned previously, the decoder adds an extra masked-attention layer. Masked attention simply adds a mask (a binary tensor) to the input vector so that the decoder can only “pay attention” to previous words.

Now that we’ve seen what a Transformer is and what is made up of, let's see some examples of how it's used.

Encoder and Decoder models

The best example of an Encoder and Decoder model is for translation. The Encoder understands the input text and the decoder takes the encoded input and translates it into the output language.

Encoder models

Because Encoders attempt to understand the meaning of text, they are used in text analysis/classification.

Decoder models

Probably the best example of a decoder model is the GPT Family. They are primarily used for text-generation.

Python Walrus Operator

Ray Flanagan — Thu, 24 Nov 2022 15:50:46 GMT

The walrus operator in Python as first introduced in PEP572 in Python 3.8. Because it resembles a walrus (: being the eyes and = being the tusks), it is known as the walrus operator.

But what does := do, why is it useful, and when should you use it?

What does it do?

From the Python docs:

:=assigns values to variables as part of a larger expression

But what does this mean? Simply, it creates a variable within an if statement.

Let’s see an example from the docs with and without using the walrus operator.

if len(array) > 10:
  print(f'List is too long ({ len(array) } elements, expected <= 10)')

# Versus

if (l := len(array)) > 10:
  print(f'List is too long ({ l } elements, expected <= 10)')

Rather than needing to repeat the len call twice, the variable l defined using the walrus operator is used.

Although the change here seems small, the benefit of using walrus operators is more significant when more complex expressions are used

For example, with Django

# example 1

if Users.objects.filter(age__gte=10).exists():
  users = Users.objects.filter(age__gte=10)
  print(users.count())
  ...

# versus example 2

if (users := Users.objects.filter(age__gte=10)).exists():
  print(users.count())

Here we can see the real benefit of walrus operators. Rather than repeating the filter twice, we only have to write it once. This is extremely beneficial if you want to change the filter; it reduces the number of places where you need to edit the statement.

Although you can get around repeating statements, and not use the walrus operator by defining a variable outside of the if-statement. For example:

# example 3

users = Users.objects.filter(age__gte=10)

if users.exists():
  print(users.count())

I find two problems with this solution.

It is an extra line compared to the walrus operator solution
If you are only planning to use the users variable within the if statement, the intention of the variable is not immediately clear as it is defined outside of the if statement.

Why is this useful?

The walrus operator reduces the number of statements that could be called in an if-statement.
Compared to example 3, the walrus operator solution makes it clear that the variable will be used inside of the if statement.

When should you use it?

When you find yourself repeating statements within the condition and inside of an if statement, you should consider using a walrus operator.

However, I would recommend resorting to example 3 if you plan to use the variable outside of the if statement as well. For example:

users = Users.objects.filter(age__gte=10)

if users.exists():
  print(users.count())

...

for user in users:
  ...

As compared to

if (users := Users.objects.filter(age__gte=10)).exists():
  print(users.count())

...

for user in users:
  ...

Although a walrus operator could be used here, I find that using it makes it less clear as to what your intentions are for the users variable.