Stories by Goyalpramod on Medium

GPT from scratch

Goyalpramod — Sun, 11 Aug 2024 04:35:32 GMT

via GIPHY

We are finally here. I have been waiting for this moment for a long time. This is a continuation of the series where I explain everything Andrej talks about in his series Zero to Hero. So kindly watch the video first and return here to help you understand a few things you may have missed. We will be moving forward assuming that you are already aware of the basics of ML and are here to learn the more intermediate or advanced ideas. Because that is what I will be mainly focusing on. For the more basic stuff, kindly consider going to the beginning of the series.

Now let us PROCEED!!!!

https://medium.com/media/2fec4f05487e91998e1efd8cd7f76e5c/href

Bi-gram generation

(the link to the colab notebook)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

This code block made me raise a few questions, let’s go through each one by one.

logits, loss = self(idx) forward is not a magic function, so why does this work?

Ans. When we create a class that inherits from nn.Module (as BigramLanguageModel does), We are leveraging a lot of functionality from PyTorch's neural network module.

Here’s what’s happening:

In Torch, when we use parentheses after an object (like self(idx)), it attempts to call the object's __call__ method.
The nn.Module class, which BigramLanguageModel inherits from, implements a __call__ method.
This __call__ method in nn.Module is set up to automatically call the forward method of your class.
So when we write self(idx), it's equivalent to calling self.forward(idx).

2. logits = logits[:, -1, :] what is this syntax?

Ans. This line uses NumPy-style array indexing in PyTorch, which is a powerful and concise way to select specific parts of a tensor. Let’s break it down:

logits[:, -1, :] is selecting data from the logits tensor, which is a 3D tensor with shape (B, T, C), where:

B is the batch size
T is the sequence length (or time steps)
C is the number of classes (or vocabulary size in this case)

Here’s what each part means:

: in the first position: This selects all elements along the first dimension (batch dimension). It keeps all batches.
-1 in the second position: This selects only the last element along the second dimension (time dimension). In Python, -1 as an index refers to the last element of a sequence.
: in the third position: This selects all elements along the third dimension (class dimension).

So, logits[:, -1, :] is effectively saying: "For all batches, take the last time step, and for that last time step, take all class probabilities."

This operation transforms the tensor from the shape (B, T, C) to (B, C), because:

It keeps all B batches
It selects only one-time step (-1, the last one)
It keeps all C classes

This is typically done in language models when you’re only interested in predicting the next token based on the last token in the sequence, which is the case for our bigram model. (We will change this in the future)

3. probs = F.sodtmax(logits,dim=-1) why is dim = -1?

Ans. This is a common practice in PyTorch and has a specific purpose:

It applies softmax across the class dimension, which is typically the last dimension in classification tasks.
This ensures that we get a probability distribution for each item in the batch over all classes.
It’s more flexible and less error-prone than hardcoding a specific dimension number.
If the tensor shape changes (e.g., from (B, C) to (B, T, C)), dim=-1 softmax will still be applied correctly over the class dimension.

It’s all about Adam

For some reason everyone is obsessed with Adam (sorry Eve), but why? People keep using it never bothering to explain the reason, yes it is the best. But what makes it the best? This section is going to be a deep dive into the workings and origins of our beloved Adam. Feel free to skip this part, it’s mostly for my own curiosity.

What was done in the biblical times (a bit of history on optimizers)

Gradient Descent:

This is the beginning of everything, We have our weights and update them with respect to the loss by multiplying it with a learning rate. It works great for small datasets, it converges and computes fast. But it is not scalable, because it can be slow to calculate the entire loss of a huge dataset also, it gets stuck on local minima or saddle points.

Now one of the biggest problems with optimizers is fixing wiggling is when we have a huge learning rate and the weight keeps jumping between the minima, to fix this we have some called exponential moving average

Exponential Moving Average:

The Exponential Moving Average is a technique used in optimizers to smooth out fluctuations in the optimization process. It’s particularly useful in stochastic optimization where we deal with noisy gradients.

e_t = β * e_{t-1} + (1 - β) * x_t

Where:

e_t is the exponential average at time t
x_t is the new data point at time t
β (beta) is a parameter between 0 and 1 that determines the weight of past observations

Taken from Medium article by Maciej Balawejder

As we can see EMA smoots out the trajectory of our loss.

Momentum

Shamelessly copied from the author of the original article (by Maciej Balawejder)

Momentum incorporates EMA with Gradient Descent, this reduces oscillations as well as reduces the chances of hitting a local minima or saddle point.

How?

Momentum works by adding a fraction of the previous update vector to the current update. This creates a sort of ‘velocity’ for the parameter updates, which allows the optimization to build up speed in directions with consistent gradients. Here’s how this addresses the two main issues:

Reducing Oscillations: In areas where the gradient changes rapidly (like narrow ravines in the loss landscape), standard Gradient Descent tends to oscillate back and forth across the slopes. Momentum dampens these oscillations by averaging the gradients over time. This allows for faster progress along the ravine while reducing the back-and-forth movement.
Escaping Local Minima and Saddle Points: When the algorithm encounters a local minimum or a saddle point, the gradients become very small or zero. Standard Gradient Descent might get stuck in these areas. However, momentum allows the optimization to build up velocity. This means that even if the current gradient is small, the accumulated momentum from previous updates can help the algorithm push through these flat regions, potentially finding a better minimum.

A nice visualization will be (from popular stories on momentum)

“SGD is a walking man downhill, slowly but steady. Momentum is a heavy ball running downhill, smooth and fast.”

The problem with this is though, that the learning rate stays constant. Which is not ideal, as we talked the optimizer has velocity so even when it is reaching the minima it can cause it to overshoot, so ideally we would like the learning rate to be adaptive. This is fixed by AdaGrad

AdaGrad

Continuing my shamelessness because this is too beautifully made (if someone could teach me how to create these I would be really grateful) (P.S I found it, Use LaTeX)

AdaGrad has adaptive learning rate, so when the gradients are too high, it decreases the learning rate it prevents it from overshooting, when the gradients are low, it does not decrease it a lot.

But AdaGrad faces a major issue, the learning rate keeps decreasing and there may be times when we are unable to reach the minima because the learning rate came very close to zero. This is fixed by RMSProp

RMSProp

“RMSProp is an upgraded version of AdaGrad that leverages mighty EMA(again). Instead of only accumulating the squared gradients, we control the amount of previous information. Thus the denominator won’t get large, and the learning rate won’t disappear!” [4]

Adam

“Essentially Adam is a combination of Momentum and RMSProp. It has reduced oscillation, a more smoothed path, and adaptive learning rate capabilities. Combining those abilities makes it the most powerful and suitable for different problem optimizers.

The good starting configuration is learning rate 0.0001, momentum 0.9, and squared gradient 0.999.” [4]

AdamW (The man of the hour)

The added term here alpha*lambda*Weight is the decay term. Now an obvious question will come to anyone’s mind. If Adam was so perfect why do we have Adam with Decay?

Because there was a problem with Adam

“Adaptive optimizers like Adam have become a default choice for training neural networks. However, when aiming for state-of-the-art results, researchers often prefer stochastic gradient descent (SGD) with momentum because models trained with Adam have been observed to not generalize as well.”[6]

I will suggest you read reference 6 for a better and more in depth understanding of AdamW, but what it essentially does is :

AdamW applies weight decay directly to the weights, separate from the gradient update.
This decoupling ensures that weight decay behaves consistently regardless of the adaptive learning rates.

Now back to our video

SELF-ATTENTION

This part absolutely baffled me, first question. Which looks easier to you and makes more sense?

version 3

# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

or version 2

# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

Unless you are a maniac anyone will say version 2, so why are we even doing version 3 (btw if they do not look the same to you, write the equation of softmax and calculate the values one by one. DO IT. YOU WANT TO BE A BETTER AI ENGINEER)

It didn’t make any sense to me initially but then it clicked. Version 2 is a straightforward normalization there is no richness of individual words being carried. This gets confusing because we see 1 for both versions.

But think of it more like this

#VERSION 2 

1 0 0
0.5 0.5 0
0.33 0.33 0.33

#then this gets multiplied with the actuall embeddings

#VERSION 3 
{some_embedding value} 0 
{some_embedding value} {some_embedding value} 0
{some_embedding value} {some_embedding value} {some_embedding value}

Now using softmax on them introduces non-linearity. Which we have learned so far is very important. (if it doesn't make sense now, I promise it will by the time you get to version 4)

Now off to version 4 we go, the actual self-attention block. And oh boy if it isn’t a big ol tough nut to understand. Well what is going on is pretty easy to understand, but my question is more intrinsic, why is this going on? why is this working? what was the rationale behind the creators, how did they even think of this? Let us find out together

# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

I found some great articles and videos that can explain the concept way better than I can. (added some recommended watch and read section)

But the theory behind Q,K and V can be understood by the answer from the following stack overflow answer

“The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).”

To understand how it works, whatever question/text/prompt you write to the model is a query. Then the model checks for keys that are closest to your query. And then presents those values back to you. (Watch the videos by Stat Quest for a more in-depth understanding, here is a recommended one)

Conclusion

I have wanted to completely and utterly watch and understand this video from the day it came out, I am happy today that I could do it. I won't say I 100 percent understand all of it, But I believe I do understand a good chunk of it. Thank you for reading this article, Here’s to striving to be a better developer.

Some recommended watches and reads:

https://www.youtube.com/watch?v=zxQyTK8quyY
https://medium.com/@geetkal67/attention-networks-a-simple-way-to-understand-self-attention-f5fb363c736d
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html (Honestly, read everything this guy has ever written)

References

Andrej and his amazing playlist link

2. Claude (hehe, thanks anthropic)

3. https://en.wikipedia.org/wiki/Stochastic_gradient_descen

4. https://medium.com/nerd-for-tech/optimizers-in-machine-learning-f1a9c549f8b4

5. https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

6. https://towardsdatascience.com/why-adamw-matters-736223f31b5d

Andrej’s Zero to Hero for Dummies Part 5

Goyalpramod — Thu, 01 Aug 2024 08:42:42 GMT

Welp, happy to see you again. It is tough but it’s worth it. Let’s continue understanding ML in depth with the help of our lord and savior andrej DADDY. As always, watch his video first and come back here to help in your understanding.

https://medium.com/media/399f90ed2ed0073e2871f2a0635050f6/href

This one will be a bit different, as the lecture itself is filled with exercises. I will be diving more into how I went about thinking and trying to solve the exercises rather than explaining things. It is more about approach than the actual answer per say.

Excercise 1

The hardest thing for me to figure out was… how to even start? what is dlogprobs. Is it dlogprobs/dloss or the otherway around.

The fundamental thing that we need to understand is

“Direction of gradients: In backpropagation, we’re calculating dloss/dx for each variable x, not dx/dloss. This is because we want to know how changes in each variable affect the loss, to update our parameters in the direction that reduces the loss.”

I ran into the second problem when we had to calculate dhprebn, In my head we were doing the chain rule in the traditional sense and just calculating the gradients. But it is effective to think of these as nodes.

Let me tell you the question I had and the the nice answer claude-3.5-sonnet gave me which helped me with my understanding

“I do not understand how dhprebn and dbnmeani were calculated

if we try to calculate for bhprebn

dloss/dbndiff*dbndiff/dbnprebn = 1 — dbnmeani/dbhprebn

and then we calculate using the above, how has andrej done it?”

I understand your confusion. Let's break this down step by step to clarify how Andrej derived these gradients.

First, recall the forward pass equations:
bnmeani = 1/n * hprebn.sum(0, keepdim=True)
bndiff = hprebn - bnmeani

In the backward pass, we're calculating gradients with respect to hprebn and bnmeani.
The key insight here is that hprebn affects the loss through two paths:
a) Directly through bndiff
b) Indirectly through bnmeani
For dhprebn:

The direct effect is simply dbndiff (because bndiff = hprebn - bnmeani)
There's no need to multiply by anything because the local derivative of hprebn with respect to bndiff is 1


For dbnmeani:

We need to consider how changes in bnmeani affect bndiff
From the equation bndiff = hprebn - bnmeani, we see that increasing bnmeani decreases bndiff by the same amount
This is why we have a negative sign: -dbndiff
We sum over axis 0 because bnmeani is computed as a mean across this axis


The full equations:
dhprebn = dbndiff.clone()
dbnmeani = (-dbndiff).sum(0)


Your equation dloss/dbndiff * dbndiff/dbnprebn = 1 - dbnmeani/dbhprebn is mixing up the concepts a bit. We're not calculating dbnmeani/dhprebn, but rather how changes in bnmeani affect the loss through bndiff.
The key to understanding this is to focus on how each variable directly affects the next one in the computational graph, rather than trying to compute complex relationships across multiple steps at once. This is the essence of the backpropagation algorithm - it breaks down complex derivative calculations into simpler, local computations.
Andrej's approach here is correctly applying the chain rule in a way that's consistent with how backpropagation works in neural networks, considering all paths through which a variable can affect the loss.

This made me realize an Aha moment and everything just became clearer.

Well, after this everything was crystal clear to me. If you still ran into some issues with the further exercises. Feel free to drop a comment,I will try to help you out.

Andrej’s Zero to Hero for Dummies Part 4

Goyalpramod — Sat, 27 Jul 2024 06:16:24 GMT

Here we are again, continuing our journey of becoming better developers. This article will be a bit different. By now you must be accustomed to the basic things, so I will skip over a few sections and only talk about the one’s that I believe are hard to grasp or can seem a bit perplexing to first time “developers”

As always, kindly watch the original video first. Then come back here, to aid with your understanding.

https://medium.com/media/866fb52e7ecbac5e41a392b7b6c089cd/href

The first problem

The random initialization gives too high a loss initially, as it is being assigned extreme values, we would like it to be a uniform distribution. That is why we multiply W2 and B2 by small numbers (this is the last layer, that is why it is being called the softmax issue)

Fixing the tanh problem

Tanh is a squasing function, it takes big values to the extremes and the data within them is lost, so we decrease the weights. This makes sure the value that is being sent to tanh itself is small hence making sure nothing goes to the extremes

calculating the init scale: “Kaiming init”

I found this part rather interesting because I was never introduced to all this, I followed the modern tutorials for ML while learning and I never knew about this issue.

So essentially to bring the distribution closer to a guassian distribution we multiply the initial weights with the following formula

“gain/square_root(fan_in)”

(But as Andrej mentioned one need not worry about these problems any more due to the modern innovations in ML)

Batch Normalization

When I went into the lecture I was pretty sure of my understand of Batch Normalization but by the end of it I was more confused than ever.

But then I went through all the code in the colab, and it brought me back to my understanding.

Google Colab

Let’s have a look at the following code from colab

  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y
  
  # forward pass
  emb = C[Xb] # embed the characters into vectors
  embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
  # Linear layer
  hpreact = embcat @ W1 #+ b1 # hidden layer pre-activation
  # BatchNorm layer
  # -------------------------------------------------------------
  bnmeani = hpreact.mean(0, keepdim=True)
  bnstdi = hpreact.std(0, keepdim=True)
  hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias
  with torch.no_grad():
    bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
    bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
  # -------------------------------------------------------------
  # Non-linearity
  h = torch.tanh(hpreact) # hidden layer
  logits = h @ W2 + b2 # output layer
  loss = F.cross_entropy(logits, Yb) # loss function

Batch normalization is nothing but. Taking a batch, subtracting the mean from it and dividing it by its standard deviation, this normalizes that particular batch.

Now why are we keeping a running mean and std?

Quite simply, during inference (running the model) we are not running a batch, but rather a single input. And we cannot calculate the mean and std of a single input (well you can, but it will be pretty pointless).

So we take the running mean and std that we had calculated, also we are changing it ever so slightly, because each batch will have a different mean and std.

Andrej’s Zero to Hero for Dummies Part 3

Goyalpramod — Mon, 22 Jul 2024 10:54:57 GMT

Continuing our epic adventure of understanding Andrej’s playlist in detail. As always I will urge you to go through the original video first. Then come back and join me on the journey to AI discovery.

https://medium.com/media/ac95e3aaf7b4bc9a34cd3098784e98a1/href

Intro

The idea that I found confusing initially was how is it increasing exponentially if we change from bigrams to considering longer sequences. So let’s break that down:

When we are considering bigram models. We only have to look at one character, the previous character and based on that predict the next character. Like for the name “emma” when we break it down, we are writing it as “.e”, “em”, “mm”, “ma”, “a.”. Now applying a mask that we need to predict. “._”, “e_”, “m_”, “m_”, “a_”. So we need to predict _ based only on one prev character. And we can have 27 possibilities in this case.
Now let us take an example of tri-gram for “emma”. So now that becomes “.em”, “emm”, “mma”, “ma.”. And if we add a mask to the last character, “.e_”, “em_”, “mm_”, “ma_”. Now the first position of the trigram can be filled with 27 characters and the second position can be filled with 27 characters. Giving us a permutation of 27*27 and so on.

Bengio et al. 2003 (MLP language model) paper walkthrough

Link to the paper

They build a word level language model
Put 17000 words in a dimensional space of size 30
Maximise log likelihood
The magic behind the model is, similar objects are clubbed together
We convert each of these 17,000 words into a vector of dimension 30. So essentially we have a matrix of 17,000 rows and 30 columns. And then the researchers are taking 3 words and trying to predict the 4th one.
So essentially we have 90 neurons to begin with.
We have a fully connected hidden layer.
The output layer has 17,000 neurons to show the predictions.

(re-)building our training dataset

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

words = open('names.txt', 'r').read().splitlines()
words[:8]
len(words)

# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)

(If at any point you believe I have not explained any point in detail, consider going through my previous blogs.)

The part I found confusing was for ch in w + ‘.’ . Lets take the example of emma to understand this. This will basically convert emma to emma. and iterate over it character by character.

(I have talked about all the other code pieces in my previous blogs)

Implementing the embedding lookup table

This took me an hour to finally wrap my head around. Bring your A game and focus.

# Shape and dtype information
X.shape, X.dtype, Y.shape, Y.dtype
# Output: (torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

# Creating a random tensor
C = torch.randn((27, 2))

# Accessing a specific element
C[5]
# Output: tensor([0.1615, 1.3169])

# Shape of C[X]
C[X].shape
# Output: torch.Size([32, 3, 2])

# Accessing a specific element
X[13,2]
# Output: tensor(1)

# Using X to index into C
C[X][13,2]
# Output: tensor([ 1.0815, -0.3502])

The following code block will make everything crystal clear.

# If X contains:
X = torch.tensor([[0, 1, 2],
                  [3, 4, 5],
                  ...  # 30 more rows
                 ])

# And C contains:
C = torch.tensor([[a, b],
                  [c, d],
                  [e, f],
                  ...  # 24 more rows
                 ])

# Then C[X] would result in:
C[X] = torch.tensor([[[a, b], [c, d], [e, f]],
                     [[g, h], [i, j], [k, l]],
                     ...  # 30 more 3x2 matrices
                    ])

We are using the elements from c to embed each of the numbers in X. That is all there is to it. But this did raise a question in my mind. Why is it C[X] and not X[C] or something else? That is when I encountered something called “Fancy Indexing” or “Advanced Indexing”. It shows a convention of how to simplify indexing.

Implementing the hidden layer + internals of torch.Tensor: storage, views

# Assuming emb, W1, and b1 are already defined

# Line 1
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)

# Line 2
result1 = torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1)
print(result1.shape)  # torch.Size([32, 6])

# Line 3
result2 = torch.cat(torch.unbind(emb, 1), 1)
print(result2.shape)  # torch.Size([32, 6])

We learn that .cat is an inefficient method as that creates a whole new tensor. Whereas using view manipulates the already existing tensor.

What I was more curious about was, how does .view(-1,6) works and why it works. What I found was

The -1 is a special argument that tells PyTorch to automatically calculate the size of this dimension.
PyTorch will figure out what number to put in place -1 to make the reshape possible, given the total number of elements and the other dimension (6 in this case).

The blog Andrej mentions.

Implementing the output layer

W2 = torch.randn((100,27))
b2 = torch.randn(27)

logits = h @ W2 + b2

Implementing the negative log likelihood loss

# Calculate probabilities
prob = counts / counts.sum(1, keepdim=True)

# Check the shape of prob
print(prob.shape)  # torch.Size([32, 27])

# Calculate loss
loss = -prob[torch.arange(32), Y].log().mean()
print(loss)  # tensor(17.2955)

This calculates the negative log-likelihood.
torch.arange(32) creates a tensor [0, 1, 2, ..., 31].
prob[torch.arange(32), Y] selects elements from prob using two indices: the first from torch.arange(32) and the second from Y. This effectively selects the predicted probability for the correct label for each item in the batch.
.log() takes the natural logarithm of these probabilities.
.mean() calculates the mean of these log probabilities.
The negative sign at the beginning turns this into a loss (we want to maximize log probability or minimize negative log probability).

Introducing F.cross_entropy and why

# Dataset shapes
X.shape, Y.shape  # dataset
# Output: (torch.Size([32, 3]), torch.Size([32]))

# Setting up reproducibility and initializing parameters
g = torch.Generator().manual_seed(2147483647)  # for reproducibility
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

# Count total number of parameters
sum(p.nelement() for p in parameters)  # number of parameters in total
# Output: 3481

# Forward pass
emb = C[X]  # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (32, 100)
logits = h @ W2 + b2  # (32, 27)
counts = logits.exp()
# prob = counts / counts.sum(1, keepdim=True)
# loss = -prob[torch.arange(32), Y].log().mean()
loss = F.cross_entropy(logits, Y)
# Output: tensor(17.7697)

# Example of softmax calculation
logits = torch.tensor([-5, -3, 0, 100]) - 100
counts = logits.exp()
probs = counts / counts.sum()
probs
# Output: tensor([0.0000e+00, 1.4013e-45, 3.7835e-44, 1.0000e+00])

Implementing the training loop, overfitting one batch

for p in parameters:
  p.requires_grad = True

for _ in range(1000):
    # forward pass
    emb = C[X]  # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (32, 100)
    logits = h @ W2 + b2  # (32, 27)
    loss = F.cross_entropy(logits, Y)
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    for p in parameters:
        p.data += -0.1 * p.grad

print(loss.item())

Training on the full dataset, minibatches

for _ in range(1000):
    # minibatch construct
    ix = torch.randint(0, X.shape[0], (32,))
    
    # forward pass
    emb = C[X[ix]]  # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (32, 100)
    logits = h @ W2 + b2  # (32, 27)
    loss = F.cross_entropy(logits, Y[ix])
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    for p in parameters:
        p.data += -0.1 * p.grad

print(loss.item())

Now when I saw this, the first question that came to my mind was. “32 is such a small subset of the entire sample space, how does iterating over this make our entire pipeline better?” and out I went to find the answer to this question.

I found that the key is that while each iteration only sees a small subset, over many iterations (1000 in this case), the model will have been exposed to a large portion of the dataset, likely multiple times. This process, cycling through mini-batches of the entire dataset, is called an “epoch”. Multiple epochs allow the model to see each sample multiple times, refining its understanding with each pass.

Finding a good initial learning rate

Over here Andrej shows a brute force method of finding a good starting lr, But he also briefly shows lr with decay. After we have achieved a loss and keep getting the same loss, we decrease the learning rate so we can get the minimized loss.

# Created using Claude 3.5 Sonnet 

import React, { useState, useEffect, useRef } from 'react';
import { Button } from '@/components/ui/button';
import { Slider } from '@/components/ui/slider';
import { Label } from '@/components/ui/label';

const LearningRateDecayOptimizationVisualization = () => {
  const canvasRef = useRef(null);
  const [learningRate, setLearningRate] = useState(0.1);
  const [decayRate, setDecayRate] = useState(0.01);
  const [isRunning, setIsRunning] = useState(false);
  const [iteration, setIteration] = useState(0);
  const [position, setPosition] = useState({ x: 75, y: 75 });

  const drawParabola = (ctx, width, height) => {
    ctx.beginPath();
    for (let x = 0; x < width; x++) {
      const y = 0.01 * (x - width / 2) ** 2 + 20;
      ctx.lineTo(x, y);
    }
    ctx.strokeStyle = 'blue';
    ctx.stroke();
  };

  const drawPoint = (ctx, x, y) => {
    ctx.beginPath();
    ctx.arc(x, y, 5, 0, 2 * Math.PI);
    ctx.fillStyle = 'red';
    ctx.fill();
  };

  useEffect(() => {
    const canvas = canvasRef.current;
    const ctx = canvas.getContext('2d');
    const width = canvas.width;
    const height = canvas.height;

    ctx.clearRect(0, 0, width, height);
    drawParabola(ctx, width, height);
    drawPoint(ctx, position.x, position.y);
  }, [position]);

  useEffect(() => {
    let animationId;

    const updatePosition = () => {
      if (isRunning) {
        setPosition((prevPos) => {
          const currentLR = learningRate / (1 + decayRate * iteration);
          const gradient = 0.02 * (prevPos.x - 150);
          const newX = prevPos.x - currentLR * gradient;
          const newY = 0.01 * (newX - 150) ** 2 + 20;
          return { x: newX, y: newY };
        });
        setIteration((prev) => prev + 1);
        animationId = requestAnimationFrame(updatePosition);
      }
    };

    if (isRunning) {
      animationId = requestAnimationFrame(updatePosition);
    }

    return () => cancelAnimationFrame(animationId);
  }, [isRunning, learningRate, decayRate, iteration]);

  const handleToggle = () => setIsRunning(!isRunning);

  const handleReset = () => {
    setIsRunning(false);
    setIteration(0);
    setPosition({ x: 75, y: 75 });
  };

  return (
    
      Learning Rate Decay Optimization

      

  );
};

export default LearningRateDecayOptimizationVisualization;

Run the above code to see how lr and decay affect loss in real-time.

Splitting up the dataset into train/val/test splits and why

Andrej mentions that “if train loss and dev loss are same, then the size of the hidden layer can be increased”. I had never heard about this, nor did I have any clue.

I found a rather interesting answer

“When the training and dev losses are very close, it often indicates that the model has reached its capacity to learn from the given data. The model is fitting the training data as well as it can, but it’s not overfitting (which would be indicated by a much lower training loss compared to the dev loss).”

Experiment: larger hidden layer

Andrej checks how the loss changes with an increase in the hidden layer. The train and dev loss still remain the same.

Visualizing the character embeddings

Screenshot from Andrej’s lecture

Experiment: larger embedding size

We observe that the training and dev loss slowly diverge, another possible way to improve the performance of the models is to take more than 3 letters.

Summary of our final code, conclusion

Andrej summarises the entire video!!

Sampling from the model

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      emb = C[torch.tensor([context])] # (1,block_size,d)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      context = context[1:] + [ix]
      out.append(ix)
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out))

Google collab (new!!) notebook advertisement

https://colab.research.google.com/drive/1YIfmkftLrz6MPTOO9Vwqrop2Q5llHIGK?usp=sharing The google collab link!!

Andrej’s Zero to Hero for Dummies Part 2

Goyalpramod — Tue, 02 Jul 2024 08:16:44 GMT

This is in continuation of my Zero to Hero for Dummies series, where I explain everything in Andrej’s series, so even a beginner can pick it up.

As always I urge you first to watch the original video, after which you can return to the blog to understand it better.

https://medium.com/media/f42d471a186af00b8035ecd204999c5d/href

Intro

GitHub - karpathy/makemore: An autoregressive character-level language model for making more things

“makemore takes one text file as input, where each line is assumed to be one training thing, and generates more things like it. Under the hood, it is an autoregressive character-level language model, with a wide choice of models from bigrams all the way to a Transformer (exactly as seen in GPT). For example, we can feed it a database of names, and makemore will generate cool baby name ideas that all sound name-like, but are not already existing names. Or if we feed it a database of company names then we can generate new ideas for a name of a company. Or we can just feed it valid scrabble words and generate english-like babble.”

Makemore is a character-to-character sequence model that generates characters by taking characters as input.

Autoregressive

Auto” means “self” or “own.”
“Regressive” relates to regression, which involves predicting values.

So, an autoregressive model is one that regresses (predicts) based on its own past values.

Reading and exploring the dataset

words = open('names.txt', 'r').read().splitlines()

Open is a Python File I/O method to open a file, we are reading from it hence the .read(), what split lines does, is break it down by new lines. Otherwise, we would have it as a single string which would look like emma\nolivia….

words[:10] #prints the first 10 names, Python is 0 indexed. So we are stopping on the 10th index

min(len(w) for w in words) # get the length of the smallest word

max(len(w) for w in words) #get the length of the maximum word

We have to understand that a lot of information is packed in a single word. If we take the example of “isabella” as Andrej mentioned.

What the model will learn is, a name is likely to start with i the letter s likely to be preceded by i the letter a is likely to be preceded by is and so on. And the whole word isabella will end after a .

for w in words[:3]:
    for ch1,ch2 in zip(w,w[1:]):
        print(ch1,ch2)

I found this beautiful, simple, and yet complex. Let us go step by step thinking about what is going on.

Let us just work with the first word emma so internally what happens is

w = "emma" and w[1:] = "mma"

if you remember correctly, when you apply a for loop on a string. What Python does is, iterate over each character.

Which looks like the following:

for ch in "emma":
    print(ch)

#e
#m
#m
#a

So when we do the following

for ch1,ch2 in zip('emma','mma'):

what happens is, the iterator goes over each of the characters ch1 for ‘emma’ and ch2 for ‘mma’. And stops when it reaches the end of the item whichever is the shortest.

Counting bigrams in a Python dictionary

b = {}
for w in words:
    chs = [''] + list(w) + ['']
    for ch1,ch2 in zip(chs,chs[1:]):
        bigram = (ch1,ch2)
        b[bigram] = b.get(bigram, 0) + 1

Here we are initialising b as a dictionary, b[bigram] searches for a key with that particular key, i.e bigram . On the right side, we try to retrieve the value of the key bigram , if none exist we start with zero. Otherwise, we add the number of occurrences.

sorted(b.items(), key = lambda kv: -kv[1])

~~Let's break this down into its individual components~~

~~sorted() is a built-in Python function that returns a new sorted list from the given iterable.~~
b.items() method returns a view of the dictionary’s items as a list of (key, value) tuples. For example, if b = {('a', 'b'): 3, ('b', 'c'): 2}, then b.items() would be [(('a', 'b'), 3), (('b', 'c'), 2)] .
~~The key parameter in sorted() specifies a function to be called on each list element prior to making comparisons~~
~~lambda kv: -kv[1] is an anonymous function that takes one argument kv (each key-value pair from b.items()).~~
~~kv[1] accesses the second element of each tuple (the value, which is the count in this case). The minus sign - negates this value.~~
~~In the end, each value is negated. And we are returned a new list which is sorted with the key value pairs, with the greatest number coming first and so on.~~

Counting bigrams in a 2D torch tensor (“training the model”)

N = torch.zeros((28,28), dtype=torch.int32)

~~Here we use pytorch to create a 2d matrix with 28 rows and columns, and of the datatype int32.~~

chars = sorted(list(set(''.join(words))))
stoi = {s:i for i,s in enumerate(chars)}
stoi[''] = 26
stoi[''] = 27

~~''.join(words) joins all the words together with no separation between them. So for example if we take words=['emma','jhon'] and did '_'.join(words) the output will be emma_jhon .~~

~~set gets rid of duplicates, list converts it into a list, and sorted returns a new final sorted list.~~

~~To create stoi we are doing list comprehension. It is a short hand that does a for-loop iteration over a list, enumerate is an in-built Python that iterates over a list and keeps a count as well.~~

b = {}
for w in words:
    chs = [''] + list(w) + ['']
    for ch1,ch2 in zip(chs,chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1,ix2] += 1

The simplest way to understand this will be start small, lets say you have 4 pairs of numbers (0,1),(1,0),(0,0),(1,1) each having occurred once. Now I want to create a matrix telling how many times each pair is coming. So if I make something like the following :

~~Created using Claude 3.5 sonnet~~

~~Now if I had to find out the number of occurrences of the (0,1) tuple. I will go to the 1st row (0) and to the 2nd column (1), Which will give me back 1.~~

Visualizing the bigram tensor

itos = {i:s for s,i in stoi.items()}

~~I have laid out everything for you to be able to understand this on your own. Take some time, and think what is going here step by step.~~

~~Hint: Think what will stoi.items return!!~~

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(28):
    for j in range(28):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');

~~%matplotlib inline is a Jupyter notebook magic command that ensures plots are displayed inline in the notebook.~~

~~plt.figure(figsize=(16,16)) creates a new figure with a size of 16x16 inches.~~

~~plt.imshow(N, cmap='Blues')displays the 2D array N as an image. cmap='Blues' sets the color map to shades of blue, where darker blue represents higher values.~~

~~plt.text(j, i, chstr, ha="center", va="bottom", color='gray'):~~

~~Adds text to the plot at position (j, i).~~
~~The text is the character pair chstr.~~
~~It’s centered horizontally and aligned to the bottom vertically, in gray color.~~

~~plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray') The text is the value from N[i, j], converted to a Python scalar with .item().~~

~~plt.axis('off') Turns off the axis labels and ticks.~~

Deleting spurious (S) and (E) tokens in favor of a single . token

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');

~~This section is pretty much self-explanatory and we have already covered most of the code, so we will be moving forward. (Feel free to drop a comment if you still do not get it though!!)~~

Sampling from the model

 N[0]

~~This returns the occurrence of each bigram in our matrix~~

p = N[0].float()
p = p / p.sum()
p

~~We convert the int32 to float and normalize it by dividing it by the sum of occurrences. Now we have the probability of each bigram occurring.~~

g = torch.Generator().manual_seed(2147483647)
p = torch.rand(3, generator=g)
p = p / p.sum()
p

g is a generator with a manual seed given (this makes the result reproducible). Then torch.rand() generates 3 random numbers between 0 and 1. We normalize them, to get the probability of each one occurring.

torch.multinomial(p, num_samples=100, replacement=True, generator=g)

~~The final line uses torch.multinomial to sample from this distribution:~~

~~p: the probability distribution to sample from~~
~~num_samples=100: it will draw 100 samples~~
~~replacement=True: sampling with replacement (can pick the same index multiple times), (otherwise the number of samples keeps decreasing)~~
~~generator=g: uses the seeded random generator for reproducibility~~

g = torch.Generator().manual_seed(2147483647)

for i in range(20):
    out = []
    ix = 0
    while True:
        p = N[ix].float()
        p = p / p.sum()
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

~~we are starting with ix = 0 as we replaced the start token ~~with . after which we keep updating the value of ix by using the generator to pick a random value from the sample with p probability.~~~~

Efficiency! vectorized normalization of the rows, tensor broadcasting

P = N.float()
P /= P.sum(1, keepdims=True)

~~These 2 lines are pretty information-packed. Even Andrej took his time explaining it and showing what happens if we are not careful. Let us break this down.~~

If we just did P.sum() we will get the sum of the entire 2d matrix and a single digit will be returned. For example, let us take a 2d matrix like the following arr_ = [[1,1],[1,1]] if we did arr_.sum() on this, we will get back 4 . So to avoid this, we will define the dimension about which we will be summing it up. Keepdims = True, keeps the original shape preserved. Otherwise, it converts it into a single row.

Also, we are doing P /= P... instead of P = P/P... as this is more memory efficient, otherwise a new array P will be created and that will be divided. We want to be as efficient as possible, this makes our program run faster in the long run.

Loss function (the negative log likelihood of the data under our model)

# GOAL: maximize likelihood of the data w.r.t. model parameters (statistical modeling)
# equivalent to maximizing the log likelihood (because log is monotonic)
# equivalent to minimizing the negative log likelihood
# equivalent to minimizing the average negative log likelihood

# log(a*b*c) = log(a) + log(b) + log(c)

~~Let us first talk about log and some of its properties then we can proceed with talking about the Maximum Likelihood Estimation.~~

~~From the definition of log, we know if a = b^c then log(a) base b = c . (The following visualization makes it simple to understand)~~

~~Created using Claude 3.5 Sonnet~~

~~log(a*b) = log(a) + log(b)~~

~~The reason is quite simple, let us break it down.~~

# Proof of Logarithm Product Rule: log(a*b) = log(a) + log(b)

Let's prove that log(a*b) = log(a) + log(b) for any positive real numbers a and b, and for any positive base c (where c ≠ 1).

## Given:
- Let y = log_c(a*b)
- Let u = log_c(a)
- Let v = log_c(b)

## Step 1: Express a and b in terms of u and v
By the definition of logarithms:
- c^u = a
- c^v = b

## Step 2: Express (a*b) in terms of u and v
(a*b) = c^u * c^v

## Step 3: Use the property of exponents
c^u * c^v = c^(u+v)

## Step 4: Express y in terms of u and v
Now we have:
y = log_c(a*b) = log_c(c^(u+v))

## Step 5: Apply the definition of logarithms
By the definition of logarithms, if c^x = z, then log_c(z) = x
Therefore:
y = u + v

## Step 6: Substitute back the original expressions
y = u + v
log_c(a*b) = log_c(a) + log_c(b)

Thus, we have proven that log(a*b) = log(a) + log(b).

~~Now what do we mean when we say log is monotonic. It simply means that the function either keeps increasing or decreasing.~~

Now what do we mean by Maximum Likelihood Estimation. In simple Maximum Likelihood Estimation (MLE) is a method used to find the best values for the parameters of a statistical model. MLE tries to answer the question: “What parameter values would make our observed data most likely to occur?”

~~Here’s how it works:~~

~~We start with some observed data.~~
~~We have a model with parameters that we want to estimate.~~
~~MLE adjusts these parameters to find the values that make our observed data the most probable.~~
~~It does this by maximizing a special function called the likelihood function, which measures how well our model explains the data.~~
~~The parameter values that give the highest likelihood are our best estimates.~~

So, instead of minimizing error, MLE is about maximizing the probability of seeing our actual data. It’s like fine-tuning the model to match our observations as closely as possible, given the type of model we’re using.

log_likelihood = 0.0
n = 0

for w in words[:3]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n=}')

Model smoothing with fake counts

P = (N+1).float()
P /= P.sum(1, keepdims=True)

Smoothing a model by adding fake counts, often called “additive smoothing” or “Laplace smoothing,” is a technique used to handle the problem of zero probabilities in statistical models, especially in natural language processing and machine learning.

We take care of cases where the likelihood of an occurrence occurring is zero, like the case of jq occurring in our bigram sample space. By adding 1 to all the samples, we have effectively smoothed the whole sample space.

PART 2: the neural network approach: intro

Creating the bigram dataset for the neural net

# create the training set of bigrams (x,y)
xs, ys = [], []

for w in words[:1]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        print(ch1, ch2)
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

Now there is a difference between torch.tensor and torch.Tensor, The difference that Andrej found was that tensor with a small t returns a float whereas Tensor with a capital T returns an int. More details here.

Feeding integers into neural nets? one-hot encodings

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes = 27).float()

One hot encoding technique. That is used to convert individual classes into row vectors. For example let us say I have a basket, which has 3 kinds of fruit inside of it. Apples, oranges and bananas, now a model does not take words but rather numbers. So if we follow the same order and do the one hot encoding. If we take a banana out from the basket we will say 0,0,1 . There is another kind of encoding done for classes, that is label encoding. But we are not taking about that right now. More details here.

~~Created using Claude 3.5 Sonnet~~

The “neural net”: one linear layer of neurons implemented with matrix multiplication

W = torch.randn((27,27))
xenc @ W

We have created 27 neurons which take 27 inputs each and then matrix multiplied it with W. Matrix multiplication is quite different from how normal multiplication works. The following illustration may help you understand.

~~Created using Claude 3.5 Sonnet~~

~~A simple rule to remember is nXm @ mXn = nXn the columns of the 1st matrix should match the number of rows of the 2nd matrix.~~

Transforming neural net outputs into probabilities: the softmax

logits = xenc @ W  # log-counts
counts = logits.exp()  # equivalent N
probs = counts / counts.sum(1, keepdims=True)
probs

~~Now to a more accurate representation~~

logits = xenc @ W  # raw scores, not actually log-counts
counts = logits.exp()  # exponentiating the raw scores
probs = counts / counts.sum(1, keepdims=True)  # normalizing to get probabilities

logits are indeed short for log-counts but here they aren’t exactly the logarithmic counts but rather the raw output from the matrix multiplication. Then we exponentiate it, so the output is always greater than 0. Then normalize it, to get the probability.

Summary, preview to next steps, reference to micrograd

xs

ys


# randomly initialize 27 neurons' weights. each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g)

xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
# btw: the last 2 lines here are together called a 'softmax'

probs.shape


nlls = torch.zeros(5)
for i in range(5):
  # i-th bigram:
  x = xs[i].item() # input character index
  y = ys[i].item() # label character index
  print('--------')
  print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')
  print('input to the neural net:', x)
  print('output probabilities from the neural net:', probs[i])
  print('label (actual next character):', y)
  p = probs[i, y]
  print('probability assigned by the net to the the correct character:', p.item())
  logp = torch.log(p)
  print('log likelihood:', logp.item())
  nll = -logp
  print('negative log likelihood:', nll.item())
  nlls[i] = nll

~~My writing cannot do justice to this part. Everything has been explained by Andrej. Even if you do not watch the whole video, watch this part again.~~

Vectorized loss

xs #tensor([ 0,  5, 13, 13,  1])
ys #tensor([ 5, 13, 13,  1,  0])

# randomly initialize 27 neurons' weights. each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g)

# forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
loss = -probs[torch.arange(5), ys].log().mean()

~~I believe the question that all of us had at this point was, why use torch.arange(5) and not just xs.~~

Let us understand step by step. We have converted xs into probs, which is a vector or likelihood. Now for xs[0] we want to get the value present at ys[0] because when xs[0] is given as an input we want the output to be ys[0] or the prediction.

Now probs contain the likelihood of each class occurrence. We want to maximize the occurrence of the class that corresponds to ys[0] . That is why we are calculating its loss. Simple way to think is, we have replaced xs with probs and the values are it’s indices now.

Backward and update, in PyTorch

W.grad = None
loss.backward()
W.data += -0.1 * W.grad

~~I have covered this portion of backpropagation in my previous post, kindly consider reading that. Thank you!!~~

Putting everything together

# create the dataset
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)


# gradient descent
for k in range(100):
    
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
    loss = -probs[torch.arange(num), ys].log().mean()
    print(loss.item())
    
    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    
    # update
    W.data += -50 * W.grad

Note 1: one-hot encoding really just selects a row of the next Linear layer’s weight matrix

Here Andrej mentions something important, What the end result W (the weights of the model) is essentially equal to what we arrived by counting. Now remember counting is not a scalable approach, that is why we are doing a gradient based approach.

Note 2: model smoothing as regularization loss

There are many kind of regularization in ML. Over here we are talking about L2 regularization. Regularization is like putting training wheels on a bicycle when you’re learning to ride. It helps prevent the model from going too wild or becoming overly specific to the training data.

~~Key points about regularization:~~

~~Prevents overfitting: It stops the model from memorizing the training data and helps it generalize better to new, unseen data.~~
~~Adds a penalty: It discourages the model from using overly complex solutions by adding a cost to complexity.~~
~~Simplifies the model: It pushes the model towards simpler explanations of the data.~~
~~Balances bias and variance: It helps find a sweet spot between underfitting (too simple) and overfitting (too complex).~~

Sampling from the neural net

# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
    out = []
    ix = 0
    while True:
        # -----------
        # BEFORE:
        #p = P[ix]
        # -----------
        # NOW:
        xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
        logits = xenc @ W # predict log-counts
        counts = logits.exp() # counts, equivalent to N
        p = counts / counts.sum(1, keepdims=True) # probabilities for next character
        # -----------
        
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

Conclusion

~~CONGRATULATIONS you have finished this blog.~~

A note

~~If you enjoyed the content, consider following.~~

Andrej’s Zero to Hero for Dummies Part 1

Goyalpramod — Wed, 26 Jun 2024 13:20:35 GMT

We all love Andrej and everything he has done for the community. If it weren’t for him, we probably would have been behind a decade of progress.
He is an exceptional teacher, with a very clear and concise method of teaching. When I watched his video as a complete beginner, it took me roughly 3 weeks to make entire sense of it.
So I have created this blog series, where I will be going through his playlist. Explaining the most minute things he talks about, so even an absolute beginner can understand it.
I will recommend you have a basic understanding of Python first, that is the only prerequisite. There are plenty of amazing resources on the web, but if you are confused. Consider going through my previous post on the topic.
(This blog post is an inspiration from his original video, I urge you first to complete that. And come back here to aid in your understanding, I will be following the same chapter sections he follows on Youtube)
https://medium.com/media/7984633a71e6c800d1122feb003713d0/href
Intro
Andrej talks a bit about himself
Micrograd Overview
https://github.com/karpathy/micrograd -> the repository we are going to be talking about, kindly have this opened.
“A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural networks library on top of it with a PyTorch-like API. Both are tiny, with about 100 and 50 lines of code respectively. The DAG only operates over scalar values, so e.g. we chop up each neuron into all of its individual tiny adds and multiplies. However, this is enough to build up entire deep neural nets doing binary classification, as the demo notebook shows. Potentially useful for educational purposes.”
I have highlighted the scariest terms, let us begin by analyzing them first.
Backpropagation
Throughout this tutorial I want you to be sure of two things.
We know what the input is (what we are giving to our machine learning black box) and what we want the output to be (what the machine learning black box returns.)
Created with Claude 3.5 Sonnet
Think that you are driving a car down an unknown territory. The only thing that you know is the end location.
Let us imagine you first go down path 2 (the yellow line) and end up in a dead-end. What do you do? backtrack to where you began, know that this path won't lead to the desired output (the end location), so you change your streaming direction and go down another path. You keep doing this back and forth (iteration) till you reach your goal.
Notice a few things,
Not all paths will lead you to the desired location
Some paths take longer than others but you still reach the desired goal
Path 1 (red line) is the best path and if you were lucky enough to start with this, you wouldn't have to take so many detours. (the ideal case)
This car navigation analogy, while simplified, gives us a good intuition about backpropagation. Now, let’s explore a related concept that’s crucial to understanding how neural networks learn: reverse-mode auto differentiation.
Reverse-Mode Autodiff
Reverse-mode autodiff, often simply called ‘reverse-mode autodiff’, is a powerful mathematical technique that underlies backpropagation. While backpropagation gives us the big picture of how neural networks adjust their parameters, reverse-mode autodiff provides a detailed mechanism for calculating gradients efficiently.
Let us visualize this.
Created with Claude 3.5 Sonnet
Differentiation means the rate of change of a variable with respect to another variable. In essence how fast does one variable(this is an arbitrary value, simply called a variable) change when I tweak another variable?
Let’s say you want to make some delicious pizza (yum!!) now if I quickly add a lot of dough, how much will that affect the output. In our case our pizza.
Reverse-Mode Autodiff and Backpropagation are the same thing, but these two analogies may seem different. So let me connect them for you, in our car variable, we are changing one variable only, that is the steering. In the pizza example, we are tweaking multiple variables, like dough, sauce, cheese, etc.
DAG (Directed Acyclic Graph)
Created with Claude 3.5 Sonnet
Directed Acyclic Graphs (DAGs) are a fundamental concept in computer science, particularly in data structures and algorithms. While a deep dive into DAGs isn’t essential for our current discussion, understanding their basic idea can enhance your grasp of how neural networks process information.
Think of a DAG as a flowchart where information moves in one direction without ever looping back. This structure is crucial in representing the flow of data and computations in neural networks. If you’re curious to learn more about DAGs, Here is a nice guide on it.
For now, let’s focus on neural networks.
Neural Networks
Created with Claude 3.5 Sonnet
Neural networks are a branch of machine learning which is in part inspired by how our brains work using neurons. I have provided a very basic illustration of a neural network. We will be diving deep into it, and explaining everything in detail as we move forward.
PyTorch
GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Do not reinvent the wheel!! PyTorch is a library that makes it easier for us to create and test machine-learning models. Python is riddled with multiple libraries that are our bread and butter. And if you plan on becoming a serious data scientist, you will be using them daily.
Scalar Values
Scalar values represent the magnitude of an entity(The absolute value, like 5, 12, 42, etc). They are from linear algebra. If you want to brush up on your linear algebra. I recommend 3Blue1Brown.
Binary Classification
Bi-nary, two-number. Binary classification means anything that can be classified in either of two categories. You can tell if something is a potato or not by looking at it.
If this is your first time hearing about binary classification, I will recommend you do not move any further and first get done with the following two things:
Learn and get used to Python
Learn basic ML, I will recommend Andrew Ng’s lectures
On to the fun part i.e. THE CODE!!!
from micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data:.4f}') # prints 24.7041, the outcome of this forward pass
g.backward()
print(f'{a.grad:.4f}') # prints 138.8338, i.e. the numerical value of dg/da
print(f'{b.grad:.4f}') # prints 645.5773, i.e. the numerical value of dg/db
When we type a = Value(-4.0) it means we are initializing an object of the class Value with a value of -4.0. Let us have a deeper look at the class Value.
class Value:
""" stores a single scalar value and its gradient """

def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0
# internal variables used for autograd graph construction
self._backward = lambda: None
self._prev = set(_children)
self._op = _op # the op that produced this node, for graphviz / debugging / etc

def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')

def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward

return out

def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')

def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward

return out

def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
out = Value(self.data**other, (self,), f'**{other}')

def _backward():
self.grad += (other * self.data**(other-1)) * out.grad
out._backward = _backward

return out

def relu(self):
out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')

def _backward():
self.grad += (out.data > 0) * out.grad
out._backward = _backward

return out

def backward(self):

# topological order all of the children in the graph
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)

# go one variable at a time and apply the chain rule to get its gradient
self.grad = 1
for v in reversed(topo):
v._backward()

def __neg__(self): # -self
return self * -1

def __radd__(self, other): # other + self
return self + other

def __sub__(self, other): # self - other
return self + (-other)

def __rsub__(self, other): # other - self
return other + (-self)

def __rmul__(self, other): # other * self
return self * other

def __truediv__(self, other): # self / other
return self * other**-1

def __rtruediv__(self, other): # other / self
return other * self**-1

def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
Do not be scared, it is quite long. But easy to break down. Let us go step by step:
“””
The triple-double quotes show a docstring. A docstring is a way to explain what the class does, the parameters it takes, and the result it returns. It does not affect the program, it is written to increase the readability of the code.
def __init__(self, data, _children=(), _op=''):
init is a dunder method that initializes (not a constructor, read more) an object of the class. Dunder methods (magic methods) are special methods in Python, characterized by a double underscore on each side.
self -> This is the specific instance of the class. It’s how Python provides you access to the object’s attributes and other methods from inside the class. In simple terms, it’s used to give identity to your class. if you write self.name = name, which is the name of the class itself now.
data -> This is the information that is passed. Now we can assign this to an instance of the class.
_children=() -> initialize a tuple with name _children
_op=’’ -> initialize a variable with empty string.
quick note -> if the term lambda seems new to you, I will recommend going through my ‘5 years of Python in 5 minutes’ where I talk about it.
def __add__(self, other):
Here 5 is an object of the class int, and so is 2. Here other is not an instance of the class. It is not inside it or part of it, but 5 is. So what essentially happens
When we type 5 + 2, the method __add__ gets called and the parameters are filled like the following (5).__add__(2) 5 calls the method add and sends 2 as a parameter.
Let us go step by step:
num_1 = Value(4.0) # Initializing an instance of the class, we now have an object named num_1, this is a class. Now if you look at __init__ num_1.data has the value of 4.0
num_2 = Value(2.0) # similar as above
Now when I do num_1 + num_2 it is possible because. Both these instances of a class have the value data. So what happens internally is (num_1.data)__add__(num_2.data)
But look at what happens when we try to do the following:
result = num_1 + 2
print(result)
What is happening internally is
(num_1.data)__add__(2.data)
We will run into an attribute error. Because 2 is of class int, and the class int does not have any method data.
To fix this issue we are running the following code
other = other if isinstance(other, Value) else Value(other)
This checks if the incoming variable is of the instance (same class as the class that is calling the method) if not, it initializes it as such.
You can understand the rest of the dunder functions following the provided links, but they all essentially follow the same principles that have been defined.
Also quickly before I forget
print(f'{b.grad:.4f}')
:.{num}f is a formatting technique used in Python to tell Python how many decimal points it needs to show.
Derivative of a simple function with one input
Let us start by going the code that Andrej writes one by one
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Here we are importing the important libraries. When we type “as” that gives an alias to a library, so instead of typing numpy.{method} we can simply write np.{method}.
%matplotlib inline allows you to directly show the graph below the code that you write
def f(x):
return 3*x**2 - 4*x + 5

f(3.0) #prints 20.0
Here we are defining a function, we can see that it is a quadratic equation
xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)
np.arange(-5,5,0.25) generates evenly spaced values between a given interval. It will start at -5 and end at 5 with a step size of 0.25 it (-5,-4.75,-4.5…so on)
ys stores the output from the function
and plt.plot plots it. Which looks something like this
Curve of the equation in the interval -5 to 5
When we are differentiating a curve on a point. We are trying to find the slope of the equation at that particular point.
By looking at the definition given in Wikipedia
Taken from Wikipedia
If we try to break the equation down. It is calculated how much the value of f(a) changes when we introduce a very small variable h (h limits to zero means h is very small)
Created using Claude 3.5 Sonnet
The slope at any point can be simply understood as, if you were at that point and you moved in any direction for a very tiny amount, how much will you rise up or fall down relative to your original position.
h = 0.00000001
x = -3.0
(f(x + h) - f(x)) / h #prints -22.00000039920269, the slope
(as Andrej pointed out, we do not manually calculate the derivative every time. Keep that in your head)
Derivative of a function with multiple inputs
h = 0.0001

# inputs
a = 2.0
b = -3.0
c = 10.0

d1 = a*b + c
a += h
d2 = a*b + c

print('d1', d1) #4.0
print('d2', d2) #3.999699999999999
print('slope', (d2 - d1)/h) #-3.000000000010772
I have been explaining things with illustrations and visuals, but I will not do so for the above equation. Why?
Because it is an equation with 3 variables, i.e. 3 dimensions, it won't have an equation of a curve but rather a plane.
If you google the slope of a plane in 3d you will get a lot of scary-looking images, but the simplest way to think about it is. If I pick one direction (1 variable) and move towards it or away from it. How much does the final value change relative to the initial value.
That is what we are trying to find out through the above code. When the value of a is nudged a bit forward. How much does this new value change relative to the old value. I.e. the slope.
I will recommend going through some of the Differentiation rules and formulas. Do not worry if they don't make sense, we will keep making it simpler as we dive into the article. These are for people who are used to calculus and just need to have a refresher look.
Andrej continues to show how changing the other variable will change d, but I believe you have the intuition so I will not repeat the same. (You are smart! don't let anyone else tell you otherwise.)
Starting the core Value object of micrograd and its visualization
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0 #initially we assume there is no affect on the output
self._prev = set(_children)
self._op = _op
self.label = label

def __repr__(self):
return f"Value(data={self.data})"

def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
return out

def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
return out

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L
print(d) #Value(data=4.0)
print(d._prev) #{Value(data=-6.0), Value(data=10.0)}
print(d._op) #+
We have already built an intuition about the code in the beginning, now let us understand what _children and _op do and how are we filling those parameters if we did not pass any arguments when we initialized a,b or c.
Let us start by having a deeper look at __add__
new_var = a + b
print(type(new_var)) #__main__.Value
As we know the value on the right is assigned to the variable on the left. So what happens on the left internally?
This is what goes on
Value(a.data + b.data, (a,b), '+')
And now this Value instance is assigned to new_var, with instances a and b as its children. And we got this parent through the addition operation
(If we go by intuition, new_var was created by a and b, so a and b should be the parent of new_var. But it is easier to understand the other way around in a tree structure. So that is the convention we follow)
So now if we have to understand d we can break it down as so
d=a*b + c on the right hand side what happens is:
Value(((a.data)__mul__(b.data), (a,b), ('*'))__add__(c.data), ((a*b),c), '+')
Take a bit of time why it is (a*b) and now (a,b)
Hint: what is the new instance of the class, or what is self now?
Answer> ((a.data)__mul__(b.data), (a,b), ('*') becomes the new instance, and the value of self.data for this instance is = a*b
d=e + c; d.label = ‘d’
This was my first time seeing ; it in Python, so I was extremely confused as to what it did. Let us find out together.
The semicolon in Python allows you to write multiple statements on a single line.
so Andrej could have written
d = e + c
d.label = 'd'
This would have been valid as well, as for d.label. This works because label is an attribute of the Value class, initialized in the __init__ method.
In Python, you can access and modify object attributes using dot notation.
from graphviz import Digraph

def trace(root):
# builds a set of all nodes and edges in a graph
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges

def draw_dot(root):
dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right

nodes, edges = trace(root)
for n in nodes:
uid = str(id(n))
# for any value in the graph, create a rectangular ('record') node for it
dot.node(name = uid, label = "{%s | data %.4f | grad %.4f}" % (n.lable,n.data,n.grad ), shape='record')
if n._op:
# if this value is a result of some operation, create an op node for it
dot.node(name = uid + n._op, label = n._op)
# and connect this node to it
dot.edge(uid + n._op, uid)

for n1, n2 in edges:
# connect n1 to the op node of n2
dot.edge(str(id(n1)), str(id(n2)) + n2._op)

return dot
(I had an issue creating the graph on my setup, so I will be attaching screenshots from the YouTube video itself. I am sorry for the inconvenience caused)
Screenshot from Youtube
We go with the assumption that initial derivative of L with respect to d is zero
Manual backpropagation example #1: simple expression
{To be continued..}

5 years of Python in 5 minutes

Goyalpramod — Mon, 24 Jun 2024 12:45:36 GMT

Photo by Chris Ried on Unsplash
Python is one of the most beginner-friendly and versatile languages. I plan to make a blog series explaining multiple ML concepts and ideas in depth with illustrations. But I believe before we can start with that, we have to get some prerequisites cleared. This is less of a guide and more of a crash course. In the end, I will also point to the direction I would go to learn some more complex ideas if I were a complete beginner.
What’s in a name! (Variables & Data Type)
_str = "a"
_str = "Pramod"
_int = 1
_float = 1.0
_bool = True
Variables in programming are just like the way we had in grade 6 mathematics. They are names that you assign a value to.
There is a convention that one needs to follow while naming their variables. Like a variable name cannot begin with a number, that will throw an error.
After naming a variable, the values are assigned right to left. And different values have different data types. I have covered the most basic ones that are used often, along with the name of the datatype.
Why was 6 afraid of 7? Because 7 8 9 (Arithemetic & Logical Operations)
_num1 = 2
_num2 = 1

# Arithmatic Operation
print(_num1 + _num2) # Add, Answer = 3
print(_num1 - _num2) # Subtract, Answer = 1
print(_num1 * _num2) # Multiple, Answer = 2
print(_num1 / _num2) # Float divide, Answer = 2.0
print(_num1 // _num2) # Int divide, Answer = 2
print(_num1 ** _num2) # Pow, Answer = 2

# Logical Operations
print("\nLogical Operations:")
print(_num1 > _num2) # Greater than, Answer = True
print(_num1 < _num2) # Less than, Answer = False
print(_num1 >= _num2) # Greater than or equal to, Answer = True
print(_num1 <= _num2) # Less than or equal to, Answer = False
print(_num1 == _num2) # Equal to, Answer = False
print(_num1 != _num2) # Not equal to, Answer = True

# Boolean Operations
print("\nBoolean Operations:")
print(True and False) # Logical AND, Answer = False
print(True or False) # Logical OR, Answer = True
print(not True) # Logical NOT, Answer = False

# = is known as the assignment operator and it assigns the value right to left
When multiple operation are present in the same time, the program follows an order of operation execution. The important thing to remember is, cross datatype operation is not possible in Python. You cannot write “pramod” + 5 and expect to get back pramod5. However you can do “pramod” + “5” and get back “pramod5”
Go with the flow (Control Flow)
# If-Else Statement
_num = 2
if _num >= 1: # This does a boolean expression and checks if the given condition is true
print("Number is Positive")
elif _num <= -1: # Check if this condition is true
print("Number is Negative")
else: # If the given conditions are false, execute the else statement
print("Number is Negative")

# For Loop
fruits = ["apple", "banana", "mango"]
for fruit in fruits:
print(f"I like {fruit}")

# While Loop
count = 0
while count < 5:
print(f"Count is {count}")
count += 1 # Always have an exit loop condition. Otherwise you will be stuck in an infinite loop

# Break and Continue
for number in range(10):
if number == 3:
continue # Skip 3, if the above condition is true. Move on.
if number == 8:
break # Stop at 8, If the above condition is true. Get out of the loop!
print(number)
A few important things to remember, you can nest these control statements, so you can have a for loop inside a for loop and so on.
There is a less used Control flow logic, called switch statements but I have seldom seen them used in practice.
Another important fact to remember is, that in Python. For loops act like iterator and are different from the way for loops function in other languages like c/c++ or java. For more details read this thread.
Fuctions are cool! (Functions)
# Simple function
def greet(name):
return f"Hello, {name}!"

print(greet("Pramod")) # This prints Pramod

# Function with default parameter
def power(base, exponent=2):
return base ** exponent

print(power(3)) # 3^2 = 9
print(power(3, 3)) # 3^3 = 27

# Function with multiple returns
def divide(a, b):
if b == 0:
return "Error: Division by zero"
return a / b

print(divide(10, 2)) # 5
print(divide(5, 0)) # Take a guess!

# Lambda function (anonymous function)
square = lambda x: x ** 2
print(square(4)) # 16
In the case where we are returning “Error: Division by Zero” What if we try to divide two numbers which results in a huge number that cannot be fit in memory? or we run into some other issue that we cannot account for and we encounter during runtime. Such cases are dealt using exception handling.
The anonymous function I introduced is one thing Python is famous for, they are often used in the famous (sometimes infamous) Python one liners (here is a list of some of my favourites)
Hold it all together (Lists, Tuples & Dictionaries)
# Lists
fruits = ["apple", "banana", "mango"]
print(fruits[0]) # apple, python is a zero indexed language
fruits.append("date") # append to the list
print(fruits) # ['apple', 'banana', 'mango', 'date']

# Tuples
coordinates = (10, 20)
x, y = coordinates
print(f"X: {x}, Y: {y}")

# Dictionaries
person = {
"name": "John",
"age": 30,
"city": "New York"
}
print(person["name"]) # John
person["job"] = "Engineer"
print(person)

# List comprehension
squares = [x**2 for x in range(5)]
print(squares) # [0, 1, 4, 9, 16]
The fundatmental difference between lists and tuples are. The former is mutable while the other is not. Essential, mutable objects can be edited, while immutable objects cannot.
Dictionaries are like hash maps of Python, they are your best friends.
List comprehension is something Pythonic that is a simple way of iterating over a list and running a function over it (think the things you can do with lambda over here).
It is all data in, data out (File I/O)
# Writing to a file
with open("example.txt", "w") as file:
file.write("Hello, World!\n")
file.write("This is a test file.\n")

# Reading from a file
with open("example.txt", "r") as file:
content = file.read()
print(content)

# Appending to a file
with open("example.txt", "a") as file:
file.write("This line is appended.\n")
I see programming as Input -> Magic -> Output. As long as you have clearly defined inputs and what you want as the output. Writing the “Magic” behind it, isn’t that tough.
Back to CLASS!! (classes and Object Oriented Programming)
# Define a simple class
class Dog:
# Class attribute
species = "Canis familiaris"

# Constructor method
def __init__(self, name, age):
self.name = name # Instance attribute
self.age = age # Instance attribute

# Instance method
def description(self):
return f"{self.name} is {self.age} years old"

# Another instance method
def speak(self, sound):
return f"{self.name} says {sound}"

# Create instances of the Dog class
buddy = Dog("Buddy", 9)
miles = Dog("Miles", 4)
OOPs is the essence behind Python, different languages have difference essences and therefore different languages are used for different purposes (besides from a plethora of reasons).
There are multiple concepts in classes itself which is beside the purview of this article, like inheritance, static methods, decorators etc
The most important thing of it all
“One does not need to know everything. One only needs to know where to find it, when one needs it.”
This is the dogma of most programmers (or atleast the one’s I have met). This article wasn’t meant to turn you into a Python wiz in 5 minutes. But rather show you how simple things are, and where to fidn the right resources to learn and use these. With the information I have given you. You can make most simple applications.
Now as promised, these are the few places I will recommend people to look to upskill themselves in Python.
Automate the boring stuff (A beautiful book that captures everything in detail about Python, and its FREE!!!)
Programming with Mosh (I love his style of pedagogy and I personally started from his tutorials)
Tech with Tim (One of the few youtubers who talks about complex and advanced Python topics)
End Note
I hope I was able to teach you something, if you liked the article consider following me and sharing it with anyone who may find it useful.

How I Wasted 2 Years Learning AI/ML and Then Finally Landed My Dream Job

Goyalpramod — Sun, 23 Jun 2024 14:03:11 GMT

Photo by AltumCode on Unsplash
I started learning ML in my second year of college. The first thing I did was, keep watching Andrew Ng’s videos for days. 3 months to be exact. After that, I stopped watching his videos and switched to another tutorial series. I was stuck in this loop for 2 years before I finally realized my mistake and fixed it.
In this article I will be sharing the things that I got right and the things I did not. So you do not make the same mistakes I made and land your dream job (hopefully sooner than me).
Tutorial hell
“What if I miss out on some information if I do not watch this tutorial”, All of us have that fear of missing out. We are afraid that we will fail if we do not have 100% of the information.
The first thing to realize is, that one can never have all the information. And even if one does have all the information, one cannot guarantee success. As soon as we can accept and let go of this fear of information scarcity, We take our first steps to success.
If you keep watching tutorials, it puts your mind on autopilot. We feed ourselves copium that we are learning something. Meanwhile, effective learning is about 30%.
When you stop watching videos, and start applying them. You learn about the gaps in your knowledge. Which forces you to read up on them again. Hence making your learning more effective.
The turning point
After 2 years of continuously watching videos of different tutorials, I was still not confident. There were 2 things that I was consistent with, watching videos and talking to people about ML. So I had gotten my name out as a person who was interested in ml.
My friends heard the news of a big national ML hackathon being conducted and asked me to join their team. As they knew how much I loved ML. I said yes on a whim. I spent a few hours building a simple LSTM model for the prediction of words (something very similar to ChatGPT but on a very small scale). Writing everything from scratch, and surprisingly, WE WON!!! I did not even apply 50 percent of everything I had learned nor did I remember half of it.
As soon as you take the leap of faith from fear of missing out, to the thrill of application that is when your hard work will start bearing fruits.
After winning the hackathon, I now had confidence. I started taking part in more competitions, we won a few, lost a lot. But this was crucial for teaching me the real-world applications of ML and what are its limitations as well. As I got good at it, I became more passionate about it. And explored more areas in ML like research and teaching ML to others.
By teaching others, I was forcing myself to get a very good grasp of the concepts behind how everything is built and made. This made my foundation strong.
Things to Do and Not to Do
Do NOT just learn ML, apply it. Think of something real-world you face every day. Make it fun, We used to have a game where we would try to predict the temperature of the next day and whoever was the closest got a “samosa” from the one who was the furthest from the ground truth.
Share what you learn. This is the quickest way to have the information imprinted in your own head for a long time.
You cannot love something you hate, if you have put genuine effort into this. And do not like it. Stop it, I wasted 2 years and I have no remorse. Because I still liked what I did.
Network, build in public. Tell people what you are doing. What you want to do; This will connect you to the right people. As well as teach you things you could not have learned alone.
How did I get my job?
I had been dabbling in open source for a while, talking to a lot of people. Meeting strangers, asking friends if they knew anyone who worked in open source. That is when a friend introduced me to Ronit. Ronit had started 2 companies and now was working in a company making more in a month than most people make in a whole year. And no one knew about him!!
He was humble, and down to earth and told me about what he does and how he does it. Taking inspiration from him, I started applying to different companies, working in open source. Solving issues.
After doing this for two months, I told Ronit about everything I had accomplished. He was astounded, and told me… “If you are so good, why not apply to the company I work at.” I took it up as a challenge, I drafted a mail that instantly captured the CEO’s attention. We had an introduction call in the evening.
Surprise Surprise, it was not an intro call but rather a technical test. This is when my countless hours of learning and applying ML helped me. I was able to ace it. And 1 week later I got the job.
Key takeaways
What I want you to learn from this article in a crux is, to keep trying different things till you find something that you genuinely find interesting. If you are in the top 1% in anything, you will be well off. And if you find something interesting you will do it while someone half-baked will drop it off. I urge you, to go out there in the world. And explore whatever you want to be doing with this life of yours.

What you NEED to know as a beginner developer!!

Goyalpramod — Sat, 22 Jun 2024 12:49:12 GMT

Photo by Danial Igdery on Unsplash
I recently got a new PC and had to set up my programming environment from scratch, A task that seemed daunting when I first started programming 4 years ago back in college. This seemingly hard task was over in a mere 30 minutes.
This experience made me realize that there are fundamental skills and dev-tools every aspiring developer should master to accelerate their growth.
Git Gud
Photo by Roman Synkevych on Unsplash
The first and foremost thing a good developer needs to know how to use is Git and GitHub (yes, both of them are different).
Git is the software you use to “version control” (a fancy word, that means. Take a snapshot of the history of everything you commit) your code. And GitHub is the place where it is all stored.
Plenty of videos and articles explain the difference in detail and how to set it up. Here is one of my favorite Git and GitHub for beginners by FreeCodeCamp. When in doubt, go with FreeCodeCamp.
If you went through the video, you must have seen some convoluted commands. Now no one remembers them (well at least I don't), here is a cheat sheet I swear by.
There is no place like your own IDE
Integrated Development Environment (IDE) is the place where you will be doing most of your programming, so you must learn how to navigate through it well. Every IDE has its own pros and cons. For most beginners, the one that I suggest is Visual Studio Code*
The extensions in it are easy to set up and it has relatively easy shortcuts to remember.
A few of my favorite extensions:
Peacock
Prettier
Material theme icons
Github Copilot
code runner
And a few of my favorite shortcuts that I cannot live without:
Ctrl + left mouse click (Do this over any function and it will take you to the function definition)
Ctrl + Shift + F (find anything present inside the directory)
Middle mouse click (MULTIPLE CURSOURS!!!!!)
Alt + Up/Down arrow key (move the line of code)
Alt + Shift + Up/Down key (Copy the line of code up or down)
Del key (start deleting from the right, opposite of backspace)
Ctrl + Backspace (Delete whole words)
Ctrl + X (Cut a whole line)
Home key (start of a line)
End key (end of a line)
This can become an endless list really, I urge you to go and explore it yourself.
*VS Code is a text editor and not an IDE, find more details about the difference here
Do you even “language” bro
“Which programming language should I learn first?”
If I had a penny for every time someone asked me this question, I probably wouldn't need to work anymore. The answer is….. it depends. On a lot of factors like, “What are you hoping to develop?”, “How experienced you are with programming”, “Who is going to be your end user” yada yada ya, and so on.
But if you are a complete beginner, I will recommend starting with Python if you would like to start with machine learning, or some simple GUI & games.
But if you are interested in making websites and apps, you should start with JavaScript, JS, and Python are very versatile and they can help you make 90% of the things you can think of.
As you grow and become better, you should also learn a low-level language like c/c++, Java, Rust, GO, etc. These will help you understand the very basics of how memory stack, control flow, pointers, etc work. These are essential if you plan to stay in the developer world for long.
If you are starting out with Python, I will urge you to go through this repo created by me. Which shows the best practices around the language.
Know your whale friend (Docker)
Photo by Ian Taylor on Unsplash
Don’t let Docker intimidate you — it’s simpler than it looks and designed to make your development life easier. As you grow and keep shipping products, you will stumble into the age-old saying “Works on my machine bro” Docker is built to take care of this.
Docker creates an image of your program, which you can upload to DockerHub from which people can get it. And start running it on their system. It is also important that you know Docker is different from virtual environments. You can find the difference here.
This video by fireship.io (one of the best developer channels out there, consider subscribing to him). Does an amazing job of teaching it.
Linux is life (at least to a developer)
Photo by Gabriel Heinzer on Unsplash
Linux is a necessary evil that every developer has to learn eventually. But why? Well because most of everything that runs on the cloud runs on Linux. So if you ever plan to have something that can be served to other people. It is easier done if you understand how and where it runs.
Now here are some good and easy resources to start with Linux and learn it. (start with an easy distro like Ubuntu)
https://linuxjourney.com/ A nice interactive website
https://training.linuxfoundation.org/training/introduction-to-linux/ Linux by the people who built it
Intro to Linux by FCC
AI is the new black
Photo by Steve Johnson on Unsplash
If you do not use AI in your workflow in this day and age, you will be left behind. You NEED to start using a general-purpose AI like ChatGPT or Claude to help you learn, understand, and write code. And to do this well, you have to learn something called prompt engineering.
Now there is too much “buzz” around prompt engineering, but it is just. Expressing yourself so the model can perform well.
Here are a few tricks and prompts that I use myself:
Act as X (tell the model what you want it to act as, a python developer, a CTO, an IT expert, etc)
Think step by step (giving it time to think makes it better)
The audience is X (tell it who is going to read it)
Be clear and explain your task very well.
Here is a prompt that I often use
“Act as Python developer and help me debug this code. Think step by step and ask me any doubts you may have. Do not make any assumptions. The audience is an experienced developer himself. I will provide you some documentation in triple backticks and the code in triple double quotes”
The end of the beginning (and a bit about myself)
And with this, you have come to the end of this little intro to being a good developer. I plan to keep releasing content about how to become an AI developer and be a good one while at it.
I am Pramod, a founding AI developer in Dimension. I am from India and I love to build…. anything really, as long as it is creative and can keep me up at night.
Consider following me on my socials to stay up to date with my articles.

How I made my first contribution to C4GT as someone who has never worked in open-source

Goyalpramod — Wed, 31 May 2023 09:20:13 GMT

Photo by Markus Spiske on Unsplash
Introduction
Hi there!!! I am your average college student who was intimidated by open source. I always heard the stories of people who got into programs like GSoC, MLH fellowship, and GitHub Winter of Code and wondered if I could ever contribute to any of the projects and get accepted into such prestigious programs.
I had worked on multiple projects of my own but working with someone else’s code was just too frightening, “What If I mess something up”, “What if I am not capable enough”, “I do not even know where to start, this is just a waste of time”.
That is when I heard about C4GT
C4GT
If you haven’t heard about C4GT, it is an initiative by Samagra, a governance consulting firm, to upskill the next generation of programmers. It has many perks like working with professional programmers in a one-to-one mentorship, Working on a complete government project, and to top it all off, a stipend of 1 lakh for the 2 months of the program
Code for GovTech - Digital Public Goods
The heart and idea behind open source are, working together, to make small small contributions and develop an idea with like-minded people. And the mentors of C4GT stay true to this. All of them are easy to approach and guide you from the very beginning on how you can go about making your first contribution, even if you have no idea what you are doing (like me)
Making my first contribution
Joining the Discord server was my first step. This introduced me to the community and showed all the other enthusiastic people working and helping each other to make contributions. It was amazing to see so many people from so many different backgrounds. I remember even seeing a kid in 10th grade enquiring about how to make a contribution. This lit a fire inside of me, and I decided. Doesn’t matter if I get into the program or not. I will make a contribution.
I was fortunate enough to be clear about my own strengths. I knew I had worked in Python and had skills in the ML, DS, and AI domains. So I went to the whole list of projects provided by C4GT. And opened each project one by one.
Each project had a tech stack written with it, this helped me pick point the exact projects with which I could help. (any project that mentioned js, react, angular, etc was of no use to me as I had little to no idea about them. So I specifically targetted the projects which had Python mentioned in their stack)
So I chose to contribute to the Text2SQL project. But if that is not the case for you, You can always contact the mentors tell them about what you are good at, and ask about which project will be good for you.
I went to the project and went through the issues present in it. I was looking for the tag “good first issue” or “beginner” which I knew were easy issues that anyone can fix. Luckily I found an issue that was dated February 22. I believed I could contribute to this issue
But it was so old, Hence I was uncertain whether they were still looking for contributions to that issue. I contacted the mentor through Discord about it. And he was very helpful and happy to guide me through it
He made me realize every little contribution counts. So I decided to work on it.
I made a few directory changes and added a list for the literature review. As small as this seems, it helped create a place where everyone wishing to contribute to the project in the future could write down their findings.
And finally, I got my first PR merged
Getting my PR merged yehhh!!!!
Why you should contribute
There are multiple reasons one should contribute to open source even if one may or may not get selected into the program, a few of them being:
Learning and Knowledge Sharing: Open source projects promote a culture of learning and knowledge sharing. By contributing, you have the opportunity to explore new technologies, programming languages, and frameworks. You can also learn from the code reviews and feedback provided by other contributors, improving your understanding of best practices and coding standards.
Building a Portfolio: Open source contributions serve as evidence of your skills and commitment to the software development community. They can be valuable additions to your professional portfolio, demonstrating your ability to work on collaborative projects and showcasing your code quality to potential employers.
Networking and Collaboration: Engaging in open-source projects enables you to connect and collaborate with developers from around the nation. This network can lead to valuable professional connections, mentorship opportunities, and exposure to diverse perspectives and approaches to problem-solving.
Conclusion
If you are a programmer or someone who is just getting into programming. This is the perfect opportunity for you to learn and grow, The community is extremely helpful and the learning opportunity is tremendous. So I urge you if you have the slightest bit of interest. Come join us.