English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 329 lectures (91h 3m) | 66.41 GB
Build and train LLM NLP transformers and attention mechanisms (PyTorch). Explore with mechanistic interpretability tools
Deep Understanding of Large Language Models (LLMs): Architecture, Training, and Mechanisms
Large Language Models (LLMs) like ChatGPT, GPT-4, , GPT5, Claude, Gemini, and LLaMA are transforming artificial intelligence, natural language processing (NLP), and machine learning. But most courses only teach you how to use LLMs. This 90+ hour intensive course teaches you how they actually work — and how to dissect them using machine-learning and mechanistic interpretability methods.
This is a deep, end-to-end exploration of transformer architectures, self-attention mechanisms, embeddings layers, training pipelines, and inference strategies — with hands-on Python and PyTorch code at every step.
Whether your goal is to build your own transformer from scratch, fine-tune existing models, or understand the mathematics and engineering behind state-of-the-art generative AI, this course will give you the foundation and tools you need.
What You’ll Learn
- The complete architecture of LLMs — tokenization, embeddings, encoders, decoders, attention heads, feedforward networks, and layer normalization
- Mathematics of attention mechanisms — dot-product attention, multi-head attention, positional encoding, causal masking, probabilistic token selection
- Training LLMs — optimization (Adam, AdamW), loss functions, gradient accumulation, batch processing, learning-rate schedulers, regularization (L1, L2, decorrelation), gradient clipping
- Fine-tuning and prompt engineering for downstream NLP tasks, system-tuning
- Evaluation metrics — perplexity, accuracy, and benchmark datasets such as MAUVE, HellaSwag, SuperGLUE, and ways to assess bias and fairness
- Practical PyTorch implementations of transformers, attention layers, and language model training loops, custom classes, custom loss functions
- Inference techniques — greedy decoding, beam search, top-k sampling, temperature scaling
- Scaling laws and trade-offs between model size, training data, and performance
- Limitations and biases in LLMs — interpretability, ethical considerations, and responsible AI
- Decoder-only transformers
- Embeddings, including token embeddings and positional embeddings
- Sampling techniques — methods for generating new text, including top-p, top-k, multinomial, and greedy
Why This Course Is Different
- 93+ hours of HD video lectures — blending theory, code, and practical application
- Code challenges in every section — with full, downloadable solutions
- Builds from first principles — starting from basic Python/Numpy implementations and progressing to full PyTorch LLMs
- Suitable for researchers, engineers, and advanced learners who want to go beyond “black box” API usage
- Clear explanations without dumbing down the content — intensive but approachable
Who Is This Course For?
- Machine learning engineers and data scientists
- AI researchers and NLP specialists
- Software developers interested in deep learning and generative AI
- Graduate students or self-learners with intermediate Python skills and basic ML knowledge
Technologies & Tools Covered
- Python and PyTorch for deep learning
- NumPy and Matplotlib for numerical computing and visualization
- Google Colab for free GPU access
- Hugging Face Transformers for working with pre-trained models
- Tokenizers and text preprocessing tools
- Implement Transformers in PyTorch, fine-tune LLMs, decode with attention mechanisms, and probe model internals
By the end of this course, you won’t just know how to work with LLMs — you’ll understand why they work the way they do, and be able to design, train, evaluate, and deploy your own transformer-based language models.
Enroll now and start mastering Large Language Models from the ground up.
Who this course is for:
- AI engineers
- Scientists interested in modern autoregressive modeling
- Natural language processing enthusiasts
- Students in a machine-learning or data science course
- Graduate students or self-learners
- Undergraduates interested in large language models
- Machine-learning or data science practitioners
- Researchers in explainable AI
Table of Contents
Introductions
1 [IMPORTANT] Prerequisites and how to succeed in this course
2 Using the Udemy platform
3 Getting the course code, and the detailed overview
4 Do you need a Colab Pro subscription
5 About the CodeChallenge videos
—— —— Part 1 Tokenizations and embeddings —— ——
6 Tokenizations and embeddings
Words to tokens to numbers
7 Why text needs to be numbered
8 Parsing text to numbered tokens
9 CodeChallenge Create and visualize tokens (part 1)
10 CodeChallenge Create and visualize tokens (part 2)
11 Preparing text for tokenization
12 CodeChallenge Tokenizing The Time Machine
13 Tokenizing characters vs. subwords vs. words
14 Byte-pair encoding algorithm
15 CodeChallenge Byte-pair encoding to a desired vocab size
16 Exploring ChatGPT4’s tokenizer
17 CodeChallenge Token count by subword length (part 1)
18 CodeChallenge Token count by subword length (part 2)
19 How many rs in strawberry
20 CodeChallenge Create your algorithmic rapper name )
21 Tokenization in BERT
22 CodeChallenge Character counts in BERT tokens
23 Translating between tokenizers
24 CodeChallenge More on token translation
25 CodeChallenge Tokenization compression ratios
26 Tokenization in different languages
27 CodeChallenge Zipf’s law in characters and tokens
28 Word variations in Claude tokenizer
Embeddings spaces
29 Word2Vec vs. GloVe vs. GPT vs. BERT… oh my!
30 Exploring GloVe pretrained embeddings
31 CodeChallenge Wikipedia vs. Twitter embeddings (part 1)
32 CodeChallenge Wikipedia vs. Twitter embeddings (part 2)
33 Exploring GPT2 and BERT embeddings
34 CodeChallenge Math with tokens and embeddings
35 Cosine similarity (and relation to correlation)
36 CodeChallenge GPT2 cosine similarities
37 CodeChallenge Unembeddings (vectors to tokens)
38 Position embeddings
39 CodeChallenge Exploring position embeddings
40 Training embeddings from scratch
41 Create a data loader to train a model
42 Build a model to learn the embeddings
43 Loss function to train the embeddings
44 Train and evaluate the model
45 CodeChallenge How the embeddings change
46 CodeChallenge How stable are embeddings
—— —— Part 2 Large language models —— ——
47 Large language models
Build a GPT
48 Why build when you can download
49 Model 1 Embedding (input) and unembedding (output)
50 Understanding nn.Embedding and nn.Linear
51 CodeChallenge GELU vs. ReLU
52 Softmax (and temperature) math, numpy, and pytorch
53 Randomly sampling words with torch.multinomial
54 Other token sampling methods greedy, top-k, and top-p
55 CodeChallenge More softmax explorations
56 What, why, when, and how to layernorm
57 Model 2 Position embedding, layernorm, tied output, temperature
58 Temporal causality via linear algebra (theory)
59 Averaging the past while ignoring the future (code)
60 The attention algorithm (theory)
61 CodeChallenge Code Attention manually and in Pytorch
62 Model 3 One attention head
63 The Transformer block (theory)
64 The Transformer block (code)
65 Model 4 Multiple Transformer blocks
66 Multihead attention theory and implementation
67 Working on the GPU
68 Model 5 Complete GPT2 on the GPU
69 CodeChallenge Time model5 on CPU and GPU
70 Inspecting OpenAI’s GPT2
71 Summarizing GPT using equations
72 Visualizing nano-GPT
73 CodeChallenge How many parameters (part 1)
74 CodeChallenge How many parameters (part 2)
75 CodeChallenge GPT2 trained weights distributions
76 CodeChallenge Do we really need Q
Pretrain LLMs
77 What is pretraining and is it necessary
78 Introducing huggingface.co
79 The AdamW optimizer
80 CodeChallenge SGD vs. Adam vs. AdamW
81 Train model 1
82 CodeChallenge Add a test set
83 CodeChallenge Train model 1 with GPT2’s embeddings
84 CodeChallenge Train model 5 with modifications
85 Create a custom loss function
86 CodeChallenge Train a model to like X
87 CodeChallenge Numerical scaling issues in DL models
88 Weight initializations
89 CodeChallenge Train model 5 with weight inits
90 Dropout in theory and in Pytorch
91 Should you output logits or log-softmax(logits)
92 The FineWeb dataset
93 CodeChallenge Fine dropout in model 5 (part 1)
94 CodeChallenge Fine dropout in model 5 (part 2)
95 CodeChallenge What happens to unused tokens
96 Optimization options
Fine-tune pretrained models
97 What does fine-tuning mean
98 Fine-tune a pretrained GPT2
99 CodeChallenge Gulliver’s learning rates
100 On generating text from pretrained models
101 CodeChallenge Maximize the X factor
102 Alice in Wonderland and Edgar Allen Poe (with GPT-neo)
103 CodeChallenge Quantify the AliceEdgar fine-tuning
104 CodeChallenge A chat between Alice and Edgar
105 Partial fine-tuning by freezing attention weights
106 CodeChallenge Fine-tuning and targeted freezing (part 1)
107 CodeChallenge Fine-tuning and targeted freezing (part 2)
108 Parameter-efficient fine-tuning (PEFT)
109 CodeGen for code completion
110 CodeChallenge Fine-tune codeGen for calculus
111 Fine-tuning BERT for classification
112 CodeChallenge IMDB sentiment analysis using BERT
113 Gradient clipping and learning rate scheduler (part 1)
114 Gradient clipping and learning rate scheduler (part 2)
115 CodeChallenge Clip, freeze, and schedule BERT
116 Saving and loading trained models
117 BERT decides Alice or Edgar
118 CodeChallenge Evolution of Alice and Edgar (part 1)
119 CodeChallenge Evolution of Alice and Edgar (part 2)
120 Why fine-tune when you can use AGI
Instruction tuning
121 What is instruction tuning
122 Some datasets for instruction tuning
123 Training a chatbot with system-user-assistant
124 Instruction tuning with GPT2
125 CodeChallenge Instruction tuning GPT2-large (part 1)
126 CodeChallenge Instruction tuning GPT2-large (part 2)
127 Reinforcement learning from human feedback (RLHF)
—— —— Part 3 Evaluating LLMs —— ——
128 Evaluating LLMs
Quantitative evaluations
129 Promises and challenges of quantitative evaluations
130 Numerical issues in logits and softmax
131 Perplexity
132 CodeChallenge Perplexing perplexities
133 Masked word prediction accuracy
134 HellaSwag
135 Import large models using bitsandbytes
136 CodeChallenge HellaSwag evals in two models (part 1)
137 CodeChallenge HellaSwag evals in two models (part 2)
138 KL (Kullback-Leibler) divergence
139 MAUVE
140 CodeChallenge Large and small MAUVE explorations
141 SuperGLUE and other amalgamations
142 Assessing bias and fairness
143 Non-technical benchmarks
Qualitative evaluations
144 Black box evals
145 Red-teaming
146 Accuracy, coherence, and relevance
147 Distributions of hidden-state activations
148 Heatmaps of tokens for qualitative inspection
149 CodeChallenge Visualize single-token predictions
—— —— Part 4 AI safety and mechanistic interpretability —— ——
150 Overview of AI safety and mechanistic interpretability
AI safety
151 AI safety and alignment
152 Why can’t AI just be safe and moral
153 In-context and few-shot learning
154 Scaling and AI safety
155 Hands-on Hack an AI to steal a password!
156 How to get involved in AI safety
Interpretability
157 What is mech interp (mechanistic interpretability)
158 How does mech interp relate to AI safety
159 Concepts and terms in mech interp
160 Theoretical and empirical approaches in research and teaching
161 General criticisms of mechanistic interpretability
—— —— Part 5 Observation (non-causal) mech interp —— ——
162 Observation (non-causal) mechanistic interpretability
Investigating token embeddings (part 1)
163 CodeChallenge Cosine similarity (advanced) (part 1)
164 CodeChallenge Cosine similarity (advanced) (part 2)
165 CodeChallenge Cosine similarity in word sequences
166 CodeChallenge Coloring cosine similarity
167 CodeChallenge Can random embeddings be interpreted
168 T-SNE projection and DBSCAN clustering (theory)
169 T-SNE projection and DBSCAN clustering (Python)
170 CodeChallenge cluster the x terms
171 CodeChallenge Tokenize, embed, and cluster happy emojis
172 RSA (representational similarity analysis)
173 CodeChallenge Compare embeddings with RSA (part 1)
174 CodeChallenge Compare embeddings with RSA (part 2)
175 CodeChallenge Word2vec vs. GPT2
176 CodeChallenge Graph representation of cosine similarities
177 Embeddings arithmetic and analogies
178 CodeChallenge soft-coded analogies in word2vec
179 Creating and interpreting linear semantic axes
180 kNN for synonym-searching in BERT
181 CodeChallenge BERT v GPT kNN kompetition
182 Research on translating embeddings spaces
183 Singular value spectrum of embeddings submatrices
184 CodeChallenge SVD projections of related embeddings
Investigating neurons and dimensions
185 Activation maximization via gradient ascent (theory)
186 Activation maximization (code)
187 Activation maximization via data sampling
188 CodeChallenge Reproducibility of activation maximization
189 Extracting activations using hooks
190 Relation between hooks and output.hidden_states
191 Clarification of final hidden_states output
192 CodeChallenge Grammar tuning in MLP neurons (part 1)
193 CodeChallenge Grammar tuning in MLP neurons (part 2)
194 CodeChallenge Context-modulated activation in MLP
195 CodeChallenge Activation histograms by token length (part 1)
196 CodeChallenge Activation histograms by token length (part 2)
197 CodeChallenge Activation histograms by token length (part 3)
198 Dealing with multitoken word embeddings
199 CodeChallenge Category-tuned MLP projections (part 1)
200 CodeChallenge Category-tuned MLP projections (part 2)
201 Classification via logistic regression theory and code
202 Logistic regression vs. t-test assumptions and applications
203 Proper noun tuning in GPT2-medium
204 CodeChallenge Negation tuning in MLP neurons (part 1)
205 CodeChallenge Negation tuning in MLP neurons (part 2)
206 CodeChallenge Negation tuning in MLP neurons (part 3)
207 CodeChallenge Negation tuning in QVK neurons
Investigating layers
208 Token-related similarities within and across Q, K, V matrices (part 1)
209 Token-related similarities within and across Q, K, V matrices (part 2)
210 CodeChallenge Token-related similarities across layers
211 Grouping and RSA in Q and K matrices
212 CodeChallenge Laminar profile of RSA and category selectivity
213 Effective dimensionality analysis with PCA
214 CodeChallenge Dimensionalities in Pythia 2.3B
215 Mutual information theory and code
216 Pairwise mutual information through the LLM
217 Mutual information vs. covariance
218 CodeChallenge Attention to coffee MI and token distances (part 1)
219 CodeChallenge Attention to coffee MI and token distances (part 2)
220 CodeChallenge Clusters in internal vs. terminal punctuation (part 1)
221 CodeChallenge Clusters in internal vs. terminal punctuation (part 2)
222 The Logit Lens
223 CodeChallenge Logit Lens in BERT (part 1)
224 CodeChallenge Logit Lens in BERT (part 2)
Investigating token embeddings (part 2)
225 Calculating rotations of embeddings vectors
226 CodeChallenge Laminar evolution of sequential angular adjustments
227 Path length and logit token prediction
228 CodeChallenge Residual stream path length decomposition (part 1)
229 CodeChallenge Residual stream path length decomposition (part 2)
230 State-space trajectories through embedding space
231 Parts of speech with SpaCy library
232 CodeChallenge Do nouns or adjectives have longer trajectories (part 1)
233 CodeChallenge Do nouns or adjectives have longer trajectories (part 2)
Identifying circuits and components
234 What is a circuit in a DL model
235 Isolating and investigating attention heads
236 CodeChallenge Laminar profile of attention head weights
237 Are circuits clustered in low-dimensional space
238 Sparse probing theory and code
239 Challenges with sparse logistic regression in large datasets
240 Latent vs. manifest variables
241 Sparse autoencoders theory and code
242 SAE in GPT2 learns about Hungarian Palinka
243 CodeChallenge Laminar profile of autoencoder sparsity
244 Non-orthogonal latent components via eigendecomposition (theory and demo)
245 Generalized eigendecomposition separates him from her in MLP
246 CodeChallenge GED for category isolation across layers (part 1)
247 CodeChallenge GED for category isolation across layers (part 2)
—— —— Part 6 Intervention (causal) mech interp —— ——
248 Intervention (causal) mech interp
How to modify activations
249 Introduction to causal mech interp
250 Activation editing Code implementations
251 CodeChallenge replacing attention, MLP, and hidden states
Editing hidden states
252 Downstream impact of early layer scaling
253 CodeChallenge Hidden-state scaling and token loss
254 CodeChallenge Noisy and shuffled BERT predictions
255 CodeChallenge Measure and correct BERT’s bias
256 Activation patching with indirect object identification
257 Skip a layer
Interfering with attention
258 Head ablation and token prediction
259 CodeChallenge Token prediction after head ablations (part 1)
260 CodeChallenge Token prediction after head ablations (part 2)
261 Impact of head-silencing on cosine similarity
262 CodeChallenge Does GPT2 like pineapple pizza
263 Attention head patching in IOI
264 CodeChallenge Head and token patching in IOI
Modifying MLP
265 Successive median-replacement of MLP neurons
266 Statistics-based lesioning MLP neurons
267 CodeChallenge Laminar profile of MLP t-lesions
268 Explorations in subspace removal
—— —— Part 7 Python tutorial —— ——
269 Python tutorial
Python intro Colab and notebooks
270 Creating, working with, and saving Colab notebooks
Python intro Data types
271 Arithmetic and comments
272 Variables
273 Lists
274 Booleans
275 Dictionaries
Python intro Indexing and slicing
276 Indexing
277 Slicing
Python intro Functions
278 Inputs and outputs
279 The numpy library
280 Getting help on functions
281 Creating functions
282 Copying (duplicating) variables
283 Generating random numbers
Python intro Flow control
284 For loops
285 If-else statements
286 List comprehension (single-line loops)
287 Initializing variables
288 Enumerate iterables
289 Zip multiple iterables
Python intro Data visualization
290 Plotting dots and lines
291 Subplot geometry
292 Making graphs look nice
Python intro Strings and texts
293 String interpolation and f-strings
294 Importing text from the web
295 Processing text
Python intro Pytorch
296 Working with classes
297 Creating custom classes
298 Datatypes, tensors, and dimensions
299 Reshaping tensors
300 Random numbers
—— —— Part 8 Deep learning intro —— ——
301 Deep learning intro
Math of deep learning
302 Terms and datatypes in math and computers
303 Vector and matrix transpose
304 Linear weighted combinations
305 The dot product
306 Matrix multiplication
307 Softmax
308 Logarithms
309 Entropy and cross-entropy
310 Minmax and argminargmax
311 Mean and variance
312 Random sampling and sampling variability
313 The t-test
314 Derivatives intuition and polynomials
315 Derivatives find minima
316 Derivatives product and chain rules
How models learn gradient descent
317 Overview of gradient descent
318 What about local minima
319 Gradient descent in 1D
320 Gradient descent in 2D
321 CodeChallenge fixed vs. dynamic learning rate
Essence of deep learning modeling
322 The perceptron and ANN architecture
323 A geometric view of ANNs
324 ANN math part 1 (forward prop)
325 ANN math part 2 (errors, loss, cost)
326 ANN math part 3 (backprop)
327 Forward pass in Pytorch
328 Backprop in Pytorch
Bonus content
329 Bonus material
Resolve the captcha to access the links!
