Stories by Void on Medium

Cancer — How does it work and how can we use AI to eliminate it

Void — Sun, 07 Dec 2025 10:21:36 GMT

Cancer — How does it work and how can we use AI to eliminate it

1. What exactly is cancer?

a. What Cancer Fundamentally Is

Cancer is not a single disease but a failure mode of multicellular life. In a healthy body, cells follow strict rules: they divide only when needed, repair DNA damage, self-destruct when severely compromised, and cooperate with surrounding tissue. Cancer arises when mutations accumulate that disable these controls. The result is a population of cells that act selfishly, prioritizing their own survival and reproduction over the organism as a whole.

At its core, cancer is when the body’s control systems stop working together properly. It isn’t something foreign invading the body, it’s normal cells that have gone wrong and stopped following the rules.

b. How Cancer Forms and Evolves

Cancer develops through a gradual evolutionary process. DNA replication is highly accurate but imperfect, and errors gets added up over time due to random chance, environmental exposure, chronic inflammation, or infection. Most mutations are harmless, but some enhance a cell’s ability to divide, survive stress, or evade control. Cells harboring these mutations gain a selective advantage and expand.

Over many years, this process creates many different kinds of cancer cells inside a single tumor. That variety is what makes cancer flexible, tough, and hard to destroy. Even treatment can make this worse by killing weaker cells and allowing resistant ones to survive and grow.

c. When Cells Become Cancerous

Cells don’t become cancer all at once. They change step by step. Early changes can cause unusual growth but not real danger. Later failures in DNA repair, cell death signals, and tissue limits allow true cancer behavior — spreading, invading other tissues, and avoiding the immune system.

Cancer appears only when many control systems break down at the same time. It is a system-wide failure, not just one bad gene.

d. The Role of Proteins and Cellular Pathways

Genes encode proteins, and proteins execute cellular decisions. Cancer-driving mutations alter the structure, activity, or expression of proteins that regulate growth, repair, and survival. Some proteins become stuck in “on” states that continuously signal division, while others that enforce limits or repair damage are disabled.

The true damage lies not in isolated proteins, but in the disruption of entire regulatory networks.

e. Why Reversing DNA Mutations Is Not the Solution

While it seems intuitive to “fix” cancer by reversing mutations, this approach is neither practical nor sufficient. By the time cancer exists, cells contain thousands of mutations and deeply altered regulatory states. Editing every relevant mutation in every malignant cell without error is currently infeasible.

More importantly, cancer behavior arises from altered system dynamics, not just corrupted DNA sequences. Correcting mutations alone would not restore normal cellular coordination. Effective treatment focuses on neutralizing cancer’s consequences, not rewriting its genome.

f. Cancer-Specific Drug Design and Precision Therapy

Designing drugs for specific cancers means targeting the weaknesses that cancer depends on to survive. If a tumor is mainly driven by a few key mutations, drugs aimed at those pathways can work very well. This approach is called precision oncology.

But one drug usually doesn’t work forever. Cancer adapts by using backup pathways or by letting resistant cells take over. So effective precision treatment uses combinations of therapies that shut down the cancer’s entire working network, not just one target.

g. What 2–3 Drugs for One Cancer Must Do

For a specific cancer, a small combination of drugs can theoretically approach complete elimination if they jointly close all survival and escape paths. One drug typically suppresses the dominant growth driver, another blocks resistance or backup signaling, and a third induces irreversible death or immune recognition.

Success depends on fully disabling all viable cancer states — dividing, dormant, stem-like, or migratory. If you miss even one of these states, it could mean a relapse.

h. When Near-100% Elimination Is Possible

Near-complete elimination can happen when a cancer is genetically simple, not very diverse, and depends on only a few pathways. In these cases, combination treatments can remove most cancer cells, and the immune system can handle what remains. This is often called a functional cure.

In more complex cancers, even the best drug combinations may fail because some cells are already resistant or hidden in hard-to-reach places.

Here, biology, not drug quality, limits what treatment can achieve.

2. Role of AI in eliminating cancer

a. Core AI framing: cancer as a learnable dynamical system

AI does not approach cancer as a static classification task. It treats cancer as a high-dimensional dynamical system whose future behavior depends on interventions.

Formally, AI models learn a transition function:

where

x_t: latent state of the tumor (clone composition, pathway activity, stress state)
a_t: therapy action (which drugs, what dose, when)
F: unknown nonlinear dynamics
ϵ_t: biological noise + unmodeled effects

Deep models approximate F, which enables simulationg therapy trajectories before applying them in reality.

b. Representation learning: how AI “sees” cancer

We know raw biological data is unusable directly (genomes, transcriptomes, proteomes, metabolomics) so AI models will havefirst learn compressed latent representations.

Models used

Variational Autoencoders (VAEs)
Contrastive learning models
Graph Neural Networks (for signaling / gene interaction graphs)

Mathematical role

These models learn embeddings zsuch that:

where distance in latent space ≈ functional similarity.
Two cancer states far apart in z-space respond differently to drugs.

This representation step is critical as everything else depends on it.

c. Modeling drug action as transformations in latent space

Rather than modeling every protein interaction explicitly, AI treats drug action as state transformation:

Deep neural operators learn Δz from perturbed data (drug screens, CRISPR, time-series assays).

This allows AI to answer:

What state does the tumor move to after Drug A?
Which escape states become reachable?
Which states collapse entirely?

Drugs are evaluated by their geometric impact on the cancer state manifold.

d. Predicting resistance as reachable states

Resistance is modeled as future reachable regions in latent space under therapy pressure.

AI uses:

Sequence models (Transformers, RNNs) for temporal tumor evolution
Graph neural nets for pathway rewiring
Evolution-aware predictors trained on longitudinal data

Note: Longitudinal data is data collected from the same person repeatedly over time, capturing how something changes instead of just what it looks like at one moment.

Mathematically, AI approximates a reachable set:

A therapy fails if R contains any viable cancer state.
Cure requires: R collapses to empty or non-viable regions.

e. Combination therapy as coverage optimization

Selecting 2–3 drugs is a coverage problem over latent vulnerabilities.

Each drug induces a transformation vector Δz.
The combination aims to minimize the volume of viable latent space:

AI searches for drug sets S such that:

This is done using:

Bayesian optimization
Policy gradient methods
Differentiable subset selection

f. Generative AI for drug creation

Molecule generation

Models:

Graph diffusion models
SMILES transformers
Energy-based models

They learn a probability distribution:

where

x: candidate molecule
c: desired properties (kill cancer state, low toxicity)

Optimization occurs in latent space by gradient ascent on expected reward:

Reward models combine:

predicted cancer kill
predicted resistance suppression
predicted toxicity

g. Closing the loop: Bayesian updating during treatment

Cancer models are uncertain. AI maintains probabilistic beliefs over its own parameters.

Bayesian neural nets / ensemble methods maintain:

As therapy progresses and real responses are observed, AI updates its posterior and adjusts policy accordingly.

This is crucial for elimination because static plans will fail.

3. Lung Cancer — How AI Helps Find a Cure

1. Framing the Problem

AI treats lung cancer as a dynamical system under intervention, not a static disease.
Goal: design 2–3 cancer-specific drugs + schedules that eliminate all tumor states (sensitive + resistant) while staying under toxicity constraints.

2. Target Discovery (What to Hit)

AI analyzes lung-cancer-specific genomic and proteomic data to learn which pathways are essential for survival in each subtype (e.g., EGFR-mutant, KRAS-mutant).

Methods:

Representation learning (autoencoders, contrastive models) to cluster tumors by biological mechanism.
Graph Neural Networks on signaling networks to identify non-bypassable dependencies and synthetic lethal targets.
Causal modeling to separate true drivers from correlations.

Output: A short, ranked list of lung-cancer-specific vulnerabilities.

3. Drug Design (What Molecules to Make)

AI designs drugs for those specific targets, not generic cytotoxics.

Methods:

Structure-based models (3D GNNs, equivariant networks) to predict binding to lung-cancer targets and resistance mutants.
Generative models (graph diffusion, transformers) to propose molecules optimized for:
• tumor kill
• resistance suppression
• low toxicity
• synthesizability

Optimization happens in learned latent spaces, using multi-objective reward functions.

4. Combination Therapy (2–3 Drugs, Not One)

AI treats drug combos as a coverage problem over tumor states.

Each drug induces a transformation in latent cancer state:

Combination seeks:

Where:

R = reachable resistant states
g(z) = viability classifier (learned neural surrogate)

Methods:

Latent-space modeling of how each drug transforms cancer state.
Optimization to select 2–3 drugs that together collapse all viable tumor states.
Explicit modeling of likely resistance paths to ensure they are closed preemptively.

Output: Minimal drug sets with maximal elimination coverage.

Thank you for reading, hope you liked it!

Lorentz Transformations & Relativistic effects

Void — Sat, 30 Aug 2025 16:04:58 GMT

When Newton was alive, the universe seemed simple. Space was just space, time was just time, and they ticked away independently of each…

Continue reading on Medium »

How understanding our brains could be a key factor in achieving AGI?

Void — Tue, 24 Jun 2025 16:11:37 GMT

What is AGI?

Continue reading on Medium »

AES Encryption I — How it works

Void — Fri, 18 Apr 2025 07:26:10 GMT

AES Encryption I — How it works

What is AES?

AES (Advanced Encryption Standard) is a symmetric encryption algorithm, meaning the same key is used for both encryption and decryption. It’s widely used across the world (banks, government, apps like WhatsApp) because it’s fast, secure, and resistant to most attacks when used properly.

Before starting, let’s discuss some of the topics we’ll need in this!

Galois Field

A Galois Field, or finite field, is a set of a fixed number of elements (finite), with two operations:

Addition
Multiplication

In AES, we use the field GF(2⁸):

It contains exactly 256 elements, i.e., all possible 8-bit byte values (from 0x00 to 0xFF).
Operations are modulo 2 and modulo an irreducible polynomial.

So instead of doing regular math, you’re doing math on bytes that “wrap around” in a structured way.

Why use GF(2⁸) in AES?

Because bytes are 8 bits, and every operation in AES is on bytes, like substitution, mixing, and XORing.
GF(2⁸) gives us:

A structured mathematical system over bytes
Invertible operations (important for decryption)
Good diffusion and confusion properties (important for cryptographic strength)

Now, let’s start!

Finite Field Arithmetic (GF(2⁸))

AES is based on operations over Galois Field GF(²⁸). This means all operations like addition, multiplication, etc., happen modulo x⁸ + x⁴ + x³ + x + 1 (an irreducible polynomial). Essentially, we treat bytes as elements of GF(2⁸), which means all numbers are between 0 and 255 (i.e., 8 bits).

1 Addition (XOR)

In GF(2⁸), addition is simply the XOR operation:

Example:
0x57 ⊕ 0x83 = 0xD4 (i.e., binary XOR between corresponding bits).

2 Multiplication

Multiplication is performed modulo the irreducible polynomial x⁸ + x⁴ + x³ + x + 1. For example, multiplying two elements a and b in GF(2⁸) involves the following steps:

Multiply the elements as if you were multiplying normal polynomials in base 2.
Reduce the result modulo the irreducible polynomial x⁸ + x⁴ + x³ + x + 1.

For example:
Multiplying 0x57 × 0x13 in GF(2⁸):
1. Start with binary representations:
0x57 = 01010111, 0x13 = 00010011.
2. Multiply them as polynomials:
01010111 × 00010011 = 00000111,
then reduce modulo x⁸ + x⁴ + x³ + x + 1..

Key Expansion

The first step in AES is key expansion, where the original key (16 bytes for AES-128) is expanded into round keys that will be used in the rounds. The number of round keys depends on the number of rounds (10 for AES-128).

Key Expansion Steps:

The original key is divided into 4 words (each 4 bytes) — W[0], W[1], W[2], W[3].
For each round key, apply the following transformation:

RotWord: Rotate the word (move the bytes around).
SubWord: Apply the S-box to each byte.
XOR with Round Constant: XOR the result with a round constant specific to that round.

Initial Round

The first round is slightly different from the others. In the Initial Round, the data undergoes AddRoundKey only. All subsequent rounds include SubBytes, ShiftRows, and MixColumns.

1 AddRoundKey

Each byte in the state is XOR-ed with the corresponding byte of the round key.

Example:

Plain text (more on this later):
54 77 6F 20
4F 6E 65 20
4E 69 6E 65
54 77 6F 20
Round Key:
54 68 61 74
73 20 6D 79
4B 75 6E 67
20 46 75 00
XOR Result
00 1F 0E 54
3C 4E 08 59
05 1C 00 02
74 31 1A 20

SubBytes (Substitution)

The SubBytes step applies a non-linear transformation to each byte using the S-box (substitution box). This is what provides confusion to the encryption, making it difficult to reverse the process.

S-box (Substitution Box)

The S-box is a pre-defined 16x16 matrix that substitutes each byte with another byte.

For example, using the S-box:

0x57 → 0x63
0x6F → 0x9F
0x20 → 0x5A

ShiftRows

In the ShiftRows step, the rows of the state matrix are cyclically shifted. The amount of shift depends on the row:

Row 0: No shift.
Row 1: Shift left by 1.
Row 2: Shift left by 2.
Row 3: Shift left by 3.

For example, if the state matrix is:

63 A7 B5 09
5F 9D 01 E2
37 FE C6 1B
A9 B7 23 D1

It becomes:

63 A7 B5 09
9D 01 E2 5F
C6 1B 37 FE
D1 A9 B7 23

MixColumns

In the MixColumns step, each column of the state is mixed. This is done by multiplying each column by a fixed matrix in GF(2⁸).

The matrix used in MixColumns is:

| 02 03 01 01 |
| 01 02 03 01 |
| 01 01 02 03 |
| 03 01 01 02 |

Each byte of the column is multiplied by this matrix in GF(2⁸).

For example, suppose one column of the state is:

63
9D
C6
D1

The result after MixColumns is obtained by performing the matrix multiplication in GF(2⁸).

Now, let’s do a dry run for this!

Dry Run

Block Size: 128 bits = 16 bytes
Key Size: 128 bits = 16 bytes
Rounds: 10
Initial AddRoundKey + 9 rounds + Final round

We’ll use:

A plaintext block: 00112233445566778899aabbccddeeff
A key: 000102030405060708090a0b0c0d0e0f

We’ll trace encryption using hex to keep it byte-friendly.

I pulled a standard AES-128 test vector from the FIPS-197 document (the official AES specification). These values are used worldwide to test AES implementations for correctness.

This 16-byte (128-bit) plaintext when arranged into the State Matrix (column-wise), it looks like:

[ 00 44 88 cc ]
[ 11 55 99 dd ]
[ 22 66 aa ee ]
[ 33 77 bb ff ]

For example to convert “hello”, you would convert each of its character to ascii then to hex. So:

h → 104 (ascii) → 68 (hex)
e → 101 (ascii) → 65 (hex)
l → 108 (ascii) → 6C (hex)
l → 108 (ascii) → 6C (hex)
0 → 111 (ascii) → 6F (hex)

1. Key Expansion

AES generates 11 round keys (each 16 bytes) from the initial key using rotations, substitution (S-box), and XOR with round constants (Rcon).
Let’s denote:

RoundKey0 = original key
RoundKey1 to RoundKey10 = derived via key schedule

In 4 words (W0–W3), arranged column-wise (4 bytes each):

W0 → 00 11 22 33
W1 → 44 55 66 77
W2 → 88 99 aa bb
W3 → cc dd ee ff

2. Generate W4 (First word of RoundKey1)

W4 = W0 ⊕ g(W3)
Where g() is a special function that includes:

RotWord: rotate bytes of W3
SubWord: apply S-box to each byte
XOR with Rcon

2.1: RotWord(W3)

W3 = cc dd ee ff

Rotate left → dd ee ff cc

2.2 SubWord

Apply AES S-box (we’ll use known substitutions from standard AES S-box table):

DD →B6
EE →9F
FF → 92
CC → 8A

So,

SubWord = B6 9F 92 8A

2.3 XOR with Rcon[1]

Rcon[1] = 01 00 00 00

So,

B6 ⊕ 01 = B7
9F ⊕ 00 = 9F
92 ⊕ 00 = 92
8A ⊕ 00 = 8A

→ g(W3) = B7 9F 92 8A

3. Compute W4

W4 = W0 ⊕ g(W3)
= 00 11 22 33 ⊕ B7 9F 92 8A
= B7 8E B0 B9

4. Compute W5

W5 = W4 ⊕ W1
= B7 8E B0 B9 ⊕ 44 55 66 77
= F3 DB D6 CE

5. Compute W6

W6 = W5 ⊕ W2
= F3 DB D6 CE ⊕ 88 99 AA BB
= 7B 42 7C 75

6. Compute W7

W7 = W6 ⊕ W3
= 7B 42 7C 75 ⊕ CC DD EE FF
= B7 9F 92 8A

These form RoundKey1.

We can keep going till W11 and generate RoundKey2 and so on.

7. Initial AddRoundKey

We XOR the state with RoundKey0 (which is the original key):

[ 00⊕00 44⊕04 88⊕08 cc⊕0c ] = [ 00 40 80 c0 ]
[ 11⊕01 55⊕05 99⊕09 dd⊕0d ] = [ 10 50 90 d0 ]
[ 22⊕02 66⊕06 aa⊕0a ee⊕0e ] = [ 20 60 a0 e0 ]
[ 33⊕03 77⊕07 bb⊕0b ff⊕0f ] = [ 30 70 b0 f0 ]

→ New State after Initial AddRoundKey:

[ 00 40 80 c0 ]
[ 10 50 90 d0 ]
[ 20 60 a0 e0 ]
[ 30 70 b0 f0 ]

8. Round 1 (of 10 total rounds)

We now go through the 4 main steps of a typical AES round:

SubBytes
ShiftRows
MixColumns
AddRoundKey

Current State (after Initial AddRoundKey)
[ 00 40 80 C0 ]
[ 10 50 90 D0 ]
[ 20 60 A0 E0 ]
[ 30 70 B0 F0 ]

Step 1: SubBytes

Using the standard AES S-box on each byte of the state:

(Now, I am not gonna make a long table, rather I am directly writing the new state here)

This becomes:

[ 00 40 80 C0 ]
[ 10 50 90 D0 ]
[ 20 60 A0 E0 ]
[ 30 70 B0 F0 ]

This:

[ 63 09 0A B7 ]
[ CA 2F 6F 68 ]
[ 2F 30 34 1F ]
[ 76 F2 36 0B ]

Step 2: ShiftRows

We shift the rows leftwards, with each row shifted by its index:

Row 0: no shift
Row 1: shift left by 1
Row 2: shift left by 2
Row 3: shift left by 3

Row 0: 63 09 0A B7 → 63 09 0A B7
Row 1: CA 2F 6F 68 → 2F 6F 68 CA
Row 2: 2F 30 34 1F → 34 1F 2F 30
Row 3: 76 F2 36 0B → 0B 76 F2 36

Rearranged column-wise (new state):

[ 63 2F 34 0B ]
[ 09 6F 1F 76 ]
[ 0A 68 2F F2 ]
[ B7 CA 30 36 ]

Step 3: MixColumns

This step is the trickiest. Each column is transformed using matrix multiplication in GF(2⁸):

Each column is multiplied by:

| 02 03 01 01 |
| 01 02 03 01 |
| 01 01 02 03 |
| 03 01 01 02 |

Let’s do it for Column 0 (63, 09, 0A, B7):

We’ll use the standard Rijndael multiplication rules:

Multiply by 1: same value
Multiply by 2: left-shift + conditional XOR with 0x1B if overflow
Multiply by 3: multiply by 2, then XOR with original

I’ll just provide the result directly for brevity (calculated or verified from test vectors):

Resulting MixColumns step gives:

[ 5F 72 64 18 ]
[ 2F 2B 10 3F ]
[ 42 96 6D A6 ]
[ 63 C4 7B 34 ]

Step 4: AddRoundKey

Now XOR with RoundKey1, which we computed as:

W4: B7 8E B0 B9
W5: F3 DB D6 CE
W6: 7B 42 7C 75
W7: B7 9F 92 8A

So, RoundKey1 (column-wise):

[ B7 F3 7B B7 ]
[ 8E DB 42 9F ]
[ B0 D6 7C 92 ]
[ B9 CE 75 8A ]

Do XOR column by column:

Column 0:

5F ⊕ B7 = E8
2F ⊕ 8E = A1
42 ⊕ B0 = F2
63 ⊕ B9 = DA

Column 1:

72 ⊕ F3 = 81
2B ⊕ DB = F0
96 ⊕ D6 = 40
C4 ⊕ CE = 0A

Column 2:

64 ⊕ 7B = 1F
10 ⊕ 42 = 52
6D ⊕ 7C = 11
7B ⊕ 75 = 0E

Column 3:

18 ⊕ B7 = AF
3F ⊕ 9F = A0
A6 ⊕ 92 = 34
34 ⊕ 8A = BE

New State After Round 1:

[ E8 81 1F AF ]
[ A1 F0 52 A0 ]
[ F2 40 11 34 ]
[ DA 0A 0E BE ]

Now we will do the exact same process from Round 2–9, even Round 10 will remain the same but we won’t apply MixColumns in that Round

Thus, to summarize:

From Round 1 to Round 9:

Apply these 4 steps:

SubBytes (byte-wise substitution via S-box)
ShiftRows (row-wise left shifts)
MixColumns (matrix multiplication in GF(²⁸))
AddRoundKey (XOR with round key)

From Round 10:

Same as above except:

No MixColumns
Only:
SubBytes
ShiftRows
AddRoundKey

The final output after Round 10 is your ciphertext.

Conclusion

So, this was AES, a highly sophesticated encryption algorithm which is unbreakable and extremely complex to implement! In the next blog we’ll code it up in C++. Hope you liked it :D

Deep Equilibrium Models: Neural Networks Without Layers

Void — Wed, 02 Apr 2025 09:07:42 GMT

What are Deep Equilibrium Models?

Deep Equilibrium Models (DEQs) redefine deep learning by replacing explicit layer-wise transformations with an implicit function that finds an equilibrium state.

Mathematical Formulation of DEQs

A standard deep neural network with L layers is defined recursively as:

where f is a transformation function (e.g., a ResNet block) and x is the input.

Instead of explicitly computing multiple layers, a DEQ finds a fixed point:

Here, z* is the representation at equilibrium, meaning applying further does not change z*.

Finding the Equilibrium: Root-Finding Methods

The equilibrium equation z* = f(z*, x) can be rewritten as a root-finding problem:

1.1 Fixed-Point Iteration

One way to solve this equation is through iterative updates:

This continues until zᵗ⁺¹ ≈ zᵗ, meaning convergence has been reached. However, simple fixed-point iteration can be slow.

1.2 Broyden’s Method (Quasi-Newton)

To accelerate convergence, DEQs use Broyden’s method, an efficient quasi-Newton method that approximates the Jacobian inverse without explicitly computing it. It updates z* using:

where J=∂g/∂z is the Jacobian. Broyden’s method avoids expensive matrix inversions by iteratively approximating J⁻¹, leading to faster convergence.

Training DEQs with Implicit Differentiation

Unlike traditional deep networks that require storing activations for backpropagation, DEQs train using implicit differentiation.

1.1 Loss Function

A DEQ is trained by defining a loss function L over the equilibrium state:

where y is the ground truth. The goal is to compute the gradient:

directly without storing the full forward pass.

1.2 Implicit Function Theorem

The equilibrium condition satisfies:

Taking the total derivative w.r.t. θ:

where J=∂f/∂z is the Jacobian. The gradient of the loss w.r.t. θ is:

which is computed by solving the linear system:

for v, and then computing:

This allows backpropagation without storing intermediate activations, drastically reducing memory consumption.

Implementation in PyTorch

import torch
import torch.nn as nn

class DEQFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, f, z0, x):
        with torch.no_grad():
            z_star = z0
            for _ in range(20):  
                z_next = f(z_star, x) 
                if torch.norm(z_next - z_star) < 1e-5:
                    break
                z_star = z_next
        ctx.save_for_backward(z_star, x)
        ctx.f = f
        return z_star

    @staticmethod
    def backward(ctx, grad_output):
        z_star, x = ctx.saved_tensors
        f = ctx.f
        J_f = torch.autograd.functional.jacobian(lambda z: f(z, x), z_star)
        v = torch.linalg.solve(torch.eye(J_f.shape[0]) - J_f, grad_output)
        return None, v, None


class DEQModel(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.f_theta = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def f(self, z, x):
        return self.f_theta(z + x) 

    def forward(self, x):
        z0 = torch.zeros_like(x)
        return DEQFunction.apply(self.f, z0, x) 


model = DEQModel(hidden_dim=128)
x = torch.randn(32, 128) 
z_star = model(x)

Here’s what happened:

1. DEQFunction (Custom Autograd Function)

This class defines the fixed-point iteration and implicit differentiation for backpropagation.

a. forward(ctx, f, z0, x)

Iteratively refines z until it converges to a fixed point.
Stops when ∣z_next − z∣ is small enough.
Saves z_star and x for backward pass.

b. backward(ctx, grad_output)

Computes gradients using implicit differentiation (bypassing full backprop).
Uses the Jacobian J of f at z_star to efficiently compute gradients.

2. DEQModel (The Neural Network)

This is the main equilibrium model.

a. __init__(self, hidden_dim)

Defines f_theta, a small feedforward function (MLP with ReLU).

b. f(self, z, x)

Defines how the function f_theta uses z and x.
Uses z + x as input before passing it through f_theta(ensuring both inputs are considered).

c. forward(self, x)

Initializes z0 (starting point).
Calls DEQFunction.apply(self.f, z0, x), solving for z_star.

Conclusion

Deep Equilibrium Models (DEQs) are a cool new way of building AI models without stacking tons of layers, they just keep updating themselves until they reach a stable state. This makes them super memory-efficient and great for handling complex tasks like language processing and computer vision.

But they’re not perfect. Training them can be tricky, they sometimes struggle with stability, and debugging is harder since there aren’t clear layers to analyze. Still, if we can make them faster and more reliable, DEQs could change the future of AI by making models smarter instead of just bigger.

DreamFusion — A Method Used For 3D Model Generation

Void — Tue, 01 Apr 2025 07:43:02 GMT

DreamFusion — A Method Used For 3D Model Generation

What is DreamFusion?

DreamFusion is an approach that generates 3D models from text prompts using a neural radiance field (NeRF) framework. The idea is to take a 2D textual description (like “a red car in a desert”) and generate a detailed 3D model that can be viewed from any angle, making it a significant advancement in 3D generation.

NeRF (Neural Radiance Fields) is a novel technique for representing 3D scenes using neural networks. It essentially models a 3D scene using a continuous volumetric scene representation that can generate high-quality views of the scene by learning a function that maps 3D coordinates (x, y, z) and viewing direction (θ, φ) to color and density values at each point.

In this article, I’ll be covering the entire maths of it!

How DreamFusion and NeRF Work Together

Text-to-Image to 3D Conversion: DreamFusion works by first using a text-to-image model (like CLIP or another vision-language model) to generate 2D images from a given text prompt. This process is similar to how models like DALL·E or MidJourney work, but DreamFusion pushes this further by using the NeRF method for 3D reconstruction.
NeRF for 3D Scene Representation: Once the 2D image is generated, the system employs NeRF to optimize a 3D scene representation. NeRF uses the pixel colors and their positions in 3D space to reconstruct the entire 3D geometry of the scene, even generating photorealistic lighting, shadows, and texture effects.
Optimization Process: DreamFusion leverages a differentiable renderer to generate 2D projections from 3D space and compare these with real images. This feedback loop is used to refine the 3D scene generation in a way that is consistent with the initial text prompt.
Final Output: The result is a 3D model that can be rotated and viewed from various angles, ready for applications like VR, AR, or 3D printing.

Now, let’s get into the maths.

Text-to-Image Generation (Vision-Language Models)

DreamFusion first converts the text prompt T into a 2D image I. This step is generally performed by a model like CLIP or DALL·E, which maps text to images through a deep learning framework that leverages large amounts of training data. (I covered this in the last two blogs)
Let’s define the mapping function f:

Here:

T∈ℝᵈ represents the vectorized text prompt, where d is the dimensionality of the text embedding.
f: ℝᵈ → ℝᴴˣᵂˣ³ is a function that maps the text vector to an image of height H, width W, and 3 color channels (RGB).

The goal here is to produce a 2D image III that aligns with the textual description T.

Neural Radiance Fields (NeRF)

NeRF represents 3D scenes as a continuous volumetric scene where the density and radiance are modeled at each point in space. A neural network learns to predict the radiance and density at each point, which is key to the 3D scene rendering process.

2.1 Representation of 3D Space with NeRF

In NeRF, each 3D point x is represented as:

We can define the neural network F as:

This network takes the 3D position x = (x,y,z) as input and outputs:

Color c(x)=(R,G,B) at the point x.
Density σ(x) at the point x.

2.2 View-dependent Rendering

To simulate a 3D scene from a specific camera viewpoint, NeRF extends the 3D neural network F(x) by also considering the viewing direction d. The viewing direction corresponds to the angle from which the camera is observing the scene.

Let the viewing direction be represented by a unit vector:

This direction is incorporated into the network as:

where c(x, d) is the color at x from direction d and σ(x, d) is the density at that point.

2.3 Camera Rays and Volume Rendering

The key operation in NeRF is volume rendering, which calculates the color of a pixel by integrating over all points along the ray cast from the camera into the scene.

For a camera ray, we parameterize the ray as:

where:

o is the camera origin (position),
d is the direction of the ray,
t∈[0,∞) is the distance along the ray.

Each ray is cast through the 3D scene, and we compute the accumulated color using volume rendering:

where:

T(t) is the transmittance (the fraction of light that reaches the point t without being absorbed).

It models how light interacts with the medium and is changed based on the density at each point along the ray.
σ(r(t)) is the density at r(t) along the ray at distance t.
c(r(t),d) is the color emitted from the point r(t) in direction d.

The integral accumulates the contributions of each point along the ray, combining both color and density to compute the final pixel color.

Optimizing the 3D Model

To align the 3D model with the original text prompt, DreamFusion optimizes the NeRF model to minimize the error between the rendered 3D views and the 2D image I generated from the text description.

3.1 Loss Function for Optimization

The optimization process involves minimizing a loss function L, which measures the difference between the rendered 3D view and the target 2D image.

The loss is typically a pixel-wise loss, such as mean squared error (MSE):

where:

Îᵢ is the predicted color of pixel i in the rendered 3D image.
Iᵢ is the corresponding pixel color from the 2D image I.
N is the total number of pixels in the image.

3.2 Backpropagation

Now that we have the loss function, the goal is to minimize this loss by adjusting the parameters of the neural network F(x, d), the weights of the network that predict the radiance and density. To do this, we need to compute the gradients of the loss function with respect to the parameters θ of the network.

The backpropagation process is used to compute these gradients. Backpropagation works by using the chain rule of calculus to propagate the error backward through the network. Here’s how it works in more detail:

Compute the Loss:
We first calculate the loss L between the predicted rendered image Îᵢ and the target image I.
Compute the Gradient of the Loss:
The gradients of the loss L with respect to the network parameters θ are computed by propagating the error backward through the network. The gradient ∂L/∂θ tells us how to adjust the parameters of the network to reduce the error.
Gradient Calculation Using the Chain Rule:
The error is propagated through each layer of the neural network. For the network F(x,d), this involves calculating the gradient of the color and density predictions with respect to the parameters of the network.
For example, in the volume rendering process, the contribution of each point to the rendered pixel color C(r) depends on both the density σ(x) and the color c(x,d). These quantities are outputs of the neural network, and we need to compute how changes in the network parameters affect the final rendered image.
The chain rule is used to compute how changes in the network weights affect the final loss. For each pixel i, we calculate the gradient of L with respect to the model parameters:

3.3 Optimization

Once we have the gradients, we use an optimization algorithm to adjust the parameters θ of the neural network. The most commonly used method is gradient descent, and a variant of it is Adam.

In gradient descent, the parameters are updated as follows:

Where:

θₜ is the parameter vector at iteration t,
η, which controls how much the parameters are adjusted in each step,
∂L/∂θ is the gradient of the loss with respect to the parameters.

For Adam optimization, the updates are slightly more involved because it uses both momentum and adaptive learning rates for each parameter:

Adam computes the first moment (mean) and second moment (variance) of the gradients and adjusts the parameter updates accordingly.

Iterative Optimization Process

The above steps (rendering, loss calculation, backpropagation, and parameter updates) are repeated iteratively. In each iteration, the neural network improves its predictions by minimizing the error between the rendered 3D scene and the target 2D image.

As the optimization progresses:

The network learns better representations of the color and density values in the scene.
Over time, the rendered images become more realistic and aligned with the 2D target image, eventually resulting in a 3D scene that matches the input prompt.

Final 3D Model

After optimization, the output is a 3D model that can be rendered from any viewpoint. The 3D scene is stored as a continuous volumetric representation, and the resulting model can be rotated and viewed from any angle.

Stable Diffusion II — Implementing It From Scratch in Python

Void — Thu, 20 Mar 2025 12:38:07 GMT

Stable Diffusion II — Implementing It From Scratch in Python

Quick Recap

Stable Diffusion is an AI model that creates images from text by starting with pure noise (like static) and gradually removing the noise step by step until the image matches the description. It learns how to turn noise into meaningful images using millions of examples, combining deep neural networks with attention mechanisms to understand both visual patterns and text prompts.

Things this will cover

In the previous article, we covered the mathematics that goes behind Stable diffusion, so in this article we’ll be implementing it in Python using just PyTorch & numpy!

Importing libraries

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
import os
from PIL import Image
import torchvision.transforms as transforms
from tqdm import tqdm
import json
import glob
import torch.nn as nn
import torch.nn.functional as F

Here:

PyTorch: For neural network models and training.
NumPy: For numerical computations.
PIL (Pillow): For image processing.
torchvision.transforms: To apply transformations to images.
tqdm: To show progress bars during training.
glob, os, json: For file handling.

Cross Attention

The CrossAttention class is a module used within the U-Net architecture to integrate textual information with image features. It allows the model to focus on specific parts of the image based on the text description.

Here is the code:


class CrossAttention(nn.Module):
    def __init__(self, channels, text_dim, heads=8):
        super().__init__()
        self.heads = heads
        self.scale = (channels // heads) ** -0.5
        
        self.norm = nn.LayerNorm([channels])
        self.to_q = nn.Linear(channels, channels)
        self.to_k = nn.Linear(text_dim, channels)
        self.to_v = nn.Linear(text_dim, channels)
        self.to_out = nn.Linear(channels, channels)
        
    def forward(self, x, text_embedding):
        # Reshape spatial dimensions for attention
        x = x.float()
        text_embedding = text_embedding.float()
        
        b, c, h, w = x.shape
        x_flat = x.reshape(b, c, h * w).permute(0, 2, 1)  # [B, H*W, C]
        
        # Apply layer norm
        x_norm = self.norm(x_flat)
        
        # Project to queries, keys, values
        q = self.to_q(x_norm).reshape(b, h * w, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, H*W, C//heads]
        k = self.to_k(text_embedding).reshape(b, -1, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, seq_len, C//heads]
        v = self.to_v(text_embedding).reshape(b, -1, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, seq_len, C//heads]
        
        # Attention
        attn = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        attn = F.softmax(attn, dim=-1)
        
        # Apply attention to values
        out = torch.matmul(attn, v).permute(0, 2, 1, 3).reshape(b, h * w, c)
        out = self.to_out(out).permute(0, 2, 1).reshape(b, c, h, w)
        
        # Residual connection
        return x + out

Let’s understand what exactly happened here:

Initialization (__init__):

channels: The number of channels in the input image features.
text_dim : The dimensionality of the text embeddings.
heads: The number of attention heads, which helps in parallelizing attention computations.

2. Normalization and Linear Layers:

norm: A layer normalization module to normalize the input features.
to_q, to_k, to_v: Linear layers that project the input features and text embeddings into queries, keys, and values for attention.
to_out: A linear layer to transform the output of the attention mechanism.

3. Forward Pass (forward):

Input: The module takes in image features (x) and text embeddings (text_embedding).
Processing:

Reshape and Normalize: Reshape the spatial dimensions of the image features and apply layer normalization.
Project to Queries, Keys, Values: Transform the normalized features and text embeddings into queries, keys, and values.
Attention Mechanism: Compute attention scores between queries and keys, apply softmax, and then apply these scores to values.
Output Transformation: Transform the output of the attention mechanism using the to_out layer.
Residual Connection: Add the original input (x) to the transformed output to maintain information flow.

SimpleUNet Class

The SimpleUNet is used within a diffusion model to predict noise added during the forward diffusion process.
It helps the model learn how to remove noise step-by-step, guided by text descriptions, allowing for text-conditioned image generation.

class SimpleUNet(nn.Module):
    """
    Simplified U-Net architecture
    """
    def __init__(self, in_channels=4, out_channels=4, time_dim=256, text_dim=768):
        super().__init__()
        
        # Time embedding
        self.time_mlp = nn.Sequential(
            nn.Linear(1, time_dim),
            nn.SiLU(),
            nn.Linear(time_dim, time_dim)
        )
        
        # Downsampling path
        self.down1 = nn.Conv2d(in_channels, 64, 3, padding=1)
        self.down2 = nn.Conv2d(64, 128, 3, padding=1, stride=2)
        self.down3 = nn.Conv2d(128, 256, 3, padding=1, stride=2)
        
        # Cross-attention layer
        self.cross_attn = CrossAttention(256, text_dim)
        
        # Middle
        self.mid = nn.Sequential(
            nn.Conv2d(256, 512, 3, padding=1),
            nn.SiLU(),
            nn.Conv2d(512, 256, 3, padding=1)
        )
        
        # Upsampling path
        self.up1 = nn.ConvTranspose2d(512, 128, 4, padding=1, stride=2)
        self.up2 = nn.ConvTranspose2d(256, 64, 4, padding=1, stride=2)
        self.out = nn.Conv2d(128, out_channels, 3, padding=1)
        
    def forward(self, x, t, text_embedding):
        x = x.float()  # or x.to(torch.float32)
        t = t.float()  # This is already done in your code
        text_embedding = text_embedding.float()
        # Time embedding
        t_emb = self.time_mlp(t.float().unsqueeze(-1))  # [B, time_dim]
        
        # Downsampling
        d1 = F.silu(self.down1(x))
        
        # Add time embedding to d1 (need to reshape t_emb to match d1's dimensions)
        t_emb_d1 = t_emb.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, d1.shape[2], d1.shape[3])
        t_emb_d1 = t_emb_d1[:, :d1.shape[1], :, :]  # Slice to match channels
        d1 = d1 + t_emb_d1
        
        d2 = F.silu(self.down2(d1))
        d3 = F.silu(self.down3(d2))
        
        # Cross-attention with text embeddings
        d3 = self.cross_attn(d3, text_embedding)
        
        # Middle (also condition on time)
        mid = self.mid(d3)
        
        # Upsampling with skip connections
        u1 = torch.cat([mid, d3], dim=1)
        u1 = F.silu(self.up1(u1))
        u2 = torch.cat([u1, d2], dim=1)
        u2 = F.silu(self.up2(u2))
        
        # Output
        out = self.out(torch.cat([u2, d1], dim=1))
        return out

Here’s what happened in this:

Time Embedding (time_mlp):

This module converts the time-step (t) into a higher-dimensional embedding (t_emb) that can be used throughout the network.
It helps the model understand at which step of the diffusion process it is operating.

2. Downsampling Path:

down1, down2, down3: These are convolutional layers that reduce the spatial dimensions of the input while increasing the number of channels.
They help extract features from the input image at different scales.

3. Cross-Attention Layer (cross_attn):

This layer integrates the text embeddings with the image features.
It allows the model to focus on specific parts of the image based on the text description.

4. Middle Layers (mid):

These layers process the features after downsampling and cross-attention.
They further refine the features before upsampling.

5. Upsampling Path:

up1, up2: These are transposed convolutional layers that increase the spatial dimensions while reducing the number of channels.
They help restore the original image size.

6. Output Layer (out):

This layer produces the final output of the U-Net, which is used in the diffusion process to predict noise or denoise the input.

Forward Pass:

Input: The network takes in an image (x), a time-step (t), and a text embedding.
Processing:

Time Embedding: Convert t into a higher-dimensional embedding.
Downsampling: Apply convolutional layers to reduce spatial dimensions.
Cross-Attention: Integrate text embeddings with image features.
Middle Layers: Process features.
Upsampling: Restore original spatial dimensions with skip connections.

Output: The processed image/latent representation.

Diffusion Model

The DiffusionModel is used for generating images conditioned on text. It uses the U-Net to predict noise at each step of the diffusion process, allowing it to iteratively refine the image based on textual descriptions.

Here’s the code for this class:


class DiffusionModel:
    def __init__(self, model, beta_start=1e-4, beta_end=0.02, timesteps=1000):
        self.model = model
        self.timesteps = timesteps
        
        # Linear noise schedule
        self.betas = np.linspace(beta_start, beta_end, timesteps, dtype=np.float32)
        self.alphas = 1 - self.betas
        self.alphas_cumprod = np.cumprod(self.alphas, dtype=np.float32)

        self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod).astype(np.float32)
        self.sqrt_one_minus_alphas_cumprod = np.sqrt(1 - self.alphas_cumprod).astype(np.float32)
        self.sqrt_recip_alphas = np.sqrt(1 / self.alphas).astype(np.float32)
        self.posterior_variance = (self.betas[1:] * (1 - self.alphas_cumprod[:-1]) / 
                                (1 - self.alphas_cumprod[1:])).astype(np.float32)
        self.posterior_variance = np.append(self.posterior_variance, 0).astype(np.float32)
        self.posterior_log_variance = np.log(np.maximum(self.posterior_variance, 1e-20)).astype(np.float32)
        self.posterior_mean_coef1 = (self.betas[1:] * np.sqrt(self.alphas_cumprod[:-1]) /
                                    (1 - self.alphas_cumprod[1:])).astype(np.float32)
        self.posterior_mean_coef2 = ((1 - self.alphas_cumprod[:-1]) * np.sqrt(self.alphas[1:]) /
                                    (1 - self.alphas_cumprod[1:])).astype(np.float32)
        self.posterior_mean_coef1 = np.append(self.posterior_mean_coef1, 0).astype(np.float32)
        self.posterior_mean_coef2 = np.append(self.posterior_mean_coef2, 0).astype(np.float32)

    
    def q_sample(self, x_0, t):
        """
        Forward diffusion process q(x_t | x_0).
        Apply noise to the initial image according to the diffusion schedule.
        """
        # Convert PyTorch tensor to NumPy if needed
        if isinstance(x_0, torch.Tensor):
            x_0_np = x_0.cpu().numpy()
        else:
            x_0_np = x_0
            
        # Generate random noise
        noise = np.random.randn(*x_0_np.shape)
        
        # Apply noise according to schedule
        mean = self.sqrt_alphas_cumprod[t][:, None, None, None] * x_0_np
        var = self.sqrt_one_minus_alphas_cumprod[t][:, None, None, None]
        x_t = mean + var * noise
        
        # Convert back to PyTorch tensor if needed
        if isinstance(x_0, torch.Tensor):
            return torch.from_numpy(x_t).to(x_0.device), torch.from_numpy(noise).to(x_0.device)
        return x_t, noise
    
    def p_mean_variance(self, x_t, t, text_embedding):
        """
        Calculate the parameters of the posterior distribution p(x_{t-1} | x_t).
        """
        # Use the model to predict noise
        if not isinstance(x_t, torch.Tensor):
            x_t = torch.from_numpy(x_t).float()
            convert_back = True
        else:
            x_t = x_t.float()
            convert_back = False
        
        text_embedding = text_embedding.float()
            
        t_tensor = torch.tensor(t, dtype=torch.float, device=x_t.device)
        
        # Predict noise with the model
        predicted_noise = self.model(x_t, t_tensor, text_embedding)
        predicted_noise = predicted_noise.float()
        
        # Calculate mean of the posterior
        batch_size = x_t.shape[0]
        posterior_mean = torch.zeros_like(x_t)
        posterior_log_variance = torch.zeros((batch_size, 1, 1, 1), device=x_t.device)
        
        for i in range(batch_size):
            coef1 = torch.tensor(self.posterior_mean_coef1[t[i]], device=x_t.device, dtype=torch.float32)
            coef2 = torch.tensor(self.posterior_mean_coef2[t[i]], device=x_t.device, dtype=torch.float32)

            sqrt_one_minus_alpha_cumprod = torch.tensor(self.sqrt_one_minus_alphas_cumprod[t[i]], device=x_t.device, dtype=torch.float32)
            sqrt_alpha_cumprod = torch.tensor(self.sqrt_alphas_cumprod[t[i]], device=x_t.device, dtype=torch.float32)

            # Calculate predicted x_0
            x_0_pred = (x_t[i] - sqrt_one_minus_alpha_cumprod * predicted_noise[i]) / sqrt_alpha_cumprod

            posterior_mean[i] = coef1 * x_0_pred + coef2 * x_t[i]

            # Convert posterior log variance explicitly
            posterior_log_variance[i] = torch.tensor(self.posterior_log_variance[t[i]], device=x_t.device, dtype=torch.float32)
        
        
        if convert_back:
            posterior_mean = posterior_mean.cpu().numpy()
            posterior_log_variance = posterior_log_variance.cpu().numpy()
            
        return posterior_mean, posterior_log_variance
    
    def p_sample(self, x_t, t, text_embedding):
        """
        Sample from p(x_{t-1} | x_t) using the reparameterization trick.
        """
        mean, log_var = self.p_mean_variance(x_t, t, text_embedding)
        
        # Sample using the reparameterization trick
        if isinstance(mean, np.ndarray):
            noise = np.random.randn(*mean.shape)
            std = np.exp(0.5 * log_var)
            x_t_1 = mean + std * noise
        else:
            noise = torch.randn_like(mean)
            std = torch.exp(0.5 * log_var)
            x_t_1 = mean + std * noise
            
        return x_t_1
    
    def p_sample_loop(self, shape, text_embedding, device="cpu"):
        """
        Generate a sample by iteratively sampling from p(x_{t-1} | x_t).
        """
        text_embedding = text_embedding.float()
        # Start from pure noise
        x_t = torch.randn(shape, device=device, dtype=torch.float32) 
        # x_t = torch.randn(shape, device=device)
        
        # Iteratively denoise
        for t in reversed(range(self.timesteps)):
            print(f"Sampling timestep {t}/{self.timesteps}")
            t_batch = np.full(shape[0], t)
            x_t = self.p_sample(x_t, t_batch, text_embedding)
            
        return x_t
    
    def train_step(self, x_0, text_embedding, optimizer):
        """
        Perform a single training step.
        """
        optimizer.zero_grad()
        x_0 = x_0.float()
        # Sample a random timestep for each image
        batch_size = x_0.shape[0]
        t = np.random.randint(0, self.timesteps, size=batch_size)
        t_tensor = torch.tensor(t, dtype=torch.float, device=x_0.device)
        
        # Forward diffusion process (add noise)
        x_t, noise = self.q_sample(x_0, t)
        
        # Predict the noise using the model
        predicted_noise = self.model(x_t, t_tensor, text_embedding)
        # Predict the noise using the model
        predicted_noise = predicted_noise.float()
        noise_tensor = noise if isinstance(noise, torch.Tensor) else torch.from_numpy(noise).to(x_0.device)
        noise_tensor = noise_tensor.float()
        
        loss = F.mse_loss(predicted_noise, noise_tensor)
        
        # KL divergence component is rarely implemented explicitly in practice
        # This is approximated by the MSE loss above in the diffusion objective
        
        # Backpropagate and update weights
        loss.backward()
        
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        return loss.item()

Breakdown:

a. Variables

model: The U-Net model used for predicting noise in the diffusion process.
beta_start, beta_end: Parameters defining the start and end of the noise schedule.
timesteps: The number of steps in the diffusion process.
betas: An array of noise variances at each step, linearly interpolated between beta_start and beta_end.
alphas: An array of noise retention probabilities at each step, calculated as 1 — betas.
alphas_cumprod: The cumulative product of alphas, representing the probability of retaining noise up to each step.
sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod: Square roots of alphas_cumprod and 1 — alphas_cumprod, used for calculating mean and variance of the posterior distribution.
posterior_variance: The variance of the posterior distribution p(x_{t-1} | x_t), calculated based on betas and alphas_cumprod.
posterior_log_variance: The logarithm of posterior_variance, used for numerical stability.
posterior_mean_coef1, posterior_mean_coef2: Coefficients used to calculate the mean of the posterior distribution.

b. Components

Initialization (__init__):

model: The U-Net model used for predicting noise.
beta_start, beta_end: Parameters defining the noise schedule.
timesteps: The number of steps in the diffusion process.

2. Noise Schedule:

betas, alphas, alphas_cumprod: These define how noise is added at each step.
sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod: Used for calculating mean and variance of the posterior distribution.

3. Forward Diffusion (q_sample):

Adds noise to an initial image (x_0) according to the diffusion schedule at time-step t.
Returns the noisy image (x_t) and the added noise.

4. Posterior Distribution (p_mean_variance):

Calculates the mean and variance of the posterior distribution p(x_{t-1} | x_t).
Uses the U-Net model to predict noise and compute these parameters.

5. Sampling from Posterior (p_sample):

Samples from the posterior distribution using the reparameterization trick.
Generates x_{t-1} from x_t by adding noise sampled from the posterior.

6. Iterative Sampling (p_sample_loop):

Starts with pure noise and iteratively samples from the posterior to generate an image.
Uses text embeddings to condition the generation process.

7. Training Step (train_step):

Performs a single training step by:

Sampling a random timestep for each image.
Adding noise to the images using forward diffusion.
Predicting the added noise using the U-Net model.
Computing the loss (MSE between predicted and actual noise).
Backpropagating and updating the model weights.

Text Encoder

The TextEncoder class is a neural network designed to process textual input (captions) and convert it into meaningful numerical representations (embeddings).

class TextEncoder(nn.Module):
    def __init__(self, vocab_size=50257, embed_dim=768, max_seq_len=77):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding = nn.Parameter(torch.zeros(1, max_seq_len, embed_dim))
        
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=embed_dim, 
                nhead=12, 
                dim_feedforward=4*embed_dim,
                batch_first=True
            ),
            num_layers=12
        )
        
    def forward(self, input_ids, attention_mask=None):
        # Token + Position embeddings
        input_ids = input_ids.long()
        embeddings = self.token_embedding(input_ids) + self.position_embedding[:, :input_ids.size(1), :]
        
        # Pass through transformer
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
            
        # Create causal mask for transformer
        seq_length = input_ids.size(1)
        causal_mask = torch.triu(torch.ones(seq_length, seq_length) * float('-inf'), diagonal=1).to(input_ids.device)
        
        # Pass through transformer
        output = self.transformer(embeddings, mask=causal_mask, src_key_padding_mask=~attention_mask.bool())
        
        return output

Components of TextEncoder:

1. Initialization (__init__)

The constructor initializes the components of the text encoder:

vocab_size: The size of the vocabulary, representing the number of unique tokens that can be embedded.
embed_dim: The dimensionality of token embeddings, determining the size of each embedding vector.
max_seq_len: The maximum length of input sequences (captions).
token_embedding: An embedding layer that maps token IDs to dense vectors of size embed_dim.
position_embedding: A learnable tensor that encodes positional information for each token in a sequence. This helps the model understand the order of tokens.
transformer: A stack of transformer encoder layers (12 layers in this case), which processes embeddings to capture complex relationships between tokens.

2. Forward Pass (forward)

This method processes input captions and generates embeddings:

Inputs:

input_ids: A tensor containing token IDs for a batch of captions.
attention_mask: An optional mask indicating which tokens are valid (not padding).

2. Token + Position Embeddings:

Each token ID is mapped to a dense vector using token_embedding.
Positional information is added using position_embedding.

3. Attention Mask:

If no attention mask is provided, it defaults to all ones (indicating all tokens are valid).
A causal mask is created to ensure that each token only attends to previous tokens during processing.

4. Transformer Processing:

The combined embeddings are passed through the transformer encoder layers, which process them using self-attention and feedforward networks.

5. Output:

The final output is a tensor containing embeddings for each token in the input sequence.

Simple Tokenizer

The SimpleTokenizer class is a basic tokenizer designed to convert text (e.g., captions) into numerical tokens that can be understood by a neural network. It also handles padding, special tokens, and attention masks to prepare text inputs for models like the TextEncoder.

Here’s the code


class SimpleTokenizer:
    def __init__(self, vocab_size=50257, max_length=77):
        self.vocab_size = vocab_size
        self.max_length = max_length
        
    def encode(self, text, device="cpu"):
        tokens = []
        for i, char in enumerate(text[:self.max_length-2]):  # Leave room for BOS/EOS
            # Simple hash function to map characters to token IDs
            token_id = (ord(char) * 17) % (self.vocab_size - 3) + 3  # Reserve 0,1,2 for special tokens
            tokens.append(token_id)
            
        # Add special tokens (0=PAD, 1=BOS, 2=EOS)
        tokens = [1] + tokens + [2]
        
        # Pad to max length
        padding = [0] * (self.max_length - len(tokens))
        tokens = tokens + padding
        
        # Create attention mask (1 for tokens, 0 for padding)
        attention_mask = [1] * len(tokens)
        attention_mask = attention_mask + [0] * len(padding)
        
        # Convert to tensors
        input_ids = torch.tensor(tokens[:self.max_length], dtype=torch.float, device=device)
        attention_mask = torch.tensor(attention_mask[:self.max_length], dtype=torch.float, device=device)
        
        return input_ids.unsqueeze(0), attention_mask.unsqueeze(0)  # Add batch dimension

Quick breakdown:

1. Initialization (__init__)

The constructor initializes the tokenizer with:

vocab_size: The size of the vocabulary, i.e., the number of unique token IDs available (default is 50,257).
max_length: The maximum length of the tokenized sequence (default is 77).

These parameters define the tokenizer’s constraints, such as how many tokens it can represent and how long a sequence can be.

2. Tokenization Method (encode)

This method converts a string of text into a numerical representation that includes:

Token IDs for each character in the text.
Special tokens for padding (0), beginning-of-sequence (1), and end-of-sequence (2).
An attention mask to indicate which tokens are valid (not padding).

Creating Dataset

The SubsetDataset class is a custom dataset loader designed to load paired image-caption data from a subset of the dataset we’re using. It ensures that each image has a corresponding caption and applies necessary preprocessing (e.g., resizing, normalization) to prepare the data for training.


class SubsetDataset(Dataset):
    def __init__(self, root_folder, img_size=64, transform=None):
        self.root_folder = root_folder
        self.images_folder = os.path.join(root_folder, "images")
        self.captions_folder = os.path.join(root_folder, "captions")
        
        # Set up the dataset items
        self.dataset_items = []
        
        # Find all image files
        image_files = []
        for ext in ['*.jpg', '*.jpeg', '*.png']:
            image_files.extend(glob.glob(os.path.join(self.images_folder, ext)))
        
        for image_path in image_files:
            image_filename = os.path.basename(image_path)
            image_id = os.path.splitext(image_filename)[0]  # Remove extension
            
            # Look for a corresponding caption file
            caption_path = os.path.join(self.captions_folder, f"{image_id}.txt")
            
            if os.path.exists(caption_path):
                with open(caption_path, 'r') as f:
                    caption = f.read().strip()
                
                self.dataset_items.append({
                    'image_path': image_path,
                    'caption': caption
                })
        
        print(f"Created dataset with {len(self.dataset_items)} matched image-caption pairs")
        
        # Set up transformation
        if transform is None:
            self.transform = transforms.Compose([
                transforms.Resize((img_size, img_size)),
                transforms.ToTensor(),
                transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Scale to [-1, 1]
            ])
        else:
            self.transform = transform
    
    def __len__(self):
        return len(self.dataset_items)
    
    def __getitem__(self, idx):
        item = self.dataset_items[idx]
        
        # Load and transform image
        image = Image.open(item['image_path']).convert('RGB')
        if self.transform:
            image = self.transform(image)
        
        return {
            'image': image,
            'caption': item['caption']
        }

Quick breakdown of this code:

1. Initialization (__init__)

This method sets up the dataset by locating images and their captions, applying transformations, and preparing the dataset items.

root_folder: The directory containing the dataset. It expects two subfolders:
1. images: Contains image files (e.g., .jpg, .png).
2. captions: Contains text files with captions corresponding to the images.
3. img_size: The size to which images will be resized (default is 64x64 pixels).
transform: A set of transformations applied to the images. If not provided, default transformations are used:
1. Resize the image to img_size.
2. Convert it to a tensor.
3. Normalize pixel values to the range [-1, 1].

2. Length Method (__len__)

Returns the total number of matched image-caption pairs in the dataset.
This allows PyTorch’s DataLoader to iterate over the dataset efficiently.

3. Get Item Method (__getitem__)

This method retrieves a single item (image and caption) from the dataset based on its index (idx).

Training

The train function is the main training loop for a text-conditioned diffusion model. It integrates all the components (dataset, tokenizer, U-Net, text encoder, and diffusion model) to train the model on a subset of the COCO dataset. The goal is to generate images conditioned on textual captions.


def train(
    root_folder="./coco_subset",
    num_epochs=50,
    batch_size=16,
    learning_rate=1e-4,
    img_size=64,
    timesteps=1000,
    device="cuda" if torch.cuda.is_available() else "cpu"
):
    # Initialize dataset and dataloader
    dataset = SubsetDataset(root_folder, img_size=img_size)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=4)
    
    unet = SimpleUNet(in_channels=4, out_channels=4, time_dim=256, text_dim=768).to(device)
    text_encoder = TextEncoder().to(device)
    diffusion = DiffusionModel(unet, timesteps=timesteps)
    
    # Initialize optimizer
    optimizer = torch.optim.AdamW(unet.parameters(), lr=learning_rate)
    
    # Initialize tokenizer
    tokenizer = SimpleTokenizer()
    
    # Create output directory for results
    output_dir = os.path.join(root_folder, "training_results")
    os.makedirs(output_dir, exist_ok=True)
    
    # Training loop
    for epoch in range(num_epochs):
        print(f"Starting epoch {epoch+1}/{num_epochs}")
        epoch_loss = 0.0
        
        for batch in tqdm(dataloader, desc=f"Epoch {epoch+1}"):
            images = batch['image'].to(device).float()  # [B, 3, H, W]
            captions = batch['caption']  # List of strings
            
            latents = torch.cat([images, torch.zeros_like(images[:, :1], dtype=torch.float32)], dim=1)  # [B, 4, H, W]
            
            # Tokenize captions
            all_input_ids = []
            all_attention_masks = []
            
            for caption in captions:
                input_ids, attention_mask = tokenizer.encode(caption, device)
                all_input_ids.append(input_ids)
                all_attention_masks.append(attention_mask)
                
            input_ids = torch.cat(all_input_ids, dim=0)
            attention_masks = torch.cat(all_attention_masks, dim=0)
            
            # Encode text with the text encoder
            with torch.no_grad():  # Freeze text encoder during training
                text_embeddings = text_encoder(input_ids, attention_masks)
            
            # Train diffusion model
            loss = diffusion.train_step(latents, text_embeddings, optimizer)
            epoch_loss += loss
            print(f"Epoch {epoch}, Loss: {loss}")
            
        # Print epoch statistics
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")
        
        # Save checkpoint
        if (epoch + 1) % 1 == 0 or epoch == num_epochs - 1:
            checkpoint_path = os.path.join(output_dir, f"diffusion_checkpoint_epoch_{epoch+1}.pt")
            torch.save({
                'epoch': epoch,
                'unet_state_dict': unet.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': avg_loss,
            }, checkpoint_path)
            print(f"Saved checkpoint to {checkpoint_path}")
            
            # Generate a sample image
            with torch.no_grad():
                sample_text = "a photograph of plastic bucket"
                sample_ids, sample_mask = tokenizer.encode(sample_text, device)
                sample_embedding = text_encoder(sample_ids, sample_mask)
                
                shape = (1, 4, img_size, img_size)
                sample = diffusion.p_sample_loop(shape, sample_embedding, device)
                
                # Convert to image (assuming the first 3 channels are RGB)
                sample_image = sample[0, :3].permute(1, 2, 0).cpu().numpy()
                sample_image = (sample_image + 1) / 2.0  # Scale from [-1, 1] to [0, 1]
                sample_image = np.clip(sample_image, 0, 1)
                
                # Save sample image
                sample_path = os.path.join(output_dir, f"sample_epoch_{epoch+1}.png")
                sample_pil = Image.fromarray((sample_image * 255).astype(np.uint8))
                sample_pil.save(sample_path)
                print(f"Saved sample image to {sample_path}")
    
    print("Training completed!")
    return unet, text_encoder, diffusion

train()

Quick breakdown:

Dataset and DataLoader:

Loads paired image-caption data using SubsetDataset.
Prepares batches of data with DataLoader.

2. Model Initialization:

U-Net (SimpleUNet): Predicts noise during the denoising process.
Text Encoder (TextEncoder): Converts captions into embeddings.
Diffusion Model (DiffusionModel): Handles the forward and reverse diffusion processes.

3. Optimizer and Tokenizer:

Uses AdamW optimizer for U-Net training.
SimpleTokenizer converts captions into token IDs and attention masks.

4. Training Loop:

For each epoch:
1. Batch Processing: Loads images and captions, generates latents, tokenizes captions, and encodes them into embeddings.
2. Train Diffusion Model: Adds noise to latents and trains U-Net to predict the noise using MSE loss.
3. Log Loss: Tracks and prints average loss per epoch.

5. Checkpoint Saving:

Saves U-Net weights, optimizer state, and loss after every epoch.

6. Sample Image Generation:

Periodically generates images from random noise conditioned on a sample caption (e.g., "a photograph of plastic bucket").
Saves generated images as .png files for visual inspection.

Downloading Dataset

This is the only working dataset I could find, feel free to use another one! If you indeed do, share the results with me too!

from datasets import load_dataset
import os
from tqdm import tqdm

output_dir = "coco_subset"
os.makedirs(os.path.join(output_dir, "images"), exist_ok=True)
os.makedirs(os.path.join(output_dir, "captions"), exist_ok=True)

try:
    dataset = load_dataset("VikramSingh178/Products-10k-BLIP-captions", split="test")
    start_idx = 0

    for i, example in tqdm(enumerate(dataset), total=len(dataset)):
        global_idx = start_idx + i
        image = example["image"]  
        filename = f"{global_idx:08d}.jpg"

        if image is not None:
            image.save(os.path.join(output_dir, "images", filename))

        caption_field = "text" if "text" in example else "text"
        
        caption = example[caption_field]
        if isinstance(caption, list):
            caption = caption[0]

        with open(os.path.join(output_dir, "captions", f"{global_idx:08d}.txt"), "w", encoding="utf-8") as f:
            f.write(caption)
    
    print(" Dataset downloaded successfully")
    
except Exception as e:
    print(f"Error: {e}")

Results

To generate some meaningful images, we’ll have to train this on 100k+ epochs, but for now I’ve trained it on just 8 epochs, since it takes very long. Here are some images our model generated:

I mean it’s pure noise, but still pretty cool!

Conclusion

So, this was it! There is a lot room for improvements, but this is exactly how image generation models works. Hopefully you liked this article!

For sharing results or any queries, reach me out on X.

Code for this article: https://github.com/Atulit23/stable_diffusion_from_scratch

Stable Diffusion I — Mathematics Behind It

Void — Fri, 07 Mar 2025 12:31:52 GMT

Stable Diffusion I — Mathematics Behind It

What is Stable Diffusion?

Before starting, there are some topics we need to cover, so let’s start with them.

KL Divergence

KL divergence (Kullback-Leibler divergence) is a concept in information theory and probability that measures how different one probability distribution is from another.

Think of KL divergence as answering the question: How much information is lost if we approximate one probability distribution with another?

It quantifies how much one probability distribution Q differs from another probability distribution P. The more different they are, the higher the KL divergence.

Formula

The KL divergence between two distributions P and Q is given by:

Where:

P is the true distribution (what the data really looks like)
Q is the approximate distribution (what we’re trying to model)
x represents the possible outcomes

What does it actually mean?

If P(x) and Q(x) are identical, then:

If Q is very different from P, the divergence will be large.

How is KL Divergence Used in Stable Diffusion?

In Stable Diffusion (and VAEs), KL divergence is used during the training phase to:

Push the encoded latent space close to a normal Gaussian distribution.
Ensure that the noise added during denoising follows the prior distribution.

The loss function often looks like:

1. Forward Diffusion Process

The forward process is like breaking down an ordered system into chaos.
We gradually add Gaussian noise over T timesteps.
At every step, the image becomes noisier.

We define a Markovian forward diffusion that gradually adds Gaussian noise to an image x₀ over time t, creating a noised version xₜ. This process is modeled as:

Where:

x₀ → The original clean image
xₜ → Noisy image at timestep t
αₜ→ Noise scaling factor at timestep t
N → Gaussian distribution

→ What is Markovian Forward Diffusion?

A Markovian process simply means that the future state depends only on the present state — not on the entire history.
Mathematically:

So, the probability of xₜ only depends on the previous step xₜ₋₁, making the process memoryless.

By leveraging the reparametrization trick, we can write the diffusion process directly as:

→ What is reparametrization trick?

In the forward diffusion process, we start with an image x0x_0x0, and after T timesteps, it becomes pure noise:

Now the whole game is about reversing this process.
What we are actually training the model to do is predict the noise at each step:

But the catch is, the whole reverse diffusion process is stochastic (random) because noise is involved at each step.
If the process is random, how can the model be trained through backpropagation?
You cannot directly backpropagate through random noise sampling because neural networks are deterministic — they can’t optimize parameters through pure randomness.

That’s Where the Reparameterization Trick Comes In:

Instead of sampling noise directly like this:

We now reparameterize the noise:

Where:

ϵθ(xₜ, t) is the deterministic neural network prediction.
z∼N(0,I) is the pure Gaussian noise.

Now rewrite the whole diffusion equation like this:

Now the entire process becomes differentiable.

The randomness is pushed into the separate z term, which does not depend on the model parameters.

Now, during backpropagation Only ϵθ(xₜ, t) gets updated, while the pure Gaussian noise stays fixed.

→ Full Pipeline

Start with pure noise xₜ∼N(0,I)
For each timestep t:

Predict the noise: ϵθ(xₜ, t)
Sample new noise: z∼N(0,I)
Reconstruct the clean image:

2. Reverse Diffusion Process

In the reverse process, we want to denoise the pure Gaussian noise step by step until we regenerate the original data x₀.

The Reverse Diffusion Equation

The forward diffusion process follows this Markov chain:

The Reverse Diffusion Process is the exact opposite:

Let’s break this down

a. The Reverse Mean μθ(xₜ, t)

The reverse mean is:

We start with the noisy sample xₜ
Predict what the noise should have been using the neural network
Subtract the noise to get a slightly denoised version
Scale everything back by the inverse diffusion factor

b. The Variance Σθ(xₜ, t)

The variance controls how much randomness we inject at each step.
For most diffusion models, it’s fixed as:

Where:

The model can also predict this variance, but most papers keep it fixed.

Thus, the final reverse diffusion equation is:

Recap:

Start with random Gaussian noise
At each step, subtract the predicted noise (order from chaos)
Add a little bit of fresh Gaussian noise (to maintain stochasticity)
Repeat this for T steps

3. Variational Lower Bound & Training Objective

Given data x₀, we want to model the probability distribution:

But directly maximizing this likelihood is impossible.

Instead, we turn to Variational Inference — where we approximate this likelihood using a Variational Lower Bound (VLB).

The full training objective is maximizing the log-likelihood of the data:

But we can’t compute this directly.

Instead, we derive a lower bound using Jensen’s Inequality:

This lower bound is called the ELBO (Evidence Lower Bound):

Let’s break this down

First, rewrite the joint probability of the entire Markov chain:

And the forward process is:

Now the ELBO becomes:

Now the ELBO is three separate losses:

→ Reconstruction Loss

The main loss is the Kullback-Leibler Divergence between the true forward distribution and the predicted reverse distribution:

Finally, at the last timestep we compute Prior Matching Loss:

But since both are Gaussian, this KL divergence has a closed form:

This is so genius because:

The model never predicts the image directly
It only predicts the noise at each step

That’s why the entire training objective is:

The full loss is:

But the simple version (used in 99% of papers) is:

Example Walkthrough of Stable Diffusion

1. Dataset & Input Preparation

Let’s say we have a dataset of cat images x∈R³ˣ²⁵⁶ˣ²⁵⁶

2. Forward Diffusion (Markovian Process)

We gradually add Gaussian noise to the image in T=1000 steps using the forward diffusion process:

where:

βₜ is the noise schedule (small noise at the beginning, larger noise at the end)
t is the time step
√(1 — βₜ) scales down the image
βₜI is the Gaussian noise added at each step

Instead of sampling one step at a time, we can directly jump to any time step t:

with:

Example:

Let’s say:

βₜ=0.01
t=100
x₀ is an image of a cat

Then:

The noisy image becomes:

3. Reverse Diffusion Process (Model Training Objective)

Now the goal is to reverse the noise and recover the original image.

We train a neural network ϵθ(xₜ, t) to predict the noise at each step:

The denoising step is:

Where:

4. Training Objective (Variational Lower Bound)

The network is trained to minimize the simplified evidence lower bound (ELBO):

This means the model is trained to predict the noise added at each timestep.

Example Training Step

Let,

Image = Cat
x₀ = Cat Image
t = 500
βₜ = 0.02

→ Forward Diffusion

→ Neural Network Prediction:

→ Loss Function

→ Gradient Descent

After training, we generate images starting from pure Gaussian noise:

Start with xₜ∼N(0,I)
Sample iteratively:

until t=0.

Conclusion

To summarize:

Sample an image x₀ from the dataset
Randomly choose a timestep t∼U(1,T)
Generate noisy image:

Predict the noise:

Compute Loss:

So, this was it! Hopefully, you liked this in the next article, we’ll code this up using PyTorch!

7 Machine Learning Projects in under 7 minutes

Void — Sun, 23 Feb 2025 10:48:25 GMT

What are we gonna do

In the previous blogs I covered all the mathematics required for Machine Learning, so in this blog we’re gonna be doing some Machine Learning projects before ultimately moving onto Deep Learning.

1. House Price Prediction — Regression (XGboost)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

data = pd.read_csv("./datasets/house-price.csv")  

data.drop(columns=['Id'], inplace=True)

data.fillna(data.median(numeric_only=True), inplace=True)
data.fillna("None", inplace=True) 

categorical_cols = data.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MAE: {mae}")
print(f"RMSE: {rmse}")

1. Load and Preprocess the Data

The dataset (house-price.csv) is loaded using pandas.
The Id column is removed since it’s just an identifier and doesn’t help in prediction.
Missing values:
Numerical columns are filled with their median.
Categorical columns are filled with "None", so no missing values remain.

2. Encode Categorical Variables

The script finds all categorical columns (text-based data).
It uses Label Encoding to convert them into numerical values since ML models can’t work directly with text.

3. Split the Data

The dataset is split into training (80%) and testing (20%) sets using train_test_split().
Features (X): All columns except SalePrice.
Target (y): The SalePrice column.

4. Scale the Features

Standardization (StandardScaler) is applied to bring all numerical features to a similar scale, improving model performance.

5. Train the Model

A Xgboost Regressor (100 trees) is trained on the X_train and y_train data.

2. Spam Email Classification (SVM)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

data = pd.read_csv("spam_dataset.csv")  

X = data["text"]
y = data["label"]

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vectorized = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

This code:

Loads a dataset (spam_dataset.csv) with text messages (text) and their labels (label).
Converts the text into numerical features using TF-IDF vectorization (removing stop words, limiting to 5000 features).
Splits the data into training (80%) and testing (20%) sets.
Trains an SVM classifier with a linear kernel.
Makes predictions on the test set and evaluates performance using accuracy and a classification report (precision, recall, F1-score).

3. Sentiment Analysis (LogisticRegression)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

data = pd.read_csv("./datasets/sentiment.csv", encoding=encoding)  
data = data.dropna()

X = data["selected_text"]
y = data["sentiment"]

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vectorized = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

This code:

Loads a sentiment dataset (sentiment.csv), removes missing values.
Converts text (selected_text) into numerical features using TF-IDF vectorization (removing stop words, limiting to 5000 features).
Splits the data into training (80%) and testing (20%) sets.
Trains a Logistic Regression model on the training data.
Makes predictions on the test set and evaluates performance using accuracy and a classification report (precision, recall, F1-score).

4. Stock Price Prediction (XGBoost)

import pandas as pd
import numpy as np
import yfinance as yf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error

ticker = "AAPL"
data = yf.download(ticker, start="2020-01-01", end="2024-01-01")

data["Return"] = data["Close"].pct_change()
data.dropna(inplace=True)

X = data[["Open", "High", "Low", "Volume", "Return"]]
y = data["Close"]

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=150, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MAE: {mae}")
print(f"RMSE: {rmse}")

This code:

Downloads AAPL stock data from Yahoo Finance (2020–2024).
Calculates the daily return (Close price percentage change).
Uses Open, High, Low, Volume, and Return as input features (X) and Close price as the target (y).
Scales the input features using MinMaxScaler.
Splits the data into training (80%) and testing (20%) sets.
Trains an XGBoost regressor with 150 estimators.
Predicts stock prices and evaluates performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

5. Movie Recommender System

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv('./datasets/movies.csv')

df['overview'].fillna("", inplace=True)
df['tagline'].fillna("", inplace=True)
df['original_title'].fillna("", inplace=True)

df['combined'] = df['original_title'] + ' ' + df['overview'] + ' ' + df['tagline']

df['combined']

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df["combined"])

similarity_matrix = cosine_similarity(tfidf_matrix)


def recommend_movies(title, num_recommendations=5):
    if title not in df["title"].values:
        return "Movie not found in dataset."
    
    idx = df[df["title"] == title].index[0]
    similar_movies = list(enumerate(similarity_matrix[idx]))
    similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]
    
    recommendations = [df.iloc[i[0]]["title"] for i in similar_movies]
    return recommendations

movie_title = "Inception"
recommended_movies = recommend_movies(movie_title)
print(f"Movies similar to {movie_title}: {recommended_movies}")

This code:

Loads movie data (movies.csv) and fills missing values.
Combines text features (title, overview, tagline) into a single string.
Converts text into vectors using TF-IDF, removing common words.
Computes cosine similarity, which measures how similar two movies are based on their text.
Cosine similarity measures how similar two movies are by calculating the angle between their TF-IDF vectors.

Finds the most similar movies to a given title by sorting similarity scores.
Recommends the top 5 movies similar to "Inception".

6. Churn Prediction (Decision Trees)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

data = pd.read_csv("./datasets/churn.csv")
data = data.dropna()  

label_encoders = {}
categorical_cols = ["Gender", "Subscription Type", "Contract Length"]
for col in categorical_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

X = data.drop(columns=["CustomerID", "Churn"])
y = data["Churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

This code:

Loads Data — Reads churn.csv and removes missing values.
Encodes Categorical Data — Converts Gender, Subscription Type, and Contract Length into numbers using Label Encoding.
Prepares Features & Labels — Drops CustomerID, sets "Churn" as the target variable.
Splits Data — Divides into 80% training and 20% testing sets.
Trains Model — Fits a Decision Tree Classifier on the training data.
Makes Predictions — Predicts churn for the test data.
Evaluates Performance — Computes accuracy and classification report.

7. MNIST Digits Classification

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

X_train = X_train / 255.0
X_test = X_test / 255.0

model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

This code:

Loads MNIST Dataset — Imports the handwritten digits dataset (28x28 grayscale images labeled 0–9).
Reshapes Images — Flattens each 28x28 image into a 1D array of 784 pixels for model compatibility.
Normalizes Pixel Values — Divides by 255.0 to scale values between 0 and 1, improving training efficiency.
Initializes XGBoost Model — Uses XGBClassifier with mlogloss (multi-class log loss) for classification.

Conclusion

Hope you liked this and was helpful! If you wanna understand these algorithms in depth, check out my previous blogs:

Also code for these projects: https://github.com/Atulit23/mini-ml-projects

Mathematics for Machine Learning — Theory & Implementation II

Void — Mon, 17 Feb 2025 14:23:11 GMT

Things this will cover

Continue reading on Medium »