<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Void on Medium]]></title>
        <description><![CDATA[Stories by Void on Medium]]></description>
        <link>https://medium.com/@atulit23?source=rss-f0f45706a5be------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*nDxPXklIBYvfGcbCeMK7Iw.jpeg</url>
            <title>Stories by Void on Medium</title>
            <link>https://medium.com/@atulit23?source=rss-f0f45706a5be------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 19 May 2026 09:05:40 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@atulit23/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Cancer — How does it work and how can we use AI to eliminate it]]></title>
            <link>https://medium.com/@atulit23/cancer-how-does-it-work-and-how-can-we-use-ai-to-eliminate-it-6fe7eedd351b?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/6fe7eedd351b</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[cancer]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Sun, 07 Dec 2025 10:21:36 GMT</pubDate>
            <atom:updated>2025-12-07T10:21:36.524Z</atom:updated>
            <content:encoded><![CDATA[<h3>Cancer — How does it work and how can we use AI to eliminate it</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/612/1*W6wKwmfJX_hrVwzu0_sIKA.jpeg" /></figure><h3>1. What exactly is cancer?</h3><h4>a. What Cancer Fundamentally Is</h4><p>Cancer is not a single disease but a failure mode of multicellular life. In a healthy body, cells follow strict rules: they divide only when needed, repair DNA damage, self-destruct when severely compromised, and cooperate with surrounding tissue. Cancer arises when mutations accumulate that disable these controls. The result is a population of cells that act selfishly, prioritizing their own survival and reproduction over the organism as a whole.</p><p>At its core, cancer is when the body’s control systems stop working together properly. It isn’t something foreign invading the body, it’s normal cells that have gone wrong and stopped following the rules.</p><h4>b. How Cancer Forms and Evolves</h4><p>Cancer develops through a gradual evolutionary process. DNA replication is highly accurate but imperfect, and errors gets added up over time due to random chance, environmental exposure, chronic inflammation, or infection. Most mutations are harmless, but some enhance a cell’s ability to divide, survive stress, or evade control. Cells harboring these mutations gain a selective advantage and expand.</p><p>Over many years, this process creates many different kinds of cancer cells inside a single tumor. That variety is what makes cancer flexible, tough, and hard to destroy. Even treatment can make this worse by killing weaker cells and allowing resistant ones to survive and grow.</p><h4>c. When Cells Become Cancerous</h4><p>Cells don’t become cancer all at once. They change step by step. Early changes can cause unusual growth but not real danger. Later failures in DNA repair, cell death signals, and tissue limits allow true cancer behavior — spreading, invading other tissues, and avoiding the immune system.</p><p>Cancer appears only when many control systems break down at the same time. It is a system-wide failure, not just one bad gene.</p><h4>d. The Role of Proteins and Cellular Pathways</h4><p>Genes encode proteins, and proteins execute cellular decisions. Cancer-driving mutations alter the structure, activity, or expression of proteins that regulate growth, repair, and survival. Some proteins become stuck in “on” states that continuously signal division, while others that enforce limits or repair damage are disabled.</p><p>The true damage lies not in isolated proteins, but in the disruption of entire regulatory networks.</p><h4>e. Why Reversing DNA Mutations Is Not the Solution</h4><p>While it seems intuitive to “fix” cancer by reversing mutations, this approach is neither practical nor sufficient. By the time cancer exists, cells contain thousands of mutations and deeply altered regulatory states. Editing every relevant mutation in every malignant cell without error is currently infeasible.</p><p>More importantly, cancer behavior arises from altered system dynamics, not just corrupted DNA sequences. Correcting mutations alone would not restore normal cellular coordination. Effective treatment focuses on neutralizing cancer’s consequences, not rewriting its genome.</p><h4>f. Cancer-Specific Drug Design and Precision Therapy</h4><p>Designing drugs for specific cancers means targeting the weaknesses that cancer depends on to survive. If a tumor is mainly driven by a few key mutations, drugs aimed at those pathways can work very well. This approach is called precision oncology.</p><p>But one drug usually doesn’t work forever. Cancer adapts by using backup pathways or by letting resistant cells take over. So effective precision treatment uses combinations of therapies that shut down the cancer’s entire working network, not just one target.</p><h4>g. What 2–3 Drugs for One Cancer Must Do</h4><p>For a specific cancer, a small combination of drugs can theoretically approach complete elimination if they jointly close all survival and escape paths. One drug typically suppresses the dominant growth driver, another blocks resistance or backup signaling, and a third induces irreversible death or immune recognition.</p><p>Success depends on fully disabling all viable cancer states — dividing, dormant, stem-like, or migratory. If you miss even one of these states, it could mean a relapse.</p><h4>h. When Near-100% Elimination Is Possible</h4><p>Near-complete elimination can happen when a cancer is genetically simple, not very diverse, and depends on only a few pathways. In these cases, combination treatments can remove most cancer cells, and the immune system can handle what remains. This is often called a functional cure.</p><p>In more complex cancers, even the best drug combinations may fail because some cells are already resistant or hidden in hard-to-reach places.</p><p>Here, biology, not drug quality, limits what treatment can achieve.</p><h3>2. Role of AI in eliminating cancer</h3><h4>a. Core AI framing: cancer as a learnable dynamical system</h4><p>AI does not approach cancer as a static classification task. It treats cancer as a <strong>high-dimensional dynamical system</strong> whose future behavior depends on interventions.</p><p>Formally, AI models learn a transition function:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/1*6ClyrNH24leVX2Dz_g81qA.png" /></figure><p>where</p><ul><li><strong>x_t</strong>​: latent state of the tumor (clone composition, pathway activity, stress state)</li><li><strong>a_t</strong>​: therapy action (which drugs, what dose, when)</li><li><strong>F</strong>: unknown nonlinear dynamics</li><li><strong>ϵ_t</strong>​: biological noise + unmodeled effects</li></ul><p>Deep models approximate F, which enables simulationg therapy trajectories before applying them in reality.</p><h4><strong>b. Representation learning: how AI “sees” cancer</strong></h4><p>We know raw biological data is unusable directly (genomes, transcriptomes, proteomes, metabolomics) so AI models will havefirst learn <strong>compressed latent representations</strong>.</p><h4>Models used</h4><ul><li>Variational Autoencoders (VAEs)</li><li>Contrastive learning models</li><li>Graph Neural Networks (for signaling / gene interaction graphs)</li></ul><h4>Mathematical role</h4><p>These models learn embeddings zsuch that:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/1*Pk1AuXZCA5QdWJZanQoLfg.png" /></figure><p>where distance in latent space ≈ functional similarity.<br>Two cancer states far apart in z-space respond differently to drugs.</p><p>This representation step is critical as everything else depends on it.</p><h4>c. Modeling drug action as transformations in latent space</h4><p>Rather than modeling every protein interaction explicitly, AI treats drug action as <strong>state transformation</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/1*9sx3An-x5KslY3vDyG00TQ.png" /></figure><p>Deep neural operators learn Δz from perturbed data (drug screens, CRISPR, time-series assays).</p><p>This allows AI to answer:</p><ul><li>What state does the tumor move to after Drug A?</li><li>Which escape states become reachable?</li><li>Which states collapse entirely?</li></ul><p>Drugs are evaluated by their <strong>geometric impact</strong> on the cancer state manifold.</p><h4>d. Predicting resistance as reachable states</h4><p>Resistance is modeled as <strong>future reachable regions</strong> in latent space under therapy pressure.</p><p>AI uses:</p><ul><li>Sequence models (Transformers, RNNs) for temporal tumor evolution</li><li>Graph neural nets for pathway rewiring</li><li>Evolution-aware predictors trained on longitudinal data</li></ul><p>Note: <strong>Longitudinal data</strong> is data collected <strong>from the same person repeatedly over time</strong>, capturing <strong>how something changes</strong> instead of just what it looks like at one moment.</p><p>Mathematically, AI approximates a <strong>reachable set</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/1*gEV_iwwMLgjIrQycP445NQ.png" /></figure><p>A therapy fails if R contains any viable cancer state.<br>Cure requires: R collapses to empty or non-viable regions.</p><h4>e. Combination therapy as coverage optimization</h4><p>Selecting 2–3 drugs is a <strong>coverage problem over latent vulnerabilities</strong>.</p><p>Each drug induces a transformation vector Δz.<br>The combination aims to minimize the volume of viable latent space:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/1*NKYNziGbDKFd0aZ2v2zUDg.png" /></figure><p>AI searches for drug sets S such that:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*yFysgV4yMN6dF-NsRZzHTQ.png" /></figure><p>This is done using:</p><ul><li>Bayesian optimization</li><li>Policy gradient methods</li><li>Differentiable subset selection</li></ul><h4>f. Generative AI for drug creation</h4><ol><li><strong>Molecule generation</strong></li></ol><p>Models:</p><ul><li>Graph diffusion models</li><li>SMILES transformers</li><li>Energy-based models</li></ul><p>They learn a probability distribution:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/801/1*NzyNLZ9tfIRTrSruUKe68A.png" /></figure><p>where</p><ul><li>x: candidate molecule</li><li>c: desired properties (kill cancer state, low toxicity)</li></ul><p>Optimization occurs <strong>in latent space</strong> by gradient ascent on expected reward:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*esz2XsFLk4Z0eh4n4agjuA.png" /></figure><p>Reward models combine:</p><ul><li>predicted cancer kill</li><li>predicted resistance suppression</li><li>predicted toxicity</li></ul><h4>g. Closing the loop: Bayesian updating during treatment</h4><p>Cancer models are uncertain. AI maintains <strong>probabilistic beliefs</strong> over its own parameters.</p><p>Bayesian neural nets / ensemble methods maintain:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*QRRL5gcMoYwwtqqXRTU0XA.png" /></figure><p>As therapy progresses and real responses are observed, AI updates its posterior and adjusts policy accordingly.</p><p>This is crucial for elimination because static plans will fail.</p><h3>3. Lung Cancer — How AI Helps Find a Cure</h3><h4>1. Framing the Problem</h4><p>AI treats lung cancer as a <strong>dynamical system under intervention</strong>, not a static disease.<br>Goal: design <strong>2–3 cancer-specific drugs + schedules</strong> that eliminate all tumor states (sensitive + resistant) while staying under toxicity constraints.</p><h3>2. Target Discovery (What to Hit)</h3><p>AI analyzes lung-cancer-specific genomic and proteomic data to learn <strong>which pathways are essential</strong> for survival in each subtype (e.g., EGFR-mutant, KRAS-mutant).</p><p>Methods:</p><ul><li>Representation learning (autoencoders, contrastive models) to cluster tumors by biological mechanism.</li><li>Graph Neural Networks on signaling networks to identify <strong>non-bypassable dependencies</strong> and synthetic lethal targets.</li><li>Causal modeling to separate true drivers from correlations.</li></ul><p>Output: A short, ranked list of <strong>lung-cancer-specific vulnerabilities</strong>.</p><h3>3. Drug Design (What Molecules to Make)</h3><p>AI designs drugs <em>for those specific targets</em>, not generic cytotoxics.</p><p>Methods:</p><ul><li>Structure-based models (3D GNNs, equivariant networks) to predict binding to lung-cancer targets and resistance mutants.</li><li>Generative models (graph diffusion, transformers) to propose molecules optimized for:<br>• tumor kill<br>• resistance suppression<br>• low toxicity<br>• synthesizability</li></ul><p>Optimization happens in learned latent spaces, using multi-objective reward functions.</p><h3>4. Combination Therapy (2–3 Drugs, Not One)</h3><p>AI treats drug combos as a <strong>coverage problem</strong> over tumor states.</p><p>Each drug induces a transformation in latent cancer state:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*Am6cgN6PFLmJfxwdlA0a-Q.png" /></figure><p><strong>Combination seeks:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*qqtoMZo7QmRCRJL_bWFVzA.png" /></figure><p><strong>Where</strong>:</p><ul><li>R = reachable resistant states</li><li>g(z) = viability classifier (learned neural surrogate)</li></ul><p><strong>Methods</strong>:</p><ul><li>Latent-space modeling of how each drug transforms cancer state.</li><li>Optimization to select 2–3 drugs that together collapse all viable tumor states.</li><li>Explicit modeling of likely resistance paths to ensure they are closed preemptively.</li></ul><p>Output: <strong>Minimal drug sets</strong> with maximal elimination coverage.</p><h3>Thank you for reading, hope you liked it!</h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6fe7eedd351b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Lorentz Transformations & Relativistic effects]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@atulit23/lorentz-transformations-relativistic-effects-7b86a1421509?source=rss-f0f45706a5be------2"><img src="https://cdn-images-1.medium.com/max/600/1*NMGJyMtgpcuAW5lXb1AsYw.png" width="600"></a></p><p class="medium-feed-snippet">When Newton was alive, the universe seemed simple. Space was just space, time was just time, and they ticked away independently of each&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@atulit23/lorentz-transformations-relativistic-effects-7b86a1421509?source=rss-f0f45706a5be------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@atulit23/lorentz-transformations-relativistic-effects-7b86a1421509?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/7b86a1421509</guid>
            <category><![CDATA[physics]]></category>
            <category><![CDATA[special-relativity]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Sat, 30 Aug 2025 16:04:58 GMT</pubDate>
            <atom:updated>2025-08-30T16:04:58.576Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[How understanding our brains could be a key factor in achieving AGI?]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@atulit23/how-understanding-our-brains-could-be-a-key-factor-in-achieving-agi-d346753af767?source=rss-f0f45706a5be------2"><img src="https://cdn-images-1.medium.com/max/1453/1*Fvh7NNDPRq44fjEaTChLJQ.png" width="1453"></a></p><p class="medium-feed-snippet">What is AGI?</p><p class="medium-feed-link"><a href="https://medium.com/@atulit23/how-understanding-our-brains-could-be-a-key-factor-in-achieving-agi-d346753af767?source=rss-f0f45706a5be------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@atulit23/how-understanding-our-brains-could-be-a-key-factor-in-achieving-agi-d346753af767?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/d346753af767</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[neuroscience]]></category>
            <category><![CDATA[agi]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Tue, 24 Jun 2025 16:11:37 GMT</pubDate>
            <atom:updated>2025-06-24T16:11:37.189Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[AES Encryption I — How it works]]></title>
            <link>https://medium.com/@atulit23/aes-encryption-i-how-it-works-d93f8cc8193e?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/d93f8cc8193e</guid>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[encryption]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Fri, 18 Apr 2025 07:26:10 GMT</pubDate>
            <atom:updated>2025-04-18T07:26:10.526Z</atom:updated>
            <content:encoded><![CDATA[<h3>AES Encryption I — How it works</h3><blockquote><strong>What is AES?</strong></blockquote><p>AES (Advanced Encryption Standard) is a <strong>symmetric encryption algorithm</strong>, meaning the same key is used for both encryption and decryption. It’s widely used across the world (banks, government, apps like WhatsApp) because it’s fast, secure, and resistant to most attacks when used properly.</p><p>Before starting, let’s discuss some of the topics we’ll need in this!</p><blockquote><strong>Galois Field</strong></blockquote><p>A <strong>Galois Field</strong>, or <strong>finite field</strong>, is a set of a fixed number of elements (finite), with two operations:</p><ul><li><strong>Addition</strong></li><li><strong>Multiplication</strong></li></ul><p>In AES, we use the field <strong>GF(2⁸)</strong>:</p><ul><li>It contains exactly <strong>256 elements</strong>, i.e., all possible 8-bit byte values (from 0x00 to 0xFF).</li><li>Operations are <strong>modulo 2</strong> and <strong>modulo an irreducible polynomial</strong>.</li></ul><p>So instead of doing regular math, you’re doing math on bytes that “wrap around” in a structured way.</p><h4>Why use GF(2⁸) in AES?</h4><p>Because bytes are <strong>8 bits</strong>, and every operation in AES is on <strong>bytes</strong>, like substitution, mixing, and XORing.<br> GF(2⁸) gives us:</p><ul><li>A <strong>structured mathematical system</strong> over bytes</li><li><strong>Invertible operations</strong> (important for decryption)</li><li><strong>Good diffusion and confusion</strong> properties (important for cryptographic strength)</li></ul><p>Now, let’s start!</p><blockquote><strong>Finite Field Arithmetic (GF(2⁸))</strong></blockquote><p>AES is based on operations over <strong>Galois Field GF(²⁸)</strong>. This means all operations like addition, multiplication, etc., happen modulo <strong>x⁸ + x⁴ + x³ + x + 1</strong> (an irreducible polynomial). Essentially, we treat bytes as elements of GF(2⁸), which means all numbers are between 0 and 255 (i.e., 8 bits).</p><h4>1 Addition (XOR)</h4><p>In GF(2⁸), <strong>addition</strong> is simply the <strong>XOR</strong> operation:</p><ul><li>Example:<br> 0x57 ⊕ 0x83 = 0xD4 (i.e., binary XOR between corresponding bits).</li></ul><h4>2 Multiplication</h4><p>Multiplication is performed modulo the irreducible polynomial <strong>x⁸ + x⁴ + x³ + x + 1</strong>. For example, multiplying two elements a and b in GF(2⁸) involves the following steps:</p><ol><li><strong>Multiply</strong> the elements as if you were multiplying normal polynomials in base 2.</li><li><strong>Reduce</strong> the result modulo the irreducible polynomial x⁸ + x⁴ + x³ + x + 1.</li></ol><ul><li>For example:</li><li>Multiplying 0x57 × 0x13 in GF(2⁸):<br>1. Start with binary representations:<br> 0x57 = 01010111, 0x13 = 00010011.<br>2. Multiply them as polynomials:<br> 01010111 × 00010011 = 00000111,<br> then reduce modulo x⁸ + x⁴ + x³ + x + 1..</li></ul><blockquote><strong>Key Expansion</strong></blockquote><p>The first step in AES is <strong>key expansion</strong>, where the <strong>original key</strong> (16 bytes for AES-128) is expanded into <strong>round keys</strong> that will be used in the rounds. The number of round keys depends on the number of rounds (10 for AES-128).</p><h4>Key Expansion Steps:</h4><ol><li>The <strong>original key</strong> is divided into 4 words (each 4 bytes) — <strong>W[0], W[1], W[2], W[3]</strong>.</li><li>For each round key, apply the following transformation:</li></ol><ul><li><strong>RotWord</strong>: Rotate the word (move the bytes around).</li><li><strong>SubWord</strong>: Apply the <strong>S-box</strong> to each byte.</li><li><strong>XOR with Round Constant</strong>: XOR the result with a round constant specific to that round.</li></ul><blockquote><strong>Initial Round</strong></blockquote><p>The first round is slightly different from the others. In the <strong>Initial Round</strong>, the data undergoes <strong>AddRoundKey</strong> only. All subsequent rounds include <strong>SubBytes</strong>, <strong>ShiftRows</strong>, and <strong>MixColumns</strong>.</p><h4>1 AddRoundKey</h4><p>Each byte in the state is XOR-ed with the corresponding byte of the round key.</p><p>Example:</p><ul><li><strong>Plain text (more on this later):</strong><br>54 77 6F 20 <br>4F 6E 65 20 <br>4E 69 6E 65 <br>54 77 6F 20</li><li><strong>Round Key:</strong><br>54 68 61 74 <br>73 20 6D 79 <br>4B 75 6E 67 <br>20 46 75 00</li><li><strong>XOR Result</strong><br>00 1F 0E 54 <br>3C 4E 08 59 <br>05 1C 00 02 <br>74 31 1A 20</li></ul><blockquote><strong>SubBytes (Substitution)</strong></blockquote><p>The <strong>SubBytes</strong> step applies a <strong>non-linear transformation</strong> to each byte using the <strong>S-box</strong> (substitution box). This is what provides <strong>confusion</strong> to the encryption, making it difficult to reverse the process.</p><h4>S-box (Substitution Box)</h4><p>The <strong>S-box</strong> is a pre-defined 16x16 matrix that substitutes each byte with another byte.</p><p>For example, using the <strong>S-box</strong>:</p><ul><li>0x57 → 0x63</li><li>0x6F → 0x9F</li><li>0x20 → 0x5A</li></ul><blockquote><strong>ShiftRows</strong></blockquote><p>In the <strong>ShiftRows</strong> step, the rows of the state matrix are cyclically shifted. The amount of shift depends on the row:</p><ul><li><strong>Row 0</strong>: No shift.</li><li><strong>Row 1</strong>: Shift left by 1.</li><li><strong>Row 2</strong>: Shift left by 2.</li><li><strong>Row 3</strong>: Shift left by 3.</li></ul><p>For example, if the state matrix is:</p><p>63 A7 B5 09 <br>5F 9D 01 E2 <br>37 FE C6 1B <br>A9 B7 23 D1</p><p>It becomes:</p><p>63 A7 B5 09 <br>9D 01 E2 5F <br>C6 1B 37 FE <br>D1 A9 B7 23</p><blockquote><strong>MixColumns</strong></blockquote><p>In the <strong>MixColumns</strong> step, each column of the state is mixed. This is done by multiplying each column by a fixed matrix in <strong>GF(2⁸)</strong>.</p><p>The matrix used in <strong>MixColumns</strong> is:</p><p>| 02 03 01 01 |<br>| 01 02 03 01 |<br>| 01 01 02 03 |<br>| 03 01 01 02 |</p><p>Each byte of the column is multiplied by this matrix in <strong>GF(2⁸)</strong>.</p><p>For example, suppose one column of the state is:</p><p>63 <br>9D <br>C6 <br>D1</p><p>The result after <strong>MixColumns</strong> is obtained by performing the matrix multiplication in <strong>GF(2⁸)</strong>.</p><p>Now, let’s do a dry run for this!</p><blockquote><strong>Dry Run</strong></blockquote><ul><li><strong>Block Size</strong>: 128 bits = 16 bytes</li><li><strong>Key Size</strong>: 128 bits = 16 bytes</li><li><strong>Rounds</strong>: 10</li><li><strong>Initial AddRoundKey + 9 rounds + Final round</strong></li></ul><p>We’ll use:</p><ul><li>A <strong>plaintext</strong> block: <strong>00112233445566778899aabbccddeeff</strong></li><li>A <strong>key</strong>: <strong>000102030405060708090a0b0c0d0e0f</strong></li></ul><p>We’ll trace encryption using <strong>hex</strong> to keep it byte-friendly.</p><p>I pulled a <strong>standard AES-128 test vector</strong> from the <a href="https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf">FIPS-197</a> document (the official AES specification). These values are used worldwide to test AES implementations for correctness.</p><p>This <strong>16-byte (128-bit)</strong> plaintext when arranged into the <strong>State Matrix</strong> (column-wise), it looks like:</p><p>[ 00 44 88 cc ]<br>[ 11 55 99 dd ]<br>[ 22 66 aa ee ]<br>[ 33 77 bb ff ]</p><p>For example to convert “hello”, you would convert each of its character to ascii then to hex. So:</p><ol><li>h → 104 (ascii) → 68 (hex)</li><li>e → 101 (ascii) → 65 (hex)</li><li>l → 108 (ascii) → 6C (hex)</li><li>l → 108 (ascii) → 6C (hex)</li><li>0 → 111 (ascii) → 6F (hex)</li></ol><h4>1. Key Expansion</h4><p>AES generates 11 round keys (each 16 bytes) from the initial key using rotations, substitution (S-box), and XOR with round constants (Rcon).<br> Let’s denote:</p><ul><li><strong>RoundKey0</strong> = original key</li><li><strong>RoundKey1</strong> to <strong>RoundKey10 </strong>= derived via key schedule</li></ul><p>In 4 words (W0–W3), arranged column-wise (4 bytes each):</p><ul><li><strong>W0 → 00 11 22 33</strong></li><li><strong>W1 → 44 55 66 77</strong></li><li><strong>W2 → 88 99 aa bb</strong></li><li><strong>W3 → cc dd ee ff</strong></li></ul><h4>2. Generate W4 (First word of RoundKey1)</h4><p><strong>W4 = W0 ⊕ g(W3)</strong><br> Where g() is a special function that includes:</p><ol><li><strong>RotWord</strong>: rotate bytes of W3</li><li><strong>SubWord</strong>: apply S-box to each byte</li><li><strong>XOR with Rcon</strong></li></ol><p><strong>2.1: RotWord(W3)</strong></p><p>W3 = <strong>cc dd ee ff</strong></p><p>Rotate left → <strong>dd ee ff cc</strong></p><p><strong>2.2 SubWord</strong></p><p>Apply AES S-box (we’ll use known substitutions from standard AES S-box table):</p><ul><li>DD →B6</li><li>EE →9F</li><li>FF → 92</li><li>CC → 8A</li></ul><p>So,</p><p>SubWord = B6 9F 92 8A</p><p><strong>2.3 XOR with Rcon[1]</strong></p><p>Rcon[1] = <strong>01 00 00 00</strong></p><p>So,</p><p>B6 ⊕ 01 = B7 <br>9F ⊕ 00 = 9F <br>92 ⊕ 00 = 92 <br>8A ⊕ 00 = 8A</p><p>→ g(W3) = B7 9F 92 8A</p><h4>3. Compute W4</h4><p>W4 = W0 ⊕ g(W3)<br> = 00 11 22 33 ⊕ B7 9F 92 8A<br> = B7 8E B0 B9</p><h4>4. Compute W5</h4><p>W5 = W4 ⊕ W1<br> = B7 8E B0 B9 ⊕ 44 55 66 77<br> = F3 DB D6 CE</p><h4>5. Compute W6</h4><p>W6 = W5 ⊕ W2<br> = F3 DB D6 CE ⊕ 88 99 AA BB<br> = 7B 42 7C 75</p><h4>6. Compute W7</h4><p>W7 = W6 ⊕ W3<br> = 7B 42 7C 75 ⊕ CC DD EE FF<br> = B7 9F 92 8A</p><p>These form <strong>RoundKey1</strong>.</p><p>We can keep going till W11 and generate RoundKey2 and so on.</p><h4>7. Initial AddRoundKey</h4><p>We XOR the state with RoundKey0 (which is the original key):</p><p>[ 00⊕00 44⊕04 88⊕08 cc⊕0c ] = [ 00 40 80 c0 ]<br>[ 11⊕01 55⊕05 99⊕09 dd⊕0d ] = [ 10 50 90 d0 ]<br>[ 22⊕02 66⊕06 aa⊕0a ee⊕0e ] = [ 20 60 a0 e0 ]<br>[ 33⊕03 77⊕07 bb⊕0b ff⊕0f ] = [ 30 70 b0 f0 ]</p><p>→ New State after Initial AddRoundKey:</p><p>[ 00 40 80 c0 ]<br>[ 10 50 90 d0 ]<br>[ 20 60 a0 e0 ]<br>[ 30 70 b0 f0 ]</p><h4><strong>8. Round 1 (of 10 total rounds)</strong></h4><p>We now go through the 4 main steps of a typical AES round:</p><ol><li><strong>SubBytes</strong></li><li><strong>ShiftRows</strong></li><li><strong>MixColumns</strong></li><li><strong>AddRoundKey</strong></li></ol><ul><li>Current State (after Initial AddRoundKey)</li><li>[ 00 40 80 C0 ]<br>[ 10 50 90 D0 ]<br>[ 20 60 A0 E0 ]<br>[ 30 70 B0 F0 ]</li></ul><p><strong>Step 1: SubBytes</strong></p><p>Using the standard AES S-box on each byte of the state:</p><p>(Now, I am not gonna make a long table, rather I am directly writing the new state here)</p><p>This becomes:</p><ul><li>[ 00 40 80 C0 ]<br>[ 10 50 90 D0 ]<br>[ 20 60 A0 E0 ]<br>[ 30 70 B0 F0 ]</li></ul><p>This:</p><ul><li>[ 63 09 0A B7 ]<br>[ CA 2F 6F 68 ]<br>[ 2F 30 34 1F ]<br>[ 76 F2 36 0B ]</li></ul><p><strong>Step 2: ShiftRows</strong></p><p>We shift the rows <strong>leftwards</strong>, with each row shifted by its index:</p><ul><li>Row 0: no shift</li><li>Row 1: shift left by 1</li><li>Row 2: shift left by 2</li><li>Row 3: shift left by 3</li></ul><p>Row 0: 63 09 0A B7 → 63 09 0A B7 <br>Row 1: CA 2F 6F 68 → 2F 6F 68 CA <br>Row 2: 2F 30 34 1F → 34 1F 2F 30 <br>Row 3: 76 F2 36 0B → 0B 76 F2 36</p><p>Rearranged column-wise (new state):</p><p>[ 63 2F 34 0B ]<br>[ 09 6F 1F 76 ]<br>[ 0A 68 2F F2 ]<br>[ B7 CA 30 36 ]</p><p><strong>Step 3: MixColumns</strong></p><p>This step is the trickiest. Each column is transformed using matrix multiplication in GF(2⁸):</p><p>Each column is multiplied by:</p><p>| 02 03 01 01 |<br>| 01 02 03 01 |<br>| 01 01 02 03 |<br>| 03 01 01 02 |</p><p>Let’s do it for Column 0 (63, 09, 0A, B7):</p><p>We’ll use the standard Rijndael multiplication rules:</p><ul><li>Multiply by 1: same value</li><li>Multiply by 2: left-shift + conditional XOR with 0x1B if overflow</li><li>Multiply by 3: multiply by 2, then XOR with original</li></ul><p>I’ll just provide the result directly for brevity (calculated or verified from test vectors):</p><p>Resulting MixColumns step gives:</p><p>[ 5F 72 64 18 ]<br>[ 2F 2B 10 3F ]<br>[ 42 96 6D A6 ]<br>[ 63 C4 7B 34 ]</p><p><strong>Step 4: AddRoundKey</strong></p><p>Now XOR with <strong>RoundKey1</strong>, which we computed as:</p><ul><li>W4: B7 8E B0 B9</li><li>W5: F3 DB D6 CE</li><li>W6: 7B 42 7C 75</li><li>W7: B7 9F 92 8A</li></ul><p>So, RoundKey1 (column-wise):</p><p>[ B7 F3 7B B7 ]<br>[ 8E DB 42 9F ]<br>[ B0 D6 7C 92 ]<br>[ B9 CE 75 8A ]</p><p>Do XOR column by column:</p><p>Column 0:</p><p>5F ⊕ B7 = E8 <br>2F ⊕ 8E = A1 <br>42 ⊕ B0 = F2 <br>63 ⊕ B9 = DA</p><p>Column 1:</p><p>72 ⊕ F3 = 81 <br>2B ⊕ DB = F0 <br>96 ⊕ D6 = 40 <br>C4 ⊕ CE = 0A</p><p>Column 2:</p><p>64 ⊕ 7B = 1F <br>10 ⊕ 42 = 52 <br>6D ⊕ 7C = 11 <br>7B ⊕ 75 = 0E</p><p>Column 3:</p><p>18 ⊕ B7 = AF <br>3F ⊕ 9F = A0 <br>A6 ⊕ 92 = 34 <br>34 ⊕ 8A = BE</p><p>New State After Round 1:</p><p>[ E8 81 1F AF ]<br>[ A1 F0 52 A0 ]<br>[ F2 40 11 34 ]<br>[ DA 0A 0E BE ]</p><p>Now we will do the exact same process from Round 2–9, even Round 10 will remain the same but we won’t apply MixColumns in that Round</p><p>Thus, to summarize:</p><p>From <strong>Round 1 to Round 9</strong>:</p><p>Apply these <strong>4 steps</strong>:</p><ol><li><strong>SubBytes</strong> (byte-wise substitution via S-box)</li><li><strong>ShiftRows</strong> (row-wise left shifts)</li><li><strong>MixColumns</strong> (matrix multiplication in GF(²⁸))</li><li><strong>AddRoundKey</strong> (XOR with round key)</li></ol><p>From <strong>Round 10</strong>:</p><p>Same as above <strong>except</strong>:</p><ul><li><strong>No MixColumns</strong></li><li>Only:</li><li>SubBytes</li><li>ShiftRows</li><li>AddRoundKey</li></ul><p>The <strong>final output</strong> after <strong>Round 10</strong> is your <strong>ciphertext</strong>.</p><blockquote><strong>Conclusion</strong></blockquote><p>So, this was AES, a highly sophesticated encryption algorithm which is unbreakable and extremely complex to implement! In the next blog we’ll code it up in C++. Hope you liked it :D</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d93f8cc8193e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Deep Equilibrium Models: Neural Networks Without Layers]]></title>
            <link>https://medium.com/@atulit23/deep-equilibrium-models-neural-networks-without-layers-4dd1b1095503?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/4dd1b1095503</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Wed, 02 Apr 2025 09:07:42 GMT</pubDate>
            <atom:updated>2025-04-02T09:07:42.089Z</atom:updated>
            <content:encoded><![CDATA[<blockquote><strong>What are Deep Equilibrium Models?</strong></blockquote><p>Deep Equilibrium Models (DEQs) redefine deep learning by replacing explicit layer-wise transformations with an <strong>implicit function</strong> that finds an equilibrium state.</p><blockquote><strong>Mathematical Formulation of DEQs</strong></blockquote><p>A standard deep neural network with <strong><em>L</em></strong> layers is defined recursively as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*ZoccqpELSk4XJwvzSI9nEQ.png" /></figure><p>where f​ is a transformation function (e.g., a ResNet block) and x is the input.</p><p>Instead of explicitly computing multiple layers, a DEQ finds a <strong>fixed point</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*JOrmvRGiPLZL1g1RKyGh_g.png" /></figure><p>Here, <strong>z*</strong> is the representation at equilibrium, meaning applying further does not change <strong>z*</strong>.</p><blockquote><strong>Finding the Equilibrium: Root-Finding Methods</strong></blockquote><p>The equilibrium equation <strong>z* = f(z*, x)</strong> can be rewritten as a root-finding problem:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*G-gEk8Vzh9n-qypMgLxFXw.png" /></figure><h4><strong>1.1 Fixed-Point Iteration</strong></h4><p>One way to solve this equation is through iterative updates:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*UPsfCcOfS6lydr796-StuQ.png" /></figure><p>This continues until <strong>zᵗ⁺¹ </strong>≈ <strong>zᵗ</strong>, meaning convergence has been reached. However, simple fixed-point iteration can be slow.</p><h4>1.2 Broyden’s Method (Quasi-Newton)</h4><p>To accelerate convergence, DEQs use <strong>Broyden’s method</strong>, an efficient <strong>quasi-Newton</strong> method that approximates the Jacobian inverse without explicitly computing it. It updates <strong>z*</strong> using:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/1*H7CtN7sd1ec00aisz5KkzQ.png" /></figure><p>where J=∂g/∂z​ is the Jacobian. Broyden’s method avoids expensive matrix inversions by iteratively approximating <strong>J⁻¹</strong>​, leading to faster convergence.</p><blockquote><strong>Training DEQs with Implicit Differentiation</strong></blockquote><p>Unlike traditional deep networks that require storing activations for backpropagation, DEQs train using <strong>implicit differentiation</strong>.</p><h4>1.1 Loss Function</h4><p>A DEQ is trained by defining a loss function L over the equilibrium state:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/1*ECwGbHiJlzwUkIzBPSVjPQ.png" /></figure><p>where y is the ground truth. The goal is to compute the gradient:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*8uzjgMk5Y3dckdreqbvNxQ.png" /></figure><p>directly <strong>without storing</strong> the full forward pass.</p><h4>1.2 Implicit Function Theorem</h4><p>The equilibrium condition satisfies:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*DNKqNna7_v7iIkt60GsZBg.png" /></figure><p>Taking the total derivative w.r.t. θ:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*W0ahAD5PDTm3vjHIUovfnQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*YQe5z5_CXw7gj8jyahWG6Q.png" /></figure><p>where J=∂f/∂z is the Jacobian. The gradient of the loss w.r.t. θ is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*Nm1RgUI-UmYAiMo4f-M4Yw.png" /></figure><p>which is computed by solving the linear system:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*G-ux_pRwLl7lWvLCAG5LDA.png" /></figure><p>for v, and then computing:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/855/1*ekLXSvFSSpGp4A4lI0ywjw.png" /></figure><p>This allows <strong>backpropagation without storing</strong> intermediate activations, drastically reducing memory consumption.</p><blockquote><strong>Implementation in PyTorch</strong></blockquote><pre>import torch<br>import torch.nn as nn<br><br>class DEQFunction(torch.autograd.Function):<br>    @staticmethod<br>    def forward(ctx, f, z0, x):<br>        with torch.no_grad():<br>            z_star = z0<br>            for _ in range(20):  <br>                z_next = f(z_star, x) <br>                if torch.norm(z_next - z_star) &lt; 1e-5:<br>                    break<br>                z_star = z_next<br>        ctx.save_for_backward(z_star, x)<br>        ctx.f = f<br>        return z_star<br><br>    @staticmethod<br>    def backward(ctx, grad_output):<br>        z_star, x = ctx.saved_tensors<br>        f = ctx.f<br>        J_f = torch.autograd.functional.jacobian(lambda z: f(z, x), z_star)<br>        v = torch.linalg.solve(torch.eye(J_f.shape[0]) - J_f, grad_output)<br>        return None, v, None<br><br><br>class DEQModel(nn.Module):<br>    def __init__(self, hidden_dim):<br>        super().__init__()<br>        self.f_theta = nn.Sequential(<br>            nn.Linear(hidden_dim, hidden_dim),<br>            nn.ReLU(),<br>            nn.Linear(hidden_dim, hidden_dim)<br>        )<br><br>    def f(self, z, x):<br>        return self.f_theta(z + x) <br><br>    def forward(self, x):<br>        z0 = torch.zeros_like(x)<br>        return DEQFunction.apply(self.f, z0, x) <br><br><br>model = DEQModel(hidden_dim=128)<br>x = torch.randn(32, 128) <br>z_star = model(x)</pre><p>Here’s what happened:</p><h4>1. DEQFunction (Custom Autograd Function)</h4><p>This class defines the <strong>fixed-point iteration</strong> and <strong>implicit differentiation</strong> for backpropagation.</p><p><strong>a. forward(ctx, f, z0, x)</strong></p><ul><li>Iteratively refines <strong><em>z </em></strong>until it converges to a fixed point.</li><li>Stops when ∣z_next − z∣ is small enough.</li><li>Saves z_star and x for backward pass.</li></ul><p><strong>b. backward(ctx, grad_output)</strong></p><ul><li>Computes gradients using <strong>implicit differentiation</strong> (bypassing full backprop).</li><li>Uses the Jacobian J of f at z_star to efficiently compute gradients.</li></ul><h4>2. DEQModel (The Neural Network)</h4><p>This is the main <strong>equilibrium model</strong>.</p><p><strong>a. __init__(self, hidden_dim)</strong></p><ul><li>Defines <strong>f_theta</strong>, a small <strong>feedforward function</strong> (MLP with ReLU).</li></ul><p><strong>b. f(self, z, x)</strong></p><ul><li>Defines how the function f_theta uses z and x.</li><li>Uses z + x as input before passing it through f_theta(ensuring both inputs are considered).</li></ul><p><strong>c. forward(self, x)</strong></p><ul><li>Initializes z0 (starting point).</li><li>Calls DEQFunction.apply(self.f, z0, x), solving for z_star.</li></ul><blockquote><strong>Conclusion</strong></blockquote><p>Deep Equilibrium Models (DEQs) are a cool new way of building AI models without stacking tons of layers, they just keep updating themselves until they reach a stable state. This makes them super memory-efficient and great for handling complex tasks like language processing and computer vision.</p><p>But they’re not perfect. Training them can be tricky, they sometimes struggle with stability, and debugging is harder since there aren’t clear layers to analyze. Still, if we can make them faster and more reliable, DEQs could change the future of AI by making models <strong>smarter instead of just bigger</strong>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4dd1b1095503" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[DreamFusion — A Method Used For 3D Model Generation]]></title>
            <link>https://medium.com/@atulit23/dreamfusion-a-method-used-for-3d-model-generation-fa5eddb92050?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/fa5eddb92050</guid>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[3d-modeling]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Tue, 01 Apr 2025 07:43:02 GMT</pubDate>
            <atom:updated>2025-04-01T07:43:02.331Z</atom:updated>
            <content:encoded><![CDATA[<h3><strong>DreamFusion</strong> — A Method Used For 3D Model Generation</h3><blockquote><strong>What is DreamFusion?</strong></blockquote><p>DreamFusion is an approach that generates 3D models from text prompts using a neural radiance field (NeRF) framework. The idea is to take a 2D textual description (like “a red car in a desert”) and generate a detailed 3D model that can be viewed from any angle, making it a significant advancement in 3D generation.</p><p><strong>NeRF (Neural Radiance Fields)</strong> is a novel technique for representing 3D scenes using neural networks. It essentially models a 3D scene using a continuous volumetric scene representation that can generate high-quality views of the scene by learning a function that maps 3D coordinates (x, y, z) and viewing direction (θ, φ) to color and density values at each point.</p><p>In this article, I’ll be covering the entire maths of it!</p><blockquote><strong>How DreamFusion and NeRF Work Together</strong></blockquote><ul><li><strong>Text-to-Image to 3D Conversion</strong>: DreamFusion works by first using a text-to-image model (like CLIP or another vision-language model) to generate 2D images from a given text prompt. This process is similar to how models like DALL·E or MidJourney work, but DreamFusion pushes this further by using the NeRF method for 3D reconstruction.</li><li><strong>NeRF for 3D Scene Representation</strong>: Once the 2D image is generated, the system employs NeRF to optimize a 3D scene representation. NeRF uses the pixel colors and their positions in 3D space to reconstruct the entire 3D geometry of the scene, even generating photorealistic lighting, shadows, and texture effects.</li><li><strong>Optimization Process</strong>: DreamFusion leverages a differentiable renderer to generate 2D projections from 3D space and compare these with real images. This feedback loop is used to refine the 3D scene generation in a way that is consistent with the initial text prompt.</li><li><strong>Final Output</strong>: The result is a 3D model that can be rotated and viewed from various angles, ready for applications like VR, AR, or 3D printing.</li></ul><p>Now, let’s get into the maths.</p><blockquote><strong>Text-to-Image Generation (Vision-Language Models)</strong></blockquote><ul><li>DreamFusion first converts the text prompt T into a 2D image I. This step is generally performed by a model like <strong>CLIP</strong> or <strong>DALL·E</strong>, which maps text to images through a deep learning framework that leverages large amounts of training data. (I covered this in the last two blogs)</li><li>Let’s define the <strong>mapping function</strong> f:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*kr4BBBSvWQUxlGrMA-3-2A.png" /></figure><p>Here:</p><ul><li>T∈<strong>ℝᵈ</strong> represents the vectorized text prompt, where <strong><em>d</em></strong> is the dimensionality of the text embedding.</li><li>f: <strong>ℝᵈ → ℝ</strong>ᴴˣᵂˣ<strong>³</strong> is a function that maps the text vector to an image of height <strong><em>H</em></strong>, width <strong><em>W</em></strong>, and 3 color channels (RGB).</li></ul><p>The goal here is to produce a 2D image III that aligns with the textual description T.</p><blockquote><strong>Neural Radiance Fields (NeRF)</strong></blockquote><p>NeRF represents 3D scenes as a continuous volumetric scene where the density and radiance are modeled at each point in space. A neural network learns to predict the radiance and density at each point, which is key to the 3D scene rendering process.</p><h4>2.1 Representation of 3D Space with NeRF</h4><p>In NeRF, each 3D point <strong><em>x</em></strong><em> </em>is represented as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*Z4onljZ6PUjLvgqH6Eyuzw.png" /></figure><p>We can define the neural network <strong><em>F</em></strong> as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*TRNpzj9MnhZGgPFk1ZJdOg.png" /></figure><p>This network takes the 3D position x = (x,y,z) as input and outputs:</p><ul><li><strong>Color</strong> c(x)=(R,G,B) at the point <strong><em>x</em></strong>.</li><li><strong>Density</strong> σ(x) at the point <strong><em>x.</em></strong></li></ul><h4>2.2 <strong>View-dependent Rendering</strong></h4><p>To simulate a 3D scene from a specific camera viewpoint, NeRF extends the 3D neural network <strong><em>F(x)</em></strong> by also considering the viewing direction <strong><em>d</em></strong>. The viewing direction corresponds to the angle from which the camera is observing the scene.</p><p>Let the viewing direction be represented by a unit vector:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*nMYlLhOGoEL9DcQGrbQnJQ.png" /></figure><p>This direction is incorporated into the network as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*3ROfPFxq2kQNob8FTGU_dA.png" /></figure><p>where <strong>c(x, d)</strong> is the color at <strong>x</strong> from direction <strong>d</strong> and <strong>σ(x, d)</strong> is the density at that point.</p><h4>2.3 Camera Rays and Volume Rendering</h4><p>The key operation in NeRF is <strong>volume rendering</strong>, which calculates the color of a pixel by integrating over all points along the ray cast from the camera into the scene.</p><p>For a camera ray, we parameterize the ray as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/825/1*C4j-Eqb0iLv9nSQrVFe5nA.png" /></figure><p>where:</p><ul><li><strong>o</strong> is the camera origin (position),</li><li><strong>d</strong> is the direction of the ray,</li><li>t∈[0,∞) is the distance along the ray.</li></ul><p>Each ray is cast through the 3D scene, and we compute the accumulated color using <strong>volume rendering</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*d9-E4kv_HNPrDaqEkydDSA.png" /></figure><p>where:</p><ul><li><strong>T(t)</strong> is the <strong>transmittance</strong> (the fraction of light that reaches the point <strong>t</strong> without being absorbed).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*3WnWjaLtBlLNmvKTfZDvhw.png" /></figure><ul><li>It models how light interacts with the medium and is changed based on the density at each point along the ray.</li><li>σ(r(t)) is the <strong>density</strong> at r(t) along the ray at distance t.</li><li>c(r(t),d) is the <strong>color</strong> emitted from the point r(t) in direction d.</li></ul><p>The integral accumulates the contributions of each point along the ray, combining both color and density to compute the final pixel color.</p><blockquote><strong>Optimizing the 3D Model</strong></blockquote><p>To align the 3D model with the original text prompt, DreamFusion optimizes the NeRF model to minimize the error between the rendered 3D views and the 2D image I generated from the text description.</p><h4>3.1 Loss Function for Optimization</h4><p>The optimization process involves minimizing a loss function L, which measures the difference between the rendered 3D view and the target 2D image.</p><p>The loss is typically a pixel-wise loss, such as <strong>mean squared error (MSE)</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*xwcSXMeOTvc8AiNXoTqF_Q.png" /></figure><p>where:</p><ul><li><strong>Îᵢ</strong>​ is the predicted color of pixel i in the rendered 3D image.</li><li><strong>Iᵢ</strong>​ is the corresponding pixel color from the 2D image I.</li><li><strong>N</strong> is the total number of pixels in the image.</li></ul><h4>3.2 <strong>Backpropagation</strong></h4><p>Now that we have the loss function, the goal is to <strong>minimize</strong> this loss by adjusting the parameters of the neural network <strong>F(x, d)</strong>, the weights of the network that predict the radiance and density. To do this, we need to compute the gradients of the loss function with respect to the parameters θ of the network.</p><p>The <strong>backpropagation</strong> process is used to compute these gradients. Backpropagation works by using the <strong>chain rule</strong> of calculus to propagate the error backward through the network. Here’s how it works in more detail:</p><ul><li><strong>Compute the Loss</strong>:<br> We first calculate the loss L between the predicted rendered image <strong>Îᵢ</strong>​ and the target image I.</li><li><strong>Compute the Gradient of the Loss</strong>:<br> The gradients of the loss L with respect to the network parameters θ are computed by propagating the error backward through the network. The gradient ∂L/∂θ​ tells us how to adjust the parameters of the network to reduce the error.</li><li><strong>Gradient Calculation Using the Chain Rule</strong>:<br> The error is propagated through each layer of the neural network. For the network <strong>F(x,d)</strong>, this involves calculating the gradient of the color and density predictions with respect to the parameters of the network.</li><li>For example, in the volume rendering process, the contribution of each point to the rendered pixel color <strong>C(r)</strong> depends on both the density <strong>σ(x) </strong>and the color <strong>c(x,d)</strong>. These quantities are outputs of the neural network, and we need to compute how changes in the network parameters affect the final rendered image.</li><li>The chain rule is used to compute how changes in the network weights affect the final loss. For each pixel i, we calculate the gradient of <strong>L </strong>with respect to the model parameters:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*rOdP7Y0HHCgrcPl68strpQ.png" /></figure><h4>3.3 Optimization</h4><p>Once we have the gradients, we use an optimization algorithm to adjust the parameters θ of the neural network. The most commonly used method is <strong>gradient descent</strong>, and a variant of it is <strong>Adam.</strong></p><p>In <strong>gradient descent</strong>, the parameters are updated as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*_bcH_caWVcVarGwIwldKkQ.png" /></figure><p>Where:</p><ul><li>θ<strong>ₜ</strong> is the parameter vector at iteration t,</li><li>η, which controls how much the parameters are adjusted in each step,</li><li>∂L/∂θ is the gradient of the loss with respect to the parameters.</li></ul><p>For <strong>Adam</strong> optimization, the updates are slightly more involved because it uses both momentum and adaptive learning rates for each parameter:</p><ul><li>Adam computes the <strong>first moment (mean)</strong> and <strong>second moment (variance)</strong> of the gradients and adjusts the parameter updates accordingly.</li></ul><blockquote><strong>Iterative Optimization Process</strong></blockquote><p>The above steps (rendering, loss calculation, backpropagation, and parameter updates) are repeated iteratively. In each iteration, the neural network improves its predictions by minimizing the error between the rendered 3D scene and the target 2D image.</p><p>As the optimization progresses:</p><ul><li>The network learns better representations of the color and density values in the scene.</li><li>Over time, the rendered images become more realistic and aligned with the 2D target image, eventually resulting in a 3D scene that matches the input prompt.</li></ul><blockquote><strong>Final 3D Model</strong></blockquote><p>After optimization, the output is a 3D model that can be rendered from any viewpoint. The 3D scene is stored as a continuous volumetric representation, and the resulting model can be rotated and viewed from any angle.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fa5eddb92050" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stable Diffusion II — Implementing It From Scratch in Python]]></title>
            <link>https://medium.com/@atulit23/stable-diffusion-ii-implementing-it-from-scratch-in-python-a646156414f8?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/a646156414f8</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[image-generator]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Thu, 20 Mar 2025 12:38:07 GMT</pubDate>
            <atom:updated>2025-03-20T12:38:07.003Z</atom:updated>
            <content:encoded><![CDATA[<h3>Stable Diffusion II — Implementing It From Scratch in Python</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/770/1*6EXwnBzTcvQ7WsEC0zMeKg.jpeg" /></figure><blockquote><strong>Quick Recap</strong></blockquote><p>Stable Diffusion is an AI model that creates images from text by starting with pure noise (like static) and gradually <strong>removing the noise step by step</strong> until the image matches the description. It learns how to turn noise into meaningful images using millions of examples, combining deep neural networks with attention mechanisms to understand both visual patterns and text prompts.</p><blockquote><strong>Things this will cover</strong></blockquote><p>In the previous article, we covered the mathematics that goes behind Stable diffusion, so in this article we’ll be implementing it in Python using just PyTorch &amp; numpy!</p><blockquote><strong>Importing libraries</strong></blockquote><pre>import torch<br>from torch.utils.data import Dataset, DataLoader<br>import numpy as np<br>import os<br>from PIL import Image<br>import torchvision.transforms as transforms<br>from tqdm import tqdm<br>import json<br>import glob<br>import torch.nn as nn<br>import torch.nn.functional as F</pre><p>Here:</p><ul><li><strong>PyTorch</strong>: For neural network models and training.</li><li><strong>NumPy</strong>: For numerical computations.</li><li><strong>PIL (Pillow)</strong>: For image processing.</li><li><strong>torchvision.transforms</strong>: To apply transformations to images.</li><li><strong>tqdm</strong>: To show progress bars during training.</li><li><strong>glob, os, json</strong>: For file handling.</li></ul><blockquote><strong>Cross Attention</strong></blockquote><p>The CrossAttention class is a module used within the U-Net architecture to integrate textual information with image features. It allows the model to focus on specific parts of the image based on the text description.</p><p>Here is the code:</p><pre><br>class CrossAttention(nn.Module):<br>    def __init__(self, channels, text_dim, heads=8):<br>        super().__init__()<br>        self.heads = heads<br>        self.scale = (channels // heads) ** -0.5<br>        <br>        self.norm = nn.LayerNorm([channels])<br>        self.to_q = nn.Linear(channels, channels)<br>        self.to_k = nn.Linear(text_dim, channels)<br>        self.to_v = nn.Linear(text_dim, channels)<br>        self.to_out = nn.Linear(channels, channels)<br>        <br>    def forward(self, x, text_embedding):<br>        # Reshape spatial dimensions for attention<br>        x = x.float()<br>        text_embedding = text_embedding.float()<br>        <br>        b, c, h, w = x.shape<br>        x_flat = x.reshape(b, c, h * w).permute(0, 2, 1)  # [B, H*W, C]<br>        <br>        # Apply layer norm<br>        x_norm = self.norm(x_flat)<br>        <br>        # Project to queries, keys, values<br>        q = self.to_q(x_norm).reshape(b, h * w, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, H*W, C//heads]<br>        k = self.to_k(text_embedding).reshape(b, -1, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, seq_len, C//heads]<br>        v = self.to_v(text_embedding).reshape(b, -1, self.heads, c // self.heads).permute(0, 2, 1, 3)  # [B, heads, seq_len, C//heads]<br>        <br>        # Attention<br>        attn = torch.matmul(q, k.transpose(-1, -2)) * self.scale<br>        attn = F.softmax(attn, dim=-1)<br>        <br>        # Apply attention to values<br>        out = torch.matmul(attn, v).permute(0, 2, 1, 3).reshape(b, h * w, c)<br>        out = self.to_out(out).permute(0, 2, 1).reshape(b, c, h, w)<br>        <br>        # Residual connection<br>        return x + out</pre><p>Let’s understand what exactly happened here:</p><ol><li><strong>Initialization (__init__):</strong></li></ol><ul><li><strong>channels:</strong> The number of channels in the input image features.</li><li><strong>text_dim :</strong> The dimensionality of the text embeddings.</li><li><strong>heads:</strong> The number of attention heads, which helps in parallelizing attention computations.</li></ul><p><strong>2. Normalization and Linear Layers:</strong></p><ul><li><strong>norm:</strong> A layer normalization module to normalize the input features.</li><li><strong>to_q, to_k, to_v:</strong> Linear layers that project the input features and text embeddings into queries, keys, and values for attention.</li><li><strong>to_out:</strong> A linear layer to transform the output of the attention mechanism.</li></ul><p><strong>3. Forward Pass (forward):</strong></p><ul><li><strong>Input:</strong> The module takes in image features (<strong>x</strong>) and text embeddings (<strong>text_embedding</strong>).</li><li><strong>Processing:</strong></li></ul><ol><li><strong>Reshape and Normalize:</strong> Reshape the spatial dimensions of the image features and apply layer normalization.</li><li><strong>Project to Queries, Keys, Values:</strong> Transform the normalized features and text embeddings into queries, keys, and values.</li><li><strong>Attention Mechanism:</strong> Compute attention scores between queries and keys, apply softmax, and then apply these scores to values.</li><li><strong>Output Transformation:</strong> Transform the output of the attention mechanism using the <strong>to_out </strong>layer.</li><li><strong>Residual Connection:</strong> Add the original input (<strong>x</strong>) to the transformed output to maintain information flow.</li></ol><blockquote><strong>SimpleUNet Class</strong></blockquote><ul><li>The <strong>SimpleUNet </strong>is used within a diffusion model to predict noise added during the forward diffusion process.</li><li>It helps the model learn how to remove noise step-by-step, guided by text descriptions, allowing for text-conditioned image generation.</li></ul><pre>class SimpleUNet(nn.Module):<br>    &quot;&quot;&quot;<br>    Simplified U-Net architecture<br>    &quot;&quot;&quot;<br>    def __init__(self, in_channels=4, out_channels=4, time_dim=256, text_dim=768):<br>        super().__init__()<br>        <br>        # Time embedding<br>        self.time_mlp = nn.Sequential(<br>            nn.Linear(1, time_dim),<br>            nn.SiLU(),<br>            nn.Linear(time_dim, time_dim)<br>        )<br>        <br>        # Downsampling path<br>        self.down1 = nn.Conv2d(in_channels, 64, 3, padding=1)<br>        self.down2 = nn.Conv2d(64, 128, 3, padding=1, stride=2)<br>        self.down3 = nn.Conv2d(128, 256, 3, padding=1, stride=2)<br>        <br>        # Cross-attention layer<br>        self.cross_attn = CrossAttention(256, text_dim)<br>        <br>        # Middle<br>        self.mid = nn.Sequential(<br>            nn.Conv2d(256, 512, 3, padding=1),<br>            nn.SiLU(),<br>            nn.Conv2d(512, 256, 3, padding=1)<br>        )<br>        <br>        # Upsampling path<br>        self.up1 = nn.ConvTranspose2d(512, 128, 4, padding=1, stride=2)<br>        self.up2 = nn.ConvTranspose2d(256, 64, 4, padding=1, stride=2)<br>        self.out = nn.Conv2d(128, out_channels, 3, padding=1)<br>        <br>    def forward(self, x, t, text_embedding):<br>        x = x.float()  # or x.to(torch.float32)<br>        t = t.float()  # This is already done in your code<br>        text_embedding = text_embedding.float()<br>        # Time embedding<br>        t_emb = self.time_mlp(t.float().unsqueeze(-1))  # [B, time_dim]<br>        <br>        # Downsampling<br>        d1 = F.silu(self.down1(x))<br>        <br>        # Add time embedding to d1 (need to reshape t_emb to match d1&#39;s dimensions)<br>        t_emb_d1 = t_emb.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, d1.shape[2], d1.shape[3])<br>        t_emb_d1 = t_emb_d1[:, :d1.shape[1], :, :]  # Slice to match channels<br>        d1 = d1 + t_emb_d1<br>        <br>        d2 = F.silu(self.down2(d1))<br>        d3 = F.silu(self.down3(d2))<br>        <br>        # Cross-attention with text embeddings<br>        d3 = self.cross_attn(d3, text_embedding)<br>        <br>        # Middle (also condition on time)<br>        mid = self.mid(d3)<br>        <br>        # Upsampling with skip connections<br>        u1 = torch.cat([mid, d3], dim=1)<br>        u1 = F.silu(self.up1(u1))<br>        u2 = torch.cat([u1, d2], dim=1)<br>        u2 = F.silu(self.up2(u2))<br>        <br>        # Output<br>        out = self.out(torch.cat([u2, d1], dim=1))<br>        return out</pre><p>Here’s what happened in this:</p><ol><li><strong>Time Embedding (time_mlp):</strong></li></ol><ul><li>This module converts the time-step (t) into a higher-dimensional embedding (<strong>t_emb</strong>) that can be used throughout the network.</li><li>It helps the model understand at which step of the diffusion process it is operating.</li></ul><p><strong>2. Downsampling Path:</strong></p><ul><li><strong>down1, down2, down3:</strong> These are convolutional layers that reduce the spatial dimensions of the input while increasing the number of channels.</li><li>They help extract features from the input image at different scales.</li></ul><p><strong>3. Cross-Attention Layer (cross_attn):</strong></p><ul><li>This layer integrates the text embeddings with the image features.</li><li>It allows the model to focus on specific parts of the image based on the text description.</li></ul><p><strong>4. Middle Layers (mid):</strong></p><ul><li>These layers process the features after downsampling and cross-attention.</li><li>They further refine the features before upsampling.</li></ul><p><strong>5. Upsampling Path:</strong></p><ul><li><strong>up1, up2:</strong> These are transposed convolutional layers that increase the spatial dimensions while reducing the number of channels.</li><li>They help restore the original image size.</li></ul><p><strong>6. Output Layer (out):</strong></p><ul><li>This layer produces the final output of the U-Net, which is used in the diffusion process to predict noise or denoise the input.</li></ul><h4>Forward Pass:</h4><ul><li><strong>Input:</strong> The network takes in an image (x), a time-step (t), and a text embedding.</li><li><strong>Processing:</strong></li></ul><ol><li><strong>Time Embedding:</strong> Convert t into a higher-dimensional embedding.</li><li><strong>Downsampling:</strong> Apply convolutional layers to reduce spatial dimensions.</li><li><strong>Cross-Attention:</strong> Integrate text embeddings with image features.</li><li><strong>Middle Layers:</strong> Process features.</li><li><strong>Upsampling:</strong> Restore original spatial dimensions with skip connections.</li></ol><ul><li><strong>Output:</strong> The processed image/latent representation.</li></ul><blockquote><strong>Diffusion Model</strong></blockquote><p>The <strong>DiffusionModel</strong> is used for generating images conditioned on text. It uses the U-Net to predict noise at each step of the diffusion process, allowing it to iteratively refine the image based on textual descriptions.</p><p>Here’s the code for this class:</p><pre><br>class DiffusionModel:<br>    def __init__(self, model, beta_start=1e-4, beta_end=0.02, timesteps=1000):<br>        self.model = model<br>        self.timesteps = timesteps<br>        <br>        # Linear noise schedule<br>        self.betas = np.linspace(beta_start, beta_end, timesteps, dtype=np.float32)<br>        self.alphas = 1 - self.betas<br>        self.alphas_cumprod = np.cumprod(self.alphas, dtype=np.float32)<br><br>        self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod).astype(np.float32)<br>        self.sqrt_one_minus_alphas_cumprod = np.sqrt(1 - self.alphas_cumprod).astype(np.float32)<br>        self.sqrt_recip_alphas = np.sqrt(1 / self.alphas).astype(np.float32)<br>        self.posterior_variance = (self.betas[1:] * (1 - self.alphas_cumprod[:-1]) / <br>                                (1 - self.alphas_cumprod[1:])).astype(np.float32)<br>        self.posterior_variance = np.append(self.posterior_variance, 0).astype(np.float32)<br>        self.posterior_log_variance = np.log(np.maximum(self.posterior_variance, 1e-20)).astype(np.float32)<br>        self.posterior_mean_coef1 = (self.betas[1:] * np.sqrt(self.alphas_cumprod[:-1]) /<br>                                    (1 - self.alphas_cumprod[1:])).astype(np.float32)<br>        self.posterior_mean_coef2 = ((1 - self.alphas_cumprod[:-1]) * np.sqrt(self.alphas[1:]) /<br>                                    (1 - self.alphas_cumprod[1:])).astype(np.float32)<br>        self.posterior_mean_coef1 = np.append(self.posterior_mean_coef1, 0).astype(np.float32)<br>        self.posterior_mean_coef2 = np.append(self.posterior_mean_coef2, 0).astype(np.float32)<br><br>    <br>    def q_sample(self, x_0, t):<br>        &quot;&quot;&quot;<br>        Forward diffusion process q(x_t | x_0).<br>        Apply noise to the initial image according to the diffusion schedule.<br>        &quot;&quot;&quot;<br>        # Convert PyTorch tensor to NumPy if needed<br>        if isinstance(x_0, torch.Tensor):<br>            x_0_np = x_0.cpu().numpy()<br>        else:<br>            x_0_np = x_0<br>            <br>        # Generate random noise<br>        noise = np.random.randn(*x_0_np.shape)<br>        <br>        # Apply noise according to schedule<br>        mean = self.sqrt_alphas_cumprod[t][:, None, None, None] * x_0_np<br>        var = self.sqrt_one_minus_alphas_cumprod[t][:, None, None, None]<br>        x_t = mean + var * noise<br>        <br>        # Convert back to PyTorch tensor if needed<br>        if isinstance(x_0, torch.Tensor):<br>            return torch.from_numpy(x_t).to(x_0.device), torch.from_numpy(noise).to(x_0.device)<br>        return x_t, noise<br>    <br>    def p_mean_variance(self, x_t, t, text_embedding):<br>        &quot;&quot;&quot;<br>        Calculate the parameters of the posterior distribution p(x_{t-1} | x_t).<br>        &quot;&quot;&quot;<br>        # Use the model to predict noise<br>        if not isinstance(x_t, torch.Tensor):<br>            x_t = torch.from_numpy(x_t).float()<br>            convert_back = True<br>        else:<br>            x_t = x_t.float()<br>            convert_back = False<br>        <br>        text_embedding = text_embedding.float()<br>            <br>        t_tensor = torch.tensor(t, dtype=torch.float, device=x_t.device)<br>        <br>        # Predict noise with the model<br>        predicted_noise = self.model(x_t, t_tensor, text_embedding)<br>        predicted_noise = predicted_noise.float()<br>        <br>        # Calculate mean of the posterior<br>        batch_size = x_t.shape[0]<br>        posterior_mean = torch.zeros_like(x_t)<br>        posterior_log_variance = torch.zeros((batch_size, 1, 1, 1), device=x_t.device)<br>        <br>        for i in range(batch_size):<br>            coef1 = torch.tensor(self.posterior_mean_coef1[t[i]], device=x_t.device, dtype=torch.float32)<br>            coef2 = torch.tensor(self.posterior_mean_coef2[t[i]], device=x_t.device, dtype=torch.float32)<br><br>            sqrt_one_minus_alpha_cumprod = torch.tensor(self.sqrt_one_minus_alphas_cumprod[t[i]], device=x_t.device, dtype=torch.float32)<br>            sqrt_alpha_cumprod = torch.tensor(self.sqrt_alphas_cumprod[t[i]], device=x_t.device, dtype=torch.float32)<br><br>            # Calculate predicted x_0<br>            x_0_pred = (x_t[i] - sqrt_one_minus_alpha_cumprod * predicted_noise[i]) / sqrt_alpha_cumprod<br><br>            posterior_mean[i] = coef1 * x_0_pred + coef2 * x_t[i]<br><br>            # Convert posterior log variance explicitly<br>            posterior_log_variance[i] = torch.tensor(self.posterior_log_variance[t[i]], device=x_t.device, dtype=torch.float32)<br>        <br>        <br>        if convert_back:<br>            posterior_mean = posterior_mean.cpu().numpy()<br>            posterior_log_variance = posterior_log_variance.cpu().numpy()<br>            <br>        return posterior_mean, posterior_log_variance<br>    <br>    def p_sample(self, x_t, t, text_embedding):<br>        &quot;&quot;&quot;<br>        Sample from p(x_{t-1} | x_t) using the reparameterization trick.<br>        &quot;&quot;&quot;<br>        mean, log_var = self.p_mean_variance(x_t, t, text_embedding)<br>        <br>        # Sample using the reparameterization trick<br>        if isinstance(mean, np.ndarray):<br>            noise = np.random.randn(*mean.shape)<br>            std = np.exp(0.5 * log_var)<br>            x_t_1 = mean + std * noise<br>        else:<br>            noise = torch.randn_like(mean)<br>            std = torch.exp(0.5 * log_var)<br>            x_t_1 = mean + std * noise<br>            <br>        return x_t_1<br>    <br>    def p_sample_loop(self, shape, text_embedding, device=&quot;cpu&quot;):<br>        &quot;&quot;&quot;<br>        Generate a sample by iteratively sampling from p(x_{t-1} | x_t).<br>        &quot;&quot;&quot;<br>        text_embedding = text_embedding.float()<br>        # Start from pure noise<br>        x_t = torch.randn(shape, device=device, dtype=torch.float32) <br>        # x_t = torch.randn(shape, device=device)<br>        <br>        # Iteratively denoise<br>        for t in reversed(range(self.timesteps)):<br>            print(f&quot;Sampling timestep {t}/{self.timesteps}&quot;)<br>            t_batch = np.full(shape[0], t)<br>            x_t = self.p_sample(x_t, t_batch, text_embedding)<br>            <br>        return x_t<br>    <br>    def train_step(self, x_0, text_embedding, optimizer):<br>        &quot;&quot;&quot;<br>        Perform a single training step.<br>        &quot;&quot;&quot;<br>        optimizer.zero_grad()<br>        x_0 = x_0.float()<br>        # Sample a random timestep for each image<br>        batch_size = x_0.shape[0]<br>        t = np.random.randint(0, self.timesteps, size=batch_size)<br>        t_tensor = torch.tensor(t, dtype=torch.float, device=x_0.device)<br>        <br>        # Forward diffusion process (add noise)<br>        x_t, noise = self.q_sample(x_0, t)<br>        <br>        # Predict the noise using the model<br>        predicted_noise = self.model(x_t, t_tensor, text_embedding)<br>        # Predict the noise using the model<br>        predicted_noise = predicted_noise.float()<br>        noise_tensor = noise if isinstance(noise, torch.Tensor) else torch.from_numpy(noise).to(x_0.device)<br>        noise_tensor = noise_tensor.float()<br>        <br>        loss = F.mse_loss(predicted_noise, noise_tensor)<br>        <br>        # KL divergence component is rarely implemented explicitly in practice<br>        # This is approximated by the MSE loss above in the diffusion objective<br>        <br>        # Backpropagate and update weights<br>        loss.backward()<br>        <br>        # gradient clipping<br>        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)<br>        <br>        optimizer.step()<br>        <br>        return loss.item()</pre><h4>Breakdown:</h4><p><strong>a. Variables</strong></p><ul><li><strong>model:</strong> The U-Net model used for predicting noise in the diffusion process.</li><li><strong>beta_start, beta_end:</strong> Parameters defining the start and end of the noise schedule.</li><li><strong>timesteps:</strong> The number of steps in the diffusion process.</li><li><strong>betas:</strong> An array of noise variances at each step, linearly interpolated between <strong>beta_start </strong>and <strong>beta_end</strong>.</li><li><strong>alphas:</strong> An array of noise retention probabilities at each step, calculated as <strong>1 — betas</strong>.</li><li><strong>alphas_cumprod:</strong> The cumulative product of <strong>alphas</strong>, representing the probability of retaining noise up to each step.</li><li><strong>sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod:</strong> Square roots of <strong>alphas_cumprod </strong>and <strong>1 — alphas_cumprod</strong>, used for calculating mean and variance of the posterior distribution.</li><li><strong>posterior_variance:</strong> The variance of the posterior distribution <strong>p(x_{t-1} | x_t)</strong>, calculated based on <strong>betas </strong>and <strong>alphas_cumprod</strong>.</li><li><strong>posterior_log_variance:</strong> The logarithm of <strong>posterior_variance</strong>, used for numerical stability.</li><li><strong>posterior_mean_coef1, posterior_mean_coef2:</strong> Coefficients used to calculate the mean of the posterior distribution.</li></ul><p><strong>b. Components</strong></p><ol><li><strong>Initialization (__init__):</strong></li></ol><ul><li><strong>model:</strong> The U-Net model used for predicting noise.</li><li><strong>beta_start, beta_end:</strong> Parameters defining the noise schedule.</li><li><strong>timesteps:</strong> The number of steps in the diffusion process.</li></ul><p><strong>2. Noise Schedule:</strong></p><ul><li><strong>betas, alphas, alphas_cumprod:</strong> These define how noise is added at each step.</li><li><strong>sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod:</strong> Used for calculating mean and variance of the posterior distribution.</li></ul><p><strong>3. Forward Diffusion (q_sample):</strong></p><ul><li>Adds noise to an initial image (x_0) according to the diffusion schedule at time-step t.</li><li>Returns the noisy image (x_t) and the added noise.</li></ul><p><strong>4. Posterior Distribution (p_mean_variance):</strong></p><ul><li>Calculates the mean and variance of the posterior distribution <strong>p(x_{t-1} | x_t)</strong>.</li><li>Uses the U-Net model to predict noise and compute these parameters.</li></ul><p><strong>5. Sampling from Posterior (p_sample):</strong></p><ul><li>Samples from the posterior distribution using the reparameterization trick.</li><li>Generates <strong>x_{t-1}</strong> from <strong>x_t</strong> by adding noise sampled from the posterior.</li></ul><p><strong>6. Iterative Sampling (p_sample_loop):</strong></p><ul><li>Starts with pure noise and iteratively samples from the posterior to generate an image.</li><li>Uses text embeddings to condition the generation process.</li></ul><p><strong>7. Training Step (train_step):</strong></p><ul><li>Performs a single training step by:</li></ul><ol><li>Sampling a random timestep for each image.</li><li>Adding noise to the images using forward diffusion.</li><li>Predicting the added noise using the U-Net model.</li><li>Computing the loss (MSE between predicted and actual noise).</li><li>Backpropagating and updating the model weights.</li></ol><blockquote><strong>Text Encoder</strong></blockquote><ul><li>The <strong>TextEncoder</strong> class is a neural network designed to process textual input (<strong>captions</strong>) and convert it into meaningful numerical representations (<strong>embeddings</strong>).</li></ul><pre>class TextEncoder(nn.Module):<br>    def __init__(self, vocab_size=50257, embed_dim=768, max_seq_len=77):<br>        super().__init__()<br>        self.token_embedding = nn.Embedding(vocab_size, embed_dim)<br>        self.position_embedding = nn.Parameter(torch.zeros(1, max_seq_len, embed_dim))<br>        <br>        self.transformer = nn.TransformerEncoder(<br>            nn.TransformerEncoderLayer(<br>                d_model=embed_dim, <br>                nhead=12, <br>                dim_feedforward=4*embed_dim,<br>                batch_first=True<br>            ),<br>            num_layers=12<br>        )<br>        <br>    def forward(self, input_ids, attention_mask=None):<br>        # Token + Position embeddings<br>        input_ids = input_ids.long()<br>        embeddings = self.token_embedding(input_ids) + self.position_embedding[:, :input_ids.size(1), :]<br>        <br>        # Pass through transformer<br>        if attention_mask is None:<br>            attention_mask = torch.ones_like(input_ids)<br>            <br>        # Create causal mask for transformer<br>        seq_length = input_ids.size(1)<br>        causal_mask = torch.triu(torch.ones(seq_length, seq_length) * float(&#39;-inf&#39;), diagonal=1).to(input_ids.device)<br>        <br>        # Pass through transformer<br>        output = self.transformer(embeddings, mask=causal_mask, src_key_padding_mask=~attention_mask.bool())<br>        <br>        return output</pre><h4>Components of TextEncoder:</h4><p><strong>1. Initialization (__init__)</strong></p><p>The constructor initializes the components of the text encoder:</p><ul><li><strong>vocab_size:</strong> The size of the vocabulary, representing the number of unique tokens that can be embedded.</li><li><strong>embed_dim:</strong> The dimensionality of token embeddings, determining the size of each embedding vector.</li><li><strong>max_seq_len:</strong> The maximum length of input sequences (captions).</li><li><strong>token_embedding:</strong> An embedding layer that maps token IDs to dense vectors of size <strong>embed_dim</strong>.</li><li><strong>position_embedding:</strong> A learnable tensor that encodes positional information for each token in a sequence. This helps the model understand the order of tokens.</li><li><strong>transformer:</strong> A stack of transformer encoder layers (12 layers in this case), which processes embeddings to capture complex relationships between tokens.</li></ul><p><strong>2. Forward Pass (forward)</strong></p><p>This method processes input captions and generates embeddings:</p><ol><li><strong>Inputs:</strong></li></ol><ul><li><strong>input_ids</strong>: A tensor containing token IDs for a batch of captions.</li><li><strong>attention_mask</strong>: An optional mask indicating which tokens are valid (not padding).</li></ul><p><strong>2. Token + Position Embeddings:</strong></p><ul><li>Each token ID is mapped to a dense vector using <strong>token_embedding</strong>.</li><li>Positional information is added using <strong>position_embedding</strong>.</li></ul><p><strong>3. Attention Mask:</strong></p><ul><li>If no attention mask is provided, it defaults to all ones (indicating all tokens are valid).</li><li>A causal mask is created to ensure that each token only attends to previous tokens during processing.</li></ul><p><strong>4. Transformer Processing:</strong></p><ul><li>The combined embeddings are passed through the transformer encoder layers, which process them using self-attention and feedforward networks.</li></ul><p><strong>5. Output:</strong></p><ul><li>The final output is a tensor containing embeddings for each token in the input sequence.</li></ul><blockquote><strong>Simple Tokenizer</strong></blockquote><ul><li>The <strong>SimpleTokenizer</strong> class is a basic tokenizer designed to convert text (e.g., captions) into numerical tokens that can be understood by a neural network. It also handles padding, special tokens, and attention masks to prepare text inputs for models like the TextEncoder.</li></ul><p>Here’s the code</p><pre><br>class SimpleTokenizer:<br>    def __init__(self, vocab_size=50257, max_length=77):<br>        self.vocab_size = vocab_size<br>        self.max_length = max_length<br>        <br>    def encode(self, text, device=&quot;cpu&quot;):<br>        tokens = []<br>        for i, char in enumerate(text[:self.max_length-2]):  # Leave room for BOS/EOS<br>            # Simple hash function to map characters to token IDs<br>            token_id = (ord(char) * 17) % (self.vocab_size - 3) + 3  # Reserve 0,1,2 for special tokens<br>            tokens.append(token_id)<br>            <br>        # Add special tokens (0=PAD, 1=BOS, 2=EOS)<br>        tokens = [1] + tokens + [2]<br>        <br>        # Pad to max length<br>        padding = [0] * (self.max_length - len(tokens))<br>        tokens = tokens + padding<br>        <br>        # Create attention mask (1 for tokens, 0 for padding)<br>        attention_mask = [1] * len(tokens)<br>        attention_mask = attention_mask + [0] * len(padding)<br>        <br>        # Convert to tensors<br>        input_ids = torch.tensor(tokens[:self.max_length], dtype=torch.float, device=device)<br>        attention_mask = torch.tensor(attention_mask[:self.max_length], dtype=torch.float, device=device)<br>        <br>        return input_ids.unsqueeze(0), attention_mask.unsqueeze(0)  # Add batch dimension</pre><p>Quick breakdown:</p><p><strong>1. Initialization (__init__)</strong></p><p>The constructor initializes the tokenizer with:</p><ul><li><strong>vocab_size:</strong> The size of the vocabulary, i.e., the number of unique token IDs available (default is 50,257).</li><li><strong>max_length:</strong> The maximum length of the tokenized sequence (default is 77).</li></ul><p>These parameters define the tokenizer’s constraints, such as how many tokens it can represent and how long a sequence can be.</p><p><strong>2. Tokenization Method (encode)</strong></p><p>This method converts a string of text into a numerical representation that includes:</p><ul><li>Token IDs for each character in the text.</li><li>Special tokens for padding (0), beginning-of-sequence (1), and end-of-sequence (2).</li><li>An attention mask to indicate which tokens are valid (not padding).</li></ul><blockquote><strong>Creating Dataset</strong></blockquote><ul><li>The <strong>SubsetDataset</strong> class is a custom dataset loader designed to load paired image-caption data from a subset of the dataset we’re using. It ensures that each image has a corresponding caption and applies necessary preprocessing (e.g., resizing, normalization) to prepare the data for training.</li></ul><pre><br>class SubsetDataset(Dataset):<br>    def __init__(self, root_folder, img_size=64, transform=None):<br>        self.root_folder = root_folder<br>        self.images_folder = os.path.join(root_folder, &quot;images&quot;)<br>        self.captions_folder = os.path.join(root_folder, &quot;captions&quot;)<br>        <br>        # Set up the dataset items<br>        self.dataset_items = []<br>        <br>        # Find all image files<br>        image_files = []<br>        for ext in [&#39;*.jpg&#39;, &#39;*.jpeg&#39;, &#39;*.png&#39;]:<br>            image_files.extend(glob.glob(os.path.join(self.images_folder, ext)))<br>        <br>        for image_path in image_files:<br>            image_filename = os.path.basename(image_path)<br>            image_id = os.path.splitext(image_filename)[0]  # Remove extension<br>            <br>            # Look for a corresponding caption file<br>            caption_path = os.path.join(self.captions_folder, f&quot;{image_id}.txt&quot;)<br>            <br>            if os.path.exists(caption_path):<br>                with open(caption_path, &#39;r&#39;) as f:<br>                    caption = f.read().strip()<br>                <br>                self.dataset_items.append({<br>                    &#39;image_path&#39;: image_path,<br>                    &#39;caption&#39;: caption<br>                })<br>        <br>        print(f&quot;Created dataset with {len(self.dataset_items)} matched image-caption pairs&quot;)<br>        <br>        # Set up transformation<br>        if transform is None:<br>            self.transform = transforms.Compose([<br>                transforms.Resize((img_size, img_size)),<br>                transforms.ToTensor(),<br>                transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Scale to [-1, 1]<br>            ])<br>        else:<br>            self.transform = transform<br>    <br>    def __len__(self):<br>        return len(self.dataset_items)<br>    <br>    def __getitem__(self, idx):<br>        item = self.dataset_items[idx]<br>        <br>        # Load and transform image<br>        image = Image.open(item[&#39;image_path&#39;]).convert(&#39;RGB&#39;)<br>        if self.transform:<br>            image = self.transform(image)<br>        <br>        return {<br>            &#39;image&#39;: image,<br>            &#39;caption&#39;: item[&#39;caption&#39;]<br>        }</pre><p>Quick breakdown of this code:</p><p><strong>1. Initialization (</strong><strong>__init__)</strong></p><p>This method sets up the dataset by locating images and their captions, applying transformations, and preparing the dataset items.</p><ul><li><strong>root_folder:</strong> The directory containing the dataset. It expects two subfolders:<br>1. <strong>images</strong>: Contains image files (e.g., .jpg, .png).<br>2. <strong>captions:</strong> Contains text files with captions corresponding to the images.<br>3. <strong>img_size</strong>: The size to which images will be resized (default is 64x64 pixels).</li><li><strong>transform:</strong> A set of transformations applied to the images. If not provided, default transformations are used:<br>1. Resize the image to <strong>img_size</strong>.<br>2. Convert it to a tensor.<br>3. Normalize pixel values to the range [-1, 1].</li></ul><p><strong>2. Length Method (__len__)</strong></p><ul><li>Returns the total number of matched image-caption pairs in the dataset.</li><li>This allows PyTorch’s DataLoader to iterate over the dataset efficiently.</li></ul><p><strong>3. Get Item Method (__getitem__)</strong></p><ul><li>This method retrieves a single item (<strong>image and caption</strong>) from the dataset based on its index (<strong>idx</strong>).</li></ul><blockquote><strong>Training</strong></blockquote><p>The <strong>train </strong>function is the main training loop for a text-conditioned diffusion model. It integrates all the components (dataset, tokenizer, U-Net, text encoder, and diffusion model) to train the model on a subset of the COCO dataset. The goal is to generate images conditioned on textual captions.</p><pre><br>def train(<br>    root_folder=&quot;./coco_subset&quot;,<br>    num_epochs=50,<br>    batch_size=16,<br>    learning_rate=1e-4,<br>    img_size=64,<br>    timesteps=1000,<br>    device=&quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;<br>):<br>    # Initialize dataset and dataloader<br>    dataset = SubsetDataset(root_folder, img_size=img_size)<br>    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=4)<br>    <br>    unet = SimpleUNet(in_channels=4, out_channels=4, time_dim=256, text_dim=768).to(device)<br>    text_encoder = TextEncoder().to(device)<br>    diffusion = DiffusionModel(unet, timesteps=timesteps)<br>    <br>    # Initialize optimizer<br>    optimizer = torch.optim.AdamW(unet.parameters(), lr=learning_rate)<br>    <br>    # Initialize tokenizer<br>    tokenizer = SimpleTokenizer()<br>    <br>    # Create output directory for results<br>    output_dir = os.path.join(root_folder, &quot;training_results&quot;)<br>    os.makedirs(output_dir, exist_ok=True)<br>    <br>    # Training loop<br>    for epoch in range(num_epochs):<br>        print(f&quot;Starting epoch {epoch+1}/{num_epochs}&quot;)<br>        epoch_loss = 0.0<br>        <br>        for batch in tqdm(dataloader, desc=f&quot;Epoch {epoch+1}&quot;):<br>            images = batch[&#39;image&#39;].to(device).float()  # [B, 3, H, W]<br>            captions = batch[&#39;caption&#39;]  # List of strings<br>            <br>            latents = torch.cat([images, torch.zeros_like(images[:, :1], dtype=torch.float32)], dim=1)  # [B, 4, H, W]<br>            <br>            # Tokenize captions<br>            all_input_ids = []<br>            all_attention_masks = []<br>            <br>            for caption in captions:<br>                input_ids, attention_mask = tokenizer.encode(caption, device)<br>                all_input_ids.append(input_ids)<br>                all_attention_masks.append(attention_mask)<br>                <br>            input_ids = torch.cat(all_input_ids, dim=0)<br>            attention_masks = torch.cat(all_attention_masks, dim=0)<br>            <br>            # Encode text with the text encoder<br>            with torch.no_grad():  # Freeze text encoder during training<br>                text_embeddings = text_encoder(input_ids, attention_masks)<br>            <br>            # Train diffusion model<br>            loss = diffusion.train_step(latents, text_embeddings, optimizer)<br>            epoch_loss += loss<br>            print(f&quot;Epoch {epoch}, Loss: {loss}&quot;)<br>            <br>        # Print epoch statistics<br>        avg_loss = epoch_loss / len(dataloader)<br>        print(f&quot;Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}&quot;)<br>        <br>        # Save checkpoint<br>        if (epoch + 1) % 1 == 0 or epoch == num_epochs - 1:<br>            checkpoint_path = os.path.join(output_dir, f&quot;diffusion_checkpoint_epoch_{epoch+1}.pt&quot;)<br>            torch.save({<br>                &#39;epoch&#39;: epoch,<br>                &#39;unet_state_dict&#39;: unet.state_dict(),<br>                &#39;optimizer_state_dict&#39;: optimizer.state_dict(),<br>                &#39;loss&#39;: avg_loss,<br>            }, checkpoint_path)<br>            print(f&quot;Saved checkpoint to {checkpoint_path}&quot;)<br>            <br>            # Generate a sample image<br>            with torch.no_grad():<br>                sample_text = &quot;a photograph of plastic bucket&quot;<br>                sample_ids, sample_mask = tokenizer.encode(sample_text, device)<br>                sample_embedding = text_encoder(sample_ids, sample_mask)<br>                <br>                shape = (1, 4, img_size, img_size)<br>                sample = diffusion.p_sample_loop(shape, sample_embedding, device)<br>                <br>                # Convert to image (assuming the first 3 channels are RGB)<br>                sample_image = sample[0, :3].permute(1, 2, 0).cpu().numpy()<br>                sample_image = (sample_image + 1) / 2.0  # Scale from [-1, 1] to [0, 1]<br>                sample_image = np.clip(sample_image, 0, 1)<br>                <br>                # Save sample image<br>                sample_path = os.path.join(output_dir, f&quot;sample_epoch_{epoch+1}.png&quot;)<br>                sample_pil = Image.fromarray((sample_image * 255).astype(np.uint8))<br>                sample_pil.save(sample_path)<br>                print(f&quot;Saved sample image to {sample_path}&quot;)<br>    <br>    print(&quot;Training completed!&quot;)<br>    return unet, text_encoder, diffusion<br><br>train()</pre><p>Quick breakdown:</p><ol><li><strong>Dataset and DataLoader:</strong></li></ol><ul><li>Loads paired image-caption data using <strong>SubsetDataset</strong>.</li><li>Prepares batches of data with <strong>DataLoader</strong>.</li></ul><p><strong>2. Model Initialization:</strong></p><ul><li><strong>U-Net (SimpleUNet)</strong>: Predicts noise during the denoising process.</li><li><strong>Text Encoder (TextEncoder)</strong>: Converts captions into embeddings.</li><li><strong>Diffusion Model (DiffusionModel)</strong>: Handles the forward and reverse diffusion processes.</li></ul><p><strong>3. Optimizer and Tokenizer:</strong></p><ul><li>Uses <strong>AdamW</strong> optimizer for U-Net training.</li><li><strong>SimpleTokenizer</strong> converts captions into token IDs and attention masks.</li></ul><p><strong>4. Training Loop:</strong></p><ul><li>For each epoch:<br><strong>1. Batch Processing:</strong> Loads images and captions, generates latents, tokenizes captions, and encodes them into embeddings.<br><strong>2. Train Diffusion Model:</strong> Adds noise to latents and trains U-Net to predict the noise using MSE loss.<br><strong>3. Log Loss:</strong> Tracks and prints average loss per epoch.</li></ul><p><strong>5. Checkpoint Saving:</strong></p><ul><li>Saves U-Net weights, optimizer state, and loss after every epoch.</li></ul><p><strong>6. Sample Image Generation:</strong></p><ul><li>Periodically generates images from random noise conditioned on a sample caption (e.g., &quot;a photograph of plastic bucket&quot;).</li><li>Saves generated images as .png files for visual inspection.</li></ul><blockquote><strong>Downloading Dataset</strong></blockquote><p>This is the only working dataset I could find, feel free to use another one! If you indeed do, share the results with me too!</p><pre>from datasets import load_dataset<br>import os<br>from tqdm import tqdm<br><br>output_dir = &quot;coco_subset&quot;<br>os.makedirs(os.path.join(output_dir, &quot;images&quot;), exist_ok=True)<br>os.makedirs(os.path.join(output_dir, &quot;captions&quot;), exist_ok=True)<br><br>try:<br>    dataset = load_dataset(&quot;VikramSingh178/Products-10k-BLIP-captions&quot;, split=&quot;test&quot;)<br>    start_idx = 0<br><br>    for i, example in tqdm(enumerate(dataset), total=len(dataset)):<br>        global_idx = start_idx + i<br>        image = example[&quot;image&quot;]  <br>        filename = f&quot;{global_idx:08d}.jpg&quot;<br><br>        if image is not None:<br>            image.save(os.path.join(output_dir, &quot;images&quot;, filename))<br><br>        caption_field = &quot;text&quot; if &quot;text&quot; in example else &quot;text&quot;<br>        <br>        caption = example[caption_field]<br>        if isinstance(caption, list):<br>            caption = caption[0]<br><br>        with open(os.path.join(output_dir, &quot;captions&quot;, f&quot;{global_idx:08d}.txt&quot;), &quot;w&quot;, encoding=&quot;utf-8&quot;) as f:<br>            f.write(caption)<br>    <br>    print(&quot; Dataset downloaded successfully&quot;)<br>    <br>except Exception as e:<br>    print(f&quot;Error: {e}&quot;)</pre><blockquote><strong>Results</strong></blockquote><p>To generate some meaningful images, we’ll have to train this on 100k+ epochs, but for now I’ve trained it on just 8 epochs, since it takes very long. Here are some images our model generated:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/64/1*i_pwgu18ynVB5452TOdlkA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/64/1*XN900fsjHqgoWjZs-NzXYQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/64/1*gZiWMCYbCXyAwbJxlqoLPg.png" /></figure><p>I mean it’s pure noise, but still pretty cool!</p><blockquote><strong>Conclusion</strong></blockquote><p>So, this was it! There is a lot room for improvements, but this is exactly how image generation models works. Hopefully you liked this article!</p><p>For sharing results or any queries, reach me out on <a href="https://x.com/atulit_gaur">X</a>.</p><p>Code for this article: <a href="https://github.com/Atulit23/stable_diffusion_from_scratch">https://github.com/Atulit23/stable_diffusion_from_scratch</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a646156414f8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stable Diffusion I — Mathematics Behind It]]></title>
            <link>https://medium.com/@atulit23/stable-diffusion-i-mathematics-behind-it-2957f5839e80?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/2957f5839e80</guid>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Fri, 07 Mar 2025 12:31:52 GMT</pubDate>
            <atom:updated>2025-03-07T12:31:52.641Z</atom:updated>
            <content:encoded><![CDATA[<h3>Stable Diffusion I — Mathematics Behind It</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CBzJPumf6tLktlNvhQVL4w.png" /></figure><blockquote>What is Stable Diffusion?</blockquote><p>Stable Diffusion is an AI model that creates images from text by starting with pure noise (like static) and gradually <strong>removing the noise step by step</strong> until the image matches the description. It learns how to turn noise into meaningful images using millions of examples, combining deep neural networks with attention mechanisms to understand both visual patterns and text prompts.</p><p>Before starting, there are some topics we need to cover, so let’s start with them.</p><blockquote>KL Divergence</blockquote><p>KL divergence (Kullback-Leibler divergence) is a concept in information theory and probability that measures how different one probability distribution is from another.</p><p>Think of KL divergence as answering the question: <em>How much information is lost if we approximate one probability distribution with another?</em></p><p>It quantifies how much one probability distribution Q differs from another probability distribution P. The more different they are, the higher the KL divergence.</p><h4>Formula</h4><p>The KL divergence between two distributions P and Q is given by:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/849/1*YkSj2kMlga2MMpC-8hRYCg.png" /></figure><p>Where:</p><ul><li>P is the <strong>true distribution</strong> (what the data really looks like)</li><li>Q is the <strong>approximate distribution</strong> (what we’re trying to model)</li><li>x represents the possible outcomes</li></ul><p><strong>What does it actually mean?</strong></p><ul><li>If P(x) and Q(x) are <strong>identical</strong>, then:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/827/1*4g2o-UHNnHqa6JrbIcCJGQ.png" /></figure><ul><li>If Q is <strong>very different</strong> from P, the divergence will be large.</li></ul><p><strong>How is KL Divergence Used in Stable Diffusion?</strong></p><p>In <strong>Stable Diffusion</strong> (and VAEs), KL divergence is used during the training phase to:</p><ol><li>Push the encoded latent space close to a normal Gaussian distribution.</li><li>Ensure that the noise added during denoising follows the prior distribution.</li></ol><p>The loss function often looks like:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/827/1*id6YIPGjeeO09ZUuK5Im3A.png" /></figure><p>More on this later</p><blockquote>Jensen’s Inequality</blockquote><p>Jensen’s Inequality tells us how a <strong>convex function </strong>(a function is convex if its curve always bends upwards — like a bowl.) interacts with the <strong>average of random variables</strong>.</p><p>In simple words:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/827/1*zw-2GGTHxmgxxp51RmAC4A.png" /></figure><p>where:</p><ul><li>X is a Random variable (like position, energy, or vibration)</li><li>f(x) is a convex function</li><li>E[X] is the <strong>average</strong> value of X (expected value)</li><li>E[f(X)] is the average of the function applied to X</li></ul><p>What all of this basically means is that if you first <strong>average the raw data</strong> and then apply the function, you’ll always get <strong>less or equal information</strong> than if you apply the function first and then average.</p><blockquote>Markov chain</blockquote><p>A <strong>Markov Chain</strong> is a mathematical model that describes a sequence of events where the <strong>next state depends only on the current state</strong> — not on the entire history of previous states.</p><p>Markov Chains are used whenever you need to <strong>model randomness over time</strong> — especially in:</p><ul><li>Text Generation (like LLMs)</li><li>Image Generation (like Stable Diffusion)</li><li>Reinforcement Learning</li><li>Time Series Forecasting</li><li>Hidden Markov Models (HMMs) in speech recognition</li></ul><blockquote>Working</blockquote><h4>1. Forward Diffusion Process</h4><ul><li>The forward process is like breaking down an ordered system into chaos.</li><li>We gradually add Gaussian noise over <strong>T</strong> timesteps.</li><li>At every step, the image becomes noisier.</li></ul><p>We define a <strong>Markovian forward diffusion</strong> that gradually adds Gaussian noise to an image <strong>x₀</strong>​ over time t, creating a noised version <strong>xₜ</strong>. This process is modeled as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/844/1*UjEpQ-QKfQFX-fm4Ea8hAg.png" /></figure><p>Where:</p><ul><li><strong>x₀</strong>​ → The original clean image</li><li><strong>xₜ</strong>​ → Noisy image at timestep <strong>t</strong></li><li>α<strong>ₜ</strong>→ Noise scaling factor at timestep <strong>t</strong></li><li>N → Gaussian distribution</li></ul><p><strong>→ What is Markovian Forward Diffusion?</strong></p><ul><li>A <strong>Markovian process</strong> simply means that the <strong>future state</strong> depends <strong>only on the present state</strong> — not on the entire history.</li><li>Mathematically:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*6fbuW2QmFWt7yF9HRFqxoA.png" /></figure><p>So, the probability of <strong>xₜ</strong>​ only depends on the <strong>previous step</strong> <strong>xₜ₋₁</strong>, making the process <strong>memoryless</strong>.</p><p>By leveraging the <strong>reparametrization trick</strong>, we can write the diffusion process directly as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*dvugUadAKDZYKLteu69grA.png" /></figure><p><strong>→ What is reparametrization trick?</strong></p><ul><li>In the forward diffusion process, we start with an image x0x_0x0​, and after T timesteps, it becomes pure noise:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*ClB0daBUA-TWb5IsezJwfA.png" /></figure><ul><li>Now the whole game is about <strong>reversing this process</strong>.</li><li>What we are actually training the model to do is predict the <strong>noise</strong> at each step:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*yEKfcKi7ENHXqLbdbYLJ6g.png" /></figure><ul><li>But the catch is, the whole reverse diffusion process is stochastic (random) because noise is involved at each step.</li><li>If the process is random, how can the model be trained through backpropagation?</li><li>You cannot directly backpropagate through <strong>random noise sampling</strong> because neural networks are <strong>deterministic</strong> — they can’t optimize parameters through pure randomness.</li></ul><p>That’s Where the <strong>Reparameterization Trick</strong> Comes In:</p><p>Instead of sampling noise directly like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*WFFT8UBumoNztjuAVKfERg.png" /></figure><p>We now <strong>reparameterize</strong> the noise:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*3g33z7lQpn3KaW_J7f6_dQ.png" /></figure><p>Where:</p><ul><li><strong>ϵθ(xₜ, t)</strong> is the <strong>deterministic neural network prediction</strong>.</li><li>z∼N(0,I) is the <strong>pure Gaussian noise</strong>.</li></ul><p>Now rewrite the whole diffusion equation like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*aG-jm-VAmua2y1NLTRNFEA.png" /></figure><p>Now the entire process becomes <strong>differentiable</strong>.</p><p>The randomness is pushed into the separate z term, which <strong>does not depend on the model parameters</strong>.</p><p>Now, during backpropagation Only <strong>ϵθ(xₜ, t)</strong> gets updated, while the pure Gaussian noise stays fixed.</p><p><strong>→ Full Pipeline</strong></p><ol><li>Start with pure noise <strong>xₜ</strong>∼N(0,I)</li><li>For each timestep t:</li></ol><ul><li>Predict the noise: <strong>ϵθ(xₜ, t)</strong></li><li>Sample new noise: z∼N(0,I)</li><li>Reconstruct the clean image:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/786/1*LI3HmarRilFVoTd2fLo7sg.png" /></figure><h4><strong>2. Reverse Diffusion Process</strong></h4><p>In the reverse process, we want to <strong>denoise the pure Gaussian noise</strong> step by step until we regenerate the original data <strong>x₀</strong>​.</p><p><strong>The Reverse Diffusion Equation</strong></p><p>The forward diffusion process follows this Markov chain:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/828/1*mSCFG_XRdPgX3CyGy_2nKg.png" /></figure><p>The <strong>Reverse Diffusion Process</strong> is the exact opposite:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/828/1*ulBKeB1B6EcURdMI0ceH-w.png" /></figure><p>Let’s break this down</p><p><strong>a. The Reverse Mean μθ(xₜ, t)</strong></p><ul><li>The reverse mean is:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*APg99AbPz9FI8VGTRnrRdg.png" /></figure><ul><li>We start with the noisy sample <strong>xₜ</strong>​</li><li>Predict what the noise <strong>should have been</strong> using the neural network</li><li>Subtract the noise to get a slightly denoised version</li><li>Scale everything back by the inverse diffusion factor</li></ul><p><strong>b. The Variance Σθ(xₜ, t)</strong></p><ul><li>The variance controls <strong>how much randomness</strong> we inject at each step.</li><li>For most diffusion models, it’s fixed as:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/754/1*86dwCey--gw3EtKC1GSX1w.png" /></figure><ul><li>Where:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/782/1*JQ-oDGsyNvFLbO9JemtGhw.png" /></figure><p>The model can also predict this variance, but most papers keep it fixed.</p><p>Thus, the final reverse diffusion equation is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/783/1*7PFHtWTF8IL9WIfNMJH_KA.png" /></figure><p><strong>Recap:</strong></p><ul><li>Start with random Gaussian noise</li><li>At each step, subtract the predicted noise (order from chaos)</li><li>Add a little bit of fresh Gaussian noise (to maintain stochasticity)</li><li>Repeat this for <strong>T</strong> steps</li></ul><h4>3. Variational Lower Bound &amp; Training Objective</h4><p>Given <strong>data</strong> <strong>x₀</strong>​, we want to model the probability distribution:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/783/1*uDZQDV_OvZjk8gUCmlHCaw.png" /></figure><p>But directly maximizing this likelihood is <strong>impossible</strong>.</p><p>Instead, we turn to <strong>Variational Inference</strong> — where we approximate this likelihood using a <strong>Variational Lower Bound (VLB)</strong>.</p><p>The full training objective is maximizing the <strong>log-likelihood</strong> of the data:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/783/1*H-HDNz4lcxtM9W9ADOBpoA.png" /></figure><p>But we can’t compute this directly.</p><p>Instead, we derive a lower bound using <strong>Jensen’s Inequality</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/783/1*0Y9Gjc0y-QOE6p1dwpLYTQ.png" /></figure><p>This lower bound is called the <strong>ELBO</strong> (Evidence Lower Bound):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/783/1*JzP72txe-Wl2EiuOjbT7vg.png" /></figure><p>Let’s break this down</p><p>First, rewrite the joint probability of the entire Markov chain:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/789/1*QoPr4FnDuCg3bDCj6AU71w.png" /></figure><p>And the forward process is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/789/1*OJn7buI5L4m69yfCjnM5sQ.png" /></figure><p>Now the ELBO becomes:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/792/1*XVeDzsWwRnbkkrWHutYBwg.png" /></figure><p>Now the ELBO is <strong>three separate losses</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/879/1*Cf9JsHsyQcFfoumWutPvug.png" /></figure><p><strong>→ Reconstruction Loss</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/818/1*ieZwGky-RgVilplcwE00WA.png" /></figure><p>The main loss is the <strong>Kullback-Leibler Divergence</strong> between the true forward distribution and the predicted reverse distribution:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/818/1*zf-upIs9MtxuQn_pIcoXDA.png" /></figure><p>Finally, at the last timestep we compute <strong>Prior Matching Loss:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/818/1*ExmNN7qkbvoB1TIAPfyvdQ.png" /></figure><p>But since both are Gaussian, this KL divergence has a <strong>closed form</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/818/1*io2Pa8eDY_sfvpE9m6HWJQ.png" /></figure><p>This is so genius because:</p><ul><li>The model <strong>never</strong> predicts the image directly</li><li>It only predicts the <strong>noise</strong> at each step</li></ul><p>That’s why the entire training objective is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/818/1*EVoFmLQFGAwEYXi0kh49Kw.png" /></figure><p>The full loss is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/827/1*8_Offrlv2VIWO4bkxpZDDA.png" /></figure><p>But the simple version (used in 99% of papers) is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/828/1*qThQaXVO3YwXhNWLyzs1lw.png" /></figure><blockquote>Example Walkthrough of Stable Diffusion</blockquote><h4>1. <strong>Dataset &amp; Input Preparation</strong></h4><p>Let’s say we have a dataset of <strong>cat images</strong> x∈<strong>R³ˣ²⁵⁶ˣ²⁵⁶</strong></p><h4>2. <strong>Forward Diffusion (Markovian Process)</strong></h4><p>We gradually add Gaussian noise to the image in <strong>T=1000</strong> steps using the forward diffusion process:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/820/1*SdzcceT-xWXSCvyfoKHK_Q.png" /></figure><p>where:</p><ul><li><strong>βₜ </strong>is the noise schedule (small noise at the beginning, larger noise at the end)</li><li><strong>t</strong> is the time step</li><li><strong>√(1 — βₜ) </strong>scales down the image</li><li><strong>βₜ</strong>I is the Gaussian noise added at each step</li></ul><p>Instead of sampling one step at a time, we can directly jump to any time step <strong>t</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/762/1*ZtVAv4fpZBTW_XOSwdITUQ.png" /></figure><p>with:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/813/1*kd_-yaX4j3QgXGbfGaKsyw.png" /></figure><p><strong>Example:</strong></p><p>Let’s say:</p><ul><li><strong>βₜ</strong>=0.01</li><li>t=100</li><li><strong>x₀</strong>​ is an image of a cat</li></ul><p>Then:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/813/1*gmb6_S0SjZ4K6cEpYWElgQ.png" /></figure><p>The noisy image becomes:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*z81-_x4OFf3KJQN6E0Fc_g.png" /></figure><h4>3. <strong>Reverse Diffusion Process (Model Training Objective)</strong></h4><p>Now the goal is to <strong>reverse the noise</strong> and recover the original image.</p><p>We train a neural network <strong>ϵθ(xₜ, t)</strong> to predict the noise at each step:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*qX_AbEHc_AEWPBbNI4YKQg.png" /></figure><p>The denoising step is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*eMeLwjbtnMJiQgGJT3MKdw.png" /></figure><p>Where:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*s2v76gBKOem24dvnpBg1uQ.png" /></figure><h4>4. <strong>Training Objective (Variational Lower Bound)</strong></h4><p>The network is trained to minimize the <strong>simplified evidence lower bound (ELBO)</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*NJeD8cWn5PlVJk7tNEKhvQ.png" /></figure><p>This means the model is trained to <strong>predict the noise</strong> added at each timestep.</p><h4>Example Training Step</h4><p>Let,</p><ul><li><strong>Image</strong> = Cat</li><li><strong>x₀ </strong>= Cat Image</li><li><strong>t </strong>= 500</li><li><strong>βₜ </strong>= 0.02</li></ul><p><strong>→ Forward Diffusion</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*rvsFaWtv_lCYjjmscfe5dA.png" /></figure><p><strong>→ Neural Network Prediction:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*lGdU3wTiMTOS31vsWF5jtw.png" /></figure><p><strong>→ Loss Function</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*lYX5FzcfC9_wOFFHkdGHsA.png" /></figure><p><strong>→ Gradient Descent</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*lSMzxeUkqinttbdsv_wRSQ.png" /></figure><p>After training, we generate images starting from pure Gaussian noise:</p><ul><li>Start with x<strong>ₜ</strong>∼N(0,I)</li><li>Sample iteratively:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*vSrBFt7-1Fcg17Vnol0OBg.png" /></figure><p>until t=0.</p><blockquote>Conclusion</blockquote><p><strong>To summarize:</strong></p><ul><li>Sample an image x<strong>₀</strong>​ from the dataset</li><li>Randomly choose a timestep t∼U(1,T)</li><li>Generate noisy image:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*uQaP86sU23NxqBzKbKyh1w.png" /></figure><ul><li>Predict the noise:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*nHtJBBuf-eBYAFFB_-IvPg.png" /></figure><ul><li>Compute Loss:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*yqXnl9HpEKEGDIyapt0HMQ.png" /></figure><p>So, this was it! Hopefully, you liked this in the next article, we’ll code this up using PyTorch!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2957f5839e80" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[7 Machine Learning Projects in under 7 minutes]]></title>
            <link>https://medium.com/@atulit23/7-machine-learning-projects-in-under-7-minutes-b9ab9f35a171?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/b9ab9f35a171</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Sun, 23 Feb 2025 10:48:25 GMT</pubDate>
            <atom:updated>2025-02-23T10:55:16.300Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/850/1*5DjlYonctRxInl6oIA5wEQ.jpeg" /></figure><blockquote>What are we gonna do</blockquote><p>In the previous blogs I covered all the mathematics required for Machine Learning, so in this blog we’re gonna be doing some Machine Learning projects before ultimately moving onto Deep Learning.</p><h3>1. House Price Prediction — Regression (XGboost)</h3><pre>import pandas as pd<br>import numpy as np<br>from sklearn.model_selection import train_test_split<br>from sklearn.preprocessing import LabelEncoder, StandardScaler<br>from xgboost import XGBRegressor<br>from sklearn.metrics import mean_absolute_error, mean_squared_error<br><br>data = pd.read_csv(&quot;./datasets/house-price.csv&quot;)  <br><br>data.drop(columns=[&#39;Id&#39;], inplace=True)<br><br>data.fillna(data.median(numeric_only=True), inplace=True)<br>data.fillna(&quot;None&quot;, inplace=True) <br><br>categorical_cols = data.select_dtypes(include=[&#39;object&#39;]).columns<br>label_encoders = {}<br>for col in categorical_cols:<br>    le = LabelEncoder()<br>    data[col] = le.fit_transform(data[col])<br>    label_encoders[col] = le<br><br>X = data.drop(columns=[&#39;SalePrice&#39;])<br>y = data[&#39;SalePrice&#39;]<br><br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)<br><br>scaler = StandardScaler()<br>X_train = scaler.fit_transform(X_train)<br>X_test = scaler.transform(X_test)<br><br>model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>mae = mean_absolute_error(y_test, y_pred)<br>mse = mean_squared_error(y_test, y_pred)<br>rmse = np.sqrt(mse)<br><br>print(f&quot;MAE: {mae}&quot;)<br>print(f&quot;RMSE: {rmse}&quot;)</pre><h4>1. Load and Preprocess the Data</h4><ul><li>The dataset (house-price.csv) is loaded using <strong>pandas</strong>.</li><li>The <strong>Id</strong> column is removed since it’s just an identifier and doesn’t help in prediction.</li><li>Missing values:</li><li>Numerical columns are filled with their <strong>median</strong>.</li><li>Categorical columns are filled with &quot;None&quot;, so no missing values remain.</li></ul><h4>2. Encode Categorical Variables</h4><ul><li>The script finds all categorical columns (text-based data).</li><li>It uses <strong>Label Encoding</strong> to convert them into numerical values since ML models can’t work directly with text.</li></ul><h4>3. Split the Data</h4><ul><li>The dataset is split into <strong>training (80%)</strong> and <strong>testing (20%)</strong> sets using train_test_split().</li><li><strong>Features (</strong><strong>X)</strong>: All columns except SalePrice.</li><li><strong>Target (</strong><strong>y)</strong>: The SalePrice column.</li></ul><h4>4. Scale the Features</h4><ul><li>Standardization (StandardScaler) is applied to bring all numerical features to a similar scale, improving model performance.</li></ul><h4>5. Train the Model</h4><ul><li>A <strong>Xgboost Regressor</strong> (100 trees) is trained on the X_train and y_train data.</li></ul><h3>2. Spam Email Classification (SVM)</h3><pre>import pandas as pd<br>import numpy as np<br>from sklearn.model_selection import train_test_split<br>from sklearn.feature_extraction.text import TfidfVectorizer<br>from sklearn.svm import SVC<br>from sklearn.metrics import accuracy_score, classification_report<br><br>data = pd.read_csv(&quot;spam_dataset.csv&quot;)  <br><br>X = data[&quot;text&quot;]<br>y = data[&quot;label&quot;]<br><br>vectorizer = TfidfVectorizer(stop_words=&#39;english&#39;, max_features=5000)<br>X_vectorized = vectorizer.fit_transform(X)<br><br>X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)<br><br>model = SVC(kernel=&#39;linear&#39;)<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>accuracy = accuracy_score(y_test, y_pred)<br>print(f&quot;Accuracy: {accuracy}&quot;)<br>print(classification_report(y_test, y_pred))</pre><p>This code:</p><ul><li>Loads a dataset (spam_dataset.csv) with text messages (text) and their labels (label).</li><li>Converts the text into numerical features using <strong>TF-IDF vectorization</strong> (removing stop words, limiting to 5000 features).</li><li>Splits the data into training (80%) and testing (20%) sets.</li><li>Trains an <strong>SVM classifier</strong> with a linear kernel.</li><li>Makes predictions on the test set and evaluates performance using <strong>accuracy</strong> and a <strong>classification report</strong> (precision, recall, F1-score).</li></ul><h3>3. Sentiment Analysis (LogisticRegression)</h3><pre>import pandas as pd<br>import numpy as np<br>from sklearn.model_selection import train_test_split<br>from sklearn.feature_extraction.text import TfidfVectorizer<br>from sklearn.linear_model import LogisticRegression<br>from sklearn.metrics import accuracy_score, classification_report<br><br>data = pd.read_csv(&quot;./datasets/sentiment.csv&quot;, encoding=encoding)  <br>data = data.dropna()<br><br>X = data[&quot;selected_text&quot;]<br>y = data[&quot;sentiment&quot;]<br><br>vectorizer = TfidfVectorizer(stop_words=&#39;english&#39;, max_features=5000)<br>X_vectorized = vectorizer.fit_transform(X)<br><br>X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)<br><br>model = LogisticRegression()<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>accuracy = accuracy_score(y_test, y_pred)<br>print(f&quot;Accuracy: {accuracy}&quot;)<br>print(classification_report(y_test, y_pred))</pre><p>This code:</p><ul><li>Loads a sentiment dataset (sentiment.csv), removes missing values.</li><li>Converts text (selected_text) into numerical features using <strong>TF-IDF vectorization</strong> (removing stop words, limiting to 5000 features).</li><li>Splits the data into <strong>training (80%)</strong> and <strong>testing (20%)</strong> sets.</li><li>Trains a <strong>Logistic Regression</strong> model on the training data.</li><li>Makes predictions on the test set and evaluates performance using <strong>accuracy</strong> and a <strong>classification report</strong> (precision, recall, F1-score).</li></ul><h3>4. Stock Price Prediction (XGBoost)</h3><pre>import pandas as pd<br>import numpy as np<br>import yfinance as yf<br>from sklearn.model_selection import train_test_split<br>from sklearn.preprocessing import MinMaxScaler<br>import xgboost as xgb<br>from sklearn.metrics import mean_absolute_error, mean_squared_error<br><br>ticker = &quot;AAPL&quot;<br>data = yf.download(ticker, start=&quot;2020-01-01&quot;, end=&quot;2024-01-01&quot;)<br><br>data[&quot;Return&quot;] = data[&quot;Close&quot;].pct_change()<br>data.dropna(inplace=True)<br><br>X = data[[&quot;Open&quot;, &quot;High&quot;, &quot;Low&quot;, &quot;Volume&quot;, &quot;Return&quot;]]<br>y = data[&quot;Close&quot;]<br><br>scaler = MinMaxScaler()<br>X_scaled = scaler.fit_transform(X)<br><br>X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)<br><br>model = xgb.XGBRegressor(objective=&quot;reg:squarederror&quot;, n_estimators=150, random_state=42)<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>mae = mean_absolute_error(y_test, y_pred)<br>mse = mean_squared_error(y_test, y_pred)<br>rmse = np.sqrt(mse)<br><br>print(f&quot;MAE: {mae}&quot;)<br>print(f&quot;RMSE: {rmse}&quot;)</pre><p>This code:</p><ul><li>Downloads <strong>AAPL stock data</strong> from Yahoo Finance (2020–2024).</li><li>Calculates the <strong>daily return</strong> (Close price percentage change).</li><li>Uses <strong>Open, High, Low, Volume, and Return</strong> as input features (X) and <strong>Close price</strong> as the target (y).</li><li><strong>Scales</strong> the input features using <strong>MinMaxScaler</strong>.</li><li>Splits the data into <strong>training (80%)</strong> and <strong>testing (20%)</strong> sets.</li><li>Trains an <strong>XGBoost regressor</strong> with 150 estimators.</li><li>Predicts stock prices and evaluates performance using <strong>Mean Absolute Error (MAE)</strong> and <strong>Root Mean Squared Error (RMSE)</strong>.</li></ul><h3>5. Movie Recommender System</h3><pre>import pandas as pd<br>from sklearn.feature_extraction.text import TfidfVectorizer<br>from sklearn.metrics.pairwise import cosine_similarity<br><br>df = pd.read_csv(&#39;./datasets/movies.csv&#39;)<br><br>df[&#39;overview&#39;].fillna(&quot;&quot;, inplace=True)<br>df[&#39;tagline&#39;].fillna(&quot;&quot;, inplace=True)<br>df[&#39;original_title&#39;].fillna(&quot;&quot;, inplace=True)<br><br>df[&#39;combined&#39;] = df[&#39;original_title&#39;] + &#39; &#39; + df[&#39;overview&#39;] + &#39; &#39; + df[&#39;tagline&#39;]<br><br>df[&#39;combined&#39;]<br><br>vectorizer = TfidfVectorizer(stop_words=&#39;english&#39;)<br>tfidf_matrix = vectorizer.fit_transform(df[&quot;combined&quot;])<br><br>similarity_matrix = cosine_similarity(tfidf_matrix)<br><br><br>def recommend_movies(title, num_recommendations=5):<br>    if title not in df[&quot;title&quot;].values:<br>        return &quot;Movie not found in dataset.&quot;<br>    <br>    idx = df[df[&quot;title&quot;] == title].index[0]<br>    similar_movies = list(enumerate(similarity_matrix[idx]))<br>    similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]<br>    <br>    recommendations = [df.iloc[i[0]][&quot;title&quot;] for i in similar_movies]<br>    return recommendations<br><br>movie_title = &quot;Inception&quot;<br>recommended_movies = recommend_movies(movie_title)<br>print(f&quot;Movies similar to {movie_title}: {recommended_movies}&quot;)</pre><p>This code:</p><ul><li><strong>Loads movie data</strong> (movies.csv) and fills missing values.</li><li><strong>Combines text features</strong> (title, overview, tagline) into a single string.</li><li><strong>Converts text into vectors</strong> using <strong>TF-IDF</strong>, removing common words.</li><li><strong>Computes cosine similarity</strong>, which measures how similar two movies are based on their text.</li><li><strong>Cosine similarity</strong> measures how similar two movies are by calculating the angle between their TF-IDF vectors.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/808/1*PX8L6zvBOvhLHg1ZN8Mzew.png" /></figure><ul><li><strong>Finds the most similar movies</strong> to a given title by sorting similarity scores.</li><li><strong>Recommends the top 5 movies</strong> similar to &quot;Inception&quot;.</li></ul><h3>6. Churn Prediction (Decision Trees)</h3><pre>import pandas as pd<br>from sklearn.model_selection import train_test_split<br>from sklearn.preprocessing import LabelEncoder<br>from sklearn.tree import DecisionTreeClassifier<br>from sklearn.metrics import accuracy_score, classification_report<br><br>data = pd.read_csv(&quot;./datasets/churn.csv&quot;)<br>data = data.dropna()  <br><br>label_encoders = {}<br>categorical_cols = [&quot;Gender&quot;, &quot;Subscription Type&quot;, &quot;Contract Length&quot;]<br>for col in categorical_cols:<br>    le = LabelEncoder()<br>    data[col] = le.fit_transform(data[col])<br>    label_encoders[col] = le<br><br>X = data.drop(columns=[&quot;CustomerID&quot;, &quot;Churn&quot;])<br>y = data[&quot;Churn&quot;]<br><br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)<br><br>model = DecisionTreeClassifier(random_state=42)<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>accuracy = accuracy_score(y_test, y_pred)<br>print(f&quot;Accuracy: {accuracy}&quot;)<br>print(classification_report(y_test, y_pred))</pre><p>This code:</p><ul><li><strong>Loads Data</strong> — Reads churn.csv and removes missing values.</li><li><strong>Encodes Categorical Data</strong> — Converts Gender, Subscription Type, and Contract Length into numbers using Label Encoding.</li><li><strong>Prepares Features &amp; Labels</strong> — Drops CustomerID, sets &quot;Churn&quot; as the target variable.</li><li><strong>Splits Data</strong> — Divides into 80% training and 20% testing sets.</li><li><strong>Trains Model</strong> — Fits a <strong>Decision Tree Classifier</strong> on the training data.</li><li><strong>Makes Predictions</strong> — Predicts churn for the test data.</li><li><strong>Evaluates Performance</strong> — Computes <strong>accuracy</strong> and <strong>classification report</strong>.</li></ul><h3>7. MNIST Digits Classification</h3><pre>import pandas as pd<br>import numpy as np<br>from sklearn.preprocessing import LabelEncoder<br>from xgboost import XGBClassifier<br>from sklearn.metrics import accuracy_score, classification_report<br>import tensorflow as tf<br><br>(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()<br><br>X_train = X_train.reshape(X_train.shape[0], -1)<br>X_test = X_test.reshape(X_test.shape[0], -1)<br><br>X_train = X_train / 255.0<br>X_test = X_test / 255.0<br><br>model = XGBClassifier(use_label_encoder=False, eval_metric=&#39;mlogloss&#39;)<br>model.fit(X_train, y_train)<br><br>y_pred = model.predict(X_test)<br><br>accuracy = accuracy_score(y_test, y_pred)<br>print(f&quot;Accuracy: {accuracy}&quot;)<br>print(classification_report(y_test, y_pred))</pre><p>This code:</p><ul><li><strong>Loads MNIST Dataset</strong> — Imports the <strong>handwritten digits dataset</strong> (28x28 grayscale images labeled 0–9).</li><li><strong>Reshapes Images</strong> — Flattens each <strong>28x28 image</strong> into a <strong>1D array of 784 pixels</strong> for model compatibility.</li><li><strong>Normalizes Pixel Values</strong> — Divides by <strong>255.0</strong> to scale values between <strong>0 and 1</strong>, improving training efficiency.</li><li><strong>Initializes XGBoost Model</strong> — Uses <strong>XGBClassifier</strong> with mlogloss (multi-class log loss) for classification.</li></ul><blockquote>Conclusion</blockquote><p>Hope you liked this and was helpful! If you wanna understand these algorithms in depth, check out my previous blogs:</p><ol><li><a href="https://medium.com/@atulit23/mathematics-for-machine-learning-theory-implementation-i-d7e3b1815e4b">Mathematics &amp; Machine Learning — I</a></li><li><a href="https://medium.com/@atulit23/mathematics-for-machine-learning-theory-implementation-ii-df90abc95e66">Mathematics &amp; Machine Learning — II</a></li></ol><p>Also code for these projects: <a href="https://github.com/Atulit23/mini-ml-projects">https://github.com/Atulit23/mini-ml-projects</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b9ab9f35a171" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Mathematics for Machine Learning — Theory & Implementation II]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@atulit23/mathematics-for-machine-learning-theory-implementation-ii-df90abc95e66?source=rss-f0f45706a5be------2"><img src="https://cdn-images-1.medium.com/max/869/1*vyaAtOICTaBotlLAQK2SBA.jpeg" width="869"></a></p><p class="medium-feed-snippet">Things this will cover</p><p class="medium-feed-link"><a href="https://medium.com/@atulit23/mathematics-for-machine-learning-theory-implementation-ii-df90abc95e66?source=rss-f0f45706a5be------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@atulit23/mathematics-for-machine-learning-theory-implementation-ii-df90abc95e66?source=rss-f0f45706a5be------2</link>
            <guid isPermaLink="false">https://medium.com/p/df90abc95e66</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[mathematics]]></category>
            <dc:creator><![CDATA[Void]]></dc:creator>
            <pubDate>Mon, 17 Feb 2025 14:23:11 GMT</pubDate>
            <atom:updated>2025-02-17T14:23:11.882Z</atom:updated>
        </item>
    </channel>
</rss>