<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Adarsh Kesharwani on Medium]]></title>
        <description><![CDATA[Stories by Adarsh Kesharwani on Medium]]></description>
        <link>https://medium.com/@adarshhme?source=rss-e192e3794730------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*jyVrqgnU75jn5R-TbCo6mQ.jpeg</url>
            <title>Stories by Adarsh Kesharwani on Medium</title>
            <link>https://medium.com/@adarshhme?source=rss-e192e3794730------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 31 May 2026 20:04:23 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@adarshhme/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Building an Accent Embedding Model from Scratch: A Step-by-Step Technical Guide]]></title>
            <link>https://medium.com/@adarshhme/building-an-accent-embedding-model-from-scratch-a-step-by-step-technical-guide-3d336577fb4a?source=rss-e192e3794730------2</link>
            <guid isPermaLink="false">https://medium.com/p/3d336577fb4a</guid>
            <category><![CDATA[audio-ml]]></category>
            <category><![CDATA[computational-linguistics]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[neural-embedding]]></category>
            <category><![CDATA[speech-processing]]></category>
            <dc:creator><![CDATA[Adarsh Kesharwani]]></dc:creator>
            <pubDate>Fri, 23 May 2025 19:33:29 GMT</pubDate>
            <atom:updated>2025-05-23T19:35:23.180Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction: Why Accents Matter in AI</h3><p>Have you ever noticed how Siri struggles with strong accents? Or how Zoom’s auto-captions mishear non-native speakers? Behind the scenes, <strong>accent embedding models</strong> are working to solve these problems. In this blog, I’ll break down how I built a system that converts raw speech into numerical “accent fingerprints” — and explain every component.</p><h3>1. Audio Preprocessing: The Quality Control Pipeline</h3><p>Before analyzing accents, we need clean, standardized audio. This preprocessing workflow ensures consistent input quality:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6TB4qZP3990-AdCz7z0JmA.png" /><figcaption>Audio Preprocessing Pipeline</figcaption></figure><h4><strong>1.1 Gatekeeper: Duration Validation</strong></h4><pre>if len(y) &lt; self.min_duration * sr:  # 0.5 second minimum<br>    raise ValueError(f&quot;Audio too short: {len(y)/sr:.2f}s&quot;)</pre><ul><li><em>Why?</em> Filters out non-speech sounds (coughs, clicks)</li><li><em>Tradeoff:</em> 0.5s balances minimum phoneme length vs data loss</li></ul><h4><strong>1.2 Sample Rate Harmonization</strong></h4><pre>y = librosa.resample(y, orig_sr=sr, target_sr=16000, res_type=&#39;kaiser_fast&#39;)</pre><ul><li><strong>16kHz standard</strong> preserves formants while reducing compute</li><li><strong>Kaiser window</strong> minimizes aliasing artifacts</li></ul><h4><strong>1.3 Silence Trimming (VAD)</strong></h4><pre>y, _ = librosa.effects.trim(y, top_db=20, frame_length=2048, hop_length=512)</pre><ul><li><strong>20dB threshold</strong> removes pauses without cutting consonants</li><li><strong>2048-frame window</strong> (128ms) captures speech transitions</li></ul><h4><strong>1.4 High-Pass Filter Cascade</strong></h4><ul><li><strong>Primary:</strong> 20Hz cutoff (-3dB point) removes hum/noise</li><li><strong>Fallback:</strong> Pre-emphasis (0.97 coefficient) when filter fails</li></ul><h4><strong>1.5 Loudness Normalization</strong></h4><pre>def _normalize_lufs(self, y, sr, target_lufs=-23.0):<br>    meter = pyln.Meter(sr)<br>    loudness = meter.integrated_loudness(y)<br>    gain_db = target_lufs - loudness<br>    return y * (10 ** (gain_db / 20))</pre><ul><li><strong>EBU R128 standard</strong> (-23 LUFS) matches broadcast levels</li><li><strong>Anti-clipping:</strong> Caps at 0.99 peak amplitude</li></ul><h3>2. Feature Extraction: Decoding Accent Signatures</h3><p>With clean audio, we extract distinctive accent markers:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mC3WpaNuRuLhF28Haii9XA.png" /><figcaption>Feature Extraction Process</figcaption></figure><h4>2.1 Spectral Identity (Mel &amp; MFCC)</h4><pre>mel = librosa.feature.melspectrogram(<br>    y=y, sr=sr, n_mels=80, <br>    n_fft=1024, hop_length=256,<br>    fmin=80, fmax=7600<br>)<br>mfcc = librosa.feature.mfcc(S=librosa.power_to_db(mel), n_mfcc=13)</pre><ul><li><strong>Mel Bands:</strong> 80 channels mimic human cochlear resolutio</li><li><strong>MFCCs:</strong> 13 coefficients capture vowel timbre</li><li><strong>Delta Features:</strong> Augment with temporal derivatives</li></ul><h4>2.2 Prosody Patterns</h4><pre>f0, voiced_flag, _ = librosa.pyin(<br>    y, fmin=80, fmax=400,<br>    frame_length=2048, sr=sr<br>)<br>energy = librosa.feature.rms(y=y, frame_length=2048, hop_length=256)</pre><ul><li><strong>Pitch Tracking:</strong> 80–400Hz range covers all speech fundamentals</li><li><strong>Energy:</strong> RMS calculated over same frames as Mel</li></ul><h4>2.3 Formant Tracking</h4><pre>snd = parselmouth.Sound(y, sr)<br>formants = snd.to_formant_burg(max_number_of_formants=3)<br>f1 = [formants.get_value_at_time(1, t) for t in formants.ts()]</pre><ul><li><strong>Burg Method:</strong> Superior for high-pitched voices</li><li><strong>F1-F3 Focus:</strong> Most accent-distinctive formants</li></ul><h4>2.4 Voice Quality Metrics</h4><pre>hnr = librosa.effects.harmonic(y, margin=8)<br>shimmer = np.mean(np.abs(np.diff(np.abs(y))))</pre><ul><li><strong>HNR:</strong> Distinguishes breathy vs modal voices</li><li><strong>Shimmer:</strong> Detects vocal instability</li></ul><h3>Why This Order Matters</h3><p>The pipeline follows psychoacoustic principles:</p><ol><li><strong>Time-Domain First</strong> (trimming, normalization)</li><li><strong>Frequency-Domain Next</strong> (spectral features)</li><li><strong>High-Level Features Last</strong> (prosody, formants)</li></ol><h3>3. Core Architecture: AccentEncoder</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tCrOZ_bvljEMpX4VOrpx5w.png" /><figcaption>Accent Encoder Architecture</figcaption></figure><h4>3.1 Input Processing Stage</h4><pre>def forward(self, mel, prosody, spectral):<br>    # Feature-specific encoding<br>    mel_feat = self.mel_encoder(mel)  # [B, 256, T]<br>    prosody_feat = self.prosody_encoder(prosody)<br>    <br>    # Spectral feature fusion<br>    mfcc_feat = F.gelu(self.mfcc_encoder(spectral[&#39;mfcc&#39;]))<br>    chroma_feat = F.gelu(self.chroma_encoder(spectral[&#39;chroma&#39;]))<br>    spectral_feat = torch.cat([mfcc_feat, chroma_feat, ...], dim=1)</pre><h4><strong>Key Design Choices:</strong></h4><p><strong>3.1.1 Heterogeneous Encoders</strong></p><ul><li>Mel: 4-layer residual CNN (captures timbral patterns)</li><li>Prosody: 2D CNN-LSTM hybrid (models pitch contours)</li><li>Spectral: Parallel 1D convolutions with late fusion</li></ul><p><strong>3.1.2. GELU Activation</strong></p><pre>self.activation = nn.GELU()</pre><ul><li>Smoother gradients for voice feature learning</li><li>1.8% better accuracy vs ReLU in ablation tests</li></ul><h4><strong>3.2 Critical Parameters:</strong></h4><pre>encoder_layer = nn.TransformerEncoderLayer(<br>    d_model=256,<br>    nhead=8,  # 256/8 = 32 dim per head<br>    dim_feedforward=1024,<br>    dropout=0.1,<br>    activation=&#39;gelu&#39;,<br>    batch_first=True<br>)</pre><ul><li><strong>Head Dimension:</strong> 32 preserves local attention patterns</li><li><strong>FFN Ratio:</strong> 4:1 (1024/256) balances capacity/compute</li></ul><h4>3.3 Embedding Projection</h4><pre>self.fusion = nn.Sequential(<br>    nn.Linear(768, 512),  # 256*3 features<br>    nn.LayerNorm(512),<br>    nn.GELU(),<br>    nn.Dropout(0.1),<br>    nn.Linear(512, 64)  # Final embedding<br>)</pre><p><strong>3.3.1 Bottleneck Design:</strong></p><ul><li>768 → 512 → 64 gradual compression</li><li>LayerNorm stabilizes training</li></ul><p><strong>3.3.2. Embedding Properties:</strong></p><ul><li>L2-normalized (‖e‖=1)</li><li>64D achieves 98% of 128D performance</li></ul><h3>4. Loss Functions: The Training Signal</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vwSOk4aSwQ5GBLQUgSiAvw.png" /><figcaption>Loss Functions</figcaption></figure><h4>4.1 Contrastive Loss (Temperature-Scaled)</h4><pre>def contrastive_loss(embeddings, labels, temp=0.07):<br>    # Normalize<br>    embeddings = F.normalize(embeddings, p=2, dim=1)  # Critical!<br>    <br>    # Similarity matrix<br>    sim = torch.mm(embeddings, embeddings.t()) / temp<br>    <br>    # Positive pairs (same accent)<br>    pos_mask = labels.unsqueeze(0) == labels.unsqueeze(1)<br>    pos_mask.fill_diagonal_(False)  # Exclude self<br>    <br>    # Negative pairs<br>    neg_mask = ~pos_mask<br>    <br>    # Log-softmax<br>    log_prob = sim - torch.logsumexp(sim * neg_mask.float(), dim=1, keepdim=True)<br>    <br>    # Mean positive log-likelihood<br>    loss = -(pos_mask.float() * log_prob).sum() / pos_mask.float().sum()<br>    return loss</pre><p><strong>Key Mechanisms:</strong></p><ol><li><strong>Temperature (τ=0.07)</strong></li></ol><ul><li>Controls separation sharpness</li><li>Auto-tuned version: τ = 0.05 + 0.1*sigmoid(embeddings.std())</li></ul><p><strong>2. Normalization</strong></p><ul><li>Prevents magnitude domination</li><li>Enforces angular similarity</li></ul><h4>4.2 Triplet Loss (Adaptive Margin)</h4><pre>class TripletLoss(nn.Module):<br>    def __init__(self, margin=1.0):<br>        super().__init__()<br>        self.margin = margin<br>        self.softplus = nn.Softplus()<br>    def forward(self, anchor, positive, negative):<br>        pos_dist = F.cosine_similarity(anchor, positive)<br>        neg_dist = F.cosine_similarity(anchor, negative)<br>        return self.softplus(neg_dist - pos_dist + self.margin)</pre><p><strong>Innovations:</strong></p><ul><li><strong>Softplus</strong> instead of ReLU for smoother gradients</li><li><strong>Dynamic Margin:</strong></li></ul><pre>margin = 1.0 + 0.5*torch.sigmoid(self.margin_learner(anchor))</pre><h4>4.3 Reconstruction Loss</h4><pre>recon_loss = (<br>    0.5*F.mse_loss(mel_recon, mel_orig) +<br>    0.3*F.mse_loss(prosody_recon, prosody_orig) +<br>    0.2*F.mse_loss(spectral_recon, spectral_orig)<br>)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/900/1*KbyOLJlKd3nfTB-HR34DzA.png" /><figcaption>Weighting Strategy</figcaption></figure><h3>5. Training Orchestration: The Learning Process</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*V3-yszD6QQDWFFJ_wPdcPA.png" /><figcaption>Accent Encoder Training Pipeline</figcaption></figure><h4>5.1 Dynamic Loss Balancing — The Training Compass</h4><p>This triple-loss system acts like an orchestra conductor:</p><p><strong>5.1.1 Triplet Loss (Weight=1.0):<br></strong>The lead violin forcing accent groups apart. Implements hard negative mining:</p><pre># Sample hardest negatives within batch<br>neg_dist = 1 - F.cosine_similarity(anchor.unsqueeze(1), embeddings, dim=2)<br>neg_dist[labels == anchor_label] = -np.inf  # Mask positives<br>hardest_neg = embeddings[neg_dist.argmax(dim=1)]</pre><ul><li><em>Why it works</em>: Creates clear decision boundaries between accents.</li></ul><p><strong>5.1.2 Contrastive Loss (Weight=0.1):<br></strong>The percussion section maintaining rhythm. Uses temperature scaling:</p><pre>sim_matrix = torch.mm(embeddings, embeddings.t()) / 0.07 # τ=0.07</pre><ul><li><em>Pro Tip</em>: τ=0.07 works best for 64D embeddings (validated through linear probing).</li></ul><p><strong>5.1.3 Reconstruction Loss (Weight=0.1):<br></strong>The bassline preserving signal integrity. Weighted by feature importance:</p><pre>0.5*mel_loss + 0.3*prosody_loss + 0.2*spectral_loss</pre><h4>5.2 Gradient Flow Optimization — The Learning Engine</h4><p>here backward pass has three critical mechanisms:</p><p><strong>5.2.1 Gradient Clipping (1.0 norm):</strong></p><pre>torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0, norm_type=2)</pre><p>Prevents exploding gradients in the transformer layers.</p><p><strong>5.2.2 Mixed Precision Training:</strong></p><pre>with torch.cuda.amp.autocast():<br>    embeddings = model(batch)</pre><p>Achieves 1.8x speedup without accuracy loss.</p><p><strong>5.2.3 OneCycle LR Scheduling:</strong></p><pre>scheduler = torch.optim.lr_scheduler.OneCycleLR(<br>    optimizer, <br>    max_lr=1e-3, <br>    steps_per_epoch=len(train_loader),<br>    epochs=50,<br>    div_factor=10  # Initial LR = 1e-4<br>)</pre><p><em>Why it works</em>: Superconvergence phenomenon — rapid navigation through loss landscape.</p><h4>5.3 Validation-Driven Checkpointing — The Quality Gate</h4><p>Evaluation protocol does more than just loss monitoring:</p><pre>if val_metrics[&#39;acc&#39;] &gt; best_acc:<br>    torch.save({<br>        &#39;epoch&#39;: epoch,<br>        &#39;embedding_std&#39;: embeddings.std(dim=1).mean().item(),  # Critical!<br>        &#39;contrastive_align&#39;: (embeddings @ embeddings.T).mean().item()<br>    }, &#39;best_model.pt&#39;)</pre><p><strong>Key Validation Metrics</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yPjnsZTOWltvHw8kF4abAA.png" /></figure><h3>Conclusion: Key Takeaways &amp; Next Steps</h3><p>This end-to-end accent embedding system transforms raw audio into discriminative 64D representations through a hybrid CNN-Transformer architecture and multi-task learning. The modular pipeline handles all stages from robust audio preprocessing to feature fusion, producing embeddings that effectively cluster accent characteristics. While optimized for Hindi/Spanish accents in L2-ARCTIC, the architecture is designed for easy extension to new languages.</p><h4>Get Started with the Code</h4><p><a href="https://github.com/Adarshh9/Accent-Embedding-Model-From-Scratch">GitHub - Adarshh9/Accent-Embedding-Model-From-Scratch</a></p><h4>Dataset</h4><p><a href="https://www.kaggle.com/datasets/divyamagg/l2-arctic-data">L2 Arctic Data</a></p><h3>Where To Go Next?</h3><ol><li><strong>Extend to New Accents</strong><br>Try adding Mandarin or Arabic speakers from CommonVoice dataset</li><li><strong>Build a Real-Time Demo</strong></li><li><strong>Explore Applications</strong></li></ol><ul><li>Accent-aware ASR</li><li>Pronunciation coaching</li><li>Forensic voice analysis</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3d336577fb4a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Physics-Informed Neural Networks (PINNs): Modeling Coffee Cooling]]></title>
            <link>https://medium.com/@adarshhme/physics-informed-neural-networks-pinns-modeling-coffee-cooling-ed7e9365b411?source=rss-e192e3794730------2</link>
            <guid isPermaLink="false">https://medium.com/p/ed7e9365b411</guid>
            <category><![CDATA[auto-differentiation]]></category>
            <category><![CDATA[smart-modeling]]></category>
            <category><![CDATA[ai-physics]]></category>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[physics]]></category>
            <dc:creator><![CDATA[Adarsh Kesharwani]]></dc:creator>
            <pubDate>Sat, 26 Apr 2025 09:00:31 GMT</pubDate>
            <atom:updated>2025-04-26T09:07:47.006Z</atom:updated>
            <content:encoded><![CDATA[<h3>1. Introduction to PINNs</h3><p>Physics-Informed Neural Networks (PINNs) are a powerful blend of deep learning and physics. They train neural networks to <strong>not just fit data</strong>, but also <strong>respect the underlying physical laws</strong> governing the system.</p><p>In this blog, we’ll explore how PINNs work by modeling a <strong>hot cup of coffee cooling down over time</strong> — a simple, real-world example that follows Newton’s Law of Cooling.</p><h3>2. Traditional Neural Networks vs. PINNs</h3><h3>Standard Neural Networks (Pure Data-Driven)</h3><p><strong>Input → Output Mapping</strong>: Learns patterns purely from data. <strong>No Physics Knowledge</strong>: Treats the problem as a black box.</p><p><strong>Limitations</strong>: Needs <strong>large amounts of data</strong>. May produce <strong>physically unrealistic predictions</strong> (e.g., negative temperatures).</p><h3>Physics-Informed Neural Networks (PINNs)</h3><p><strong>Combines Data + Physics</strong>: Uses known physical laws to guide learning. <strong>How?</strong> By <strong>embedding physics into the loss function</strong>.</p><p><strong>Advantages</strong>: Works with <strong>small/noisy data</strong>. Produces <strong>physically consistent predictions</strong>.</p><h3>3. How Physics is Introduced in PINNs</h3><p>The key idea is to <strong>constrain the neural network to obey known physics</strong> while still fitting observed data. This is done through <strong>three types of losses</strong>:</p><h3>I. Data Loss: Fitting Observed Measurements</h3><p><strong>Purpose</strong>: Ensures the neural network’s predictions match real-world measurements.</p><p>Formula:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*wh6pB5er1pHglgVaq1he3A.png" /><figcaption>Data Loss</figcaption></figure><p><strong>How It Works</strong>:</p><p>The network predicts temperature 𝑇(𝑡) at given time points 𝑡𝑖​. The loss penalizes deviations from actual measured temperatures 𝑇𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑(𝑡𝑖). This is identical to standard supervised learning in neural networks.</p><p><strong>Example (Coffee Cooling)</strong>:</p><p>If we measure the coffee at 𝑡=2<em>t</em>=2 min to be 80°C, the network should predict a value close to 80°C at that time.</p><h3>II. Physics Loss: Enforcing Governing Equations</h3><p><strong>Purpose: </strong>Forces the neural network to obey the underlying physical law (e.g., Newton’s Law of Cooling).</p><p>Formula for coffee cooling (Newton’s Law):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*BN1IHTKVBKymN_2MvSwIiQ.png" /><figcaption>Physics Loss</figcaption></figure><p><strong>How It Works</strong>:</p><p>The neural network predicts 𝑇(𝑡).</p><p><strong>Automatic differentiation</strong> computes 𝑑𝑇/𝑑𝑡​ (exact derivative, no approximations). The term inside the loss (𝑑𝑇/𝑑𝑡+𝑘(𝑇−𝑇𝑒𝑛𝑣)) is the <strong>residual</strong> of Newton’s Law of Cooling. If the physics is perfectly satisfied, this residual should be <strong>zero</strong>. The loss penalizes deviations from zero (i.e., violations of physics).</p><p><strong>Why This Matters</strong>:</p><p>Unlike traditional curve-fitting, the network <strong>cannot</strong> just memorize data — it must respect the physics. Even with <strong>noisy or sparse data</strong>, the physics term keeps predictions realistic.</p><h3>III. Initial/Boundary Condition Loss: Ensuring Correct Starting Behavior</h3><p><strong>Purpose</strong>: Guarantees the solution satisfies known initial or boundary conditions.</p><p>Formula for coffee cooling:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/754/1*BkD4O7HqMP_45JGamyBg4g.png" /><figcaption>IC Loss</figcaption></figure><p><strong>How It Works</strong>:</p><p>The network must predict the correct initial temperature 𝑇0 at 𝑡=0. Without this, the solution could start at a wrong value (e.g., 50°C instead of 90°C) and still fit data.</p><p><strong>Example (Coffee Cooling)</strong>:</p><p>At 𝑡=0, the coffee is freshly brewed at 90°C. The loss ensures 𝑇𝑝𝑟𝑒𝑑(0)=90°<em>C</em>.</p><h3>4. Combining the Losses: Total Training Objective</h3><p>The <strong>total loss</strong> is a weighted sum:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/810/1*TgPqvxV5BY1diM0ggvW2RA.png" /><figcaption>Total Loss</figcaption></figure><p><strong>Default Weights</strong>:</p><p>𝜆𝑑𝑎𝑡𝑎 = 𝜆𝑝ℎ𝑦𝑠𝑖𝑐 = 𝜆𝐼𝐶 = 1 (can be adjusted if needed).</p><h3>5. Training Process</h3><ol><li>The neural network predicts 𝑇(𝑡).</li><li>The <strong>data loss</strong> ensures it fits measurements.</li><li>The <strong>physics loss</strong> ensures it follows Newton’s Law.</li><li>The <strong>initial condition loss</strong> ensures it starts at 𝑇0=90°𝐶.</li><li>The optimizer (e.g., Adam) updates the network weights to minimize 𝐿𝑡𝑜𝑡𝑎𝑙.</li></ol><h3>6. Why This Approach Works</h3><ul><li><strong>Automatic Differentiation</strong>: Computes exact derivatives of 𝑇(𝑡) (no numerical approximations).</li><li><strong>Physics as a Soft Constraint</strong>: The network is nudged toward physically plausible solutions.</li><li><strong>Balanced Learning</strong>: The three losses work together.</li></ul><p><strong>Result</strong>: A neural network that <strong>generalizes better</strong> than pure data-driven methods, even with limited/noisy data.</p><h3><strong>PINNs aren’t magic — they’re just <em>smart</em> neural networks.</strong> Time to see them in action !!</h3><h3>1. Generate Synthetic Data</h3><p>We simulate noisy temperature measurements from a cooling coffee cup.</p><pre>import torch<br>import torch.nn as nn<br>import numpy as np<br>import matplotlib.pyplot as plt<br><br># Physics parameters<br>T_env = 25.0  # Room temperature (°C)<br>T0 = 90.0     # Initial coffee temperature (°C)<br>k = 0.1       # Cooling rate constant<br><br># True analytical solution<br>def true_solution(t):<br>    return T_env + (T0 - T_env)*np.exp(-k*t)<br><br># Generate some time points (every 2 minutes for 20 minutes)<br>t_min, t_max = 0.0, 20.0<br>t_data = np.array([0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])<br><br># Generate noisy temperature measurements<br>np.random.seed(0)<br>noise_level = 2.0  # ±2°C measurement error<br>T_data_exact = true_solution(t_data)<br>T_data_noisy = T_data_exact + noise_level*np.random.randn(len(t_data))<br><br># Convert to PyTorch tensors<br>t_data_tensor = torch.tensor(t_data, dtype=torch.float32).view(-1, 1)<br>T_data_tensor = torch.tensor(T_data_noisy, dtype=torch.float32).view(-1, 1)</pre><h3>2. Define the Neural Network</h3><p>A simple neural network that takes time t and predicts temperature T.</p><pre>class CoffeePINN(nn.Module):<br>    def __init__(self):<br>        super(CoffeePINN, self).__init__()<br>        self.net = nn.Sequential(<br>            nn.Linear(1, 20),  # 1 input (time), 20 hidden neurons<br>            nn.Tanh(),<br>            nn.Linear(20, 20),<br>            nn.Tanh(),<br>            nn.Linear(20, 1)   # 1 output (temperature)<br>        <br>    def forward(self, t):<br>        return self.net(t)<br><br>model = CoffeePINN()</pre><h3>3. Define the Physics Loss</h3><p>This ensures the network follows Newton’s Law of Cooling.</p><pre>def derivative(y, x):<br>    &quot;&quot;&quot;Computes dy/dx using autograd&quot;&quot;&quot;<br>    return torch.autograd.grad(y, x, <br>                             grad_outputs=torch.ones_like(y),<br>                             create_graph=True)[0]<br><br>def physics_loss(model, t):<br>    &quot;&quot;&quot;Physics loss: dT/dt = -k*(T - T_env)&quot;&quot;&quot;<br>    t.requires_grad_(True)<br>    <br>    # Get prediction and derivative<br>    T = model(t)<br>    dT_dt = derivative(T, t)<br>    <br>    # Governing equation residual<br>    residual = dT_dt + k*(T - T_env)<br>    return torch.mean(residual**2)</pre><h3>4. Define Data &amp; Initial Condition Losses</h3><p><strong>Data Loss</strong>: Fits noisy measurements.</p><p><strong>IC Loss</strong>: Ensures initial temperature is correct.</p><pre>def data_loss(model, t_data, T_data):<br>    &quot;&quot;&quot;MSE between predictions and measurements&quot;&quot;&quot;<br>    T_pred = model(t_data)<br>    return torch.mean((T_pred - T_data)**2)<br><br>def initial_condition_loss(model):<br>    &quot;&quot;&quot;Initial condition: T(0) = T0&quot;&quot;&quot;<br>    t0 = torch.zeros(1, 1, dtype=torch.float32)<br>    T_pred = model(t0)<br>    return (T_pred - T0).pow(2).mean()</pre><h3>5. Train the PINN</h3><p>We optimize all three losses together.</p><pre>optimizer = torch.optim.Adam(model.parameters(), lr=0.01)<br><br># Loss weights (you can adjust these)<br>lambda_data = 1.0<br>lambda_ode = 1.0<br>lambda_ic = 1.0<br><br>num_epochs = 2000<br>print_every = 200<br><br>model.train()<br>for epoch in range(num_epochs):<br>    optimizer.zero_grad()<br>    <br>    # Compute all loss components<br>    l_data = data_loss(model, t_data_tensor, T_data_tensor)<br>    l_ode = physics_loss(model, t_data_tensor)<br>    l_ic = initial_condition_loss(model)<br>    <br>    # Combined loss<br>    loss = lambda_data*l_data + lambda_ode*l_ode + lambda_ic*l_ic<br>    <br>    # Backpropagation<br>    loss.backward()<br>    optimizer.step()<br>    <br>    # Print progress<br>    if (epoch + 1) % print_every == 0:<br>        print(f&quot;Epoch {epoch+1}/{num_epochs}, &quot;<br>              f&quot;Total Loss = {loss.item():.4f}, &quot;<br>              f&quot;Data Loss = {l_data.item():.4f}, &quot;<br>              f&quot;ODE Loss = {l_ode.item():.4f}, &quot;<br>              f&quot;IC Loss = {l_ic.item():.4f}&quot;)</pre><h3>6. Visualize the Results</h3><pre>model.eval()<br>t_plot = np.linspace(t_min, t_max, 100).reshape(-1, 1)<br>t_plot_tensor = torch.tensor(t_plot, dtype=torch.float32)<br>T_pred_plot = model(t_plot_tensor).detach().numpy()<br><br># True solution for comparison<br>T_true_plot = true_solution(t_plot)<br><br># Plot results<br>plt.figure(figsize=(10, 6))<br>plt.scatter(t_data, T_data_noisy, color=&#39;red&#39;, label=&#39;Noisy Measurements&#39;)<br>plt.plot(t_plot, T_true_plot, &#39;k--&#39;, label=&#39;Analytical Solution&#39;)<br>plt.plot(t_plot, T_pred_plot, &#39;b&#39;, label=&#39;PINN Prediction&#39;)<br>plt.xlabel(&#39;Time (minutes)&#39;)<br>plt.ylabel(&#39;Temperature (°C)&#39;)<br>plt.title(&#39;Coffee Cooling: PINN vs Physics&#39;)<br>plt.legend()<br>plt.grid(True)<br>plt.show()</pre><h3>Expected Output</h3><p>The PINN will:</p><ol><li>Fit the noisy data points</li><li>Obey Newton’s Law of Cooling</li><li>Start at the correct initial temperature</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/841/1*3KAKcJED1l8VlP8HFndSGA.png" /><figcaption>Plot Result</figcaption></figure><h3>Conclusion</h3><p>PINNs are a <strong>game-changer</strong> for scientific machine learning. By embedding physics into neural networks, they produce <strong>more reliable and interpretable models</strong> than pure data-driven approaches.</p><h3>Quick Experiments to Explore</h3><h4>1. Tune Loss Weights (λ)</h4><ul><li>Increase <strong>λ_physics</strong>: Forces stricter physics compliance (smoother curves)</li><li>Increase <strong>λ_data</strong>: Prioritizes fitting measurements (may overfit noise)</li><li>Try <strong>λ_IC = 0</strong>: See how wrong initial conditions affect results</li></ul><h4>2. Test Noisy Data</h4><ul><li>Add <strong>more noise</strong>: Does the physics term still keep predictions realistic?</li><li>Remove <strong>some data points</strong>: Can PINNs fill gaps using physics?</li></ul><h4>3. New Physics Scenarios</h4><ul><li><strong>Bouncing ball</strong>: Gravity + energy loss on impact</li><li><strong>Room heating</strong>: Thermostat-controlled temperature change</li><li><strong>Simple pendulum</strong>: Swing dynamics with friction</li></ul><h4>4. Change the Model</h4><ul><li>Fewer/more <strong>neural network layers</strong> → How does complexity affect results?</li><li>Swap <strong>Tanh for ReLU</strong> → Does activation choice matter?</li></ul><p><strong>Try tweaking one thing at a time and observe the changes!</strong> 🧪</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ed7e9365b411" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Train the SD3 Model Using DreamBooth LoRA]]></title>
            <link>https://medium.com/@adarshhme/train-the-sd3-model-using-dreambooth-lora-56fd5f230652?source=rss-e192e3794730------2</link>
            <guid isPermaLink="false">https://medium.com/p/56fd5f230652</guid>
            <category><![CDATA[dreambooth]]></category>
            <category><![CDATA[ai-generated-image]]></category>
            <category><![CDATA[stable-diffusion-3]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <category><![CDATA[lora]]></category>
            <dc:creator><![CDATA[Adarsh Kesharwani]]></dc:creator>
            <pubDate>Wed, 08 Jan 2025 12:50:03 GMT</pubDate>
            <atom:updated>2025-01-08T12:50:03.193Z</atom:updated>
            <content:encoded><![CDATA[<h4>A Step-by-Step Guide</h4><h3><strong><em>DreamBooth And LoRA: A Quick Overview</em></strong></h3><p>Think of <strong>DreamBooth</strong> as a way to teach an AI model something special — like your favorite art style, a unique character, or even a specific object. You only need a few images to fine-tune the model, and it starts generating outputs tailored to what you taught it. This makes it perfect for adding a personal touch to AI-generated content.</p><p>Now, let’s talk about <strong>LoRA (Low-Rank Adaptation)</strong>. Training large AI models from scratch can be time-consuming and resource-heavy. LoRA solves this by updating only the most important parts of the model instead of retraining everything. It’s like adjusting the screws on a machine rather than rebuilding the entire thing.</p><p>Combining DreamBooth and LoRA gives you a powerful and efficient way to personalize huge models like Stable Diffusion (SD). It’s fast, flexible, and doesn’t require a supercomputer to get amazing results.</p><h3><strong><em>Why Does SD3 Need Different Scripts Than SD1/SD2?</em></strong></h3><p>Stable Diffusion 3 (SD3) isn’t just an upgrade — it’s a whole new level of sophistication compared to SD1 and SD2. These advancements make SD3 more powerful but also mean it requires specially tailored scripts for DreamBooth training. Let’s break down why.</p><h4><strong><em>1. Key Architectural Differences -</em></strong></h4><p><strong>SD1/SD2 Architecture</strong>:<br>Think of this as a simple pipeline. It uses a CLIP text encoder and a single U-Net for generating images.</p><pre># SD1/2 basic flow<br>text_encoder -&gt; CLIP embedding  <br>image -&gt; VAE encoding -&gt; U-Net -&gt; VAE decoding</pre><p><strong>SD3 Architecture</strong>:<br>SD3 introduces <strong>multimodal transformer embeddings</strong>, an <strong>advanced VAE</strong>, and a <strong>Perceiver module</strong> that bridges components.</p><pre># SD3 enhanced flow<br>text_encoder -&gt; multimodal transformer embedding  <br>image -&gt; advanced VAE -&gt; Perceiver + U-Net -&gt; VAE decoding</pre><h4><strong><em>2. Loss Functions -</em></strong></h4><p><strong>SD1/SD2 Loss</strong>:<br>A straightforward loss function focused on noise prediction and preserving prior knowledge.</p><pre>loss = MSE(predicted_noise, random_noise) + prior_preservation_loss</pre><p><strong>SD3 Loss</strong>:<br>Along with the basics, SD3 adds new terms for <strong>perceptual loss</strong> (to improve image quality) and <strong>consistency loss</strong> (to ensure outputs are stable).</p><pre>loss = MSE(predicted_noise, random_noise) +  <br>       prior_preservation_loss +  <br>       perceptual_loss +  # Helps maintain fine details  <br>       consistency_loss   # Keeps results coherent</pre><h4><strong><em>3. Learning Rate Handling -</em></strong></h4><p><strong>SD1/SD2</strong>: <br>Uses a basic cosine annealing scheduler for smooth transitions.</p><pre>lr_scheduler = CosineAnnealingLR(initial_lr=1e-6, min_lr=1e-7)</pre><p><strong>SD3</strong>: <br>Switches to AdaFactor, a smarter optimizer designed for large-scale models.</p><pre>lr_scheduler = AdaFactor(initial_lr=5e-7, relative_step=True, warmup_init=True)</pre><h4><strong><em>4. Memory Management -</em></strong></h4><p><strong>SD1/SD2</strong>: <br>Basic gradient checkpointing to save memory.</p><pre>gradient_checkpointing = True</pre><p><strong>SD3</strong>: Adds memory-efficient techniques like xformers and advanced attention mechanisms.</p><pre>gradient_checkpointing = True  <br>use_efficient_attention = True  <br>enable_xformers = True  <br>use_memory_efficient_attention = True</pre><h3><strong><em>Let’s Get to the Code !!</em></strong></h3><ol><li>Install Necessary Libraries</li></ol><pre>!pip install -q -U git+https://github.com/huggingface/diffusers<br>!pip install -q -U \<br>    transformers \<br>    accelerate \<br>    bitsandbytes \<br>    peft</pre><p>2. Authenticate with Hugging Face</p><pre>!huggingface-cli login</pre><p>3. Clone the Diffusers Repository</p><pre>!git clone https://github.com/huggingface/diffusers<br>%cd diffusers/examples/research_projects/sd3_lora_colab</pre><p>4. Upload Training Images</p><ul><li>Place your training images in a folder (e.g., XYZ_folder) within the sd3_lora_colab directory.</li><li>Ensure your images have a consistent format (e.g., .png).</li></ul><p>5. Configure the Script</p><p>Open the compute_embeddings.py file in the same directory and make the following changes:</p><ul><li><strong>Line 28:</strong> Update the PROMPT to describe the concept you&#39;re training (e.g., &quot;employees in a modern office&quot;).</li><li><strong>Line 30:</strong> Set LOCAL_DATA_DIR to the folder containing your images (e.g., &quot;XYZ_folder&quot;).</li><li><strong>Line 79:</strong> Adjust the image file extension if needed (e.g., .png).<br>Save the changes.</li></ul><p>6. Compute Image Embeddings</p><pre>!python compute_embeddings.py</pre><p>7. Clear GPU Memory</p><pre>import torch<br>import gc<br><br>def flush():<br>    torch.cuda.empty_cache()<br>    gc.collect()<br><br>flush()</pre><p>8. Train the Model<br>Feel free to play with hyperparameters!</p><pre>!accelerate launch train_dreambooth_lora_sd3_miniature.py \<br>  --pretrained_model_name_or_path=&quot;stabilityai/stable-diffusion-3-medium-diffusers&quot;  \<br>  --instance_data_dir=&quot;XYZ_folder&quot; \<br>  --data_df_path=&quot;sample_embeddings.parquet&quot; \<br>  --output_dir=&quot;trained-sd3-lora-miniature&quot; \<br>  --mixed_precision=&quot;fp16&quot; \<br>  --instance_prompt=&quot;employees in modern office&quot; \<br>  --resolution=1024 \<br>  --train_batch_size=1 \<br>  --gradient_accumulation_steps=4 --gradient_checkpointing \<br>  --use_8bit_adam \<br>  --learning_rate=1e-4 \<br>  --lr_scheduler=&quot;constant&quot; \<br>  --lr_warmup_steps=0 \<br>  --max_train_steps=500 \<br>  --seed=&quot;0&quot;</pre><p>The LoRA weights will be saved in the trained-sd3-lora-miniature directory.</p><p>9. Perform Inference</p><pre>from diffusers import DiffusionPipeline<br>import torch<br><br>pipeline = DiffusionPipeline.from_pretrained(<br>    &quot;stabilityai/stable-diffusion-3-medium-diffusers&quot;,<br>    torch_dtype=torch.float16<br>)<br>lora_output_path = &quot;trained-sd3-lora-miniature&quot;<br>pipeline.load_lora_weights(lora_output_path)<br><br>pipeline.enable_sequential_cpu_offload()<br><br>image = pipeline(&quot;employees in modern office using their laptops&quot;).images[0]<br>image.save(&quot;output.png&quot;)</pre><p>10. Save and Reuse Weights<br>You can download the trained-sd3-lora-miniature folder to store the LoRA weights and reuse them later.</p><h3>Wrapping Up</h3><p>Training Stable Diffusion 3 with LoRA makes creating personalized, high-quality images easier and more efficient. With these steps, you can fine-tune the model, save your LoRA weights, and reuse them whenever you need. Now it’s your chance to get creative and see what amazing results you can achieve!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=56fd5f230652" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Distilling Knowledge: Making Large Models Smaller And Smarter]]></title>
            <link>https://medium.com/@adarshhme/distilling-knowledge-b8fe2e972eb5?source=rss-e192e3794730------2</link>
            <guid isPermaLink="false">https://medium.com/p/b8fe2e972eb5</guid>
            <category><![CDATA[large-langauge-model]]></category>
            <category><![CDATA[ai-model-compression]]></category>
            <category><![CDATA[knowledge-distillation]]></category>
            <category><![CDATA[model-optimization]]></category>
            <dc:creator><![CDATA[Adarsh Kesharwani]]></dc:creator>
            <pubDate>Thu, 28 Nov 2024 20:00:15 GMT</pubDate>
            <atom:updated>2025-01-08T12:57:27.783Z</atom:updated>
            <content:encoded><![CDATA[<p>Large Language Models (LLMs), like GPT-4, have revolutionized AI, unlocking new possibilities, but they come with significant challenges. These models demand immense computational power and storage, making them costly and impractical for standard devices. Their complexity introduces latency, causing frustrating delays in real-time responses, and their overparameterization leads to inefficiencies, with many parameters adding little value.<br>Accessibility is another concern, as only resource-rich organizations can afford to deploy them. Moreover, their high energy consumption raises serious environmental issues. While LLMs are undeniably powerful, addressing these challenges is essential for their broader adoption and sustainable use.</p><p><strong><em>Why Do We Need Knowledge Distillation?</em></strong></p><p>Knowledge Distillation (KD) offers a game-changing solution by transferring the knowledge of large models into smaller, more efficient ones. These compact models retain most of the original’s performance but are faster, lighter, and less resource-intensive.<br>With KD, AI becomes more accessible and sustainable, enabling us to leverage the power of LLMs without their limitations. It’s a step toward making advanced AI both practical and scalable, paving the way for smarter, greener innovations.</p><p><strong><em>What is Knowledge Distillation?</em></strong></p><p>Knowledge Distillation (KD) is a machine learning technique where a large, powerful teacher model trains a smaller, more efficient student model by passing on its knowledge. Unlike traditional training that relies only on true labels, KD uses the teacher’s outputs — soft probabilities or logits — to guide the student.<br>This approach helps the student model learn not just the correct answers but also the nuanced relationships between classes, enabling it to mimic the teacher’s behavior effectively. The result? A compact model that’s faster and lighter, yet still retains the teacher’s expertise.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/975/1*YPO7k7bht5pGKlBJx-wRmw.jpeg" /><figcaption>Training student model using KD</figcaption></figure><p><strong><em>How Does Knowledge Distillation Work?</em></strong></p><p>Knowledge Distillation transfers knowledge from a large <strong>teacher model</strong> to a smaller <strong>student model</strong>. The key idea is to train the student model on the true labels of the dataset and the <strong>soft predictions (logits)</strong> generated by the teacher model. These logits carry rich information about the relationships between different classes that aren’t captured by hard labels.</p><p><strong><em>The Process in Simple Terms:</em></strong></p><ol><li><strong>Teacher Model Training:</strong><br> The teacher model is first trained on the dataset using traditional methods, achieving high accuracy by learning complex patterns and relationships in the data.</li><li><strong>Generating Soft Labels:</strong><br> During inference, the teacher generates <strong>soft labels</strong> — probabilities for each class instead of a single “hard” label. For example, instead of predicting “cat” with 100% certainty, the teacher might output a distribution like cat: 0.7, dog: 0.2, rabbit: 0.1, reflecting its nuanced understanding.</li><li><strong>Student Model Training:</strong><br> The student model is then trained using two losses:</li></ol><ul><li><strong>Cross-Entropy Loss (CE)</strong>: This loss compares the student’s predicted probabilities to the true class labels, ensuring the student learns to classify correctly based on the actual data.</li><li><strong>Kullback-Leibler Divergence Loss (Distillation Loss)</strong>: This loss measures the difference between the teacher’s softened logits (probabilities) and the student’s predictions. By scaling the logits using a <strong>temperature</strong> parameter, the teacher’s output becomes softer, allowing the student to mimic the teacher’s knowledge better while focusing on more nuanced information.</li></ul><p><strong><em>The Role of Temperature:</em></strong></p><p>The <strong>temperature parameter (T)</strong> in KD smooths the logits from the teacher model, amplifying smaller probabilities to expose subtle relationships between classes. For example, increasing T may transform logits like [0.7, 0.2, 0.1] to [0.5, 0.3, 0.2], making it easier for the student to grasp these relationships.<br>This approach was first explored in the original paper <a href="https://arxiv.org/abs/1503.02531"><em>Distilling the Knowledge in a Neural Network</em></a>, which demonstrated that even a smaller, simpler model could match the performance of larger models.</p><p><strong><em>Intuition:</em></strong></p><p>Imagine the teacher as a skilled mentor who doesn’t just provide correct answers but also explains why other options are less likely. This nuanced guidance helps the student develop a deeper understanding, enabling them to perform well without requiring the same level of complexity as the teacher.<br>By blending direct supervision (true labels) with this informed guidance (teacher logits), the student learns to generalize effectively, achieving performance close to the teacher’s — while being faster, smaller, and more efficient.</p><p><strong><em>Phew, enough theory! Time to roll up our sleeves and dive into the code 💻</em></strong></p><p><strong><em>Install Dependencies</em></strong><em><br></em> Start by installing torch, torchvision for image datasets, models, and transformations.</p><pre>!pip install -q torch torchvision</pre><p><strong><em>Import Libraries &amp; Setup Device</em></strong><em><br></em> Import the necessary libraries and set up the device (CPU/GPU) for training.</p><pre>import torch<br>import torch.nn as nn<br>import torch.optim as optim<br>import torchvision.transforms as transforms<br>import torchvision.datasets as datasets<br><br>device = torch.device(&quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;)</pre><p><strong><em>Load CIFAR-10 Dataset</em></strong><em><br></em> We load the CIFAR-10 dataset with image transformations (e.g., normalization) for preprocessing and split it into training and testing datasets.</p><pre>transform = transforms.Compose([<br>    transforms.ToTensor(),<br>    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])<br>])<br><br>train_dataset = datasets.CIFAR10(root=&#39;./data&#39;, train=True, download=True, transform=transform)<br>test_dataset = datasets.CIFAR10(root=&#39;./data&#39;, train=False, download=True, transform=transform)<br><br>train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)<br>test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)</pre><p><strong><em>Define Teacher and Student Networks</em></strong><em><br></em> The teacher (DeepNN) is a larger, more complex network, while the student (LightNN) is lightweight, making it suitable for resource-constrained scenarios.</p><pre># Teacher Model (DeepNN)<br>class DeepNN(nn.Module):<br>    def __init__(self, num_classes=10):<br>        super(DeepNN, self).__init__()<br>        self.features = nn.Sequential(<br>            nn.Conv2d(3, 128, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.Conv2d(128, 64, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.MaxPool2d(kernel_size=2, stride=2),<br>            nn.Conv2d(64, 64, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.Conv2d(64, 32, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.MaxPool2d(kernel_size=2, stride=2)<br>        )<br>        self.classifier = nn.Sequential(<br>            nn.Linear(2048, 512),<br>            nn.ReLU(),<br>            nn.Dropout(0.1),<br>            nn.Linear(512, num_classes)<br>        )<br><br>    def forward(self, x):<br>        x = self.features(x)<br>        x = torch.flatten(x, 1)<br>        x = self.classifier(x)<br>        return x<br><br># Student Model (LightNN)<br>class LightNN(nn.Module):<br>    def __init__(self, num_classes=10):<br>        super(LightNN, self).__init__()<br>        self.features = nn.Sequential(<br>            nn.Conv2d(3, 16, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.MaxPool2d(kernel_size=2, stride=2),<br>            nn.Conv2d(16, 16, kernel_size=3, padding=1),<br>            nn.ReLU(),<br>            nn.MaxPool2d(kernel_size=2, stride=2)<br>        )<br>        self.classifier = nn.Sequential(<br>            nn.Linear(1024, 256),<br>            nn.ReLU(),<br>            nn.Dropout(0.1),<br>            nn.Linear(256, num_classes)<br>        )<br><br>    def forward(self, x):<br>        x = self.features(x)<br>        x = torch.flatten(x, 1)<br>        x = self.classifier(x)<br>        return x</pre><p><strong><em>Define Training and Testing Functions</em></strong><em><br></em> It focus on training and testing both the <strong>Teacher</strong> (DeepNN) and <strong>Student</strong> (LightNN) models using <strong>only Cross-Entropy (CE) loss</strong>. The <strong>Teacher model</strong> is trained first to minimize the CE loss, and then the <strong>Student model</strong> is trained with the same loss, serving as a baseline. In real-world scenarios, the <strong>Teacher model</strong> is typically pre-trained, and <strong>Knowledge Distillation (KD)</strong> is used to transfer its knowledge to a <strong>lighter version</strong> of the <strong>Teacher model</strong>, creating a more efficient <strong>Student model</strong> that is ideal for resource-constrained environments.</p><pre># Training Function<br>def train(model, train_loader, epochs, learning_rate, device):<br>    criterion = nn.CrossEntropyLoss()<br>    optimizer = optim.Adam(model.parameters(), lr=learning_rate)<br><br>    model.train()<br>    for epoch in range(epochs):<br>        running_loss = 0.0<br>        for inputs, labels in train_loader:<br>            inputs, labels = inputs.to(device), labels.to(device)<br><br>            optimizer.zero_grad()<br>            outputs = model(inputs)<br>            loss = criterion(outputs, labels)<br>            loss.backward()<br>            optimizer.step()<br>            running_loss += loss.item()<br><br>        print(f&quot;Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}&quot;)<br><br># Testing Function<br>def test(model, test_loader, device):<br>    model.to(device)<br>    model.eval()<br><br>    correct, total = 0, 0<br>    with torch.no_grad():<br>        for inputs, labels in test_loader:<br>            inputs, labels = inputs.to(device), labels.to(device)<br>            outputs = model(inputs)<br>            _, predicted = torch.max(outputs.data, 1)<br>            total += labels.size(0)<br>            correct += (predicted == labels).sum().item()<br>    <br>    accuracy = 100 * correct / total<br>    print(f&quot;Test Accuracy: {accuracy:.2f}%&quot;)<br>    return accuracy</pre><p><strong><em>Train Teacher and Student Models</em></strong><em><br></em> The teacher (DeepNN) is trained first, followed by the student (LightNN) using only cross-entropy loss for comparison.</p><pre>torch.manual_seed(42)<br>teacher = DeepNN(num_classes=10).to(device)<br>train(teacher, train_loader, epochs=10, learning_rate=0.001, device=device)<br>test_accuracy_teacher = test(teacher, test_loader, device)<br><br>student = LightNN(num_classes=10).to(device)<br>train(student, train_loader, epochs=10, learning_rate=0.001, device=device)<br>test_accuracy_student = test(student, test_loader, device)</pre><p><strong><em>Train Student Model with Knowledge Distillation</em></strong><em><br></em> The student model is trained with <strong>KD</strong>, using two losses: <strong>Kullback-Leibler divergence loss</strong> and <strong>Cross Entropy loss </strong>. The teacher’s <strong>soft logits</strong> are passed through a temperature-scaled softmax to create <strong>soft targets</strong>, which the student learns to match. Simultaneously, the student’s predictions are compared to the <strong>true labels</strong> using CE loss. Both losses are combined to update the student model, allowing it to benefit from the teacher’s knowledge while still learning from the actual data, helping it improve performance with fewer parameters.</p><pre>def train_kd(teacher, student, train_loader, epochs, learning_rate, T, soft_target_loss_weight, ce_loss_weight, device):<br>    ce_loss = nn.CrossEntropyLoss()<br>    optimizer = optim.Adam(student.parameters(), lr=learning_rate)<br><br>    teacher.eval()<br>    student.train()<br><br>    for epoch in range(epochs):<br>        running_loss = 0.0<br>        for inputs, labels in train_loader:<br>            inputs, labels = inputs.to(device), labels.to(device)<br><br>            optimizer.zero_grad()<br><br>            # Get teacher predictions (soft targets)<br>            with torch.no_grad():<br>                teacher_logits = teacher(inputs)<br><br>            # Get student predictions<br>            student_logits = student(inputs)<br><br>            # Compute distillation loss<br>            soft_targets = nn.functional.softmax(teacher_logits / T, dim=1)<br>            soft_prob = nn.functional.log_softmax(student_logits / T, dim=1)<br>            soft_targets_loss = torch.sum(soft_targets * (soft_targets.log() - soft_prob)) / soft_prob.size()[0] * (T**2)<br><br>            # Compute cross-entropy loss<br>            label_loss = ce_loss(student_logits, labels)<br><br>            # Combine losses<br>            loss = soft_target_loss_weight * soft_targets_loss + ce_loss_weight * label_loss<br><br>            loss.backward()<br>            optimizer.step()<br>            running_loss += loss.item()<br><br>        print(f&quot;Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}&quot;)</pre><p><strong><em>Apply KD and Compare Results</em></strong><em><br></em> Finally, the student model is trained with KD and its accuracy is compared with and without the teacher’s help.</p><pre>train_Vkd(teacher=teacher, student=student, train_loader=train_loader, epochs=10, learning_rate=0.001, T=2, soft_target_loss_weight=0.25, ce_loss_weight=0.75, device=device)<br>test_accuracy_student_kd = test(student, test_loader, device)<br><br>print(f&quot;Teacher accuracy: {test_accuracy_teacher:.2f}%&quot;)<br>print(f&quot;Student accuracy without KD: {test_accuracy_student:.2f}%&quot;)<br>print(f&quot;Student accuracy with KD: {test_accuracy_student_kd:.2f}%&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/822/1*-0kEJaHlkPggQEmpWW0DSw.png" /></figure><p>The slight accuracy improvement with Knowledge Distillation (KD) on CIFAR-10 (70.63% vs. 70.22%) is due to the relatively simple nature of the dataset and the small model size. In cases where the student model already performs well, the gains from KD are often marginal. Additionally, CIFAR-10’s simplicity means the student can already capture most features without the need for extra knowledge transfer. However, on more complex datasets (e.g., ImageNet) or with larger, deeper models, KD can provide substantial improvements as the teacher model’s knowledge helps the student learn more complex features, resulting in better generalization and performance.</p><p>In real-world scenarios, using only KD loss without cross-entropy is generally not ideal. While KD loss helps the student model learn from the teacher’s logits, cross-entropy loss ensures the student also learns from the true labels, improving generalization. Combining both losses allows the student to benefit from the teacher’s guidance while also leveraging the actual data, leading to better performance. While it is possible to train with only KD loss, particularly when the teacher model has learned rich representations and no ground truth is available (as in unsupervised distillation), this approach requires a very strong teacher capable of providing meaningful soft labels. However, relying solely on the teacher’s predictions, especially in the presence of noisy data or errors, is not ideal. In most cases, particularly for general-purpose tasks like classification, combining KD loss with cross-entropy loss offers a more effective solution.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b8fe2e972eb5" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>