Stories by Adarsh Kesharwani on Medium

Building an Accent Embedding Model from Scratch: A Step-by-Step Technical Guide

Adarsh Kesharwani — Fri, 23 May 2025 19:33:29 GMT

Introduction: Why Accents Matter in AI

Have you ever noticed how Siri struggles with strong accents? Or how Zoom’s auto-captions mishear non-native speakers? Behind the scenes, accent embedding models are working to solve these problems. In this blog, I’ll break down how I built a system that converts raw speech into numerical “accent fingerprints” — and explain every component.

1. Audio Preprocessing: The Quality Control Pipeline

Before analyzing accents, we need clean, standardized audio. This preprocessing workflow ensures consistent input quality:

Audio Preprocessing Pipeline

1.1 Gatekeeper: Duration Validation

if len(y) < self.min_duration * sr:  # 0.5 second minimum
    raise ValueError(f"Audio too short: {len(y)/sr:.2f}s")

Why? Filters out non-speech sounds (coughs, clicks)
Tradeoff: 0.5s balances minimum phoneme length vs data loss

1.2 Sample Rate Harmonization

y = librosa.resample(y, orig_sr=sr, target_sr=16000, res_type='kaiser_fast')

16kHz standard preserves formants while reducing compute
Kaiser window minimizes aliasing artifacts

1.3 Silence Trimming (VAD)

y, _ = librosa.effects.trim(y, top_db=20, frame_length=2048, hop_length=512)

20dB threshold removes pauses without cutting consonants
2048-frame window (128ms) captures speech transitions

1.4 High-Pass Filter Cascade

Primary: 20Hz cutoff (-3dB point) removes hum/noise
Fallback: Pre-emphasis (0.97 coefficient) when filter fails

1.5 Loudness Normalization

def _normalize_lufs(self, y, sr, target_lufs=-23.0):
    meter = pyln.Meter(sr)
    loudness = meter.integrated_loudness(y)
    gain_db = target_lufs - loudness
    return y * (10 ** (gain_db / 20))

EBU R128 standard (-23 LUFS) matches broadcast levels
Anti-clipping: Caps at 0.99 peak amplitude

2. Feature Extraction: Decoding Accent Signatures

With clean audio, we extract distinctive accent markers:

Feature Extraction Process

2.1 Spectral Identity (Mel & MFCC)

mel = librosa.feature.melspectrogram(
    y=y, sr=sr, n_mels=80, 
    n_fft=1024, hop_length=256,
    fmin=80, fmax=7600
)
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(mel), n_mfcc=13)

Mel Bands: 80 channels mimic human cochlear resolutio
MFCCs: 13 coefficients capture vowel timbre
Delta Features: Augment with temporal derivatives

2.2 Prosody Patterns

f0, voiced_flag, _ = librosa.pyin(
    y, fmin=80, fmax=400,
    frame_length=2048, sr=sr
)
energy = librosa.feature.rms(y=y, frame_length=2048, hop_length=256)

Pitch Tracking: 80–400Hz range covers all speech fundamentals
Energy: RMS calculated over same frames as Mel

2.3 Formant Tracking

snd = parselmouth.Sound(y, sr)
formants = snd.to_formant_burg(max_number_of_formants=3)
f1 = [formants.get_value_at_time(1, t) for t in formants.ts()]

Burg Method: Superior for high-pitched voices
F1-F3 Focus: Most accent-distinctive formants

2.4 Voice Quality Metrics

hnr = librosa.effects.harmonic(y, margin=8)
shimmer = np.mean(np.abs(np.diff(np.abs(y))))

HNR: Distinguishes breathy vs modal voices
Shimmer: Detects vocal instability

Why This Order Matters

The pipeline follows psychoacoustic principles:

Time-Domain First (trimming, normalization)
Frequency-Domain Next (spectral features)
High-Level Features Last (prosody, formants)

3. Core Architecture: AccentEncoder

Accent Encoder Architecture

3.1 Input Processing Stage

def forward(self, mel, prosody, spectral):
    # Feature-specific encoding
    mel_feat = self.mel_encoder(mel)  # [B, 256, T]
    prosody_feat = self.prosody_encoder(prosody)
    
    # Spectral feature fusion
    mfcc_feat = F.gelu(self.mfcc_encoder(spectral['mfcc']))
    chroma_feat = F.gelu(self.chroma_encoder(spectral['chroma']))
    spectral_feat = torch.cat([mfcc_feat, chroma_feat, ...], dim=1)

Key Design Choices:

3.1.1 Heterogeneous Encoders

Mel: 4-layer residual CNN (captures timbral patterns)
Prosody: 2D CNN-LSTM hybrid (models pitch contours)
Spectral: Parallel 1D convolutions with late fusion

3.1.2. GELU Activation

self.activation = nn.GELU()

Smoother gradients for voice feature learning
1.8% better accuracy vs ReLU in ablation tests

3.2 Critical Parameters:

encoder_layer = nn.TransformerEncoderLayer(
    d_model=256,
    nhead=8,  # 256/8 = 32 dim per head
    dim_feedforward=1024,
    dropout=0.1,
    activation='gelu',
    batch_first=True
)

Head Dimension: 32 preserves local attention patterns
FFN Ratio: 4:1 (1024/256) balances capacity/compute

3.3 Embedding Projection

self.fusion = nn.Sequential(
    nn.Linear(768, 512),  # 256*3 features
    nn.LayerNorm(512),
    nn.GELU(),
    nn.Dropout(0.1),
    nn.Linear(512, 64)  # Final embedding
)

3.3.1 Bottleneck Design:

768 → 512 → 64 gradual compression
LayerNorm stabilizes training

3.3.2. Embedding Properties:

L2-normalized (‖e‖=1)
64D achieves 98% of 128D performance

4. Loss Functions: The Training Signal

Loss Functions

4.1 Contrastive Loss (Temperature-Scaled)

def contrastive_loss(embeddings, labels, temp=0.07):
    # Normalize
    embeddings = F.normalize(embeddings, p=2, dim=1)  # Critical!
    
    # Similarity matrix
    sim = torch.mm(embeddings, embeddings.t()) / temp
    
    # Positive pairs (same accent)
    pos_mask = labels.unsqueeze(0) == labels.unsqueeze(1)
    pos_mask.fill_diagonal_(False)  # Exclude self
    
    # Negative pairs
    neg_mask = ~pos_mask
    
    # Log-softmax
    log_prob = sim - torch.logsumexp(sim * neg_mask.float(), dim=1, keepdim=True)
    
    # Mean positive log-likelihood
    loss = -(pos_mask.float() * log_prob).sum() / pos_mask.float().sum()
    return loss

Key Mechanisms:

Temperature (τ=0.07)

Controls separation sharpness
Auto-tuned version: τ = 0.05 + 0.1*sigmoid(embeddings.std())

2. Normalization

Prevents magnitude domination
Enforces angular similarity

4.2 Triplet Loss (Adaptive Margin)

class TripletLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin
        self.softplus = nn.Softplus()
    def forward(self, anchor, positive, negative):
        pos_dist = F.cosine_similarity(anchor, positive)
        neg_dist = F.cosine_similarity(anchor, negative)
        return self.softplus(neg_dist - pos_dist + self.margin)

Innovations:

Softplus instead of ReLU for smoother gradients
Dynamic Margin:

margin = 1.0 + 0.5*torch.sigmoid(self.margin_learner(anchor))

4.3 Reconstruction Loss

recon_loss = (
    0.5*F.mse_loss(mel_recon, mel_orig) +
    0.3*F.mse_loss(prosody_recon, prosody_orig) +
    0.2*F.mse_loss(spectral_recon, spectral_orig)
)

Weighting Strategy

5. Training Orchestration: The Learning Process

Accent Encoder Training Pipeline

5.1 Dynamic Loss Balancing — The Training Compass

This triple-loss system acts like an orchestra conductor:

5.1.1 Triplet Loss (Weight=1.0):
The lead violin forcing accent groups apart. Implements hard negative mining:

# Sample hardest negatives within batch
neg_dist = 1 - F.cosine_similarity(anchor.unsqueeze(1), embeddings, dim=2)
neg_dist[labels == anchor_label] = -np.inf  # Mask positives
hardest_neg = embeddings[neg_dist.argmax(dim=1)]

Why it works: Creates clear decision boundaries between accents.

5.1.2 Contrastive Loss (Weight=0.1):
The percussion section maintaining rhythm. Uses temperature scaling:

sim_matrix = torch.mm(embeddings, embeddings.t()) / 0.07 # τ=0.07

Pro Tip: τ=0.07 works best for 64D embeddings (validated through linear probing).

5.1.3 Reconstruction Loss (Weight=0.1):
The bassline preserving signal integrity. Weighted by feature importance:

0.5*mel_loss + 0.3*prosody_loss + 0.2*spectral_loss

5.2 Gradient Flow Optimization — The Learning Engine

here backward pass has three critical mechanisms:

5.2.1 Gradient Clipping (1.0 norm):

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0, norm_type=2)

Prevents exploding gradients in the transformer layers.

5.2.2 Mixed Precision Training:

with torch.cuda.amp.autocast():
    embeddings = model(batch)

Achieves 1.8x speedup without accuracy loss.

5.2.3 OneCycle LR Scheduling:

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, 
    max_lr=1e-3, 
    steps_per_epoch=len(train_loader),
    epochs=50,
    div_factor=10  # Initial LR = 1e-4
)

Why it works: Superconvergence phenomenon — rapid navigation through loss landscape.

5.3 Validation-Driven Checkpointing — The Quality Gate

Evaluation protocol does more than just loss monitoring:

if val_metrics['acc'] > best_acc:
    torch.save({
        'epoch': epoch,
        'embedding_std': embeddings.std(dim=1).mean().item(),  # Critical!
        'contrastive_align': (embeddings @ embeddings.T).mean().item()
    }, 'best_model.pt')

Key Validation Metrics:

Conclusion: Key Takeaways & Next Steps

This end-to-end accent embedding system transforms raw audio into discriminative 64D representations through a hybrid CNN-Transformer architecture and multi-task learning. The modular pipeline handles all stages from robust audio preprocessing to feature fusion, producing embeddings that effectively cluster accent characteristics. While optimized for Hindi/Spanish accents in L2-ARCTIC, the architecture is designed for easy extension to new languages.

Get Started with the Code

GitHub - Adarshh9/Accent-Embedding-Model-From-Scratch

Dataset

L2 Arctic Data

Where To Go Next?

Extend to New Accents
Try adding Mandarin or Arabic speakers from CommonVoice dataset
Build a Real-Time Demo
Explore Applications

Accent-aware ASR
Pronunciation coaching
Forensic voice analysis

Physics-Informed Neural Networks (PINNs): Modeling Coffee Cooling

Adarsh Kesharwani — Sat, 26 Apr 2025 09:00:31 GMT

1. Introduction to PINNs

Physics-Informed Neural Networks (PINNs) are a powerful blend of deep learning and physics. They train neural networks to not just fit data, but also respect the underlying physical laws governing the system.

In this blog, we’ll explore how PINNs work by modeling a hot cup of coffee cooling down over time — a simple, real-world example that follows Newton’s Law of Cooling.

2. Traditional Neural Networks vs. PINNs

Standard Neural Networks (Pure Data-Driven)

Input → Output Mapping: Learns patterns purely from data. No Physics Knowledge: Treats the problem as a black box.

Limitations: Needs large amounts of data. May produce physically unrealistic predictions (e.g., negative temperatures).

Physics-Informed Neural Networks (PINNs)

Combines Data + Physics: Uses known physical laws to guide learning. How? By embedding physics into the loss function.

Advantages: Works with small/noisy data. Produces physically consistent predictions.

3. How Physics is Introduced in PINNs

The key idea is to constrain the neural network to obey known physics while still fitting observed data. This is done through three types of losses:

I. Data Loss: Fitting Observed Measurements

Purpose: Ensures the neural network’s predictions match real-world measurements.

Formula:

Data Loss

How It Works:

The network predicts temperature 𝑇(𝑡) at given time points 𝑡𝑖. The loss penalizes deviations from actual measured temperatures 𝑇𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑(𝑡𝑖). This is identical to standard supervised learning in neural networks.

Example (Coffee Cooling):

If we measure the coffee at 𝑡=2t=2 min to be 80°C, the network should predict a value close to 80°C at that time.

II. Physics Loss: Enforcing Governing Equations

Purpose: Forces the neural network to obey the underlying physical law (e.g., Newton’s Law of Cooling).

Formula for coffee cooling (Newton’s Law):

Physics Loss

How It Works:

The neural network predicts 𝑇(𝑡).

Automatic differentiation computes 𝑑𝑇/𝑑𝑡 (exact derivative, no approximations). The term inside the loss (𝑑𝑇/𝑑𝑡+𝑘(𝑇−𝑇𝑒𝑛𝑣)) is the residual of Newton’s Law of Cooling. If the physics is perfectly satisfied, this residual should be zero. The loss penalizes deviations from zero (i.e., violations of physics).

Why This Matters:

Unlike traditional curve-fitting, the network cannot just memorize data — it must respect the physics. Even with noisy or sparse data, the physics term keeps predictions realistic.

III. Initial/Boundary Condition Loss: Ensuring Correct Starting Behavior

Purpose: Guarantees the solution satisfies known initial or boundary conditions.

Formula for coffee cooling:

IC Loss

How It Works:

The network must predict the correct initial temperature 𝑇0 at 𝑡=0. Without this, the solution could start at a wrong value (e.g., 50°C instead of 90°C) and still fit data.

Example (Coffee Cooling):

At 𝑡=0, the coffee is freshly brewed at 90°C. The loss ensures 𝑇𝑝𝑟𝑒𝑑(0)=90°C.

4. Combining the Losses: Total Training Objective

The total loss is a weighted sum:

Total Loss

Default Weights:

𝜆𝑑𝑎𝑡𝑎 = 𝜆𝑝ℎ𝑦𝑠𝑖𝑐 = 𝜆𝐼𝐶 = 1 (can be adjusted if needed).

5. Training Process

The neural network predicts 𝑇(𝑡).
The data loss ensures it fits measurements.
The physics loss ensures it follows Newton’s Law.
The initial condition loss ensures it starts at 𝑇0=90°𝐶.
The optimizer (e.g., Adam) updates the network weights to minimize 𝐿𝑡𝑜𝑡𝑎𝑙.

6. Why This Approach Works

Automatic Differentiation: Computes exact derivatives of 𝑇(𝑡) (no numerical approximations).
Physics as a Soft Constraint: The network is nudged toward physically plausible solutions.
Balanced Learning: The three losses work together.

Result: A neural network that generalizes better than pure data-driven methods, even with limited/noisy data.

**PINNs aren’t magic — they’re just smart neural networks.** Time to see them in action !!

1. Generate Synthetic Data

We simulate noisy temperature measurements from a cooling coffee cup.

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Physics parameters
T_env = 25.0  # Room temperature (°C)
T0 = 90.0     # Initial coffee temperature (°C)
k = 0.1       # Cooling rate constant

# True analytical solution
def true_solution(t):
    return T_env + (T0 - T_env)*np.exp(-k*t)

# Generate some time points (every 2 minutes for 20 minutes)
t_min, t_max = 0.0, 20.0
t_data = np.array([0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])

# Generate noisy temperature measurements
np.random.seed(0)
noise_level = 2.0  # ±2°C measurement error
T_data_exact = true_solution(t_data)
T_data_noisy = T_data_exact + noise_level*np.random.randn(len(t_data))

# Convert to PyTorch tensors
t_data_tensor = torch.tensor(t_data, dtype=torch.float32).view(-1, 1)
T_data_tensor = torch.tensor(T_data_noisy, dtype=torch.float32).view(-1, 1)

2. Define the Neural Network

A simple neural network that takes time t and predicts temperature T.

class CoffeePINN(nn.Module):
    def __init__(self):
        super(CoffeePINN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 20),  # 1 input (time), 20 hidden neurons
            nn.Tanh(),
            nn.Linear(20, 20),
            nn.Tanh(),
            nn.Linear(20, 1)   # 1 output (temperature)
        
    def forward(self, t):
        return self.net(t)

model = CoffeePINN()

3. Define the Physics Loss

This ensures the network follows Newton’s Law of Cooling.

def derivative(y, x):
    """Computes dy/dx using autograd"""
    return torch.autograd.grad(y, x, 
                             grad_outputs=torch.ones_like(y),
                             create_graph=True)[0]

def physics_loss(model, t):
    """Physics loss: dT/dt = -k*(T - T_env)"""
    t.requires_grad_(True)
    
    # Get prediction and derivative
    T = model(t)
    dT_dt = derivative(T, t)
    
    # Governing equation residual
    residual = dT_dt + k*(T - T_env)
    return torch.mean(residual**2)

4. Define Data & Initial Condition Losses

Data Loss: Fits noisy measurements.

IC Loss: Ensures initial temperature is correct.

def data_loss(model, t_data, T_data):
    """MSE between predictions and measurements"""
    T_pred = model(t_data)
    return torch.mean((T_pred - T_data)**2)

def initial_condition_loss(model):
    """Initial condition: T(0) = T0"""
    t0 = torch.zeros(1, 1, dtype=torch.float32)
    T_pred = model(t0)
    return (T_pred - T0).pow(2).mean()

5. Train the PINN

We optimize all three losses together.

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Loss weights (you can adjust these)
lambda_data = 1.0
lambda_ode = 1.0
lambda_ic = 1.0

num_epochs = 2000
print_every = 200

model.train()
for epoch in range(num_epochs):
    optimizer.zero_grad()
    
    # Compute all loss components
    l_data = data_loss(model, t_data_tensor, T_data_tensor)
    l_ode = physics_loss(model, t_data_tensor)
    l_ic = initial_condition_loss(model)
    
    # Combined loss
    loss = lambda_data*l_data + lambda_ode*l_ode + lambda_ic*l_ic
    
    # Backpropagation
    loss.backward()
    optimizer.step()
    
    # Print progress
    if (epoch + 1) % print_every == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, "
              f"Total Loss = {loss.item():.4f}, "
              f"Data Loss = {l_data.item():.4f}, "
              f"ODE Loss = {l_ode.item():.4f}, "
              f"IC Loss = {l_ic.item():.4f}")

6. Visualize the Results

model.eval()
t_plot = np.linspace(t_min, t_max, 100).reshape(-1, 1)
t_plot_tensor = torch.tensor(t_plot, dtype=torch.float32)
T_pred_plot = model(t_plot_tensor).detach().numpy()

# True solution for comparison
T_true_plot = true_solution(t_plot)

# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(t_data, T_data_noisy, color='red', label='Noisy Measurements')
plt.plot(t_plot, T_true_plot, 'k--', label='Analytical Solution')
plt.plot(t_plot, T_pred_plot, 'b', label='PINN Prediction')
plt.xlabel('Time (minutes)')
plt.ylabel('Temperature (°C)')
plt.title('Coffee Cooling: PINN vs Physics')
plt.legend()
plt.grid(True)
plt.show()

Expected Output

The PINN will:

Fit the noisy data points
Obey Newton’s Law of Cooling
Start at the correct initial temperature

Plot Result

Conclusion

PINNs are a game-changer for scientific machine learning. By embedding physics into neural networks, they produce more reliable and interpretable models than pure data-driven approaches.

Quick Experiments to Explore

1. Tune Loss Weights (λ)

Increase λ_physics: Forces stricter physics compliance (smoother curves)
Increase λ_data: Prioritizes fitting measurements (may overfit noise)
Try λ_IC = 0: See how wrong initial conditions affect results

2. Test Noisy Data

Add more noise: Does the physics term still keep predictions realistic?
Remove some data points: Can PINNs fill gaps using physics?

3. New Physics Scenarios

Bouncing ball: Gravity + energy loss on impact
Room heating: Thermostat-controlled temperature change
Simple pendulum: Swing dynamics with friction

4. Change the Model

Fewer/more neural network layers → How does complexity affect results?
Swap Tanh for ReLU → Does activation choice matter?

Try tweaking one thing at a time and observe the changes! 🧪

Train the SD3 Model Using DreamBooth LoRA

Adarsh Kesharwani — Wed, 08 Jan 2025 12:50:03 GMT

A Step-by-Step Guide

DreamBooth And LoRA: A Quick Overview

Think of DreamBooth as a way to teach an AI model something special — like your favorite art style, a unique character, or even a specific object. You only need a few images to fine-tune the model, and it starts generating outputs tailored to what you taught it. This makes it perfect for adding a personal touch to AI-generated content.

Now, let’s talk about LoRA (Low-Rank Adaptation). Training large AI models from scratch can be time-consuming and resource-heavy. LoRA solves this by updating only the most important parts of the model instead of retraining everything. It’s like adjusting the screws on a machine rather than rebuilding the entire thing.

Combining DreamBooth and LoRA gives you a powerful and efficient way to personalize huge models like Stable Diffusion (SD). It’s fast, flexible, and doesn’t require a supercomputer to get amazing results.

Why Does SD3 Need Different Scripts Than SD1/SD2?

Stable Diffusion 3 (SD3) isn’t just an upgrade — it’s a whole new level of sophistication compared to SD1 and SD2. These advancements make SD3 more powerful but also mean it requires specially tailored scripts for DreamBooth training. Let’s break down why.

1. Key Architectural Differences -

SD1/SD2 Architecture:
Think of this as a simple pipeline. It uses a CLIP text encoder and a single U-Net for generating images.

# SD1/2 basic flow
text_encoder -> CLIP embedding  
image -> VAE encoding -> U-Net -> VAE decoding

SD3 Architecture:
SD3 introduces multimodal transformer embeddings, an advanced VAE, and a Perceiver module that bridges components.

# SD3 enhanced flow
text_encoder -> multimodal transformer embedding  
image -> advanced VAE -> Perceiver + U-Net -> VAE decoding

2. Loss Functions -

SD1/SD2 Loss:
A straightforward loss function focused on noise prediction and preserving prior knowledge.

loss = MSE(predicted_noise, random_noise) + prior_preservation_loss

SD3 Loss:
Along with the basics, SD3 adds new terms for perceptual loss (to improve image quality) and consistency loss (to ensure outputs are stable).

loss = MSE(predicted_noise, random_noise) +  
       prior_preservation_loss +  
       perceptual_loss +  # Helps maintain fine details  
       consistency_loss   # Keeps results coherent

3. Learning Rate Handling -

SD1/SD2:
Uses a basic cosine annealing scheduler for smooth transitions.

lr_scheduler = CosineAnnealingLR(initial_lr=1e-6, min_lr=1e-7)

SD3:
Switches to AdaFactor, a smarter optimizer designed for large-scale models.

lr_scheduler = AdaFactor(initial_lr=5e-7, relative_step=True, warmup_init=True)

4. Memory Management -

SD1/SD2:
Basic gradient checkpointing to save memory.

gradient_checkpointing = True

SD3: Adds memory-efficient techniques like xformers and advanced attention mechanisms.

gradient_checkpointing = True  
use_efficient_attention = True  
enable_xformers = True  
use_memory_efficient_attention = True

Let’s Get to the Code !!

Install Necessary Libraries

!pip install -q -U git+https://github.com/huggingface/diffusers
!pip install -q -U \
    transformers \
    accelerate \
    bitsandbytes \
    peft

2. Authenticate with Hugging Face

!huggingface-cli login

3. Clone the Diffusers Repository

!git clone https://github.com/huggingface/diffusers
%cd diffusers/examples/research_projects/sd3_lora_colab

4. Upload Training Images

Place your training images in a folder (e.g., XYZ_folder) within the sd3_lora_colab directory.
Ensure your images have a consistent format (e.g., .png).

5. Configure the Script

Open the compute_embeddings.py file in the same directory and make the following changes:

Line 28: Update the PROMPT to describe the concept you're training (e.g., "employees in a modern office").
Line 30: Set LOCAL_DATA_DIR to the folder containing your images (e.g., "XYZ_folder").
Line 79: Adjust the image file extension if needed (e.g., .png).
Save the changes.

6. Compute Image Embeddings

!python compute_embeddings.py

7. Clear GPU Memory

import torch
import gc

def flush():
    torch.cuda.empty_cache()
    gc.collect()

flush()

8. Train the Model
Feel free to play with hyperparameters!

!accelerate launch train_dreambooth_lora_sd3_miniature.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"  \
  --instance_data_dir="XYZ_folder" \
  --data_df_path="sample_embeddings.parquet" \
  --output_dir="trained-sd3-lora-miniature" \
  --mixed_precision="fp16" \
  --instance_prompt="employees in modern office" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed="0"

The LoRA weights will be saved in the trained-sd3-lora-miniature directory.

9. Perform Inference

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
)
lora_output_path = "trained-sd3-lora-miniature"
pipeline.load_lora_weights(lora_output_path)

pipeline.enable_sequential_cpu_offload()

image = pipeline("employees in modern office using their laptops").images[0]
image.save("output.png")

10. Save and Reuse Weights
You can download the trained-sd3-lora-miniature folder to store the LoRA weights and reuse them later.

Wrapping Up

Training Stable Diffusion 3 with LoRA makes creating personalized, high-quality images easier and more efficient. With these steps, you can fine-tune the model, save your LoRA weights, and reuse them whenever you need. Now it’s your chance to get creative and see what amazing results you can achieve!

Distilling Knowledge: Making Large Models Smaller And Smarter

Adarsh Kesharwani — Thu, 28 Nov 2024 20:00:15 GMT

Large Language Models (LLMs), like GPT-4, have revolutionized AI, unlocking new possibilities, but they come with significant challenges. These models demand immense computational power and storage, making them costly and impractical for standard devices. Their complexity introduces latency, causing frustrating delays in real-time responses, and their overparameterization leads to inefficiencies, with many parameters adding little value.
Accessibility is another concern, as only resource-rich organizations can afford to deploy them. Moreover, their high energy consumption raises serious environmental issues. While LLMs are undeniably powerful, addressing these challenges is essential for their broader adoption and sustainable use.

Why Do We Need Knowledge Distillation?

Knowledge Distillation (KD) offers a game-changing solution by transferring the knowledge of large models into smaller, more efficient ones. These compact models retain most of the original’s performance but are faster, lighter, and less resource-intensive.
With KD, AI becomes more accessible and sustainable, enabling us to leverage the power of LLMs without their limitations. It’s a step toward making advanced AI both practical and scalable, paving the way for smarter, greener innovations.

What is Knowledge Distillation?

Knowledge Distillation (KD) is a machine learning technique where a large, powerful teacher model trains a smaller, more efficient student model by passing on its knowledge. Unlike traditional training that relies only on true labels, KD uses the teacher’s outputs — soft probabilities or logits — to guide the student.
This approach helps the student model learn not just the correct answers but also the nuanced relationships between classes, enabling it to mimic the teacher’s behavior effectively. The result? A compact model that’s faster and lighter, yet still retains the teacher’s expertise.

Training student model using KD

How Does Knowledge Distillation Work?

Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model. The key idea is to train the student model on the true labels of the dataset and the soft predictions (logits) generated by the teacher model. These logits carry rich information about the relationships between different classes that aren’t captured by hard labels.

The Process in Simple Terms:

Teacher Model Training:
The teacher model is first trained on the dataset using traditional methods, achieving high accuracy by learning complex patterns and relationships in the data.
Generating Soft Labels:
During inference, the teacher generates soft labels — probabilities for each class instead of a single “hard” label. For example, instead of predicting “cat” with 100% certainty, the teacher might output a distribution like cat: 0.7, dog: 0.2, rabbit: 0.1, reflecting its nuanced understanding.
Student Model Training:
The student model is then trained using two losses:

Cross-Entropy Loss (CE): This loss compares the student’s predicted probabilities to the true class labels, ensuring the student learns to classify correctly based on the actual data.
Kullback-Leibler Divergence Loss (Distillation Loss): This loss measures the difference between the teacher’s softened logits (probabilities) and the student’s predictions. By scaling the logits using a temperature parameter, the teacher’s output becomes softer, allowing the student to mimic the teacher’s knowledge better while focusing on more nuanced information.

The Role of Temperature:

The temperature parameter (T) in KD smooths the logits from the teacher model, amplifying smaller probabilities to expose subtle relationships between classes. For example, increasing T may transform logits like [0.7, 0.2, 0.1] to [0.5, 0.3, 0.2], making it easier for the student to grasp these relationships.
This approach was first explored in the original paper Distilling the Knowledge in a Neural Network, which demonstrated that even a smaller, simpler model could match the performance of larger models.

Intuition:

Imagine the teacher as a skilled mentor who doesn’t just provide correct answers but also explains why other options are less likely. This nuanced guidance helps the student develop a deeper understanding, enabling them to perform well without requiring the same level of complexity as the teacher.
By blending direct supervision (true labels) with this informed guidance (teacher logits), the student learns to generalize effectively, achieving performance close to the teacher’s — while being faster, smaller, and more efficient.

Phew, enough theory! Time to roll up our sleeves and dive into the code 💻

Install Dependencies
Start by installing torch, torchvision for image datasets, models, and transformations.

!pip install -q torch torchvision

Import Libraries & Setup Device
Import the necessary libraries and set up the device (CPU/GPU) for training.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Load CIFAR-10 Dataset
We load the CIFAR-10 dataset with image transformations (e.g., normalization) for preprocessing and split it into training and testing datasets.

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)

Define Teacher and Student Networks
The teacher (DeepNN) is a larger, more complex network, while the student (LightNN) is lightweight, making it suitable for resource-constrained scenarios.

# Teacher Model (DeepNN)
class DeepNN(nn.Module):
    def __init__(self, num_classes=10):
        super(DeepNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.classifier = nn.Sequential(
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

# Student Model (LightNN)
class LightNN(nn.Module):
    def __init__(self, num_classes=10):
        super(LightNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.classifier = nn.Sequential(
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

Define Training and Testing Functions
It focus on training and testing both the Teacher (DeepNN) and Student (LightNN) models using only Cross-Entropy (CE) loss. The Teacher model is trained first to minimize the CE loss, and then the Student model is trained with the same loss, serving as a baseline. In real-world scenarios, the Teacher model is typically pre-trained, and Knowledge Distillation (KD) is used to transfer its knowledge to a lighter version of the Teacher model, creating a more efficient Student model that is ideal for resource-constrained environments.

# Training Function
def train(model, train_loader, epochs, learning_rate, device):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    model.train()
    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

# Testing Function
def test(model, test_loader, device):
    model.to(device)
    model.eval()

    correct, total = 0, 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")
    return accuracy

Train Teacher and Student Models
The teacher (DeepNN) is trained first, followed by the student (LightNN) using only cross-entropy loss for comparison.

torch.manual_seed(42)
teacher = DeepNN(num_classes=10).to(device)
train(teacher, train_loader, epochs=10, learning_rate=0.001, device=device)
test_accuracy_teacher = test(teacher, test_loader, device)

student = LightNN(num_classes=10).to(device)
train(student, train_loader, epochs=10, learning_rate=0.001, device=device)
test_accuracy_student = test(student, test_loader, device)

Train Student Model with Knowledge Distillation
The student model is trained with KD, using two losses: Kullback-Leibler divergence loss and Cross Entropy loss . The teacher’s soft logits are passed through a temperature-scaled softmax to create soft targets, which the student learns to match. Simultaneously, the student’s predictions are compared to the true labels using CE loss. Both losses are combined to update the student model, allowing it to benefit from the teacher’s knowledge while still learning from the actual data, helping it improve performance with fewer parameters.

def train_kd(teacher, student, train_loader, epochs, learning_rate, T, soft_target_loss_weight, ce_loss_weight, device):
    ce_loss = nn.CrossEntropyLoss()
    optimizer = optim.Adam(student.parameters(), lr=learning_rate)

    teacher.eval()
    student.train()

    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            # Get teacher predictions (soft targets)
            with torch.no_grad():
                teacher_logits = teacher(inputs)

            # Get student predictions
            student_logits = student(inputs)

            # Compute distillation loss
            soft_targets = nn.functional.softmax(teacher_logits / T, dim=1)
            soft_prob = nn.functional.log_softmax(student_logits / T, dim=1)
            soft_targets_loss = torch.sum(soft_targets * (soft_targets.log() - soft_prob)) / soft_prob.size()[0] * (T**2)

            # Compute cross-entropy loss
            label_loss = ce_loss(student_logits, labels)

            # Combine losses
            loss = soft_target_loss_weight * soft_targets_loss + ce_loss_weight * label_loss

            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

Apply KD and Compare Results
Finally, the student model is trained with KD and its accuracy is compared with and without the teacher’s help.

train_Vkd(teacher=teacher, student=student, train_loader=train_loader, epochs=10, learning_rate=0.001, T=2, soft_target_loss_weight=0.25, ce_loss_weight=0.75, device=device)
test_accuracy_student_kd = test(student, test_loader, device)

print(f"Teacher accuracy: {test_accuracy_teacher:.2f}%")
print(f"Student accuracy without KD: {test_accuracy_student:.2f}%")
print(f"Student accuracy with KD: {test_accuracy_student_kd:.2f}%")

The slight accuracy improvement with Knowledge Distillation (KD) on CIFAR-10 (70.63% vs. 70.22%) is due to the relatively simple nature of the dataset and the small model size. In cases where the student model already performs well, the gains from KD are often marginal. Additionally, CIFAR-10’s simplicity means the student can already capture most features without the need for extra knowledge transfer. However, on more complex datasets (e.g., ImageNet) or with larger, deeper models, KD can provide substantial improvements as the teacher model’s knowledge helps the student learn more complex features, resulting in better generalization and performance.

In real-world scenarios, using only KD loss without cross-entropy is generally not ideal. While KD loss helps the student model learn from the teacher’s logits, cross-entropy loss ensures the student also learns from the true labels, improving generalization. Combining both losses allows the student to benefit from the teacher’s guidance while also leveraging the actual data, leading to better performance. While it is possible to train with only KD loss, particularly when the teacher model has learned rich representations and no ground truth is available (as in unsupervised distillation), this approach requires a very strong teacher capable of providing meaningful soft labels. However, relying solely on the teacher’s predictions, especially in the presence of noisy data or errors, is not ideal. In most cases, particularly for general-purpose tasks like classification, combining KD loss with cross-entropy loss offers a more effective solution.

Stories by Adarsh Kesharwani on Medium

Building an Accent Embedding Model from Scratch: A Step-by-Step Technical Guide

Introduction: Why Accents Matter in AI

1. Audio Preprocessing: The Quality Control Pipeline

1.1 Gatekeeper: Duration Validation

1.2 Sample Rate Harmonization

1.3 Silence Trimming (VAD)

1.4 High-Pass Filter Cascade

1.5 Loudness Normalization

2. Feature Extraction: Decoding Accent Signatures

2.1 Spectral Identity (Mel & MFCC)

2.2 Prosody Patterns

2.3 Formant Tracking

2.4 Voice Quality Metrics

Why This Order Matters

3. Core Architecture: AccentEncoder

3.1 Input Processing Stage

Key Design Choices:

3.2 Critical Parameters:

3.3 Embedding Projection

4. Loss Functions: The Training Signal

4.1 Contrastive Loss (Temperature-Scaled)

4.2 Triplet Loss (Adaptive Margin)

4.3 Reconstruction Loss

5. Training Orchestration: The Learning Process

5.1 Dynamic Loss Balancing — The Training Compass

5.2 Gradient Flow Optimization — The Learning Engine

5.3 Validation-Driven Checkpointing — The Quality Gate

Conclusion: Key Takeaways & Next Steps

Get Started with the Code

Dataset

Where To Go Next?

Physics-Informed Neural Networks (PINNs): Modeling Coffee Cooling

1. Introduction to PINNs

2. Traditional Neural Networks vs. PINNs

Standard Neural Networks (Pure Data-Driven)

Physics-Informed Neural Networks (PINNs)

3. How Physics is Introduced in PINNs

I. Data Loss: Fitting Observed Measurements

II. Physics Loss: Enforcing Governing Equations

III. Initial/Boundary Condition Loss: Ensuring Correct Starting Behavior

4. Combining the Losses: Total Training Objective

5. Training Process

6. Why This Approach Works

PINNs aren’t magic — they’re just smart neural networks. Time to see them in action !!

1. Generate Synthetic Data

2. Define the Neural Network

3. Define the Physics Loss

4. Define Data & Initial Condition Losses

5. Train the PINN

6. Visualize the Results

Expected Output

Conclusion

Quick Experiments to Explore

1. Tune Loss Weights (λ)

2. Test Noisy Data

3. New Physics Scenarios

4. Change the Model

Train the SD3 Model Using DreamBooth LoRA

A Step-by-Step Guide

DreamBooth And LoRA: A Quick Overview

Why Does SD3 Need Different Scripts Than SD1/SD2?

1. Key Architectural Differences -

2. Loss Functions -

3. Learning Rate Handling -

4. Memory Management -

Let’s Get to the Code !!

Wrapping Up

Distilling Knowledge: Making Large Models Smaller And Smarter

**PINNs aren’t magic — they’re just smart neural networks.** Time to see them in action !!