Welcome to My Blog!

2025-01-15T00:00:00-08:00

Welcome to my new blog! I’ve finally decided to start writing about my research journey, thoughts on NLP and machine learning, and various technical topics that interest me.

What to Expect

You can expect posts about:

Research insights: Sharing lessons learned from my work on attribution, memorization, and language models
Technical tutorials: How-to guides on various NLP and ML techniques
Paper reviews: My thoughts on interesting papers in the field
Academic life: Tips and experiences from graduate school and research

Why Start a Blog?

As a PhD student working on NLP and ML, I often come across interesting ideas, debugging stories, and “aha!” moments that I think could be valuable to share. This blog will serve as both a personal record and hopefully a resource for others in the field.

I’m particularly excited to write about:

Attribution methods for language models
The fascinating relationship between memorization and generalization
Practical tips for ML research and experimentation

Stay Tuned

I’ll be posting regularly about my research adventures. Feel free to reach out if you have any questions or topics you’d like me to cover!

Happy researching!

Understanding Attribution in Language Models: A Research Overview

2025-01-10T00:00:00-08:00

Attribution has become one of the most crucial topics in making language models more transparent and trustworthy. In this post, I’ll share some insights from my research on attribution methods and why they matter for building safe AI systems.

What is Attribution?

Attribution, in the context of language models, refers to the process of identifying which parts of the training data, model components, or input contributed most to a particular prediction. Think of it as asking: “Why did the model produce this specific output?”

Types of Attribution

There are several flavors of attribution that researchers work on:

1. Data Attribution

Question: Which training examples influenced this prediction?
Methods: Influence functions, TracIn, gradient-based methods
Applications: Data selection, debugging, privacy

2. Feature Attribution

Question: Which input tokens/features matter most?
Methods: Gradients, attention, LIME, SHAP
Applications: Model interpretation, bias detection

3. Component Attribution

Question: Which model parameters/layers are responsible?
Methods: Probing, circuit analysis, mechanistic interpretability
Applications: Model understanding, targeted editing

Why Attribution Matters

From my research experience, I’ve found that attribution is crucial for:

Building Trust: Users need to understand why models make certain decisions
Debugging Models: Finding and fixing problematic behaviors
Data Quality: Identifying low-quality or biased training data
Regulatory Compliance: Many domains require explainable AI

Challenges in Attribution

Working on attribution research has taught me that there are several fundamental challenges:

Ground Truth: How do we know if our attributions are “correct”?
Scalability: Many methods don’t scale to modern large language models
Faithfulness: Do the attributions actually reflect the model’s reasoning?

Current Research Directions

The field is rapidly evolving with exciting developments in:

Mechanistic Interpretability: Understanding the circuits within transformers
Efficient Attribution: Methods that work with billion-parameter models
Multimodal Attribution: Extending attribution to vision-language models

Example: Simple Gradient-Based Attribution

Here’s a quick example of how you might compute simple gradient-based attribution:

import torch

def compute_input_attribution(model, input_ids, target_token_id):
    """
    Compute gradient-based attribution for input tokens.
    """
    # Enable gradients for input embeddings
    embeddings = model.get_input_embeddings()(input_ids)
    embeddings.requires_grad_(True)
    
    # Forward pass
    outputs = model(inputs_embeds=embeddings)
    logits = outputs.logits
    
    # Get probability for target token
    target_prob = torch.softmax(logits[0, -1], dim=-1)[target_token_id]
    
    # Backward pass
    target_prob.backward()
    
    # Attribution is the gradient magnitude
    attribution = embeddings.grad.norm(dim=-1)
    
    return attribution.detach()

This is just scratching the surface, but it gives you an idea of how we can start understanding what drives model predictions.

Looking Forward

As language models become more powerful and ubiquitous, the need for robust attribution methods will only grow. I’m excited to continue working on making these models more interpretable and trustworthy.

What aspects of attribution are you most interested in? Feel free to reach out if you’d like to discuss any of these topics further!

This post is based on insights from my ongoing research on attribution methods. Stay tuned for more technical deep-dives!

Yupei Du