Inspiration
LLMs can't learn from corrections, which is kind of absurd when you think about it. You tell it it's wrong, it apologizes, and then next conversation it makes the exact same mistake because nothing about the model actually changed. The SDFT paper (Shenfeld et al., ICLR 2025) showed that self-distillation lets you fine-tune without catastrophic forgetting, but it was all benchmark experiments on curated datasets. We wanted to see what happens when you put that in a real chat loop where a user just talks to the model and corrects it naturally.
What it does
You talk to a local model, and if you correct it, the system picks up on that, generates training data around the correction, and fine-tunes in the background while you keep chatting. The header shows a progress bar during training and flips green when it's done. Multiple chat sessions are stored in SQLite but the corrections carry across all of them because it's the model weights themselves that changed, not some per-session prompt injection.
How we built it
FastAPI backend running on a single GPU, vanilla HTML/CSS/JS frontend with a sidebar for sessions and a live training indicator. No React, no framework, nothing fancy.
The training pipeline is where the work went. We adapted HuggingFace TRL's DistilTrainer to do SDFT: when a correction fires, GPT-4o-mini handles the structured parts (detecting what went wrong, generating prompt variations, writing expert demos), and then the local Qwen2.5-7B trains against a demo-conditioned copy of itself using reverse KL divergence. Teacher weights are an EMA of the student so they stay current, full fine-tuning in bfloat16, no LoRA. We serve chat during training by hooking into the trainer callback and running inference between optimizer steps, which is hacky but means the user never has to sit there waiting for training to finish.
Challenges we ran into
The model kept getting worse at unrelated tasks after corrections, which is exactly what SDFT is supposed to prevent, so we had to figure out what was going wrong with our implementation versus the paper.
The biggest issue was that the KL divergence was pointing the wrong direction. Forward KL computes the loss weighted by the teacher's probabilities across the entire vocabulary at every token position, so even positions that have nothing to do with your correction get pushed around, and over multiple corrections those small shifts compound into real degradation. Reverse KL weights by the student's own probabilities instead, which means if the student and teacher already agree on a token then basically nothing happens and the update stays focused on what actually matters. There were a couple other things too: the teacher model was frozen at init instead of being an EMA that tracks the student, and the teacher prompt said "short, direct response" which made it generate in a totally different style from the student, widening the gap between their distributions in ways that had nothing to do with the actual correction.
Accomplishments that we're proud of
The model genuinely accumulates corrections. You can correct it on three different topics and it retains all three while still being normal on everything else, which is the core promise of the paper and we got it working in a live interactive setting. The background training UX is also something we're happy with since gradient descent is literally running behind your conversation and the only indication is a progress bar filling up.
What we learned
The direction of KL divergence matters way more than we expected. The paper writes reverse KL in its equations but the reference code actually ships forward KL, and that works fine when you train on hundreds of diverse examples because the noise at irrelevant token positions cancels out across the dataset. With 10 examples about one fact it doesn't cancel, everything drifts, and you get the forgetting you were trying to avoid. The other thing is that on-policy learning only prevents forgetting when the teacher distribution stays close to the student's, and there are a surprising number of ways to accidentally violate that (bad prompt template, frozen teacher weights, wrong KL direction) where each one independently brings the forgetting back.
What's next for LifeLearn
Adding a replay buffer so each training run mixes in some data from past corrections, since reverse KL handles most of the forgetting but an explicit anchor would help as you get into dozens of updates. After that we want to extend beyond factual corrections to style and reasoning, and let the model flag low-confidence answers proactively so it can ask for help before getting something wrong.
Log in or sign up for Devpost to join the conversation.