BioModal

Taking the Mask Off: What Protein Binder RL Is Actually Learning

Utkarsh Singh — Wed, 15 Apr 2026 17:24:16 GMT

This post builds on ideas from Nick Boyd’s essay on RL and PDB-derived reward signals, particularly his analysis of structural diversity and reward bias.

The models have been getting better at generating first-pass binders. Brian Naughton’s guide walks you through designing a VHH from scratch: pick a target, run your generative model of choice, filter by your in-silico structure metric of choice, and send it away to the lab for in vitro testing.

As Nick Boyd(founder of Escalante) mentions, there are currently two main approaches to computational protein binder design: optimization (exemplified by BindCraft) and generative models (like BoltzGen). At the risk of repeating something already mentioned, BoltzGen is faster, but the per-binder design quality is much lower than that of something like BindCraft, which has the opposite properties. So the net computational cost is roughly the same. In his essay, Boyd shows how you can improve upon BoltzGen by borrowing the LLM posttraining playbook — finetune on high-quality hallucinated binders from optimization-based methods, apply GRPO, and watch your binder quality improve.

It would seem as if the binder generation problem has been solved, but that’s far from the case.

Leash Bio showed that when we analyze what ChEMBL-trained models actually learn, it’s less about molecular binding and more about which medicinal chemists tend to make which kinds of molecules. Their model Hermes — a lightweight 50M parameter sequence-only transformer trained entirely on combinatorially synthesized molecules with no human design intent, outperforms Boltz2 on out-of-distribution chemistry despite being several orders of magnitude smaller.

There are uncomfortable parallels to be observed. ChEMBL is biased by what medicinal chemists chose to synthesize: their preferences, their intuitions, their career-long optimization toward molecules that look drug-like to a human expert. PDB has the same problem. It’s biased by what structural biologists chose to crystallize: proteins that are stable enough to survive purification, well-behaved enough to form crystals, interesting enough to justify the grant money. Both are human-curated snapshots of a tiny corner of an astronomically vast space.

As Nick Boyd emphasizes in his essay, BoltzGen, Protenix, and ProteinMPNN are all downstream of PDB— structures of what structural biologists chose to crystallize. When you apply GRPO with a reward signal that's just PDB all the way down, can you confidently say you’re truly learning what makes a good binder in a general sense? Or are we optimizing for something a structural biologist would have crystallized?

In this article, we explore whether applying reinforcement learning causes a collapse in generated binders. We generated 200 structures per condition across three model checkpoints — base BoltzGen, an SFT checkpoint finetuned on high-quality Mosaic hallucinated binders, and an RL checkpoint trained on top of that — and evaluated structural diversity using Foldseek across five targets spanning in-distribution (ACE2, CCL2, PDL1), near-OOD (EGFR), and far-OOD (KRAS) regimes.

Methods

Trimming the targets

For PDL1, CCL2, and KRAS, full protein sequences were used. For EGFR and ACE2, truncations were necessary to fit within GPU memory constraints on an RTX 5090. For EGFR, residues 190–505 were retained, preserving domains II and III — the core dimerization arm and primary ligand-binding domain. For ACE2, the full M2 peptidase domain was kept, covering the entire SARS-CoV-2 binding interface. Both truncated structures were refolded with AlphaFold3 to confirm iPTM did not drop meaningfully.

Evaluating Structural Diversity amongst the Binders

Once we had binders for all targets, we used Foldseek to perform three steps: database construction, intra-set structural clustering at TM-score threshold 0.5, and easy-search against PDB100. We report Shannon entropy of the cluster distribution as our primary diversity metric, cluster diversity (unique clusters / N) as a secondary metric, and mean TM-score to nearest PDB hit as our PDB sociology signal.

Measuring Collapse: Two Lenses on the Same Problem

To measure structural diversity within each generated set, we compute the Shannon entropy of the Foldseek cluster distribution:

where p_i is the fraction of structures belonging to cluster i. Higher entropy means the generated set spans many distinct structural solutions. Lower entropy means the set has collapsed toward a small number of dominant folds --- in the extreme case of PDL1 baseline, a single cluster, giving H = 0.

To measure how PDB-like the generated structures are, we run each binder against PDB100 using Foldseek’s structural search and record the TM-score of the nearest hit. TM-score ranges from 0 to 1, where scores above 0.5 indicate the same overall fold and scores above 0.8 indicate near-identical structures.

Results

PDL1 - Clones all the way down

On PDL1, base BoltzGen produced 200 structurally distinct binders that Foldseek collapsed into a single cluster with a Shannon entropy of 0.0. We also see the Mean TM-score to nearest PDB hit: 0.946. The model had already memorized the answer before any RL ever happened.

This could be due to its possible overrepresentation: it’s one of the most studied drug targets in structural biology, with drugs like atezolizumab and durvalumab, all having crystal structures in PDB. This reinforces our hypothesis that the training data is so saturated with examples that the model can’t generate anything else.

For less PDB-saturated targets, RL induces measurable collapse.

On KRAS, an oncology target with far fewer known binders in PDB, we see base BoltzGen generates a wide range of binders: 29 clusters from 200 structures. After RL finetuning, this collapses to 10 clusters and a 68% reduction in structural entropy. We see the pattern in CCL2 as well, where the model generates fewer diverse structures.

On EGFR, finetuning increases structural diversity relative to base BoltzGen (entropy 1.06 → 2.14), before RL collapses it back down. The RL PDB TM-score actually decreases on EGFR, same as PDL1. Finetuning breaks that specific memorized mode and recovers diversity, before RL imposes a new, different form of collapse. Whether this recovered diversity is meaningful or simply a different memorized mode is an open question we can't resolve without experimental validation.

Conclusion

For targets heavily represented in PDB, the base model is already collapsed, and RL can’t push further, and may slightly, if at all, break the memorized mode. It is possible that RL’ing for longer horizons may change this. For targets less represented (KRAS, CCL2), the base model does seem to have diversity that RL destroys. This is counterintuitive to what RL should be doing. The reward signal, Protenix iPTM, itself PDB-trained, reinforces structures that resemble PDB proteins, and the severity of collapse tracks how well the target is represented in that training data.

This may have a grim therapeutic implication for hard-to-drug targets. KRAS is one of the most clinically important undruggable oncology targets because its shallow binding pocket requires non-helical contacts — beta-strand mimetics, cyclic peptides, and designed loops that can engage the flat RAS surface.¹ An RL-finetuned model that has collapsed toward helical bundle solutions is less likely to design binders specific to KRAS and rather generate its favorite motif, hoping it sticks.

As for whether to use RL, it might be worth doing depending on how similar your target is to the data your model has been trained on.

What comes next

It seems evident, then, that the next task to focus on is harnessing signals that are qualitatively different from adding another PDB-derived term to your loss function, or looking at entropy regularizers. As Nick Boyd discusses, one alternative is to explore physics-based reward signals (e.g., Rosetta energy). Whether integrating that distinction into the RL loop resolves the mode collapse is what I want to explore next.

Thanks to Brian Naughton and Dr Aaron Ring for feedback on an early version of this post.

100 Open questions to think about

Utkarsh Singh — Mon, 20 Feb 2023 13:49:02 GMT

Here’s a list of a few questions I came up with while filling up an application for SPARC. Most of them revolve around rationality, AI and human decision-making in general. Do let me know if you find these interesting to think about, have any comments, or know the answers to any of these questions!

Does Effective Altruism’s emphasis on the future generations, belittle the needs of the current, and if so, is this morally appropriate?
Why are calls to people's emotions more ‘effective’ than rationality?
How does effective forecasting like Fermicasting provide any plausible benefit? Are people reasonably confident to make meaningful decisions based on the heuristics of other people?
Does crypto really have any inherent value? And what is something we can do with crypto that we’d be unable to do with the money?
Artificial intelligence is trained on human data. Why then are we outraged when a word-predicting model outputs something outrageous?
It is not entirely impossible that artificial intelligence might be better at decision-making than humans, if so, would it be better to align AI to human values, or leave it in some area for independent decision-making?
There are several opinions that AI would help create not reduce net jobs. For unskilled blue-collar workers. What are some of these jobs?
Is it possible to satisfy the need for human craving for loneliness solely through artificial intelligence?
What's the best way to make a positive impact on the world?
What's the end goal of humans, is it to optimize individual happiness?
What's the best way to form meaningful relationships with people?

What are some of the main issues plaguing AI Alignment?
Some of the current language models are owned by companies as these are expensive to run. Will Open Source models ever come to be as competitive as them?
How should developing countries optimize for development and progress while ensuring that they’re not accelerating climate change?
How does one manage to build depth in a specific niche of ai, while managing to stay ‘dangerous’ in other fields
What are the best ways to solve problems of distribution like hunger and poverty?
What's the best solution to the current problem plaguing chatgpt making up nonexistent information and sources?
To what extent of freedom of speech be allowed?
With students using chatgpt for almost all essay prompts, what are examples of areas where only humans will be able to make intelligible responses if any?
What are some of the best ways to learn about opinions in philosophy, and try to answer questions?
With OpenAI constantly patching the different jailbreaks being used to bypass its content policies,
What’s more likely to make people mad: something that’s false or true?
Is there a reproducible process for making pop songs that AI can replicate?
Will AI ever be able to better understand us than we do ourselves?
Is dissociation from emotions better or worse for making decisions? Particularly in a field like friendship or family
Should powerful AI systems should behave in the way users want or their creators intend?
What's the best way to depolarise society from its current state?
The current language models are being trained on data that don't accurately reflect all strata of society, what are the best ways to overcome this?
What are some of the best ways to form contrarian ideas that are right?
What measures does crypto have once, hypothetically prop-shops start trading crypto and intentionally boost or deflate their value?
What are some of the best ways to get better at forecasting?
Would Universal basic income promote innovation or deaccelerate it?
How do we get better at noticing things overlooked by others
Would it be possible to build a programming language that generates additional syntax to solve different needs?
What are the best ways to help solve the rising issues of loneliness
How does one get to regularly interact and learn from successful people?
Is it ever truly possible to overcome our biases, and how can we do this?
What are some of the positive outcomes we can get through gene editing, and how can we make an impact in this field?
Is there a thing like free thought?
Do Animals Dream?
How do various religions differ in the nature and magnitude of their effects?
What influences when people to act in accordance with their self-interest and when they don't?
How does mental imagery work? How do we improve its function?
Do people have different levels of self-control or do they just experience temptation differently?
What makes a good life? How do we study this?
We remember dreams almost perfectly right after waking up and then the memory rapidly recedes and disappears completely, unless we write them down. This isn’t how normal memories function. So, why the difference?
What is “personal productivity” and why does it vary from day to day so much (eg. Weinberger et al 2018)? And why does it not seem to correlate with environmental variables like weather or sleep quality?
Does listening to music improve or worsen memory?
What is consciousness?
What would happen if we could travel faster than the speed of light?
How much of our behavior is determined by nature versus nurture?
How does language shape the way we think?
What makes some memories more vivid than others?
What does it really mean to be ‘self-aware’?
What laws should be imposed by governments on generative AI, if any?
Is rationality a universal trait, or are there cultural differences in what is considered rational behavior?
What determines how we perceive time? Is it the same for everyone?
How do people make major decisions in their lives? When and why does it come up, and how do they go about making those decisions?
Are we all fundamentally selfish, even when we do things for others' benefit? Or are there truly settings (intrinsic and/or exogenous) where we do things that are good for others but bad for both our short-term future and long-term future selves?
Do people have different levels of self-control or do they just experience temptation differently?
Is there a way to conduct research without bias in funding? How?
Would it be feasible for prop trading shops to be owned by the government to ensure market liquidity?
What would be the best way to go about building a large language model to rival that of GPT-3?
Are we all fundamentally selfish, even when we do things for others' benefit? Or are there truly settings (intrinsic and/or exogenous) where we do things that are good for others but bad for both our short-term future and long-term future selves?
What basically goes on in the brain, when we design or think of something ‘new’ or never seen before?
Are people able to concentrate more effectively under total silence?
what are the factors that influence the speed and accuracy of learning a new language?
What is the best way to factor in risk while making uncertain decisions?
How does trusting the ‘gut’ work?
If we ever find a way to significantly extend the human health span or reverse ageing, what could that post-death society look like?
What are the neural mechanisms underlying consciousness, and how can we study and manipulate these mechanisms?
Would it be possible to prevent shrinkage of the brain?
How does neuroplasticity differ between the developing brain and the adult brain,
What is happening in the brain when a human questions?
What is the probability there is microbial-like life (other than from earth) in our solar system?"
Is string theory more closely correct than any other current theory of physics?
What's the best way to determine if someone would be a good friend for you?
How do you ask the right questions?
How do I get people to like me?
How do you tell the difference between a preference and a bias
What is the probability that I might be sleep deprived if I wake up before my alarm goes off more than 95% of the time?
What do other people subjectively experience when they are thinking? To me it’s like talking to myself (in verbal English sentences) but I'm told that isn't universal.
When is self-denial useful in altering your desires, vs satisfying them so you can devote time to other things?
How does one define wisdom?
What happens to consciousness once you fall asleep?
Can charisma be taught?
Why is it so hard to predict success?
Why are we so fascinated by coincidences?
Is It Wrong to Enjoy Yourself While the World Is Burning?
Is it more important to help society or to help yourself?
how can we stop confusing correlation with causation?
Why Do We Want What We Can’t Have?
Which Matters More, a First or Last Impression?
How do I improve my ability to simulate/guess other people's internal states and future behaviours?
How do I work out what I want and what I should do?
2) Would the human race be eradicated if there is a worst-possible-scenario nuclear incident? Or merely a lot of people?
What could be the potential downsides of building a universal sign language?
How do people ascertain emotions in certain songs?
Do animals ever 'ask questions'?
When you forget a thought, where does this thought go?