Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce Alignment Tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors.
This arises from core limitations of RLHF: preference datasets are constructed from the LLM's own outputs, and pairwise comparisons only indicate which response is better, not why. When biased responses are also higher quality, annotators may prefer them based on quality, and the resulting reward model can inherit the bias-quality correlation.
Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability.
Alignment tampering is a phenomenon in which an LLM undergoing alignment influences the preference dataset to reflect preference for undesired behaviors, leading to their reinforcement through RLHF. Pairwise preference labels reveal which response is preferred, but not whether that preference comes from response quality or from a correlated bias.
When biased responses are also higher quality, annotators prefer them during preference dataset construction. The learned reward model then favors both quality and bias, and RL optimization amplifies the undesired bias.
Click or hover over each step to spotlight it in the figure.
PPO and DPO fine-tuning drive the bias rate toward 1.0. Best-of-N sampling also increases the bias rate as the number of sampled responses grows. Win rate increases concurrently with bias rate, showing that RLHF optimizes the correlated quality and bias together.
Response evolution during PPO training. While responses transition from harmful to helpful and safe, the bias rate increases from 0.19 to 1.0, with the keyword "AI" appearing excessively in the final response.
Alignment tampering amplifies biases across propaganda, promotion, and instrumental goal categories. This suggests practical risks such as brand promotion, political propaganda, and goal-seeking behaviors being reinforced during alignment.
Example responses for each of the nine biases, generated from the same prompt. Underlined text highlights the biased content in each response.
We propose a detection method that leverages the tendency of a tampering policy to generate two distinct types of responses. In representation space, triggered prompts show separated clusters that correspond to high-reward biased responses and low-reward unbiased responses. This clustering behavior provides a signal for detecting prompts that activate alignment tampering, and can also help identify likely trigger phrases.
Mitigation remains challenging. We evaluate reward model variants designed to be more robust to spurious correlations, including InfoRM, WARM, and RRM. Although these methods can slow down bias amplification in some PPO runs, they do not fully prevent it. In BoN sampling, bias and win rate still increase together, indicating that the reward models continue to favor responses where higher quality is correlated with the undesired bias.
These results show a persistent trade-off: methods that reduce the bias rate also tend to reduce improvements in response quality. This suggests that simply changing the reward model is not enough; preventing alignment tampering likely requires methods that decouple response quality from the undesired behavior during data generation, preference labeling, or optimization.
@inproceedings{hahm2026alignment,
title={Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases},
author={Hahm, Dongyoon and Hadfield-Menell, Dylan and Lee, Kimin},
booktitle={Proceedings of the 43rd International Conference on Machine Learning},
year={2026}
}