Jean-Charles Noirot Ferrand

Title

Wed, 8 Feb 2024 15:00:00 GMT

Description with markup .

[Paper] Our paper has been accepted to the Findings track of CVPR 2026!

Thu, 11 Mar 2026 15:00:00 GMT

Title: Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Authors: Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

Abstract: Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

[Paper] Our paper has been accepted to SATML 2026!

Thu, 11 Dec 2025 15:00:00 GMT

Title: Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Authors: Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

Abstract: Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

Here is the code.

[Paper] Our paper has been accepted to SURE 2025!

Thu, 14 Aug 2025 15:00:00 GMT

Title: LibIHT: A Hardware-Based Approach to Efficient and Evasion-Resistant Dynamic Binary Analysis

Authors: Changyu Zhao, Yohan Beugin, Jean-Charles Noirot Ferrand, Quinn Burke, Guancheng Li, Patrick McDaniel

Abstract: Dynamic program analysis is invaluable for malware detection, debugging, and performance profiling. However, software-based instrumentation incurs high overhead and can be evaded by anti-analysis techniques. In this paper, we propose LibIHT, a hardware-assisted tracing framework that leverages on-CPU branch tracing features (Intel Last Branch Record and Branch Trace Store) to efficiently capture program control-flow with minimal performance impact. Our approach reconstructs control-flow graphs (CFGs) by collecting hardware generated branch execution data in the kernel, preserving program behavior against evasive malware. We implement LibIHT as an OS kernel module and user-space library, and evaluate it on both benign benchmark programs and adversarial anti-instrumentation samples. Our results indicate that LibIHT reduces runtime overhead by over 150x compared to Intel Pin (7x vs 1,053x slowdowns), while achieving high fidelity in CFG reconstruction (capturing over 99% of execution basic blocks and edges). Although this hardware-assisted approach sacrifices the richer semantic detail available from full software instrumentation by capturing only branch addresses, this trade-off is acceptable for many applications where performance and low detectability are paramount. Our findings show that hardware-based tracing substantially improves performance while reducing detection risk and enabling dynamic analysis with minimal interference.

Here is the code.

[Award] Distinguished Artifact Reviewer for USENIX Security 2025

Thu, 14 Aug 2025 15:00:00 GMT

I have been recognized as a Distinguished Artifact Reviewer at the 34th USENIX Security Symposium (Usenix Security '25)

[Paper] Our paper has been accepted to ICCV 2025!

Thu, 26 Jun 2025 15:00:00 GMT

Title: On the Robustness Tradeoff in Fine-Tuning

Authors: Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel

Abstract: Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks—over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks—57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

[Preprint] Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Wed, 29 Jan 2025 15:00:00 GMT

Title: Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Authors: Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

Abstract: Reinforcement learning (RL) offers powerful techniques for solving complex sequential decision-making tasks from experience. In this paper, we demonstrate how RL can be applied to adversarial machine learning (AML) to develop a new class of attacks that learn to generate adversarial examples: inputs designed to fool machine learning models. Unlike traditional AML methods that craft adversarial examples independently, our RL-based approach retains and exploits past attack experience to improve future attacks. We formulate adversarial example generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On CIFAR-10, our agent increases the success rate of adversarial examples by 19.4% and decreases the median number of victim model queries per adversarial example by 53.2% from the start to the end of training. In a head-to-head comparison with a state-of-the-art image attack, SquareAttack, our approach enables an adversary to generate adversarial examples with 13.1% more success after 5000 episodes of training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to attack ML models efficiently and at scale.

[Preprint] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Wed, 29 Jan 2025 15:00:00 GMT

Title: Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Authors: Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

Abstract: Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we present and evaluate a method to assess the robustness of LLM alignment. We observe that alignment embeds a safety classifier in the target model that is responsible for deciding between refusal and compliance. We seek to extract an approximation of this classifier, called a surrogate classifier, from the LLM. We develop an algorithm for identifying candidate classifiers from subsets of the LLM model. We evaluate the degree to which the candidate classifiers approximate the model's embedded classifier in benign (F1 score) and adversarial (using surrogates in a white-box attack) settings. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find attacks mounted on the surrogate models can be transferred with high accuracy. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70%, a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is a viable (and highly effective) means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

[Master Thesis] Extracting the Harmfulness Classifier of Aligned LLMs

Wed, 18 Dec 2024 15:00:00 GMT

Title: Extracting the Harmfulness Classifier of Aligned LLMs

Authors: Jean-Charles Noirot Ferrand, Patrick McDaniel

Abstract: Large language models (LLMs) exhibit high performance on a wide array of tasks. Before deployment, these models are aligned to enforce certain guidelines, such as harmlessness. Previous work has shown that alignment fails in adversarial settings through jailbreak attacks. Such attacks, by modifying the input, can induce harmful behaviors in aligned LLMs. However, since they are based on heuristics, they fail in giving a systematic understanding of why and where alignment fails in adversarial settings. In this paper, we hypothesize that alignment embeds a harmfulness classifier in the model, responsible for deciding between refusal or compliance. Investigating the harmfulness and robustness of the alignment of a model then reduces to evaluating its corresponding classifier, which motivates this work: Can we extract it?. Our approach first builds estimations of the classifier from varying parts of the model and evaluates how well they approximate the classifier in both benign and adversarial settings. We study 4 models across 2 datasets and find through the benign settings that the classifier spans at least a third of each model. In addition, the evaluation in adversarial settings shows that it ends before the first half of most models, exhibiting a transferability greater than 80%. Our results show that the classifier can be extracted, which is beneficial from an attack and defense perspective due to the improvements in both efficiency (smaller model to consider) and efficacy (higher attack success rate).