[{"content":" Open Source View all of my projects on GitHub!\nSWE-agent Autonomous Software Engineering → View research project SWE-agent enables your language model of choice to autonomously use tools to fix issues in real GitHub repositories, find cybersecurity vulnerabilities, or perform any custom task.\nState of the art on SWE-bench among open-source projects Free-flowing \u0026 generalizable: Leaves maximal agency to the LM Configurable \u0026 fully documented: Governed by a single yaml file Made for research: Simple \u0026 hackable by design swe-agent/swe-agent v1.1.0 19.2k 2.1k mini-swe-agent Autonomous Software Engineering → View research project In 2024, SWE-bench \u0026amp; SWE-agent helped kickstart the coding agent revolution. We now ask: What if SWE-agent was 100x smaller, and still worked nearly as well?\nmini-swe-agent is a100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores \u0026gt;74% on SWE-bench verified!\nswe-agent/mini-swe-agent v2.2.8 4.4k 608 ProgramBench Can language models rebuild entire programs from scratch? → View research project ProgramBench is a benchmark for evaluating whether AI agents can architect and implement a complete codebase given only a compiled reference binary and its documentation. The 200 tasks range from small CLI tools to large applications like FFmpeg, SQLite, and PHP, and are scored against more than 248,000 behavioral tests. Current frontier models fully resolve 0% of tasks.\nfacebookresearch/programbench v1.0.2 596 38 SWE-ReX Remote Code Execution for Software Engineering Agents → View research project SWE-ReX is a runtime interface for interacting with sandboxed shell environments, allowing you to effortlessly let your AI agent run any command on any environment. SWE-ReX came out of our experiences with SWE-agent and SWE-agent EnIGMA. Using SWE-ReX, we\n🦖 Support fast, massively parallel agent runs (which made evaluating on large benchmarks a breeze). 🦖 Support a broad range of platforms, including non-Linux machines without Docker. 🦖 Disentangle agent logic from infrastructure concerns, making SWE-agent more stable and easier to maintain. swe-agent/SWE-ReX v1.4.0 502 110 SWE-smith Generate tens of thousands of tasks for autonomous software engineering agents → View research project SWE-smith is a toolkit for training software engineering (SWE) agents. With SWE-smith, you can:\nCreate an unlimited number of SWE-bench style task instances for any Python repository. Generate trajectories of SWE-agent solving those task instances. Train local LMs on these trajectories to improve their software engineering capabilities (SWE-agent-LM-32B). swe-bench/SWE-smith 648 118 CodeClash Benchmarking Goal-Oriented Software Engineering → View research project CodeClash is a benchmark for evaluating AI systems on goal-oriented software engineering. In CodeClash, LM agentscompete via their codebases across multi-round tournaments to achieve high-level goals. LMs are still vastly inferior to human baselines on CodeClash and show a multitude of weaknesses.\ncodeclash-ai/codeclash 154 16 Graph Neural Network Tracking Reconstruct charged particle trajectories in particle detectors → View research project This open source project uses graph neural networks to reconstruct charged particle trajectories in particle detectors (\u0026ldquo;tracking\u0026rdquo;). Batteries are included: The package implements the whole processing pipeline, several models and approaches, and the evaluation of the final performance metrics. Built around pytorch lightning, the models are easy to train and to restore. By using hooks and callbacks, everything remains modular and maintainable.\ngnn-tracking/gnn_tracking v23.12.1 50 19 Teaching Material Paradigms \u0026amp; Patterns: Lecture slides (both as LaTeX source code and as rendered pdfs) and exercises for a course taught originally at iCSC 2020 to more than 500 participants. Everything you didn't know you needed: A collection of tips and tricks for python, the command line and more. Collaborative Programming with GitHub HEP Fitting Tutorial: Jupyter notebooks for tutorials on fitting for high energy physics. Old projects Various projects Sunburst (2016): sunburst is a python package for matplotlib that creates sunburst plots (\"hierarchical pie charts\" similar to disk space visualizations). RandomFileTree (2019): Create a random file/directory tree/structure in python for testing purposes. jekyll-relative-url-check (2021): Enforce that all URLs in your Jekyll setup are relative to site.baseurl Verzettler (2020): Non-linear, non-hierarchical knowledge management: Helper scripts for your Zettelkasten. video-frame-merger (2018): Overlay the moving elements of video frames to condense a whole motion into one still image. Electronics Piezo Puzzle (2021): An interactive birthday puzzle with a piezo buzzer operated by an atmega8 MCU. Different values can be selected on a rotary dial. For each selected value, letter combinations are communicated with morse code. Once their meaning is understood, they can be brought in the right order to get a code. After this code has been entered, three dial values play different birthday songs. View of the main board before packaging.\nAnki Plug-ins Anki is a spaced repetition flashcard program that boasts high customizability with Add-ons. Writing Anki add-ons was my first serious foray into open-source software development.\nTemplate tester (2017): The styling of Anki's flashcards is governed by templates written in HTML and CSS. This is a small tool to batch generate previews of templates for different user input cases, which comes in handy when maintaining multiple and complicated templates. Ignore duplicates (2015): Customize how and when Anki flags cards as duplicates. Sync fields (2015): Add-on to synchronize information/field values between different cards/notes, e.g. including information/mnemonics about the kanji used in Japanese words also on the cards of Japanese words that use them (and add these as exemplary use cases to the kanji cards). Requires substantial configuration. Merge notes (2015): Plug-in to merge a set of notes (flashcards) into another set of notes. Rudimentary Add-on intended for one time use! cbcImport (2015): Adds a new toolbar to Anki's Add Card dialog to load .csv files and then cycle through them, adding cards/vocabulary items step by step. Readings Audio (2016): Add Kunyomi/Onyomi audio to Kanji readings flashcards in Anki. Currently not completely functional. Reset Fields (2015): Adds a button to reset all fields in the editor window in Anki. Templates (2015): HTML/CSS templates for Anki flashcards that used for learning Japanese. Other tools for learning Japanese RTK Lookup (2014): For people who learn kanji with the books from James Heisig (Remembering The Kanji). A little command line interface that allows to look up multiple kanji by keyword (or parts of it), by parts of the story/mnemonic or by frame number. rtk-table-tools (2019): Generates beautiful posters of all JLPT kanji! Also includes a web scraper to get additional information for that purpose. ","permalink":"https://lieret.net/opensource/","summary":"\u003ch1\u003eOpen Source\u003c/h1\u003e\n\n\u003cdiv style=\"text-align: center; margin-bottom: 3rem;\"\u003e\n\n\n\n\u003cstyle\u003e\n.github-link {\n    display: inline-flex;\n    align-items: center;\n    flex-direction: column;\n    text-decoration: none;\n    transition: color 0.2s;\n}\n\n.github-link svg {\n    width: 100px;\n    height: 100px;\n    margin-bottom: 1rem;\n    fill: var(--primary);\n    transition: fill 0.2s;\n}\n\n.github-link p {\n    margin: 0;\n    font-size: 1.2rem;\n    color: var(--primary);\n    transition: color 0.2s;\n}\n\n.github-link:hover svg,\n.github-link:hover p {\n    fill: var(--secondary);\n    color: var(--secondary);\n}\n\u003c/style\u003e\n\n\n\u003ca href=\"https://github.com/klieret/\" class=\"github-link\"\u003e\n    \u003csvg viewBox=\"0 0 24 24\" xmlns=\"http://www.w3.org/2000/svg\"\u003e\u003cpath d=\"M12 .297c-6.63 0-12 5.373-12 12 0 5.303 3.438 9.8 8.205 11.385.6.113.82-.258.82-.577 0-.285-.01-1.04-.015-2.04-3.338.724-4.042-1.61-4.042-1.61C4.422 18.07 3.633 17.7 3.633 17.7c-1.087-.744.084-.729.084-.729 1.205.084 1.838 1.236 1.838 1.236 1.07 1.835 2.809 1.305 3.495.998.108-.776.417-1.305.76-1.605-2.665-.3-5.466-1.332-5.466-5.93 0-1.31.465-2.38 1.235-3.22-.135-.303-.54-1.523.105-3.176 0 0 1.005-.322 3.3 1.23.96-.267 1.98-.399 3-.405 1.02.006 2.04.138 3 .405 2.28-1.552 3.285-1.23 3.285-1.23.645 1.653.24 2.873.12 3.176.765.84 1.23 1.91 1.23 3.22 0 4.61-2.805 5.625-5.475 5.92.42.36.81 1.096.81 2.22 0 1.606-.015 2.896-.015 3.286 0 .315.21.69.825.57C20.565 22.092 24 17.592 24 12.297c0-6.627-5.373-12-12-12\"/\u003e\u003c/svg\u003e\n    \u003cp\u003eView all of my projects on GitHub!\u003c/p\u003e","title":""},{"content":"","permalink":"https://lieret.net/photography/","summary":"","title":""},{"content":"CV 📄 Download PDF LinkedIn My research develops AI systems that autonomously perform complex problem-solving tasks in software engineering and beyond. My multidisciplinary background includes postdoctoral research on Graph Neural Networks, petabyte-scale data analysis in experimental high energy physics, and dual degrees in mathematics and physics.\nProfessional Experience Feb 2026 – present AI Research Scientist Meta FAIR Feb 2024 – Jan 2026 Research Software Engineer II Princeton University, Princeton Language \u0026 Intelligence Initiative Show details Adviser: Karthik Narasimhan\nAgentic AI for Software Engineering: lead developer of SWE-agent since Mar 2024; repeatedly achieved SotA on SWE-bench; refactored and built around SWE-ReX for 10x execution time speedup and cloud capabilities Contributed to achieving SotA on various cybersecurity benchmarks (SWE-agent EnIGMA), and open-weight SotA on SWE-bench by large-scale generation of agent trajectories for synthetic issues (SWE-smith) July 2022 – Jan 2024 Associate Research Scholar / Postdoctoral Research Associate Princeton University, Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) Show details Adviser: Peter Elmer\nMachine learning for high-throughput algorithms in High Energy Physics Learned-clustering with graph neural networks and transformers (more information) Leadership \u0026 Service 2020, 2023 – 2024 Group Convener HEP Software Foundation Training Working Group (HSF Training) More information 2020 – 2022 Group Convener Belle II Software Training \u0026 Documentation Working Group More information Education Oct 2018 – May 2022 Ph.D. in Experimental High Energy Physics Ludwig Maximilian University of Munich (LMU) Show details Adviser: Thomas Kuhr\nThesis: Calibration of Machine Learning-based Hadronic Tagging in Preparation for a |Vcb| Measurement and Clustering of Kinematic Distributions\nGraduated Summa Cum Laude Calibration \u0026 debiasing of machine learning algorithms for the reconstruction of particle decays (more information) Cluster analyses of kinematic distributions of particle decays (more information) Part of the Belle II Software Team; responsible for software performance testing Oct 2014 – Sep 2018 Master of Science in Theoretical \u0026 Mathematical Physics Elite master program at Ludwig Maximilian University of Munich (LMU) and Technical University of Munich (TUM) Thesis: Construction of Angular Observables Sensitive to New Physics in B̄→D*τ⁻ν̄τ Decays and Measurements of Differential Cross Sections of B̄→D*ℓ⁻ν̄ℓ Decays with Hadronic Tagging at Belle (more information) Oct 2011 – Sep 2015 Bachelor of Science in Physics Ludwig Maximilian University of Munich (LMU) Thesis: Truth-Level Based Estimation of the Sensitivity to Phenomenological Minimal Supersymmetric Standard Models in Events With One Hard Lepton (more information) Oct 2011 – Aug 2014 Bachelor of Science in Mathematics Ludwig Maximilian University of Munich (LMU) Thesis: Elliptic Functions\nGraduated top of my class Research Stays Dec 2017 – Feb 2018 Research Stay University of Tokyo Project: Construction of Angular Observables Sensitive to New Physics in B̄→D*τ⁻ν̄τ Decays (more information) Jul 2017 – Sep 2017 Research Summer Program Tokyo Institute of Technology (TITECH) Project: Complex Organic Molecules in Protoplanetary Disks Sep 2015 – Sep 2016 Exchange Year Nagoya University Jul 2015 – Sep 2015 Summer Student CERN Project: Data Acquisition Performance Analysis for the LHCb group Languages German Native English Near native (C2), TOEFL iBT: 115/120 (Nov 2014) Japanese Upper-intermediate (B2/C1), JLPT N2 (Jul 2016) French Intermediate (B2) Personal Details Nationality German Date of birth March 1993 ","permalink":"https://lieret.net/cv/","summary":"\u003ch1\u003eCV\u003c/h1\u003e\n\n\n\u003cdiv style=\"text-align: center; margin-bottom: 2rem; display: flex; gap: 1rem; justify-content: center; flex-wrap: wrap;\"\u003e\n    \u003ca class=\"strong-button\" href=\"/cv.pdf\" rel=\"noopener\" title=\"Download PDF\" download\u003e\n        📄 Download PDF\n    \u003c/a\u003e\n    \u003ca class=\"strong-button\" href=\"https://www.linkedin.com/in/klieret\" target=\"_blank\" rel=\"noopener\" title=\"LinkedIn Profile\"\u003e\n        \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" style=\"height: 1em; width: 1em; vertical-align: -0.1em; margin-right: 0.3em;\"\u003e\n            \u003cpath d=\"M16 8a6 6 0 0 1 6 6v7h-4v-7a2 2 0 0 0-2-2 2 2 0 0 0-2 2v7h-4v-7a6 6 0 0 1 6-6z\"\u003e\u003c/path\u003e\n            \u003crect x=\"2\" y=\"9\" width=\"4\" height=\"12\"\u003e\u003c/rect\u003e\n            \u003ccircle cx=\"4\" cy=\"4\" r=\"2\"\u003e\u003c/circle\u003e\n        \u003c/svg\u003e\n        LinkedIn\n    \u003c/a\u003e\n\u003c/div\u003e \n\n\u003cdiv class=\"cv-statement\"\u003e\n\u003cp\u003eMy \u003ca href=\"/research\"\u003eresearch\u003c/a\u003e develops AI systems that autonomously perform complex problem-solving tasks in software engineering and beyond. My multidisciplinary background includes postdoctoral research on Graph Neural Networks, petabyte-scale data analysis in experimental high energy physics, and dual degrees in mathematics and physics.\u003c/p\u003e","title":"Curriculum Vitae"},{"content":"","permalink":"https://lieret.net/icon-test/","summary":"","title":"Icon Test Page"},{"content":" Research All papers at:\nGoogle Scholar ORCID My current work focuses on AI for software engineering. The SWE-agent project has been the first open source system to demonstrate how modern language models can effectively utilize tools to fix complex repository-level tasks (as measured on the SWE-bench benchmark). SWE-agent EnIGMA showed that the same system with different tools can set state of the art performance for red-teaming cybersecurity application. Our most recent benchmark, ProgramBench, asks whether language models can rebuild entire programs from scratch given only a compiled binary and its documentation, a task on which today's frontier models score essentially zero. Before that, SWE-smith introduced a pipeline for generating software engineering training data at scale, allowing us to set state of the art performance for open source models on the SWE-bench Verified benchmark. I also continue to support research into Graph Neural Networks for High Energy Physics, research I started as a Postdoc at Princeton University. During my PhD and studies, I focused on various aspects of data analysis, software engineering, maths and physics, including calibrating machine learning algorithms, clustering analyses, integration testing, anomaly detection, differential equations, data acquisition simulations, supersymmetry, and elliptic functions. ▼ Recent News May 2026softwareWe just released ProgramBench, a 0% software engineering benchmark. May 2026paperCodeClash was accepted to ICML! Feb 2026I joined Meta FAIR as an AI Research Scientist. Nov 2025softwareWe just released CodeClash, our extremely challenging new benchmark for autonomous software engineering agents. Sep 2025paperSWE-smith was accepted as a NeurIPS 2025 Spotlight paper. July 2025softwareWe just released mini-swe-agent, a radically minimal AI agent that scores \u003e74% on SWE-bench verified. Jun 2025talkFrom code completion to autonomous software engineering agents at the Databricks Data+AI summit in SF (link) May 2015softwareSWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified (SWE-smith project) Apr 2025paperI presented our poster SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? at ICLR in Singapore (poster link, more information) Apr 2025talkAs part of the week week at GenAI collective NY, I joined a technical fireside chat: Key lessons from pushing AI beyond autocomplete Apr 2025talkBeyond Code Completion: Building Next-Gen AI Engineering Agents at Daytona AI Builder's day at GitHub HQ in SF (video) Apr 2025talkInterview/podcast together with C.E. Jimenez on Databrew by Databricks (video) Mar 2025talkVirtual talk at MLOps Agent hour: From Code Completion to Autonomous Software Engineering Agents Feb 2025softwareSWE-agent 1.0 sets State of the Art on SWE-bench Lite, Verified, and Full ProgramBench: Can Language Models Rebuild Programs From Scratch? Research at Meta FAIR Since 2026 Audience: General AI/Software Benchmarks like SWE-bench have shown that AI agents can fix isolated bugs in large existing codebases. But real software engineering also involves the opposite end of the spectrum: starting from a blank directory and building an entire program, choosing the language, designing the architecture, organizing the files, and getting all the details right. ProgramBench asks a deceptively simple question: given only a working executable and its documentation, can a language model rebuild the program from scratch? The benchmark contains 200 tasks ranging from very simple command line tools to complex projects that have been developed over more than a decade by hundreds of developers. AI agents have no access to the original source code (or to the internet) and must architect a complete codebase that reproduces the reference program's behavior. The headline result: none of the 9 evaluated language models fully resolve a single task. The strongest model passes 95% of tests on only 3% of instances. Model-written code also looks structurally very different from human-written code, with a strong bias toward monolithic single-file implementations. Benchmarks like SWE-bench have shown that AI agents can fix isolated bugs in existing codebases. But real software engineering also involves the opposite end of the spectrum: starting from a blank directory and building an entire program, choosing the language, designing the architecture, organizing the files, and getting all the details right. ProgramBench asks a deceptively simple question: given only a working executable and its documentation, can a language model rebuild the program from scratch? The benchmark contains 200 tasks ranging from compact command-line utilities like jq and ripgrep to large, mature projects like FFmpeg, SQLite, and PHP. Agents have no access to the original source code and must architect a complete codebase that reproduces the reference program's behavior, judged against more than 248,000 behavioral tests generated through agent-driven fuzzing. The headline result: none of the 9 evaluated language models fully resolve a single task. The strongest model passes 95% of tests on only 3% of instances. Model-written code also looks structurally very different from human-written code, with a strong bias toward monolithic single-file implementations. Website GitHub arXiv Dataset SWE-agent: Autonomous Software Engineering Research at Princeton University with PLI NeurIPS '24 Since 2024 Audience: General AI/Software In early 2024, most software engineers were using large AI models as chatbot assistants, answering questions or generating small pieces of code on request. In addition, smaller language models were powering code autocompletion tools in code editors, speeding up the writing of new code. However, software engineers spend most of their time not writing new code, but rather fixing bugs and implementing features in existing codebases, often comprising tens or hundreds of thousands of lines of code. This means that a large amount of time is spent understanding the codebase, finding the correct place to make changes, and making small-scale modifications. The simple AI tools of early 2024 were not well-suited to this task. To address this gap, we built SWE-agent. Mimicking the workflow of human engineers, SWE-agent uses tools to perform tasks, working incrementally toward complicated goals. In order to autonomously fix bugs and implement complex features in large software repositories, SWE-agent takes time to navigate the codebase, read select files, and finally make modifications, before testing and validating the changes. SWE-agent was the first open-source system to score significantly on the SWE-bench benchmark that assesses the performance of AI systems on real-world software engineering tasks. Since the initial release in April 2024, SWE-agent has continued to evolve and regularly ranks at the top of the SWE-bench leaderboard while maintaining a lightweight and accessible design. SWE-agent enables your language model of choice to use tools to fix issues in real GitHub repositories, find cybersecurity vulnerabilities, or perform any custom task.\nIt was the first open-source system to significantly score on SWE-bench, far outperforming RAG baselines and creating a breakthrough for agentic AI in software engineering.\nReleased just days after the commercial equivalent project Devin showed its first public demo, we demonstrated that a simple open-source system with optimized agent tooling could perform similarly (if not better) than a well-funded company's demo, democratizing access to AI-powered software engineering capabilities.\nThe central innovation discussed in our paper is the design and optimization of the agent-computer interface (ACI) that allows the language model to effectively navigate, understand, and modify large codebases. This includes custom shell commands, file editing interfaces, and feedback mechanisms.\nSince the initial release in April 2024, development has never stopped, and SWE-agent still regularly ranks at the top of the SWE-bench leaderboard while maintaining a lightweight, modular architecture that makes it easy to extend and customize for different use cases.\nGitHub Website arXiv Videos \u0026amp; Interviews ▼ Mar '25 Keynote talk: Beyond Code Completion: Building Next-Gen AI Engineering Agents Apr '25 Databricks podcast interview Dec '24 SWE-agent Team interview NeurIPS Hackercup AI: SWE-agent CodeClash: Benchmarking Goal-Oriented Software Engineering Research at Princeton University with PLI ICML '26 Since Nov 2025 LMs have gotten pretty good at solving GitHub issues. But real software development isn't a series of isolated tasks. It's driven by goals: Improve user retention, increase revenue, reduce costs. We build to achieve outcomes, not to close tickets. What if AI evaluations reflected this dynamism of real-world software development? In CodeClash, LM agentscompete via their codebases across multi-round tournaments to achieve high-level goals. In every round, agents can get to improve their codebase as they see fit. Write notes, analyze past rounds, run test suites, refactor code—whatever helps. Then, they face off against an opponent in one of multiple arenas. CodeClash shows how far AI agents are still from being able to lead software development. Not only are they inferior to human baselines, but our paper reveals various ways in which models produce messy codebases, hallucinate about past rounds, and fail to work towards a higher-level goal. Website GitHub arXiv SWE-smith: Scaling Data for Software Engineering Agents Research at Princeton University with PLI NeurIPS '25 Spotlight Since 2025 Audience: General AI/Software In order to train language models to be good software engineers, we need large amounts of high-quality training data — examples of software bugs and what steps were taken in order to fix them. But collecting such data is difficult: it often requires hours of manual work and complex setups that are hard to scale. That's where SWE-smith comes in. SWE-smith is a system that automatically creates realistic training data for AI agents that work with code. Given any existing software project, it builds a runnable version of the code and then generates hundreds or thousands of small tasks — for example, by artificially introducing a bug. We can then use existing language models to attempt to fix these bugs, keep only the successful attempts, and then use this data to train a new language model. We used SWE-smith to generate over 50,000 tasks from 128 popular open-source repositories and trained SWE-agent-LM-32B, using this data. This language model achieved the top scores among open-source models, and outperformed GPT-4o on SWE-bench Verified, a benchmark that tests AI agents on real-world software engineering tasks. Despite recent progress in Language Models for software engineering, collecting training data remains a significant pain point. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.\nTo address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing tests in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.\nWe train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, outperforming GPT-4o and setting state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.\nGitHub Website arXiv Dataset SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Research at Princeton University with PLI ICLR '25 Since 2024 SWE-bench has become the industry-standard evaluation framework for benchmarking autonomous software engineering agents, utilized extensively by all major language model providers. It presents language models with real-world software engineering tasks drawn from GitHub repositories, challenging them to resolve complex issues requiring deep codebase understanding and reasoning beyond typical code generation. I was particularly involved with our follow up project, SWE-bench multimodal. SWE-bench multimodal generalizes the SWE-bench framework to typical frontend engineering tasks, shifting focus from Python to Javascript and requiring visual understanding in addition to reasoning and agentic abilities. Agents evaluated under this new benchmark must effectively interpret images provided within task descriptions and use visual feedback during issue resolution and validation processes. As a result, current state-of-the-art models continue to find this benchmark exceptionally difficult, successfully solving fewer than 25% of the included tasks. Leaderboard GitHub arXiv Enigma: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Research at Princeton University with PLI ICML '25 2024 Proactively identifying and resolving cybersecurity vulnerabilities through red-teaming exercises where systems are tested from an attacker’s perspective is critical to securing modern infrastructure. However, attack vectors are diverse and require a broad set of skills, tools, and knowledge, making them very challenging to execute for AI systems. We build on the generalist capabilities of SWE-agent to create EnIGMA, an AI agent equipped with various cybersecurity tooling. In particular, we enable the agent to use interactive terminal applications, including debuggers and real-time server interactions. Evaluated across leading Capture The Flag (CTF) cybersecurity benchmarks including NYU CTF, Intercode-CTF, and CyBench, EnIGMA sets new state-of-the-art standards, significantly outperforming existing approaches and marking a notable leap forward in agent-driven cybersecurity. GitHub Website arXiv Graph Neural Networks for Charged Particle Tracking [+] Research at Princeton University with IRIS-HEP 2022-2024 Audience: General AI/Physics Modern particle physics experiments are among the most computationally demanding scientific efforts, generating vast amounts of data that must be processed in real time to capture rare and interesting events. One of the hardest challenges is reconstructing the paths of charged particles (known as \"tracking\") as they move through detectors. This step is so complex that it can limit how much high-quality data experiments like the CMS experiment at the Large Hadron Collider can record in the first place.\nTracking in this setting is unlike typical trajectory problems. Particle collisions happen millions of times per second, each producing thousands of new particles. These particles move so fast that we can't measure when they pass through the detector, and instead of a smooth trail, we only get 5–15 individual \"hits\" in different detector layers. The task is like solving an extremely difficult 3D connect-the-dots puzzle—starting from a cloud of scattered points, we must infer around a thousand particle trajectories for each collision.\nTraditional algorithms struggle as the number of particles and collisions increases, but recent advances in machine learning offer a promising alternative. In this project, we explore the use of Graph Neural Networks to tackle this problem more efficiently.\nThe reconstruction of charged particle trajectories (\"tracking\") in particle physics detectors is one of the computationally most challenging tasks of the field, limiting the amount of high-quality data that can even be recorded. Applied to particle collider experiments such as the CMS experiment, this task is different from many other problems that involve trajectories: There are millions of particle collisions per second, each with thousands of individual particles that need to be tracked, there is no time information (the particles travel too fast), and we do not observe a continuous trajectory but instead only 5-15 points (\"hits\") along the way in different detector layers. The task can be described as a combinatorically very challenging \"connect-the-dots\" problem, essentially turning a cloud of points (hits) in 3D space into a set of $\\mathcal O(1000)$ trajectories.\nUnlike traditional tracking algorithms built around Kalman filters, this project uses graph neural networks for significant speed increases. A conceptually simple way to turn tracking into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn't connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. In this project, we instead explore the idea of object condensation or learned clustering, where a network maps all hits to a latent space, learning to place hits from the same track close to each other, such that simple operations can recover the hits belonging to the same tracks.\nGitHub arXiv ▼ Talks ▼ High Pileup Particle Tracking with Learned Clustering 2312.03823: High Pileup Particle Tracking with Object Condensation 2309.16754: An Object Condensation Pipeline for Charged Particle Tracking at the High Luminosity LHC ACAT 24: High Pileup Particle Tracking with Learned Clustering CTD23: High Pileup Particle Tracking with Object Condensation CHEP 23: An Object Condensation Pipeline for Charged Particle Tracking Charged particle tracking as an embedding task: The left side shows a tSNE embedding of all hit features, with hits belonging to some (randomly selected) particles colored. Our embedding maps hits belonging to the same particle in the same place (right picture), such that tracks can be recovered by a simple clustering operation.\nPast projects Calibration of Machine Learning Algorithms for the Reconstruction of $B$ Mesons [+] PhD research at LMU Munich 2018 - June 2022 Audience: General Physics While the Standard Model of particle physics describes particles and their interactions to astonishing accuracy, there are a range of shortcomings. One of the avenues to probe and test the Standard Model further is to investigate differences between similar particles of different \"flavor\". Studies of B mesons are particularly sensitive to this. In this project, data from the Belle experiment is analyzed with the software of its successor, the Belle II experiment. One of the key ingredients is the Full Event Interpretation, a machine learning algorithm that reconstructs B mesons from their decay products as recorded by the detector. Crucial to the physics goals of the experiments is to ensure that the reconstruction algorithms perform exactly the same on data and Monte Carlo simulation. However, because Monte Carlos simulations are never perfect, small differences can be picked up and exaggerated by the reconstruction algorithms. Furthermore, biases in the training data can lead to biases in reconstruction efficiencies. Therefore, it is paramount to calibrate the reconstruction algorithms. The decay $\\bar B\\longrightarrow D^*\\ell^-\\bar\\nu_\\ell$ is used to precisely determine the CKM matrix element $|V_{cb}|$, an important ingredient for tests of the flavor sector of the Standard Model. It is also the normalization channel for measurements of $R(D^*)$, one of the key quantities of the flavor anomalies that recently sparked a flurry of interest in the field. Improving our understanding of $\\bar B\\longrightarrow D^*\\ell^-\\bar\\nu_\\ell$ might help to understand and improve analyses of $R(D^*)$ as well. Reconstruction of a tag side B meson in addition to the semileptonically decaying B allows for a very clean data sample. Using the large Belle dataset but applying Belle II software for analysis, we can improve upon previous studies: The Belle II Full Event Interpretation, a machine learning algorithm to reconstruct the tag side B meson is almost two times more efficient than previously used algorithms. However, careful calibration studies are needed to address inconsistencies in its efficiency between data and Monte Carlo simulation. Ph.D. Thesis Calibration factors for the Belle II Full Event Interpretation algorithm.\nClustering of Kinematic Graphs [+] PhD research at LMU Munich 2018-2019 New Physics can manifest itself in kinematic distributions of particle decays. The parameter space defining the shape of such distributions can be large which is challenging for both theoretical and experimental studies. Using clustering algorithms, the parameter space can however be dissected into subsets (clusters) which correspond to similar kinematic distributions. Clusters can then be represented by benchmark points, which allow for less involved studies and a concise presentation of the results. To demonstrate this concept, I have written the Python package ClusterKinG, an easy to use framework for the clustering of distributions that particularly aims to make these techniques more accessible in a High Energy Physics context. As a physics use case its application has been demonstrated for the kinematic distributions of $\\bar B \\longrightarrow D^{(*)}\\tau^-\\bar\\nu_\\tau$. JHEP Publication Ph.D. Thesis GitHub Belle II Software Integration and Performance Testing [+] 2018-2022 As the maintainer of the Belle II validation framework I was responsible for tests covering the working and overall performance of the Belle II software. Each software package provides a selection of scripts (Python or C++) that run on small scale realistic data samples. The validation framework resolves dependencies between these scripts, executes them on a central server and uses different metrics to detect inconsistentencies and performance degradations. The results are visualized on a dynamic website.\nWeb server reporting on the detailed results of the latest validation run.\nCoordinating Software Training and Education Efforts [+] at LMU Munich, Princeton University, and IRIS-HEP 2018-2024 Experimental high energy physics at large experiments is tasked with analyzing petabytes of data, necessitating an ever-evolving, ever more complex software stack. Delivering the best possible science depends crucially on the software skills of a large workforce of researchers. Keeping up with the latest big data tools and technology requires extensive training, covering everything from programming best practices to the latest industry tools and experiment-specific software frameworks.\nFrom 2020 to 2022, I led the Belle II Software Training and Documentation group that organizes training events and provides training material, primarily focusing on getting researchers up to speed with the Belle II software framework. In 2020 and since 2022, I have also been coordinating software training across experiments as one of the conveners of the HSF Training Group. I have also taught the basics of programming paradigms and software design patterns to more than 500 participants.\nConstruction of Angular Observables Sensitive to New Physics in $\\bar B\\longrightarrow D^* \\tau^-\\bar\\nu_\\tau$ Decays and Measurements of Differential Cross Sections of $\\bar B\\longrightarrow D^*\\ell^-\\bar\\nu_\\ell$ Decays with Hadronic Tagging at Belle [+] Thesis (M. Sc.) at LMU Munich, TU Munich 2017-2018 Audience: General Physics Most of our current understanding of elementary particle physics is encoded in the \"Standard Model\", a mathematically consistent description of all known particles and their interactions (except gravitation). There are a number of shortcomings of the Standard Model and most notably, astrophysical observations suggest that the currently known particle content accounts for but 5% of the total mass content of the universe (the rest being called dark matter). However, at the same time, accelerator experiments have confirmed the predictions of the Standard Model with astonishing accuracy. One of the few exceptions are the so called \u0026ldquo;anomalies in semileptonic B decays\u0026rdquo;, anomalies found in decays of B mesons into other mesons (particles made up of two quarks) and leptons (e.g. electrons, neutrinos). Deviations have been seen by three independent experiments (Belle, BaBar and LHCb) and taken together, the measurements challenge the Standard Model like few before.\nIn my Master thesis project, I give an overview over possible models of physics beyond the Standard Model, develop new observables that could help to distinguish between them and finally present continuing work on an analysis of one kind of semileptonic B decays that plays an important role in the anomalies.\nRecent measurements of $\\bar B\\longrightarrow D^{(*)}\\ell^-\\bar\\nu_\\ell$ at Belle, BaBar and LHCb challenge lepton universality and thus the Standard Model at a combined confidence level close to four standard deviations. New measurements of differential decay rates could contribute to the understanding of these anomalies. The differential cross section of the decay $\\bar B\\longrightarrow D^*(\\rightarrow D\\pi)\\ell^-\\bar\\nu_\\ell$ is parametrized according to different dependencies on the three decay angles and the coupling constants of potential new physics contributions. Observables using binned measurements of the differential cross section are characterized and explicitly constructed. Based on an estimate for the obtainable sensitivity, optimal binnings for such measurements are discussed. The discriminatory power of the thus constructed observables is discussed based on a basis of dimension six operators with renormalizable couplings contributing to $\\bar B\\longrightarrow D^*\\ell^-\\bar\\nu_\\ell$. Furthermore, continuing work on an analysis of the $\\bar B\\longrightarrow D^*(\\rightarrow D\\pi)\\ell^-\\bar\\nu_\\ell$ decay channel for $\\ell = e, \\mu$ using data from the Belle detector at KEKB is presented. The events are selected from 772 million $e^+e^- \\longrightarrow \\Upsilon(4S) \\longrightarrow B\\bar B$ events, where one $B$ meson is fully reconstructed in hadronic modes. Unfolded differential decay rates in four kinematic variables are presented separately for $\\ell= e, \\mu$ and a combined fit, allowing for precise calculations of $|V_{cb}|$ and $B\\longrightarrow D^*$ form factors. The new lepton flavor specific results are also expected to impact the discussion about potential light lepton flavor universality violations prompted by measurements of $B\\longrightarrow K^{(*)}\\ell\\ell$ decays. Master\u0026#39;s Thesis The world average for the measurements of the observables $R(D^{(*)})$ currently shows a $4sigma$ deviation from the Standard Model. Result of the Heavy Flavor Averaging Group from 2017.\nComplex Organic Molecules in Protoplanetary Disks [+] Summer Project at TITECH July 2017 till September 2017 Audience: General Astrophysics The spectroscopic study of interstellar and circumstellar molecules has been long ongoing, with researchers concluding already in the 1970s that interstellar dust contains large numbers of complex organic molecules (COMs). Since then, ever improving searches have found about 50 such COMs. Besides being of great interest for astrochemistry and some researchers even pointing out their potential role regarding the origin of life, COMs also serve as valuable probes for the physical conditions of the surrounding medium.\nThe build-up of molecular complexity in a given system can be studied with chemical reaction networks (CRNs), mathematical models of the concentrations of various molecules based on a fixed set of reactions and an initial set of reactant concentrations. By expanding the previously studied CRNs with additional dust grain-surface reactions, we tried to improve the description of COM formation in protoplanetary disks. Trying to automate some time-consuming manual tasks necessary for studies of the influence of physical and chemical parameters, I wrote an analysis framework that will enable future students to conduct similar studies much more efficiently, thereby opening new research possibilities.\nComplex Organic Molecules (COMs) in protoplanetary disks have been the subject of extensive studies using chemical reaction networks (CRNs) (e.g. Walsh et al., 2014). The accuracy of these models depends on our knowledge of the relevant chemical processes. Some classes of reactions have been comprehensively studied, resulting in large databases like the UMIST database of astrochemistry, which lists more than 6000 gas-phase reactions. However, other classes of reactions, such as grain-surface reactions, still pose challenges.\nBy expanding the previously studied CRNs with additional grain-surface reactions that are currently studied in new laboratory experiments (and have so far mostly been considered in the context of meteorites), we tried to improve the description of COM formation in protoplanetary disks. More specifically, I have been using the existing simulation code to investigate the influence of physical and chemical parameters, such as temperature, density and activation energies, on the time evolution of the chemistry found on grains. Trying to automate some time-consuming manual tasks necessary for such studies, I wrote a framework to repeatedly run the simulation with different settings and to visualize the resulting datasets. This framework will enable future students to conduct similar studies much more efficiently, thereby opening new research possibilities.\nExperience Report Data Acquisition Pipeline Performance Analysis [+] Summer Student Project at CERN/LHCb July 2015 till September 2015 Audience: General Physics/Computing The LHCb experiment is one of the particle physics detector experiments located at the LHC at CERN. At the collision points at the LHC, millions of particle collisions happen every second. Due to limitations in computing power and storage capacity, so far not every one of these events could be recorded and processed. Rather, only a fraction of the events were picked out to be processed by so called triggers. From 2020, the LHCb experiment plans to proceed to a trigger-free readout, where all of the events can be processed, thereby increasing the amount of data available to physicists. This requires an update of the LHCb Data Acquisition (DAQ) systems, which are responsible for the recording and processing of the events. DAQPIPE (Data Acquisition Protocol Independent Performance Evaluator) is a tool to simulate and evaluate the performance of such a DAQ system.\nThe aim of this 10-week summer student project was to implement network monitoring for a more detailed performance evaluation of different transport protocols and to spot potential bottlenecks. First, several existing performance monitors were tested. To that end DAQPIPE was run together with Tau and the obtained performance data was plotted with ParaProf, JumpShot and Vampir. In the second stage of the project, a light-weight performance analysis tool was written from scratch in C++.\nIn 2020 the Data Acquisition (DAQ) of the LHCb experiment will be updated to feature a trigger-free readout. This requires an event builder network consisting of about 500 nodes with a total network capacity of 4 TBytes/s. DAQPIPE (Data Acquisition Protocol Independent Performance Evaluator) is a tool to simulate and evaluate the performance of such a DAQ system. The current implementation of DAQPIPE only gives rough feedback about the event building rate.\nThe aim of this 10-week summer student project was to implement network monitoring for a more detailed performance evaluation of different transport protocols and to spot potential bottlenecks. First, several existing performance monitors were tested. To that end DAQPIPE was run together with Tau and the obtained performance data was plotted with ParaProf, JumpShot and Vampir. In the second stage of the project, a light-weight performance analysis tool was written from scratch by wrapping around the C++ MPI communication library to collect data.\nMonitoring the data sent by two readout units (RUs). RUs collect incoming data fragments from different subdetectors and send it to builder units (BUs), which process the information.\nTruth-level based estimation of the sensitivity to pMSSM models in events with one hard lepton [+] Thesis (B.Sc. in Physics) at LMU Munich 2015 Audience: General Physics Most of our current understanding of elementary particle physics is encoded in the \"Standard Model\", a mathematically consistent description of all known particles and their interactions (except gravitation). However there are a number of shortcomings of the Standard Model and most notably, astrophysical observations suggest that the currently known particle content can account for but 5% of the total mass content of the universe (the rest being called dark matter). One of the most popular theoretical concept that introduces additional particles is the concept of supersymmetry (SUSY).\nCurrent searches for SUSY particles are for example conducted with the ATLAS detector at the LHC at CERN. However, as SUSY theories depend on many unknown parameters, computation power becomes a limiting resource for the study (and exclusion) of possible concrete SUSY scenarios. To address this, two types of analysis methods are used: Truth level analysis (fast but unreliable) and reco level analysis (slow but reliable). Because truth level analysis is a shortcut, it has to be validated by comparing it with the reliable reco level analysis results.\nFor my thesis I performed such a comparison for a specific setup. Unfortunately I found but low levels of agreement between the results of both analysis strategies. I ruled out several sources of error and showed the necessity of a more detailed study of the underlying assumptions.\nBased on the search for supersymmetry in final states containing one isolated lepton, jets and missing transverse momentum with proton-proton collision data recorded with the ATLAS detector at a center-of-mass energy of $\\sqrt s = 8\\, \\mathrm{TeV}$ in 2012, I looked into the estimation of the sensitivity to phenomenological MSSM models using the signal shape of truth level signal samples. These were then compared to the sensitivity as calculated with MC samples on which a full detector simulation and reconstruction had been performed. The agreement was found to be generally low. Several sources of error were ruled out, showing the necessity of a more detailed study of the underlying truth- and reco-level signal samples. Bachelor\u0026#39;s Thesis Comparing the CLs values obtained by reco level analysis (y axis) and truth level analysis (x axis). Ideally both values should roughly agree (resulting in the red line with $x=y$), but this is obviously not the case here.\nElliptic Functions [+] Thesis (B.Sc. in Mathematics) at LMU Munich 2014 Audience: General Mathematical Central subject of this thesis are so called elliptic functions. Elliptic functions are a special type of meromorphic functions (complex-valued functions in one complex variable, which are holomorphic apart from a discrete set of poles) that are periodic in two directions on the complex plane, i.e. $ f(x) = f(x+a) $ and $ f(x) = f(x+b) $ for any $ x $ out of the domain with two complex numbers $ a $ and $ b $ (which are required to be non-collinear on the complex plane).\nAmong others, elliptic functions are of great use in number theory, in particular there are interesting connections to sums of divisors of natural numbers. Furthermore they are used in the theory of elliptic curves and elliptic integrals.\nSubject of the thesis are so called elliptic functions, meromorphic functions that are periodic in two directions, i.e. invariant under a translation of their argument by two linearly independent complex numbers.\nAmong others, elliptic functions are of great use in number theory, in particular there are interesting connections to sums of divisors of natural numbers. Furthermore they are used in the theory of elliptic curves and elliptic integrals.\nImaginary part of the Weierstrass p function, an example of an elliptic function. Clearly visible are the two periods $ p(x+2) = p(x) = p(x+2i) $ throughout the domain.\nAll papers at:\nGoogle Scholar ORCID ","permalink":"https://lieret.net/research/","summary":"\u003clink rel=\"stylesheet\" href=\"https://cdn.jsdelivr.net/npm/katex@0.16.8/dist/katex.min.css\" crossorigin=\"anonymous\"\u003e\n\u003cscript defer src=\"https://cdn.jsdelivr.net/npm/katex@0.16.8/dist/katex.min.js\" crossorigin=\"anonymous\"\u003e\u003c/script\u003e\n\u003cscript defer src=\"https://cdn.jsdelivr.net/npm/katex@0.16.8/dist/contrib/auto-render.min.js\" crossorigin=\"anonymous\"\u003e\u003c/script\u003e\n\u003cscript\u003e\n  document.addEventListener(\"DOMContentLoaded\", function() {\n    renderMathInElement(document.body, {\n      delimiters: [\n        {left: \"$$\", right: \"$$\", display: true},\n        {left: \"$\", right: \"$\", display: false}\n      ]\n    });\n  });\n\u003c/script\u003e\n\n\u003ch1\u003eResearch\u003c/h1\u003e\n\n\u003cdiv style=\"text-align: center; margin-bottom: 3.5rem;\"\u003e\n\u003cstyle\u003e\n.research-social-icons {\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    gap: 1rem;\n    margin: 1rem 0;\n}\n\n.research-social-icons p {\n    margin: 0;\n    font-weight: normal;\n}\n\n.research-social-icons .social-icons {\n    display: flex;\n    gap: 0.5rem;\n    align-items: center;\n}\n\n.research-social-icons .social-icons a {\n    text-decoration: none;\n    display: flex;\n    align-items: center;\n    gap: 0.3rem;\n    padding: 0.2rem;\n    transition: opacity 0.2s;\n}\n\n.research-social-icons .social-icons a:hover {\n    opacity: 0.7;\n}\n\n.research-social-icons .social-icons a svg {\n    width: 1.2rem;\n    height: 1.2rem;\n}\n\n.research-social-icons .social-icons a span {\n    font-size: 0.9rem;\n    color: var(--primary);\n}\n\n \n@media (max-width: 768px) {\n    .research-social-icons {\n        flex-direction: column;\n        gap: 0.5rem;\n    }\n}\n\u003c/style\u003e\n\n\u003cdiv class=\"research-social-icons\"\u003e\n    \u003cp\u003eAll papers at:\u003c/p\u003e","title":"Research"}]