prof_pic.jpg

Benjamin Minixhofer

b{lastname}@gmail.com

Hello there! I am a third-year PhD student at the Language Technology Lab of the University of Cambridge. I do research in Natural Language Processing. My research focusses on compute allocation (a.k.a. tokenization) and building language models like we build software, using methods which scale to all the world’s languages.

Previously, I obtained a BSc in Artifical Intelligence from Johannes Kepler University Linz in 2023. I interned at Ai2, Google DeepMind, Cohere, H2O.ai and Huawei Noah’s Ark Lab (2x). My research journey began with Kaggle.

If you’d like to chat about any of these topics (or about 🐱⛵🌳), drop me an email!

News

Jan 21, 2026 I wrote a (my first!) blog post on Four Ingredients for Successful Retrofitting.
Dec 19, 2025 Our new preprint Bolmo: Byteifying the Next Generation of Language Models is up on arXiv!
Sep 18, 2025 Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching is accepted at NeurIPS! See you (probably) at EurIPS!
Jul 7, 2025 Started an internship at Ai2 in Seattle over the Summer to work with Luca Soldaini and Valentin Hofmann!
Nov 29, 2024 Talk about «The Past, Present and Future of Tokenization» at the NLIP Seminar in Cambridge. This talk was based on an Invited Lecture at the University of Göttingen in early November. Slides.
Nov 27, 2024 Attended the ELLIS NLP Workshop at Dagstuhl. Some nice photos.
Sep 25, 2024 Zero-Shot Tokenizer Transfer is accepted at NeurIPS 2024. See you in Vancouver! :canada:
Jul 24, 2024 I presented Zero-Shot Tokenizer Transfer at Google DeepMind and Mozilla. Slides.

Selected Publications

  1. Bolmo: Byteifying the Next Generation of Language Models
    Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, and Valentin Hofmann
    In arXiv preprint, Dec 2025
  2. Cross-Tokenizer Distillation via Approximate Likelihood Matching
    Benjamin Minixhofer, Ivan Vulić, and Edoardo Maria Ponti
    In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Dec 2025
  3. Zero-Shot Tokenizer Transfer
    Benjamin Minixhofer, Edoardo Ponti, and Ivan Vulić
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Dec 2024
  4. Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
    Markus Frohmann, Igor Sterner, Ivan Vulić*, Benjamin Minixhofer*, and Markus Schedl*
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024