Ben Clavié (@bclavie) / X

Ben Clavié

3,609 posts

Ben Clavié

@bclavie

regressing linearly on a daily basis. wife guy who does retrieval. research @mixedbreadai, prev answerdotai

Mitaka-shi, Tokyo

Joined April 2016

Pinned
Ben Clavié
@bclavie
Mar 12
I'm so excited to introduce this! We've worked on a million different moving parts to produce this. I'm fairly confident it's the best multimodal model that exists, period -- and it's not too shabby at pushing back the LIMITs of retrieval either...
Mixedbread
@mixedbreadai
Mar 12
Introducing Mixedbread Wholembed v3, our new SOTA retrieval model across all modalities and 100+ languages. Wholembed v3 brings best-in-class search to text, audio, images, PDFs, videos... You can now get the best retrieval performance on your data, no matter its format.
145K
Ben Clavié
@bclavie
Sep 5, 2024
RAG is increasingly going multi-modal, but document retrieval is tough, and layout gets in your way. But it shouldn't! Introducing 🪤RAGatouille's Vision-equipped, ColPali-powered sibling: 🐭Byaldi With just a few lines of code, search through documents, with no pre-processing.
150K
Ben Clavié
@bclavie
Jan 4, 2024
The RAG wave is here to stay, but in practice, it's hard to retrieve the right docs w/ embdings, & better IR models are hard to use! Let's fix that: Introducing 🪤RAGatouille, a lib to train&use SotA retrieval model, ColBERT, in just a few lines of code! github.com/bclavie/RAGato…
203K
Ben Clavié
@bclavie
Feb 10, 2025
What if a [MASK] was all you needed? ModernBERT is great, but we couldn't stop wondering if it could be greater than previous encoders in different ways. Maybe we don't need task-specific heads? Maybe it can do all sort of tasks with only its generative head? Spoilers: Yes
165K
Ben Clavié
@bclavie
Aug 13, 2024
🎉Happy to finally release answerai-colbert-small-v1: the small but mighty @answerdotai ColBERT. It might not be able to count the number of "r"s in words, but it can definitely find the instructions on how to do that. With just 33M params, it beats even `bge-base` on BEIR!
146K
Ben Clavié
@bclavie
Feb 8, 2025
36K
Ben Clavié
@bclavie
Dec 14, 2024
Turn on early stopping, and all I see is a successful training loop
40K
Ben Clavié
@bclavie
Jun 9, 2025
Multimodal RAG: Just use ColPali/DSE then pass your screenshots to the LLM This is the dream, but how well do LLMs read text contained in images? We wanted to know, so we tried a simple thing: do results change on evals when using screenshots rather than text as input? Yes.
70K
Ben Clavié
@bclavie
Mar 6, 2024
"Just use a reranker for better retrieval" ... Yes, but which one? Someone asked me recently what reranker they should use (with no data to fine-tune it), and I realised just how loaded that question actually was, so I made this (mostly English) "cheatsheet"
40K
Ben Clavié
@bclavie
Mar 14, 2024
Document reranking is powerful, but daunting to get started with. Moreover, trying a new approach requires modifying your pipeline, even though it does the same thing! Introducing 🔧rerankers: a lightweight library to provide a unified way to use various reranking methods🧵1/?
92K
Ben Clavié
@bclavie
Dec 19, 2024
It's finally out! We at @answerdotai, @LightOnIO and friends are releasing ModernBERT 🎉 It does exactly what it says on the tin: It's BERT, but not 2018 BERT, no, it's 2024 BERT, with all the 2024 bells and whistles. They're slot-in replacements for BERT, at both model sizes.
88K
Ben Clavié
@bclavie
Jun 27, 2024
🥁🥁 New blog post out (link in thread), w/ two aims: 🤓 Providing a clear, hopefully easy-to-read intro to ColBERT, without assuming you've ever used it. 🏊Introducing ColBERT Token Pooling ✨: You can reduce the size of ColBERT indexes by 66% with barely any performance hit!
80K
Ben Clavié
@bclavie
Sep 16, 2024
Time for a new ✨Information Retrieval Blogpost✨ It's about our rerankers library, and the why&how of it. It features this updated "what model should I start with" cheatsheet, as well as an intro to what reranking is and why you should embrace it (and a lot more cool stuff!)
27K
Ben Clavié
@bclavie
Sep 4, 2024
Full slides for this talk are here: docs.google.com/presentation/d… Expect a lot of ColBERT and ColPali, with a tiny SLADE and BM25 cameo to give some context. Thanks @jxnlco and @dan_s_becker for having me!
Ben Clavié
@bclavie
Sep 4, 2024
Replying to @mervenoyann
Couldn't agree more, I literally just ended my talk at @jxnlco's RAG course with this slide an hour ago 😄 Normalise accepting ColPali+VLM is more amazing than it has any right to be and accepting that we don't need overly complex pipelines to do their job.
59K