William Berrios (@w33lliam) / X

William Berrios

2,366 posts

William Berrios

@w33lliam

Engineer @UNIoficial 🇵🇪

San Francisco, CA

Joined January 2020

William Berrios
@w33lliam
Jul 14, 2025
Tired of seeing O3 hallucinate? 😵‍💫 Today, I am excited to share how we built the least hallucinatory LLM in the 🌍 Our GLMv2, developed at @ContextualAI, just claimed 1st place 🥇 on the FACTS Grounded leaderboard by Google DeepMind — outperforming Gemini-2.5-pro, Claude 4, and
00:00
581K
William Berrios
@w33lliam
Jun 29, 2023
Announcing LENS 🔎, a framework for vision-augmented language models. - Outperforms Flamingo by 9% (56->65%) on VQAv2 - Eliminates the additional cost of multimodal pre-training Demo: lens.contextual.ai Blog+Paper+Code: contextual.ai/introducing-le… A 🧵 [1/N]
78K
William Berrios
@w33lliam
Jun 23, 2025
Excited to share 🤯 that our LMUnit models with @ContextualAI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11
77K
William Berrios
@w33lliam
Jun 30, 2023
If you want to augment your favorite LLM with vision capabilities like GPT-4, take a look at the following: Blog+Paper: contextual.ai/introducing-le… Demo: lens.contextual.ai Code: github.com/ContextualAI/l…
AK
@_akhaliq
Jun 29, 2023
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language paper page: huggingface.co/papers/2306.16… propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a
8.3K
William Berrios
@w33lliam
Jul 22, 2025
📢 As promised ✨, we're open-sourcing LMUnit! Our SoTA generative model for fine-grained criteria evaluation of your LLM responses 🎯 ✅ SoTA on Flask & BigGbench ✅ SoTA generative reward model on RewardBench2 🤗 Models available on @huggingface: tiny.cc/qjzp001 💻
7.1K
William Berrios
@w33lliam
Mar 18, 2022
Happy to share that the fantastic teamwork with @ArtDeza resulted in our new paper which shows that a Vision Transformer trained in an adversarial manner and coupled with rotation invariance achieves new SOTA in Area V4 at @brain_score competition🤯. #COSYNE22
Arturo Deza
@ArtDeza
Mar 18, 2022
1/ Excited to share our new paper showing that training a Transformer adversarially with rotation invariance achieves new SOTA in Area V4 for in @brain_score ! We also scored 2nd in the aggregate competition. This epic tour-de-force was lead by @W33lliam96 openreview.net/forum?id=SOulr…
William Berrios
@w33lliam
Jun 23, 2025
Replying to @w33lliam
As a quick recap, in LMUnit, we utilize "natural language unit tests," which decompose response quality into explicit, testable criteria. Instead of relying on opaque metrics like "pick the better response," each quality aspect becomes a specific question that humans can
2.1K
William Berrios
@w33lliam
Jun 23, 2025
Replying to @w33lliam
Everything is available 👀 🏆 Leaderboard: huggingface.co/spaces/allenai… 🔧 API: contextual.ai/request-lmunit… 💻 RewardBench2 code submission: github.com/ContextualAI/e… 📄 Paper: arxiv.org/abs/2412.13091 ⭐ LMUnit-llama3.1 available now, LMUnit-qwen2.5 coming soon! 9/11
Reward Bench Leaderboard - a Hugging Face Space by allenai
From huggingface.co
882
William Berrios
@w33lliam
Aug 9, 2022
Would be a good idea to extend the discussion period by 1 week? Hopefully, most authors could get a response in their rebuttal 🙃 #NeurIPS2022
William Berrios
@w33lliam
Apr 3, 2025
LMUnit in CI/CD pipelines for catching regressions ❤️
Contextual AI
@ContextualAI
Apr 3, 2025
🔥 Introducing the most reliable way to evaluate LLMs and agents in production! It's time to stop “vibe testing” your AI systems. Our latest developer's guide shows you how to rigorously test AI systems so that they hold up in production, using Contextual AI's LMUnit evaluation
131
William Berrios
@w33lliam
Jun 23, 2025
Replying to @w33lliam
But here is also one of our most exciting results: When humans evaluate using our unit tests instead of traditional preference judgments, inter-annotator agreement jumps from 71% to 86%! That's a 15% improvement in human consensus, just by asking better questions. 6/11
1.3K
William Berrios
@w33lliam
Dec 13, 2023
🌟 Gen AI advances fairness in AI models with 🔁Diffusion Perturbations! Explore our demographic-balanced dataset for fair AI evaluation, led by the remarkable @niclui97 and @bryanchiaw! Let's ensure AI fairness prevails! 🤖✨ #AIForFairness #AAAI24
Nicholas Lui
@niclui97
Dec 13, 2023
Can Gen AI help us evaluate the fairness of AI models? The answer is YES. Excited to announce 🔁Diffusion Perturbations, a diffusion-based approach to create datasets balanced across demographic traits. Paper: arxiv.org/abs/2311.15108 Dataset: huggingface.co/datasets/Diffu… 🧵👇! 1/N
693
William Berrios
@w33lliam
Jun 29, 2023
Replying to @w33lliam
Make your favorite LLM vision-augmented with just a pip install and a few lines of Python! [9/N]
609
William Berrios
@w33lliam
Mar 19, 2024
Excited to share what we have been working @ContextualAI! RAG 2.0, our end-to-end system for developing production-grade AI 🚀 Check out our post with benchmarks and long-context experiments!
Contextual AI
@ContextualAI
Mar 19, 2024
Today, we’re excited to announce RAG 2.0, our end-to-end system for developing production-grade AI. Using RAG 2.0, we’ve created Contextual Language Models (CLMs), which achieve state-of-the-art performance on a variety of industry benchmarks. CLMs outperform strong RAG
360