Benchmarks have never been less useful for telling us which models are best.
They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There’s no other way.
Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.