Pinned
Scaling expert-annotated image captions is expensive. Supervised distillation from VLMs helps but has a diversity ceiling: models memorize the teacher's style and generalize poorly. Can RL fix this without a verifiable "ground truth"?
Introducing RubiCap: arxiv.org/abs/2603.09160






