Ok, it is yesterdays news already, but good night sleep is important.
After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @XiaohuaZhai and @giffmana, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
Alexander Kolesnikov
444 posts
- Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc. arxiv.org/abs/2302.08242
- I've always been frustrated that, beyond image classification, computer vision is full of complex and task-specific components. Thus, very excited to share our new work, where we propose a unified modeling approach for vision: arxiv.org/abs/2205.10337. More in the thread🧵.
- Let me introduce big_vision: an original home of ViT, MLP-Mixer, LiT and many more. These days I do all my research in this codebase: it is great for doing vision research with emphasis on large-scale pretraining and transfer. Highlights in 🧵 ⬇️ Link:
- We release pre-trained vision transformer models and code for inference/fine-tuning: github.com/google-researc…. There is still a long way towards understanding transformers in vision and I am looking forward to the future research. Hope this release will be a good starting point.
- We just released PaliGemma-3B, a very capable Vision-Language Model. Do not waste any time, finetune it for your task: Code: github.com/google-researc… Colab: colab.research.google.com/github/google-… Kaggle: kaggle.com/models/google/… HF: huggingface.co/collections/go… Vertex AI: console.cloud.google.com/vertex-ai/publ…
- MLP-Mixer (a new vision architecture based on MLP only) code and pretrained models are now available: github.com/google-researc…. Looking forward to community contributions that will shed some light on how Mixer works and how to make it even better. paper: arxiv.org/abs/2105.01601.
- We've landed a big revamp of github.com/google-researc…. The main new feature is support for flexible weight sharding, which doesn't get in the way of cutting-edge research code. Scaling ViTs, ResNets, MLP-Mixers, SigLIPs (and so on) beyond single GPU/TPU device memory becomes easy.
- I always dreamed of a model that simultaneously 1. optimizes NLL of raw pixel data, 2. generates competitive high-res. natural images, 3. is practical. But it seemed too good to be true. Until today! Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these. 🧵Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/
- With some delay, JetFormer's *prequel* paper is finally out on arXiv: a radically simple ViT-based normalizing flow (NF) model that achieves SOTA results in its class. Jet is one of the key components of JetFormer, deserving a standalone report. Let's unpack: 🧵⬇️
- We have opensourced UViM models and complete training/inference/eval code. You can now train new models yourself and explore the released models (and UViM guiding codes) in the interactive colabs. All available at github.com/google-researc…. UViM paper: arxiv.org/abs/2205.10337.
- Also an interesting survey on MLP-Mixer and concurrent/follow-up research: arxiv.org/abs/2111.04060 Crazy all of it happened in ~6 months only.An incredibly thorough-looking survey of Vision Transformers! It only been just over a year since we published ViT. I thought it would be useful, but didn't imagine this much cool innovation would happen. arxiv.org/abs/2111.06091
- Do not want to miss out on the recent trend, so I officially announce that 1. All my ICML 2022 papers were rejected. 2. All my ICML 2022 papers were accepted. 3. Both statements above are true.
- Our PaliGemma technical report is finally out: arxiv.org/abs/2407.07726. We share many insights that we learned while cooking the PaliGemma-3B model. Both about pretraining and transfer.✨PaliGemma report will hit arxiv tonight. We tried hard to make it interesting, and not "here model. sota results. kthxbye." So here's some of the many interesting ablations we did, check the paper tomorrow for more! 🧶arxiv.orgPaliGemma: A versatile 3B VLM for transferPaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base...
















