Alexander Kolesnikov (@__kolesnikov_

Alexander Kolesnikov

444 posts

Alexander Kolesnikov

@__kolesnikov__

Joined January 2019

Alexander Kolesnikov
@__kolesnikov__
Dec 4, 2024
Ok, it is yesterdays news already, but good night sleep is important. After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @XiaohuaZhai and @giffmana, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
110K
Alexander Kolesnikov
@__kolesnikov__
Feb 17, 2023
Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc. arxiv.org/abs/2302.08242
218K
Alexander Kolesnikov
@__kolesnikov__
May 23, 2022
I've always been frustrated that, beyond image classification, computer vision is full of complex and task-specific components. Thus, very excited to share our new work, where we propose a unified modeling approach for vision: arxiv.org/abs/2205.10337. More in the thread🧵.
Alexander Kolesnikov
@__kolesnikov__
May 4, 2022
Let me introduce big_vision: an original home of ViT, MLP-Mixer, LiT and many more. These days I do all my research in this codebase: it is great for doing vision research with emphasis on large-scale pretraining and transfer. Highlights in 🧵 ⬇️ Link:
github.com
GitHub - google-research/big_vision: Official codebase used to develop Vision Transformer, SigLIP,...
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more. - google-research/big_vision
Alexander Kolesnikov
@__kolesnikov__
Oct 23, 2020
We release pre-trained vision transformer models and code for inference/fine-tuning: github.com/google-researc…. There is still a long way towards understanding transformers in vision and I am looking forward to the future research. Hope this release will be a good starting point.
Alexander Kolesnikov
@__kolesnikov__
May 14, 2024
We just released PaliGemma-3B, a very capable Vision-Language Model. Do not waste any time, finetune it for your task: Code: github.com/google-researc… Colab: colab.research.google.com/github/google-… Kaggle: kaggle.com/models/google/… HF: huggingface.co/collections/go… Vertex AI: console.cloud.google.com/vertex-ai/publ…
28K
Alexander Kolesnikov
@__kolesnikov__
May 5, 2021
MLP-Mixer (a new vision architecture based on MLP only) code and pretrained models are now available: github.com/google-researc…. Looking forward to community contributions that will shed some light on how Mixer works and how to make it even better. paper: arxiv.org/abs/2105.01601.
Alexander Kolesnikov
@__kolesnikov__
Nov 1, 2023
We've landed a big revamp of github.com/google-researc…. The main new feature is support for flexible weight sharding, which doesn't get in the way of cutting-edge research code. Scaling ViTs, ResNets, MLP-Mixers, SigLIPs (and so on) beyond single GPU/TPU device memory becomes easy.
github.com
GitHub - google-research/big_vision: Official codebase used to develop Vision Transformer, SigLIP,...
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more. - google-research/big_vision
81K
Alexander Kolesnikov
@__kolesnikov__
Dec 2, 2024
I always dreamed of a model that simultaneously 1. optimizes NLL of raw pixel data, 2. generates competitive high-res. natural images, 3. is practical. But it seemed too good to be true. Until today! Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these. 🧵
Michael Tschannen
@mtschannen
Dec 2, 2024
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/
68K
Alexander Kolesnikov
@__kolesnikov__
Dec 20, 2024
With some delay, JetFormer's *prequel* paper is finally out on arXiv: a radically simple ViT-based normalizing flow (NF) model that achieves SOTA results in its class. Jet is one of the key components of JetFormer, deserving a standalone report. Let's unpack: 🧵⬇️
57K
Alexander Kolesnikov
@__kolesnikov__
Jul 28, 2022
We have opensourced UViM models and complete training/inference/eval code. You can now train new models yourself and explore the released models (and UViM guiding codes) in the interactive colabs. All available at github.com/google-researc…. UViM paper: arxiv.org/abs/2205.10337.
Alexander Kolesnikov
@__kolesnikov__
Nov 15, 2021
Also an interesting survey on MLP-Mixer and concurrent/follow-up research: arxiv.org/abs/2111.04060 Crazy all of it happened in ~6 months only.
Neil Houlsby
@neilhoulsby
Nov 15, 2021
An incredibly thorough-looking survey of Vision Transformers! It only been just over a year since we published ViT. I thought it would be useful, but didn't imagine this much cool innovation would happen. arxiv.org/abs/2111.06091
Alexander Kolesnikov
@__kolesnikov__
May 16, 2022
Do not want to miss out on the recent trend, so I officially announce that 1. All my ICML 2022 papers were rejected. 2. All my ICML 2022 papers were accepted. 3. Both statements above are true.
Alexander Kolesnikov
@__kolesnikov__
Jul 11, 2024
Our PaliGemma technical report is finally out: arxiv.org/abs/2407.07726. We share many insights that we learned while cooking the PaliGemma-3B model. Both about pretraining and transfer.
Lucas Beyer (bl16)
@giffmana
Jul 10, 2024
✨PaliGemma report will hit arxiv tonight. We tried hard to make it interesting, and not "here model. sota results. kthxbye." So here's some of the many interesting ablations we did, check the paper tomorrow for more! 🧶
arxiv.org
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base...
19K