Stories by Leon Sick on Medium

Paper explained: Exploring Plain Vision Transformer Backbones for Object Detection

Leon Sick — Sun, 20 Nov 2022 09:29:18 GMT

The power of ViTs as object detection backbones

In this story, we will take a closer look at a paper published recently by researchers from Meta AI, where the author explore how a standard ViT can be re-purposed to be used as an object detection backbone. In short, their detection architecture is called ViTDet.

A visualization of a feature pyramid network (FPN) used for a plain backbone, like the one in ViTDet, an a “classic” hierarchical backbone, commonly represented as a CNN. Source: [1]

Pre-requisite: Object Detection Backbones

Previously, backbones for object detectors have profited from a different resolutions at different stages of the network. As displayed in the figure above, the feature map has different resolutions, from which the detection heads performing the actual object detection step greatly benefit. These backbones are commonly called hierarchical backbones in scientific literature. Commonly, ResNets or other CNNs are referred to as hierarchical backbones, but also certain ViTs like the Swin Transformer have hierarchical structures. The paper we will take a look at today has to deal with a different backbone structure: Since ViTs are made up of a certain number of transformer block, which all output features in the same dimensionality, it never outputs feature maps of different resolutions naturally. The authors tackle this issue in their paper and explore different strategies to construct a multi-resolution FPN.

Generating multi-resolution features from a single resolution backbone

Since ViTs naturally only provide one resolution for its feature maps, the authors explore how to convert this map to different resolutions using an FPN. To facilitate memory constraints and add global context to the feature outputs, the authors do not compute the self-attention for all ViT blocks. Instead, they opt to divide the transformer into 4 even sections, e.g. for a ViT-L with 24 blocks, each section makes up 6 blocks. At the end of each section, they compute the global-self attention for the section, whose output is used as a feature map for the FPNs.

The different FPNs explored for ViTDet. Source: [1]

For approach (a), they attempt construct a FPN-like solution by up- or downsampling the 1/16 feature map, using convolutions or deconvolutions, from the individual global-attention outputs of each section. They also add lateral connections, visualized by the arrows connecting the blue blocks.

For approach (b), they construct the FPN by up- and downscaling only the last feature map from the global self-attention module. This means all features in the FPN are constructed from a single output. Also, they add the lateral connections again.

For approach (c), they propose a very simple and puristic solution: Up- and downsampling the final global-attention output and not adding any lateral connections. This approach is by far the most minimalistic approach, but as we will see now, it works remarkably well.

Performance comparison of different FPN approaches

Let’s get right into it!

The results on COCO for different FPNs. Source: [1]

Remarkably, the simple FPN, approach (c), work the best, across two ViT sizes, for bounding box regression and instance segmentation on the MS COCO detection benchmark.

But why even attempt such a simple solution to enable plain ViTs to be used as detection backbones when there already are ViT-based detection networks? The answer will become apparent now.

Comparison against state-of-the-art (SOTA) ViT detection networks

Recent research in the field of self-supervised pre-training has started to unlock incredible capabilities in ViTs. One of the most promising tasks in this domain has to challenge a network to reconstruct the masked parts of an object, implement through the Masked Autoencoders (MAE) paper. We have revisited this paper on my blog, feel free to refresh your knowledge here.

Results on COCO of ViT-Det (the three bottom rows) compared to previous, ViT-based detectors. Source: [1]

The MAE pre-trains a standard ViT to learn reconstructing the masked parts of an image. This has proven to be a successful strategy for pre-training. To transfer this advantage to object detection, the authors create the ViTDet architecture. This is the entire purpose of the paper: Unlock the power of pre-training ViTs for object detection. And the results tell the story.

As you can see from the results table, pre-training the backbone with the MAE and then using their simple FPN on top yields SOTA results for ViT-based detection backbones. Since the Swin Transformer and MViT are not compatible with self-supervised pre-training strategies without modifications, they are pre-training supervised on ImageNet. Astonishingly, MAE pre-training unlocks much more performance then standard supervised pre-training. Therefore, the authors hint where future improvements in object detection research will come from: Not the detection architecture itself, but more powerful pre-training of the backbone in a self-supervised manner.

In my eyes, this represents a key shift in object detection research. If you would like to read more about the paradigm shift self-supervised pre-training bring to computer vision domain, feel free to reach my story detailing the transition here.

Wrapping it up

We have explored the ViTDet architecture, a simple yet powerful modification to traditional FPNs, specifically to ViTs, that unlocks the power of self-supervised vision transformers for object detection. Not only that, but this research paves the way for a new direction of object detection research in which the focus is shifted from the architecture to the pre-training technique.

While I hope this story gave you a good first insight into ViTDet, there is still so much more to discover. Therefore, I would encourage you to read the papers yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

And last but not least, if you would like to dive deeper in the field of advanced computer vision, consider becoming a follower of mine. I try to post a story here and there and keep you and anyone else interested up-to-date on what’s new in computer vision research!

References:

[1] Li, Yanghao, et al. “Exploring plain vision transformer backbones for object detection.” arXiv preprint arXiv:2203.16527 (2022).

Paper explained: Exploring Plain Vision Transformer Backbones for Object Detection was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paper explained: Pushing The Limits Of Self-Supervised ResNets: Can We Outperform Supervised…

Leon Sick — Sat, 26 Feb 2022 15:37:29 GMT

Paper explained: Pushing The Limits Of Self-Supervised ResNets: Can We Outperform Supervised Learning Without Labels On ImageNet?

Exploring the novel approaches in ReLICv2

In this story, we will look at a recent paper that pushes the state of self-supervised learning forward, published by DeepMind, and with the nickname ReLICv2.

In their publication “Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?”, Tomasev et al. present an improvement over their technique presented in the paper behind ReLIC named “Representation learning via invariant causal mechanisms”. At the core of their method stands their addition of a Kullback-Leibler-Divergence loss their that is calculated using a probabilistic formulation of the classical contrastive learning objective. Not only that, but they also use a refined augmentation scheme and learn from the successes of other relevant publications.

I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

An illustration of the ReLIC training pipeline. Source: [1]

Pre-requisite: Self-supervised & unsupervised pre-training for computer vision

Before we go deeper into the paper, it’s worth quickly re-visiting what self-supervised pre-training is all about. If you have been reading other self-supervised learning stories from me or you are familiar with self-supervised pre-training, feel free to skip this part.

Traditionally, computer vision models have always been trained using supervised learning. That means humans looked at the images and created all sorts of labels for them, so that the model could learn the patterns of those labels. For example, a human annotator would assign a class label to an image or draw bounding boxes around objects in the image. But as anyone who has ever been in contact with labeling tasks knows, the effort to create a sufficient training dataset is high.

In contrast, self-supervised learning does not require any human-created labels. As the name suggest, the model learns to supervise itself. In computer vision, the most common way to model this self-supervision is to take different crops of an image or apply different augmentations to it and passing the modified inputs through the model. Even though the images contain the same visual information but do not look the same, we let the model learn that these images still contain the same visual information, i.e., the same object. This leads to the model learning a similar latent representation (an output vector) for the same objects.

We can later apply transfer learning on this pre-trained model. Usually, these models are then trained on 10% of the data with labels to perform downstream tasks such as object detection and semantic segmentation.

A combination of novel contributions and learnings

Just as it is the case for many other self-supervised pre-training techniques, the first step in the ReLICv2 training process is all about data augmentations. In the paper, the authors first mention the use of previously successful augmentation schemes.

The first are the augmentations used in SwAV. Contrary to previous work, SwAV not only creates two different crops of the input image, but up to 6 croppings. These can be made in different sizes such as 224x244 and 96x96, the most successful quantity is two large crops and 6 small crops. If you would like to learn more about the augmentation scheme of SwAV, make sure to read my story on it.

The second set of previously described augmentations come from SimCLR. This scheme is now used by pretty much all papers in this space. The image is manipulated by applying a random horizontal flip, color distortion, a gaussian blur and solarization. If you would like to read more about SimCLR, make sure to take a look at my article.

But ReLICv2 also uses another augmentation technique: Removing the background from the object in the image. To achieve this, they train a salient background removal model on some of the ImageNet data in an unsupervised way. This augmentation has been found to be most effective by the authors when applied with a probability of 10%.

The salient background removal augmentation using an unsupervised DeepUSPS. Source: [2]

Once the image is augmented and multiple crops are made, the outputs are passed through an encoder network and a target network which output feature vectors of the same dimension. While the encoder network is updated using backpropagation, the target network receives updates through a momentum calculation similar to the MoCo framework.

The overall goal of ReLICv2 is to learn the encoder network to produce consistent output vectors for the same classes. To achieve this, the authors formulate a novel loss function. They start with the standard contrastive negative log-likelihood which at its core has a similarity function that compares the anchor views (the main input image) to the positive examples (augmented versions of the image) and the negative examples (other images in the same batch).

The ReLICv2 loss function consisting of the negative log likelihood and the Kullback-Leibler divergence of the anchor view and the positive views. Source: [2]

This loss is extended by a probabilistic formulation of a contrastive objective: The Kullback-Leibler divergence between the likelihood of the anchor image and a positive. This enforces the network to not learn positives to be close together and negatives to be farer apart, but creates a more balanced landscape between the clusters when extreme clustering that could lead to a collapse in learning is avoided. Therefore, this additional loss term can be seen similar to a regulatization. The two terms are accompanied by an alpha and beta hyperparameter that allow an individual weighting of the two loss terms.

The addition of all these novelties prove to be successful. To find out in which ways, let’s have a closer look at the results presented in the paper.

Results

The main point ReLICv2 is trying to prove, as it says in the title of the paper, is that self-supervised pre-training methods are only comparable if they all use the same network architecture for the encoder network. For their work, they have opted to use the classic ResNet-50.

Results of using the different pre-trained ResNet-50 under the ImageNet linear evaluation protocol. Source: [2]

When using the same ResNet-50 and training its linear layer on ImageNet-1K while freezing all other weights, ReLICv2 outperfroms existing methods by a considerable margin. The introduced improvement even lead to a performance advantage versus the original ReLIC paper.

Increase in accuracy versus a supervised pre-trained model on different datasets. Source: [2]

When comparing the transfer learning performance on other datasets, ReLICv2 continues to show impressive performance versus other methods such as NNCLR and BYOL. This further manifests ReLICv2 as a novel state-of-the-art methods for self-supervised pre-training. Evaluation on other datasets is not often mentioned in other papers.

An illustration of learned clusters by ReLICv2 and BYOL. The more blue the point, the closer it is learned to the corresponding class cluster. Source: [2]

Another telling graphic shows the classes learned by ReLICv2 to be much closer together then they are for other frameworks such as BYOL. This again show that the techniques has the potential to create much more fine grained clusters then other methods.

Wrapping it up

In this article, you have learned about ReLICv2, a novel method for self-supervised pre-training that has shown promising experimental results.

By incorporating a probabilistic formulation of the contrastive learning objective and by adding proven augmentation schemes, this technique has been able to push the space of self-supervised pre-training in vision forward.

While I hope this story gave you a good first insight into ReLICv2, there is still so much more to discover. Therefore, I would encourage you to read the papers yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

And last but not least, if you would like to dive deeper in the field of advanced computer vision, consider becoming a follower of mine. I try to post a story once a week and and keep you and anyone else interested up-to-date on what’s new in computer vision research!

References:

[1] Mitrovic, Jovana, et al. “Representation learning via invariant causal mechanisms.” arXiv preprint arXiv:2010.07922 (2020). https://arxiv.org/pdf/2010.07922.pdf

[2] Tomasev, Nenad, et al. “Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?.” arXiv preprint arXiv:2201.05119 (2022). https://arxiv.org/pdf/2201.05119.pdf

Paper explained: Pushing The Limits Of Self-Supervised ResNets: Can We Outperform Supervised… was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paper explained: Momentum Contrast for Unsupervised Visual Representation Learning

Leon Sick — Mon, 10 Jan 2022 14:21:02 GMT

Going over the principles from the MoCo v1 & v2 papers

In this story, we will go over the MoCo papers, an unsupervised approach to pre-training computer vision models using Momentum Contrast (MoCo) that has been iteratively improved by the authors, from version one up to version three.

MoCo was first presented in the paper “Momentum Contrast for Unsupervised Visual Representation Learning” by He et al. in 2020 and further improved to MoCo v2 in “Improved Baselines with Momentum Contrastive Learning”. There is another iteration called MoCo v3 in “An Empirical Study of Training Self-Supervised Vision Transformers” by Chen et al. which introduces a fundamental adaptation of training process and model architecture, which we will cover in a future story.

Since this method has been continuously improved, we will first go over the initial ideas behind the paper, followed by an overview over the improvements introduced with newer version. I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

A simplified illustration of the key idea behind the initial MoCo paper. Source: [1]

Pre-requisite: Self-supervised & unsupervised pre-training for computer vision

Before we go deeper into the MoCo paper, it’s worth quickly re-visiting what self-supervised pre-training is all about. If you have been reading other self-supervised learning stories from me or you are familiar with self-supervised pre-training, feel free to skip this part.

Using Momentum Contrast for Unsupervised Learning

In the MoCo paper, the unsupervised learning process is framed as a dictionary look-up: Each view or image is assigned a key, just like in a dictionary. This key is generated by encoding each image using a convolutional neural network, with the output being a vector representation of the image. Now, if a query is presented to this dictionary in the form of another image, this query image is also encoded into a vector representation and will belong to one of the keys in the dictionary, the one with the lowest distance.

On the left is the query image and encoder, on the right is the queue of mini-batches of keys and the momentum encoder. Source: [1]

The training process is designed as follows: A query image is selected and processed by the encoder network to compute q, the encoded query image. Since the goal of the model is learn to differentiate between a large number of different images, this query image encoding is not only compared to one mini-batch of encoded key images, but to multiple of them. To achieve that, MoCo forms a queue of mini-batches that are encoded by the momentum encoder network. As a new mini-batch is selected, its encodings are enqueued and the oldest encodings in the data structure are dequeued. This decouples the dictionary size, represented by the queue, from the batch size and enables a much larger dictionary to query from.

If the encoding of the query image matches a key in the dictionary, these two views are deemed to be from the same image (e.g. multiple different crops).

The training object is formulated by the InfoNCE loss function:

The InfoNCE contrastive loss function. Source: [1]

This loss function allows the model to learn a smaller distance for views from the same image and a larger distance between different images. To achieve this, the loss resembles a softmax-based classfier loss that aims to classify q as k+. The sum is computed over one positive key and K negative keys for query q.

Finally, backpropagation is performed through the encoder network as can be seen in the first illustration but not through the momentum encoder.

The authors try to achieve a relatively consistent encoding for each image with the momentum encoder. Since keeping the weights frozen would not reflect the progress of learning by the model, they propose using a momentum update instead of copying the encoder’s weights to achieve consistent momentum encoder results. To achieve this, to set the momentum encoder network to be updated by a momentum-based moving average of the query encoder. This update is performed for every training iteration.

Further improvements in MoCo v2

In the second version of the model, MoCo v2, presented in their paper “Improved Baselines with Momentum Contrastive Learning”, the authors transfer some of the contributions from the SimCLR paper to their model. Initially, for training a linear classifier after pre-training MoCo, they added a single linear layer to the model. In MoCo v2, this layer has been replaced by an MLP projection head, leading to better classification performance. Another significant improvement has presented with the introduction of stronger data augmentations, similar to those in SimCLR. They extend the v1 augmentations by adding blur and stronger color distortion.

The latest version, MoCo v3, which was introduced in the “An Empirical Study of Training Self-Supervised Vision Transformers” paper, adapts the model to a different training process. Since these changes are more significant than just a simple extension, we will cover this paper in a future story.

Wrapping it up

In this article, you have learned about MoCo, a paper that is one of the most popular self-supervised frameworks which paved the way for more promising papers in this space.

Usually in my paper stories, I present the results and performance of the method towards the end. This time, we will skip this part since both methods are not state-of-the-art anymore, but MoCo v3 is. Nevertheless, I found it of great importance the learn the principles behind MoCo as a foundation for the state-of-the-art MoCo v3 paper.

While I hope this story gave you a good first insight into MoCo, there is still so much more to discover. Therefore, I would encourage you to read the papers yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] He, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. https://arxiv.org/pdf/1911.05722.pdf

Paper explained: Momentum Contrast for Unsupervised Visual Representation Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Supervised To Unsupervised Learning: A Paradigm Shift In Computer Vision

Leon Sick — Sat, 01 Jan 2022 17:15:24 GMT

Slowly removing the injection of human knowledge from the training process

Since the inception of modern computer vision methods, success in the application of these techniques could only be seen in the supervised domain. In order to make a model useful for performing tasks such as image recognition, object detection or semantic segmentation, human supervision used to be necessary. In a major shift, the last few years of computer vision research have change the focus of the field: Away from the guaranteed success with human supervision onto new frontiers: Self-supervised and unsupervised learning.

An animation of the clustering of different classes in an unsupervised way. Source: [1]

Let’s go on a journey towards a new era that has already begun.

The success of supervised learning

An illustration of the original AlexNet architecture. Source: [2]

AlexNet marked the first breakthrough in the application of neural networks for image tasks, more specifically the ImageNet challenge. From there, it was game on and the computer vision research community stormed towards perfecting supervised techniques for many kinds of computer vision tasks.

For image classification, many variations of models have emerged since the original AlexNet paper. ResNet has unarguably become the classic among convolutional neural networks. Efficient architectures such as the EfficientNet have emerged. Even networks optimized for mobile devices such as the MobileNet architecture. More recently, Vision Transformers have gained increasing attention (unintended joke) and have shown to outperform convolutional neural networks under the right settings (lots of data and compute). Originally invented for language tasks, their application for computer vision has been a huge success. Another interesting approach has been to design network design spaces where a quantized linear function defines the network architecture called RegNet.

The next tasks that was tackled successfully with supervised learning were object detection and semantic segmentation. R-CNNs have made the first big splash in the first domain, followed by many advances in computational efficiency and accuracy. Notable approaches are the Fast, Faster and Mask R-CNN but also the YOLO algorithms and single-shot detectors such as the SSD MobileNet. A milestone in the domain of semantic segmentation was the U-Net architecture.

Also not to forget are the benchmark datasets that made the supervised techniques more comparable. ImageNet set the standard for image classification and MS COCO is still important for object detection and segmentation tasks.

All of these techniques have one thing in common: They rely on distilled human knowledge and skill in the form of labeled data to perform well. In fact, they are built around this resource and depend on it.

In some way, all these techniques employ artificial neural networks that model the biological neural network in humans. But still, these models learn very different to perceive from the way humans learn this. Why only mimic the human brain in its biological form and not the cognitive process behind learning the recognize and classify?

This is where the next evolution comes in: self-supervised learning.

Introducing self-supervision into the process

Think about how you have learned to see. How you learn to recognize an apple. When you were younger, you have seen many apple, but not all of them had a sign on them that said “This is an apple” and no one told you it was an apple every time you saw one. The way you learned was by similarity: You have seen this object time and time again, multiple times per week, maybe even per day. You recognized: Hey… this is the same thing!

Then, one day, someone taught you this is an apple. All of a sudden, this abstract object, this visual representation, it now became know to you as “apple”. This is a similar process used in self-supervised learning.

An illustration of the SimCLR training process. Source: [3]

The most state-of-the-art techniques such as SimCLR or SwAV copy this process. For pre-training, all labels are discarded, the models train without the use of human knowledge. The models are shown two versions of the same image, may it be cropped or color distorted or rotated, and it starts to learn that despite their differing visual representation, these objects are the same “thing”. In fact, this is visible in their similar latent vector representations (remember this for later). So, the model learn to produce a consistent vector for each class of object.

Next is the “teaching” step: The pre-trained model is shown some images with labels this time. And it learns much more quickly and effectively to classify different kinds of objects.

So much of the human knowledge has been removed from the training process, but not all of it. But the next step is just around the corner.

Towards unsupervised learning

To make a model fully unsupervised, it has to be trained without human supervision (labels) and still be able to achieve the tasks it is expected to do, such as classifying images.

Remember that the self-supervised models already take a step in this direction: Before they are shown any labels, they are already able to compute consistent vector representations for different objects. This is key to removing all human supervision.

An illustration of the clustering of different classes in an unsupervised way. Source: [1]

What this vector generally represents is an images reduced in its dimensionality. In fact, autoencoders can be trained to recreate the image pixels. Because of its reduced dimension, we can use a technique long ignored (for good reasons) in computer vision: A k-nearest-neighbors classifer. If our vector representations are that good so that only the same objects form a cluster and different objects are clustered far away, we can feed the model a new, unknown image and the model will assign it to the cluster of the correct class. The model will not be able to tell you what the class name is, but what group of images it belongs to. If you assign a class name to this group, all objects in the group can be classified. After all, class names are artificial creations by humans (someone defined that this thing is called apple) and are only assigned meaning by humans.

Since all labels are removed from the training process and the results in papers like DINO are quite promising, this is the closest we have come to removing all supervision from the training process of computer vision models.

But there is still more to come, more room for improvement.

Wrapping it up

If you have been reading up to this point, I highly appreciate you taking your time. I have purposely not included any images in this story since they divert your attention away from the meaning of this text. I mean, we all want be a good transformer, right? (This time it was intended)

I sincerely thank you for reading this article. If you are interested in self-supervised learning, have a look at other stories of mine where I try to explain state-of-the-art papers in the space to anyone interestd. And if you would like to dive deeper in the field of advanced computer vision, consider becoming a follower of mine. I try to post a story once a week and and keep you and anyone else interested up-to-date on what’s new in computer vision research!

References:

[1] Meta AI Research Blog Post. https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/

[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems25 (2012): 1097–1105. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

[3] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020. https://arxiv.org/pdf/2002.05709.pdf

From Supervised To Unsupervised Learning: A Paradigm Shift In Computer Vision was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paper explained: Masked Autoencoders Are Scalable Vision Learners

Leon Sick — Wed, 29 Dec 2021 12:42:05 GMT

How reconstructing masked parts of an image can be beneficial

Autoencoders have a history of success for Natural Language Processing tasks. The BERT model started masking word in different parts of a sentence and tried to reconstruct the full sentence by predicting the words to be filled into the blanks. Recent work has aimed to transfer this idea to the computer vision domain.

In this story, we will have a look at the recently published paper “Masked Autoencoders Are Scalable Vision Learners” by He et al. from 2021. Kaiming He is one of the most influential researchers in the field of computer visions, having produced breakthroughs such as the ResNet, Faster R-CNN and Mask R-CNN along with other researchers at Meta AI Research. In their latest paper, they presented a novel approach for using autoencoders for self-supervised pre-training of computer vision models, specifically vision transformers.

A high-level visualization of the masked autoencoder training pipeline. Source: [1]

Before we dive deeper into the method presented by them, it is important that we quickly revisit self-supervised pre-training to set the context right. If you are already familiar with self-supervised pre-training, feel free to skip this part. I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

Pre-requisites: Self-supervised pre-training for computer vision

Before we go deeper into the paper, it’s worth quickly re-visiting what self-supervised pre-training is all about. If you are familiar with self-supervised pre-training, feel free to skip this part.

Use masking to make autoencoders understand the visual world

A key novelty in this paper is already included in the title: The masking of an image. Before an image is fed into the encoder transformer, a certain set of masks is applied to it. The idea here is to remove pixels from the image and therefore feed the model an incomplete picture. The model’s task is to now learn what the full, original image looked like.

On the left, the masked image can be seen. On the right, the original image is displayed. And the center column shows the image reconstructed by the autoencoder. Source: [1]

The authors found a very high masking ratio to be most effective. In these examples, they have covered 75% of the image with masks. This brings along two benefits:

Training the model is 3x faster since it has to process much fewer image patches
The accuracy increases since the model has to learn the visual world from the images thoroughly

The masking is always applied randomly, so multiple versions of the same image can be used as input.

Now that the images have been pre-processed, let’s have a look at the model architecture. In their paper, He et al. decide on using an asymetric encoder design. That means their encoder can be much deeper and while they opt for a rather lightweight decoder.

The encoder divides the image into patches that are assigned positional encodings (i.e. the squares in the images above) and only processes the non-masked parts of the image. The output of the encoder is a latent vector representation of the input image patches.

A visualization of the encoder receiving the non-masked image patches and outputting the latent vector representation. Source: [1]

Following this, the mask tokens are introduced since the next step is for the decoder to reconstruct the initial image. Each mask token is a shared, learned vector that indicates the presence of a missing patch. Positional encodings are again applied to communicate to the decoder where the individual patches are located in the original image.

The decoder receives the latent representation along with the mask tokens as input and outputs the pixel values for each of the patches, including the masks. From this information, the original image can be pieced together to form the predicted version of the full image from the masked image that served as the input.

The decoding process from the latent representation of the masked input image to the reconstructed target image. Source: [1]

Adding the mask tokens after the computation of the latent vector in blue is an important design decision. It reduces the computational cost of the encoder arriving at the vector output since it has to process less patches. This makes the model faster during training.

Once the target image has been reconstructed, it’s difference to the original input image is measured and used as the loss.

A comparison of a reconstructed image with the original image. Source: [1]

After the model has been trained, the decoder is discarded and only the encoder, i.e., the vision transformer, is kept for further use. It is now capable of computing latent representations of images for further processing.

Now that we have gone over the methodology introduced by the paper, let’s look at some results.

Results

Since the masked autoencoder makes use of transformers, it makes sense for the authors to compare its performance to other transformer-based self-supervised methods. An their improvement show in the first comparison:

Pre-training with fine-tuning results on ImageNet-1K. Source: [1]

In their comparisons with other methods, when pre-training the model on ImageNet-1K and then fine-tuning it end-to-end, the MAE (masked autoencoder) shows superiors performance compared to other approaches such as DINO, MoCov3 or BEiT. The improvements stay steady even with increasing model size, performance is the best with a ViT-H (Vision Transformer Huge). MAE achieves an incredible accuracy of 87.8.

This performance holds true for transfer learning on downstream tasks as well:

Transfer learning applied to differnet transformer-based pre-training methods. These results are for the COCO detection and segmentation dataset. Source: [1]

When using the pre-trained transformer as a backbone for a Mask R-CNN that trained on the MS COCO detection and segmentation dataset, the MAE again outperforms all other transformer-based methods. It achieves an incredible 53.3 AP (average precision) for the boxes. The Mask R-CNN also outputs a segmentation mask of the object. For this evaluation, the MAE again tops all other methods with up to 47.2 AP for the mask. The method even outperforms fully-supervised training of the Mask R-CNN, again showing the benefit of self-supervised pre-training.

Wrapping it up

In this article, you have learned about masked autoencoders (MAE), a paper that leverages transformers and autoencoders for self-supervised pre-training and adds another simple but effective concept to the self-supervised pre-training toolbox. It even outperforms fully-supervised approaches on some tasks. While I hope this story gave you a good first insight into the paper, there is still so much more to discover. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” arXiv preprint arXiv:2111.06377 (2021). https://arxiv.org/pdf/2111.06377.pdf

Paper explained: Masked Autoencoders Are Scalable Vision Learners was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Two Simple Ways To Measure Your Model’s Uncertainty

Leon Sick — Tue, 28 Dec 2021 15:23:46 GMT

The key to better understanding your model’s predictions

In this article, we will go over 2 methods that allow you to obtain your model’s uncertainty: Monte Carlo Dropout and Deep Ensembles. They are applicable for a wide variety of tasks, but in this article, we will show an example for image classification. Both of them are relatively easy to understand and implement, both can easy be applied to any existing convolutional neural network architecture (e.g. ResNet, VGG, RegNet, etc.). To help you with your fast and easy application of these techniques, I will provide the complementary code for these techniques written in PyTorch.

Given an image of two cups, how certain is your model of its prediction? Image by Author.

Before we start, let’s go over what measuring model uncertainty means and how it can be useful for your machine learning project.

What is model uncertainty?

Just like humans, a machine learning model can display a degree of confidence in its predictions. In general, when talking about model uncertainty, the distinction is made between epistemic and aleatoric uncertainty.

Epistemic uncertainty is the uncertainty represented in the model parameters and captures the ignorance about the models most suitable to explain our data. This type of uncertainty can be reduced with additional training data and therefore carries the alternative name “reducible uncertainty”. A model will broadcast high epistemic uncertainty for inputs far away from the training data and low ep- istemic uncertainty for data points near the training data.

Aleatoric uncertainty captures noise inherent to the environment i.e., the observation. Compared to epistemic uncertainty, this type cannot be reduced with more data but with more precise sensor output.

The third type is called predictive uncertainty which is the conveyed uncertainty in the model’s output. Predictive uncertainty can combine epistemic and aleatoric uncertainty.

An example of a softmax output for a classifier of 4 classes. This value is not suitable for uncertainty estimation. Image by Author.

If you have already trained simple neural networks yourself, the most intuitive thing to think about is the softmax output of your model, i.e., the percentage values you often see displayed as a result of the model’s prediction.

But using the softmax output as a measure of model uncertainty can be misleading and is not very useful. This is because all that the softmax function does is compute a sort of “relation” between the different activation values of the model. So, your model can have low activation values in all of the neurons of its output layer and still arrive at a high softmax value. This is not what we are aiming for. But thankfully, there are multiple more effective techniques to estimate a model’s uncertainty, such as Monte Carlo Dropout and Deep Ensembles.

Why is model uncertainty useful?

There are two major aspects that make estimating your model’s uncertainty useful:

The first is transparency. Imagine you are building a machine learning model that is applied in medical image analysis. So, the doctors using your tool heavily depend on its capabilities to make a correct diagnosis. If your model now makes a prediction it is actually highly uncertain about but does communicate this information to the doctor, the consequences for the treatment of the patient can fatal. Therefore, having an estimation of the model’s uncertainty can help the doctor massively when judging the model’s prediction.

The second is showing room for improvement. No machine learning model is perfect. Therefore, knowing the uncertainties and weaknesses of your model can actually inform you what improvements to make to your model. There is actually an entire discipline dedicated to that called Active Learning. Let’s say you have trained your ConvNet with 1000 images and 10 classes. But you still have 9000 more images that are not labeled yet. If you now use your trained model to predict which images it is most uncertain about, you can only label those and re-train the model. It has been shown that this type of uncertainty sampling is much more effective for model improvement compared to random sampling of these images.

Alright, enough of the prerequisites, let’s get to the two techniques.

Technique 1: Monte Carlo Dropout

Monte Carlo Dropout, or MC Dropout for short, is a technique that uses dropout layers in your model to create variation in the model’s outputs.

A simplified visualization of dropout applied to a neural network. Image by Author.

Dropout layers are usually used as a regularization technique during training. Some neurons are randomly dropped out at a certain probability during a forward pass through the network. This has been shown the make the model more robust against overfitting. Usually, these dropout layers are disabled after training as to not interfere with the forward pass on a new image. So, to use this technique, make sure to have at least one dropout layer implemented in your model. This can look something like this.

import torch
import torch.nn as nn
import torch.nn.functional as F


class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
        # Dropout layer defined with 0.25 dropout probability
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        
        # Dropout applied
        x = self.dropout(x)
        
        x = self.fc3(x)

        return x


model = Model()

But for MC Dropout, the dropout layers are still activated, meaning neurons can still randomly drop out. This results in a variation of the softmax results of the model. To turn dropout during the inference or testing, use the following code:

for module in model.modules():
    if module.__class__.__name__.startswith('Dropout'):
        module.train()

Now dropout is still applied since we have put all dropout layers into training mode!

Let’s say we wanted to obtain the model’s uncertainty on one image now. To do this, we will not only predict on the image once, but multiple times and analyze the different outputs generated by the multiple forward passes. I would recommend to let the model predict on one image 3 or 5 times. I will go over how to combine the 3 or 5 outputs at the end of this article.

Technique 2: Deep Ensembles

The second technique to estimate model uncertainty takes advantage of creating an ensemble of models. Instead of using one model and predicting 5 times with it, the idea is to use multiple models of the same type, randomly initialize their weights and train them on the same data.

A visualization of an ensemble of multiple neural networks, similar to the deep ensembles approach. Image by Author.

This will also create a variation in the model parameters. If the model is trained robustly and is certain about an image, it will output similar values for each forward pass. To initialize the models, it is best to save them as a list of of models:

# How many models do we want to have in our ensemble
n_ensemble = 5

# Initialize the ensemble
ensemble = [Model() for e_model in range(n_ensemble)]

Following the initialization, all models are trained on the same training data. Just like for MC Dropout, a number of 3 or 5 models is a good choice. To obtain the model’s uncertainty on a given image, it is passed through each of the models in the ensemble and its predictions are combined for analysis.

Combining the model outputs from multiple forward passes

Assume we have defined 5 forward passes for MC Dropout and an ensemble size of 5 for the deep ensemble. We now expect some variation between these outputs that exhibits the model’s uncertainty. To arrive at a final value for uncertainty, these outputs have to be stacked first. This code is an example of how this can be achieved for MC Dropout:

import numpy as np

fwd_passes = 5
predictions = []

for fwd_pass in range(fwd_passes):
    output = model(image)

    np_output = output.detach().cpu().numpy()

    if fwd_pass == 0:
        predictions = np_output
    else:
        predictions = np.vstack((predictions, np_output))

First, we define the number of forward passes to perform as well as an empty list to save all predictions to. Then we perform 5 forward passes. The first output serves as the initialization of the numpy array of results, all other outputs are stacked on top.

This is the code for deep ensembles with the same underlying principle:

import numpy as np

predictions = []

for i_model, model in enumerate(ensemble):
    output = model(image)

    np_output = output.detach().cpu().numpy()

    if i_model == 0:
        predictions = np_output
    else:
        predictions = np.vstack((predictions, np_output))

Now that we have combined all outputs, let’s look at how to calculate the model’s uncertainty from these outputs.

Obtaining model uncertainty

To keep it simple, we will use the predictive entropy to estimate the uncertainty of the model on a given image.

The mathematical formulation of the predictive entropy given y (label), x (input image) Dtrain (training data), c (class), p (probability). Source: [1]

In general, the predictive uncertainty tells you how “surprised” your model is to see this image. If the value is low, the model is certain about it’s prediction. If the result is high, the model does not know that is in the image.

Calculating the predictive uncertainty can be achieved with this piece of code that received the predictions array from earlier as input.

import sys
import numpy as np

def predictive_entropy(predictions):
    epsilon = sys.float_info.min
    predictive_entropy = -np.sum( np.mean(predictions, axis=0) * np.log(np.mean(predictions, axis=0) + epsilon),
            axis=-1)

    return predictive_entropy

The epsilon in the equation prevents a division by 0, which is mathematically not defined.

Alright! Now, you have your uncertainty value for one image. As previously mentioned, the higher the value, the more uncertain your model is.

Wrapping it up

In this article, you have learned to estimate your model’s uncertainty. This technique can also be applied to object detection with a couple of tweaks and it is very powerful. While I hope this story gave you a good first insight into the topic, there is still so much more to discover.

References:

[1] Uncertainty in Deep Learning, Yarin Gal. http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf

Two Simple Ways To Measure Your Model’s Uncertainty was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paper explained: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View…

Leon Sick — Mon, 20 Dec 2021 13:02:05 GMT

Paper explained: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Taking advantage of labeled support samples for semi-supervised learning

In this story, we will have a closer look at PAWS (predicting view assignments with support labels), a novel method for applying semi-supervised learning to computer vision problems.

This method has been presented as part of the recent paper “Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples” by Assran et al. at ICCV 2021. In contrast to some of the other papers I have written about, this method allows the limited use of labeled data, only to project this information on a larger pool of unlabeled data to learn from. Therefore, the labeled data becomes even more valuable than it would be if the model were to be trained fully supervised. As always, I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

A simplified visualization of the PAWS training process. Most notably, labels are introduced to the training process along with the support samples. Source: [1]

Pre-requisite: Self-Supervised vs. Semi-Supervised Learning

An important distinction to make is the difference between self-supervised and semi-supervised learning.

When training a model in a self-supervised way, the supervision does not come from labeled data. Rather the model, as the name suggests, supervises itself. What might that supervision look like? Simply put, the model can be fed the same image with different data augmentations applied and the goal would be to learn that the image is still the same. In that way, the learning process is guided.

In contrast, semi-supervised learning does rely on labeled data, but very few of it. When using a traditional supervised training approach, all data has to be labeled. Semi-supervised models only use a fraction of the labeled data and distill the labels to reason and learn about the unlabeled training data. This become more clear throughout this article.

Learning from few labeled images

In the past, supervised computer vision models have always been evaluated on the ImageNet data. To perform the evaluation, the models were trained on the dataset using only labeled images. PAWS only has 1% or 10% of the labels available for training and still achieves incredible performance. Let’s have a look why!

To start with, each image is randomly augmented to form 2 views: An anchor view and a positive view. While the content of the image is still the same, e.g. it contains a dog, the augmentations have now distorted the visual representation of it, i.e., they look different.

An example of a data augmentation. On the right, the original image, on the left, the augmented image. Source: [2]

These two views of the same image are now encoded using a convolutional neural network, in this case a ResNet-50. This network takes each image as an input and outputs a vector representation of it. In self-supervised learning, we would now formulate a similarity loss or something similar between the two views of the same image. But in semi-supervised learning, before any loss function is formulated to learn from these outputs, we can benefit from the labeled images we have available.

The labeled images are called support samples in the paper. These labeled support samples are also encoded, using the same ResNet-50 to form a vector representation for each of them.

An illustration of the encoder forming vector representations of the input images. Source: [3]

Now that we have the vector representations with their labels available, we can use those to measure their similarity to the anchor and positive image encodings. This is achieved by using a so-called soft nearest-neighbor strategy. This means the anchor and positive view are classified according to their similarity to one of the support samples, and they are assigned a soft pseudo-label, i.e., the label of the support sample with the minimal distance.

The full training process visualized. The arrows just before the bar charts represent the soft nearest neighbor classifier. Source: [3]

Now that the soft pseudo-labels have been generated for the positive and anchor view, the cross entropy is calculated between them as a formulation of the loss term.

Note that, in order to prevent the learning from collapse, temperature parameters are introduced to the nearest neighbor classifier, but also to the target prediction displayed in the bottom-right corner of the illustration. The temperature parameter act as a sharpening tool in this case. Simply put, in the output distribution, high values are changed to be even higher and low values are changed to be even lower, increasing the contrast between them. This makes for a learning target closer to a one-hot encoding (the label is encoded into an array of zeros and a one).

With that, we will conclude our short overview of the techniques behind the paper. Now, let’s look at some results!

Results

As previously discussed, compared to self-supervised learning, semi-supervised learning can make use of information in the form of labels to learn to understand the visual world around it. As part of the evaluations, 1% or 10% of ImageNet training data was labeled for PAWS during pre-training. All methods were trained without supervision and then fine-tuned using 10% of the labeled ImageNet data. This makes the first advantage of PAWS rather intuitive.

PAWS compared to other pre-training techniques for Top 1 ImageNet classification. Source: [3]

The first chart shows that PAWS performs much better on this classification benchmark with fewer training since its training data is much more information-rich. Therefore, using labels during training replaces the need for long training times such as SwAV or SimCLRv2 are requiring. This makes sense since SwAV, a self-supervised method, has to figure out a good model of the visual world without any human help, whereas PAWS, in this case, uses 10% of data with labels.

A table comparing PAWS’ performance to other pre-training techniques. All model are using a ResNet-50 and were fine-tuned on other 1 or 10% of the labeled, except for PAWS-NN, which uses nearest neighbor classification on the learned embeddings. Source: [3]

The performance advantage of PAWS hold true when comparing it to other pre-training methods, also state-of-the-art self-supervised pre-training. With fewer training epochs, PAWS is able outperform all of them by either using 1% or 10% of the labeled ImageNet data as support samples. Even more impressive are the results seen under PAWS-NN. For this, no fine-tuning of PAWS was done and all images were classified based on a nearest neighbor classification of their raw output embeddings. This is remarkable and shows the true potential of semi-supervised training.

This shows that distilled human knowledge in the form of labels, when incorporated correctly, can be extremely powerful. As shown by the lower training times compared to self-supervised learning, by letting the human assist the learning process, information can be apprehended much more efficiently by the artificial neural network. In some sense, it is two neural networks working together. One biological, one artificial.

Wrapping it up

In this article, you have learned about PAWS, a paper using semi-supervised learning to profit from few labeled images and transfer that knowledge onto all other unlabeled images. This has been shown the exhibit desirable properties such as lower pre-training time and great performance. While I hope this story gave you a good first insight into the paper, there is still so much more to discover. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] PAWS GitHub Repository: https://github.com/facebookresearch/suncet

[2] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020. https://arxiv.org/pdf/2002.05709.pdf

[3] Assran, Mahmoud, et al. “Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples.” arXiv preprint arXiv:2104.13963 (2021). https://arxiv.org/pdf/2104.13963.pdf

Paper explained: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View… was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Well Does Self-Supervised Learning Perform In The Real World?

Leon Sick — Wed, 15 Dec 2021 15:36:28 GMT

Pre-training a model on random internet images instead of ImageNet

If you have been reading recent publications on self-supervised pre-training, you might have noticed that all of the novel methods and techniques were mostly evaluated on ImageNet. The ImageNet dataset is highly diverse, large and contains an enormous number of classes. It has been curated specifically to evaluate the performance of image processing models, so it is unquestionably well suited for this task. But relatively few emphasis has been put on how these self-supervised techniques perform on other image datasets. Datasets that are uncurated and contain large amounts of random images. In their paper “Self-supervised Pretraining of Visual Features in the Wild”, Goyal et al. set out to investigate whether the perceived performances of self-supervised pre-training techniques hold true when trained on a set of random, uncurated images.

This graph shows SEER, the main model of the paper, outperforming all other architectures. Source: [1]

Pre-requisites

The SEER model introduced in the paper combines multiple recent advances in computer vision.

First, it takes advantage of a novel and scalable architecture call RegNet. A RegNet is defined by a quantized linear function to form a network of multiple blocks with optimal widths and depths. RegNet has two variants: RegNetX which uses the residual block from the classic ResNet, and RegNetY which takes advantage of squeeze-and-excite blocks. I’ve written an entire article about the RegNet architecture, feel free to read it here.

An illustration of a RegNet in all its parts. It is composed of a stem, body and a head, as displayed by (a). Illustrations (b) and (c) show a more detailed view of the stages and blocks. Source: [2]

Another important component of the SEER paper is a self-supervised pre-training technique called SwAV. This technique is used for the SEER model and to compare against. SwaV uses data augmentations to form multiple different versions of the same image. These are then passed through a convolutional neural network to create a latent representation. This vector is then learned to be assigned to one of K prototype vectors by formulating a swapped prediction problem. If you would like to refresh your knowledge on SwAV, feel free to read my story on the paper here.

An illustration of the SwAV training process. Source: [3]

An last, the SEER paper compares its performance versus SimCLR, another technique for self-supervised pre-training. SimCLR, just like SwAV, uses data augmentations to form pairs of augmented versions of the same image. These are then passed into a convolutional neural network to form a feature vector. This vector then goes into an MLP to form the final network output. SimCLR uses a novel loss function called NT-Xent which seeks attract different representations of the same object. Again, if you would like to dive deeper into SimCLR, I have an article on the paper which you can read here.

A figure showing the training process of SimCLR. Source: [4]

Developing a model that can benefit from large uncurated image datasets

Now on to the main contribution of the paper. As mentioned before, a key goal for this paper was to find out how a large, uncurated image dataset would impact the performance of the self-supervised method. Also, the authors aimed to develop a method for outperforming other current state-of-the-art techniques.

To achieve this, they used the SwAV technique to define the pre-training process. More specifically, they created pairs of images of resolutions of 2 x 224 and 4 x 96 with many different data augmentations to pass into the model. They also defined SwAV to have 16K prototype vectors, an important hyperparameter to set for this technique.

For the model architecture, they opted for the before-mentioned RegNet, specifically they experiment with a range of networks, namely RegNetY-{8, 16, 32, 64, 128, 256}GF that uses the Squeeze-and-excite blocks mentioned earlier. This range of specifications is only enabled by the great flexibility of the RegNet architecture. On top of this RegNet, they defined a 3-layer MLP projection head to form an output vector of 256-D.

“SEER combines a recent architecture family, RegNet, with an online self-supervised training to scale pretraining to billion parameters on billions of random images.”. Source: [5]

The entire SEER model (SwAV with RegNet) was trained on multiple different datasets to which we will get in the “Results” section, the most notable one is one billion uncurated images from Instagram. To train the model, the authors used a stunning 512 NVIDIA V100 32GB GPUs and training for 122K iterations. Now, let’s see how the SEER model measures against other techniques and for different datasets.

Results

There’s a lot to unpack here. Let’s start with the classical evaluation for a self-supervised learning model.

A table of results for different self-supervised pre-training methods when pre-trained on different datasets and finetuned on ImageNet. Source: [1]

As part of the first experiment, SEER was pre-trained on one billion random images from Instagram and then fine-tuned on ImageNet. Incredibly, SEER is able to outperform all other methods in ImageNet top-1 accuracy. Notably, it can outperform the original SwAV paper even-though it uses its self-supervised pre-training technique, just with a different network architecture. Also, it outperforms the SimCLRv2 model which has an increased parameter size over its predecessor. There also seems to be a correlation between top-1 accuracy and parameter count: The larger the model, the better it perfroms. It is also interesting that SEER outperforms all other methods even-though it is the only method pre-trained on random images. SimCLRv2 was even pre-trained on ImageNet, which was later used for evaluation.

A table showing results on a low-shot learning scenario. Source: [1]

The authors also defined a so-called low-shot learning scenario, i.e., following the pre-training, the model was only fine-tuned using 1% or 10% of the ImageNet dataset (compared to 100% for the first evaluation). While SimCLRv2 seems to be the best performing model, being pre-trained on ImageNet, SEER is able to almost match its performance despite not having seen any images from ImageNet before (pre-training on random images). This again shows that SEER is able to learn enough about the visual world it has seen during pre-training to transfer its knowledge well enough to the ImageNet classification task.

A graph showing the RegNetY classification accuracy plotted against the number of parameters in the model. Source: [1]

Another highly relevant finding from paper is that, as the number of parameters in the RegNet increases, the advantage of a pre-trained model versus a RegNet trained from scratch increases dramatically. In other words, if you are training a very large model, it is more likely to benefit from (self-supervised) pre-training compared to a smaller model.

A table showing the performance of pre-training with SEER for downstream tasks such as object detection and semantic segmentation. Source: [1]

And last but not least, let’s have a look at SEER’s impact on downstream tasks. The authors also trained a Mask R-CNN with a pre-trained RegNet backbone on the MS COCO dataset for object detection and semantic segmentation. They show that, in comparison to training the model from scratch with labels, a model using the SEER RegNet backbone that was pre-trained on random internet images leads to increased performance for both downstream tasks.

Wrapping it up

In this article, you have learned about SEER and how self-supervised pre-training can be effective even when it is not used in combination with a curated dataset. The implications of this are quite profound: We could be one step closer to completely unsupervised training of image models. While I hope this story gave you a good first insight into the paper, there is still so much more to discover, especially in terms of the results and the ablation studies. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] Goyal, Priya, et al. “Self-supervised pretraining of visual features in the wild.” arXiv preprint arXiv:2103.01988 (2021). https://arxiv.org/pdf/2103.01988.pdf

[2] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. https://arxiv.org/pdf/2003.13678.pdf

[3] Caron, Mathilde, et al. “Unsupervised learning of visual features by contrasting cluster assignments.” arXiv preprint arXiv:2006.09882 (2020). https://arxiv.org/pdf/2006.09882.pdf

[4] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020. https://arxiv.org/pdf/2002.05709.pdf

[5] Facebook AI Research Blog Post: SEER: The start of a more powerful, flexible, and accessible era for computer vision. https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision/

How Well Does Self-Supervised Learning Perform In The Real World? was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

RegNet: The Most Flexible Network Architecture For Computer Vision

Leon Sick — Mon, 13 Dec 2021 08:43:35 GMT

A model design that scales for high-efficiency or high-accuracy

Traditionally, convolutional neural network architectures have been designed and optimized for one specific purpose. For example, the ResNet model family was optimized for the highest accuracy on ImageNet at the time of its initial release. MobileNets, as the name suggests, are optimized to run on mobile devices. An last, EfficientNet was designed to be highly efficient for visual recognition tasks.

In their paper “Designing Network Design Spaces”, Radosavovic et al. decided to set a very unusual but highly interesting goal: They set out to explore and design a highly flexible network architecture. One that can be adapted to be highly efficient or run on mobile devices, but also be highly accurate when adapted for the best classification performance. This adaptation is supposed to be controlled by setting the right parameters in a quantized linear function (a set of formulas that have specific paramters) to determine width and depth of the network.

A visualization of network design spaces. The space is constantly optimized to arrive at a smaller design space with the best models. Source: [1]

The approach they took was also very non-traditional: Instead of hand-crafting the model architecture, they set up what they call Network Design Spaces.

Deriving the RegNet model from network design spaces

If you have only here to see a description of the RegNet model, feel free to skip this part. I highly recommend you though the read the full article since it is difficult to understand RegNet without an understanding of network design space. In the end, RegNet is actually not an architecture, but a network design space.

A network design space is not only, as the name might suggest, made up of different model architectures, but of different parameters that define a space of possible model architectures. This is different from neural architecture search, where all you do is try different architectures and search the most suitable one. Such parameters can be the width, depth, groups, etc. of the network. RegNet also uses only one type of network block out of the many different architectures already available, e.g. the bottleneck block.

To arrive at the final RegNet design space, the authors first define a space of all possible models which they call AnyNet. This space creates all kinds of models from all kinds of combinations of the different parameters. All of these models are trained and evaluated on the ImageNet dataset using a consistent training regime (epochs, optimizer, weight decay, learning rate scheduler).

Statistics for the AnyNet design space, W4 stands for the width of the network at stage 4. Source: [1]

From this AnyNet space, they create progressively simplified versions of the initial AnyNet design space by analyzing what parameters are responsible for the good performance of the best models in the AnyNet design space. Basically, they are experimenting with the importance of different parameters to narrow down the design space to only the good models.

These improvements from a current design space to a narrower design space include setting a shared bottleneck ratio and a shared group width, parameterizing the width and depth to increase with later stages.

Finally, they arrive at the optimized RegNet design space that contains only good models and also the quantize linear function necessary to define the models!

The RegNet design space

The network is composed of multiple stage consisting of multiple blocks, forming a stem (start), body (main part) and head (end).

The RegNet is composed of a stem, body and head. Source: [1]

Inside the body, multiple stages are defined, and each of the stages is composed of multiple blocks. As mention before, there is only one type of block used in the RegNet which is the standard residual bottleneck block with group convolution.

The residual bottleneck block with group convolution. On the right, a stride of 2 is applied. Source: [1]

As mentioned above, the design of the RegNet model is not defined by fixed parameters such as depth and width, but rather a quantized linear function controlled with the chosen parameters. Following the optimizations, the block widths are calculates as follows:

uj is the block width at stage j, w0 the initial width and wa the slope parameter. j starts at 0 and ends at the depth of the network. Source: [1]

It is notable that the width for each block is increasing by a factor of wa for each additional block.
The authors then introduce an additional parameter wm (this can be set by you) and calculate sj:

The formula necessary to calculate sj with the introduction of wm. Source: [1]

Finally, to quantize uj, the authors round sj and compute the quantized per-block widths:

Per-block widths computed using the inital width and wm to the power of sj. Source: [1]

Now that the per-block width has been computed, let’s move to the stage level. To arrive at the width for each stage i, all blocks with the same width are simply counted to form one stage since all blocks in once should be of the same width.

To now create a RegNet out of the RegNet design space, the parameters d (depth), w0 (inital width), wa (slope), wm (width parameter), b (bottleneck) and g (group) have to be set.

The authors now set these parameters differently to obtain different RegNets with different properties:

A RegNet optimized for mobile use
An efficient RegNet
A highly-accurate RegNet

Let’s how well these network perform compared to other architectures.

Results

First, let’s examine RegNet’s mobile performance.

RegNets compared to other architectures. The X or Y stand for residual block or squeeze-and-exite block, the rest represents the FLOPS required for the network. Source: [1]

At the same number of FLOPS required, both RegNets outperform the other mobile-optimized nets or show similar performance. But it does not stop there.

As mentioned in the introduction, the RegNet was designed to be highly flexible. This is shown perfectly is the next two evaluations.

First, RegNets efficient performance versus the EfficientNet architecture.

RegNet versus EfficientNet. Source: [1]

Impressively, for all comparisons, RegNet has an advantage. Either is has similar accuracy at much higher training and inference speeds or it is more accuracy AND faster, especially towards the lower end. Additionally, the authors claim that RegNetX-F8000 is about 5× faster than EfficientNet-B5. This is an incredible leap!

When RegNet is configured for high accuracy, the results also look good.

RegNet compared to ResNet and ResNe(X)t. Source: [1]

This again shows the flexibility of RegNet: The model can be specified to be highly efficient and fast or highly accurate. This has been possible in a single architecture before.

Wrapping it up

In this article, you have learned about RegNet, a model design space that is highly flexible and takes a very different approach. RegNet is not a single architecture, it is a design space defined by a quantized linear function. While I hope this story gave you a good first insight into the paper, there is still so much more to discover. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. https://openaccess.thecvf.com/content_CVPR_2020/papers/Radosavovic_Designing_Network_Design_Spaces_CVPR_2020_paper.pdf

RegNet: The Most Flexible Network Architecture For Computer Vision was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paper explained: A Simple Framework for Contrastive Learning of Visual Representations

Leon Sick — Thu, 09 Dec 2021 10:33:39 GMT

Going over the ideas presented in the SimCLR paper

In this story, we will take a look at SimCLR: The architecture that led the computer vision research community to new heights in self-supervised pre-training for vision tasks.

SimCLR was presented in the Paper “A Simple Framework for Contrastive Learning of Visual Representations” by Chen et al. from Google Research in 2020. The ideas in this paper are relatively simple and intuitive, but there is also a novel loss function that is key for achieve great performance for self-supervised pre-training. I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

An illustration of the SimCLR training procedure. Source: [1]

Pre-requisite: Self-supervised pre-training for computer vision

Before we go deeper into the SimCLR paper, it’s worth quickly re-visiting what self-supervised pre-training is all about. If you have been reading other self-supervised learning stories from me or you are familiar with self-supervised pre-training, feel free to skip this part.

Learning image similarity with SimCLR

A key contribution by the paper is the use of data augmentations. SimCLR creates pairs of images to learn the similarity from. If we would input the same image twice, there would be no learning effect. Therefore, each pair of images is created by applying augmentations or transformations to the image.

Different data augmentations applied to an image of a dog. Source: [2]

As can be seen in this excerpt from the paper, the authors apply different augmentations such as resizing, color distortion, blurring, noising and much more. They also take crops from different parts of the image which is important for the model to learn a consistent representation. The image can be cropped into a global and local view (full image and cropped part of the image) or adjacent views can be used (two crops from different parts of the image). Each pair is formulated as a positive pair, i.e., both augmented images contains the same object.

Next, these pairs are passed into a convolutional neural network to create a feature representation for each of the images. In the paper, the authors opted to use the popular ResNet architecture for their experiments. The pairs of images are always fed to the model in batches. A special emphasis is put on the size of the batch which the authors vary from 256 to 8192. From this batch, the data augmentations are applied, leading to the batch doubling size, so from 512 to 16382 input images.

Once the vector representation for an input image is computed by the ResNet, this output is being passed to a projection head for further processing. In the paper, this projection head is an MLP (Multi Layer Perceptron) with one hidden layer. This MLP is only used during training and further refining the feature representation of the input images.

The SimCLR training process from the raw input image to the representation computed by the MLP. Source: [1]

Once the MLP computation is completed, the result reserves as the input into the loss function. The learning goal of SimCLR is maximize agreement between different augmentations of the same image. That means the model tried to minimize the distance between images that contain the same object and maximize the distance between images that contain vastly different object. This mechanism is also called contrastive learning.

One major contribution by the SimCLR paper the formulation of its NT-Xent loss. NT-Xent stands for normalized temperature-scaled cross entropy loss. This novel loss function has a property that is especially desirable: Different examples are weighted effectively allowing the model to learn much more effectively from vector representations that are far away from each other even though their origin is the same image. These examples the model perceives to be very different from each other are called hard negatives.

This loss effectively achieves an attraction of similar images, i.e., similar images are learned to be mapped closer together.

Similar images are attracted to each other. Source: [1]

Results

Once the network is fully trained, the MLP projection head is discarded and only the convolutional neural network is used for evaluation. In their paper, the authors performed different evaluations:

First, they measure the performance of SimCLR as a linear classifier on the ImageNet dataset. Their results show SimCLR performing all other self-supervised methods.

Results of SimCLR and other self-supervised methods on ImageNet linear classification. Source: [2]

Please bear in mind that these results are not up-to-date anymore, since novel methods with better performance have come along. Feel free to read my other articles where I go over other self-supervised pre-training models.

Second, they evaluated the performance of SimCLR on different image datasets versus training the same ResNet with labels, i.e., with a supervised learning. Again, SimCLR performs very well, beating the supervised training method on many datasets. In the same table, they also looked at results from fine-tuning the self-supervised model with labeled data. In this row, they show that SimCLR outperforms the supervised training approach on almost all datasets.

Evaluation of SimCLR vs. supervised training of the ResNet. Source: [2]

Wrapping it up

In this article, you have learned about SimCLR, a paper that is one of the most popular self-supervised frameworks with a simple concept and promising results. SimCLR is constantly improved and there is even a second version of this architecture. While I hope this story gave you a good first insight into the paper, there is still so much more to discover. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere ;)

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

References:

[1] SimCLR GitHub Implementation: https://github.com/google-research/simclr

[2] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020. https://arxiv.org/pdf/2002.05709.pdf

Paper explained: A Simple Framework for Contrastive Learning of Visual Representations was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.