Stories by Divam Gupta on Medium

An Introduction to Pseudo-semi-supervised Learning for Unsupervised Clustering

Divam Gupta — Sat, 31 Oct 2020 00:00:00 GMT

This post gives an overview of our deep learning based technique for performing unsupervised clustering by leveraging semi-supervised models. An unlabeled dataset is taken and a subset of the dataset is labeled using pseudo-labels generated in a completely unsupervised way. The pseudo-labeled dataset combined with the complete unlabeled data is used to train a semi-supervised model.

This is a re-post of the original post: https://divamgupta.com/unsupervised-learning/2020/10/31/pseudo-semi-supervised-learning-for-unsupervised-clustering.html

This work was published in ICLR 2020 and the paper can be found here and the source code can be found here.

Introduction

In the past 5 years, several methods have shown tremendous success in semi-supervised classification. These models work very well when they are given a large amount of unlabeled data along with a small amount of labeled data.

The unlabeled data helps the model to discover new patterns in the dataset and learn high-level information. The labeled data helps the model to classify the data-points using the learned information. For example, Ladder Networks can yield 98% test accuracy with just 100 data-points labeled and the rest unlabeled.

In order to use a semi-supervised classification model for completely unsupervised clustering, we need to somehow generate a small number of labeled samples in a purely unsupervised way. These automatically generated labels are called pseudo-labels.

It is very important to have good quality pseudo-labels used to train the semi-supervised model. The classification performance drops if there is a large amount of noise in the labels. Hence, we are okay with having less number of pseudo-labeled data points, given that the noise in the pseudo-labels is less.

A straightforward way to do this is the following:

Start with an unlabeled dataset.
Take a subset of the dataset and generate pseudo-labels for it, while ensuring the pseudo-labels are of good quality.
Train a semi-supervised model by feeding the complete unlabeled dataset combined with the small pseudo-labeled dataset.

Image by Author

This approach uses some elements of semi-supervised learning but no actual labeled data-points are used. Hence, we call this approach pseudo-semi-supervised learning.

Generating pseudo-labels

Generating high-quality pseudo-labels is the trickiest and the most important step to get good overall clustering performance.

The naive ways to generate a pseudo-labeled dataset are

Run a standard clustering model on the entire dataset and make the pseudo-labels equal to the cluster IDs from the model.
Run a standard clustering model with way more number of clusters than needed. Then only keep a few clusters to label the corresponding data-points while discarding the rest.
Run a standard clustering model and only keep the data-points for which the confidence by the model is more than a certain threshold.

In practice, none of the ways described above work.

The first method is not useful because the pseudo labels are just the clusters returned by the standard clustering model, hence we can’t expect the semi-supervised model to perform better than that.

The second way does not work because there is no good way to select distinct clusters.

The third way does not work because in practice the confidence of a single model is not an indicator of the quality.

After experimenting with several ways to generate a pseudo-labeled dataset, we observed that consensus of multiple unsupervised clustering models is generally a good indicator of the quality. The clusters of the individual models are not perfect. But if a large number of clustering models assign a subset of a dataset into the same cluster, then there is a good chance that they actually belong to the same class.

In the following illustration, the data points which are in the intersection of the cluster assignments of the two models could be assigned the same pseudo-label with high confidence. Rest can be ignored in the pseudo-labeled subset.

Image by Author

Using a graph to generate the pseudo-labels

There is a more formal way to generate the pseudo-labeled dataset. We first construct a graph of all the data-points modeling the pairwise agreement of the models.

The graph contains two types of edges.

Strong positive edge — when a large percentage of the models think that the two data-points should be in the same cluster
Strong negative edge — when a large percentage of the models think that the two data-points should be in different clusters.

It is possible that there is neither a strong positive edge nor a strong negative edge between the two data-points. That means that confidence of the cluster assignments of those data points is low.

After constructing the graph, we need to pick K mini-clusters such that data-points within a cluster are connected with strong positive edges and the data-points of different clusters are connected with strong negative edges.

An example of the graph is as follows:

Example of a constructed graph. Strong positive edge — green , Strong negative edge — red. Image by the Author

We first pick the node with the maximum number of strong positive edges. That node in circled in the example:

The selected node is circled. Image by Author

We then assign a pseudo-label to the neighbors connected to the selected node with strong positive edges:

Image by Author

Nodes which are neither connected with a strong positive edge nor a strong negative edge are removed because we can’t assign any label with high confidence:

Image by Author

We then repeat the steps K more times to get K mini-clusters. All data-points in one mini-cluster are assigned the same pseudo-label:

The final pseudo-labeled subset. Image by Author

We can see that a lot of data-points will be discarded in this step, hence it’s ideal to send these pseudo-labeled data points to a semi-supervised learning model for the next step.

Using pseudo-labels to train semi-supervised models

Now we have a pruned pseudo-labeled dataset along with the complete unlabeled dataset which is used to train a semi-supervised classification network. The output of the network is a softmaxed vector which can be seen as the cluster assignment.

If the pseudo labels are of good quality, then this multi-stage training yields better clustering performance compared to the individual clustering models.

Rather than having separate clustering and semi-supervised models, we can have a single model that is capable of performing unsupervised clustering and semi-supervised classification. An easy way to do this to have a common neural network architecture and apply both the clustering losses and the semi-supervised classification losses.

We decided to use a semi-supervised ladder network combined with information maximization loss for clustering. You can read more about different deep learning clustering methods here.

Putting everything together

In the first stage, only the clustering loss is applied. After getting the pseudo-labels, both clustering and classification losses are applied to the model.

After the semi-supervised training, we can extract more pseudo-labeled data points using the updated models. This process of generating the pseudo labels and semi-supervised training can be repeated multiple times.

The overall algorithm is as follows:

Train multiple independent models using the clustering loss
Construct a graph modeling pairwise agreement of the models
Generate the pseudo-labeled data using the graph
Train each model using unlabeled + pseudo-labeled data by applying both clustering and classification loss
Repeat from step 2

Overview of the final system. Image by Author

Evaluation

We want our clusters to be close to the ground truth labels. But because the model is trained in a completely unsupervised manner, there is no fixed mapping of the ground truth classes and the clusters. Hence, we first find the one-to-one mapping of ground truth with the model clusters with maximum overlap. Then we can apply standard metrics like accuracy to evaluate the clusters. This is a very standard metric for the quantitative evaluation of clusters.

We can visualize the clusters by randomly sampling images from the final clusters.

Visualizing the clusters of the MNIST dataset. Source: original paper.

Visualizing the clusters of the CIFAR10 dataset. Source: original paper.

In this post, we discussed a deep learning based technique for performing unsupervised clustering by leveraging pseudo-semi-supervised models. This technique outperforms several other deep learning based clustering techniques. If you have any questions or want to suggest any changes feel free to contact me or write a comment below.

Get the full source code from here

Originally published at https://divamgupta.com on October 31, 2020.

An Introduction to Pseudo-semi-supervised Learning for Unsupervised Clustering was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Introduction to Virtual Adversarial Training

Divam Gupta — Fri, 31 May 2019 00:00:00 GMT

Virtual Adversarial Training is an effective regularization technique which has given good results in supervised learning, semi-supervised learning, and unsupervised clustering.

This is a re-post of the original post: https://divamgupta.com/unsupervised-learning/semi-supervised-learning/2019/05/31/introduction-to-virtual-adversarial-training.html

Get the source code used in this post from here

Virtual adversarial training has been used for:

Improving supervised learning performance
Semi-supervised learning
Deep unsupervised clustering

There are several regularization techniques which prevent overfitting and help the model generalize better for unseen examples. Regularization helps the model parameters to be less dependent on the training data. The two most commonly used regularization techniques are Dropout and L1/L2 regularization.

In L1/L2 regularization, we add a loss term which tries to reduce the L1 norm or the L2 norm of the weights matrix. Small value of weights would result in simpler models which are less prone to overfitting.

In Dropout, we randomly ignore some neurons while training. This makes the network more robust to noise and variation in the input.

In neither of the two techniques mentioned, the input data distribution is taken into account.

Local distributional smoothness

Local distributional smoothness (LDS) can be defined as the smoothness of the output distribution of the model, with respect to the input. We do not want the model to be sensitive to small perturbations in the inputs. We can say that, there should not be large changes in the model output with respect to small changes in the model input.

In LDS regularization, smoothness of the model distribution is rewarded. It is also invariant of the parameters on the network and only depends on the model outputs. Having a smooth model distribution should help the model generalize better because the model would give similar outputs for unseen data-points which are close to data-points in the training set. Several studies show that making the model robust against small random perturbations is effective for regularization.

An easy way for LDS regularization is to generate artificial data-points by applying small random perturbations on real data points. After that encourage the model to have similar outputs for the real and perturbed data points. Domain knowledge can also be used to make better perturbations. For example, if the inputs are images, various image augmentation techniques such as flipping, rotating, transforming the color can be used.

Example of input data transformation

Virtual adversarial training

Virtual adversarial training is an effective technique for local distribution smoothness. Pairs of data points are taken which are very close in the input space, but are very far in the model output space. Then the model is trained to make their outputs close to each other. To do that, a given input is taken and perturbation is found for which the model gives very different output. Then the model is penalized for sensitivity with the perturbation.

Step 1 : Generate the adversarial image

Step 2: Minimize the KL divergence

The key steps for virtual adversarial training are:

Begin with an input data point x
Transform x by adding a small perturbation r, hence the transformed data point will be T(x) = x + r
The perturbation r should be in the adversarial direction — model output of the perturbed input T(x) should be different from the output of non-perturbed input. In particular, the KL divergence between the two output distributions should be maximum, while ensuring the L2 norm of r to be small. From all the perturbations r, let be the perturbation in the adversarial direction.

After finding the adversarial perturbation and the transformed input, update the weights of the model such that the KL divergence is minimized. This would make the model robust towards different perturbations. The following loss is minimized via gradient descent:

During the virtual adversarial training, the model becomes more robust against different input perturbations. As the model becomes more robust, it becomes harder to generate perturbations and a drop in the loss is observed.

This method can be seen as similar to generative adversarial networks. But there are several differences:

Rather than having a generator to fool the discriminator, a small perturbation is added to the input, in order to fool the model in thinking they are two vastly different inputs.
Rather than discriminating between fake and real, the KL divergence between the model outputs is used. While training the model ( which is analogous to training the discriminator) we minimize the KL divergence.

Virtual adversarial training can be thought of as an effective data augmentation technique where we do not need prior domain knowledge. This can be applied to all kinds of input distributions, hence useful for true “unsupervised learning”.

How is virtual adversarial training different from adversarial training?

In adversarial training, labels are also used to generate the adversarial perturbations. The perturbation is generated such that classifier’s predicted label y’ becomes different from the actual label y.

In virtual adversarial training, no label information is used and the perturbation is generated using just the model outputs. The perturbation is generated such that output of the perturbed input is different from the model output of the original input ( as opposed to the ground truth label).

Implementing virtual adversarial training

Now we will implement basic virtual adversarial training using Tensorflow and Keras. The full code can be found here

First, define the neural network in Keras

network = Sequential()
network.add( Dense(100 ,activation='relu' ,  input_shape=(2,)))
network.add( Dense( 2  ))

Define the model_input , the logits p_logit by applying the input to the network and the probability scores p by applying softmax activation on the logits.

model_input = Input((2,))
p_logit = network( model_input )
p = Activation('softmax')( p_logit )

To generate the adversarial perturbation, start with random perturbation r and make it unit norm.

r = tf.random_normal(shape=tf.shape( model_input ))
r = make_unit_norm( r )

The output logits of the perturbed input would be p_logit_r

p_logit_r = network( model_input + 10*r  )

Now compute the KL divergence of logits from the input and the perturbed input.

kl = tf.reduce_mean(compute_kld( p_logit , p_logit_r ))

To get the adversarial perturbation, we need an r such that the KL-divergence is maximized. Hence take the gradient of kl with respect to r. The adversarial perturbation would be the gradient. We use the stop_gradient function because we want to keep r_vadv fixed while back-propagation.

grad_kl = tf.gradients( kl , [r ])[0]

Finally, normalize the norm adversarial perturbation. We set the norm of r_vadv to a small value which is the distance we want to go along the adversarial direction.

r_vadv = make_unit_norm( r_vadv )/3.0

Now we have the adversarial perturbation r_vadv , for which the model gives a very large difference in outputs. We need to add a loss to the model which would penalize the model for having large KL-divergence with the outputs from the original inputs and the perturbed inputs.

p_logit_r_adv = network( model_input  + r_vadv )
vat_loss =  tf.reduce_mean(compute_kld( tf.stop_gradient(p_logit), p_logit_r_adv ))

Finally, build the model and attach the vat_loss .

model_vat = Model(model_input , p )
model_vat.add_loss( vat_loss   )
model_vat.compile( 'sgd' ,  'categorical_crossentropy'  ,  metrics=['accuracy'])

Now let’s use some synthetic data to train and test the model. This dataset is two dimensional and has two classes. Class 1 data-points lie in the outer ring and the class 2 data-points lie in the inner ring. We are using only 8 data-points per class for training, and 1000 data-points for testing.

The plot of the synthetic dataset on a 2D plane

Let’s train the model by calling the fit function.

model.fit( X_train , Y_train_cat )

Visualizing model outputs

Now, let’s visualize the output space of the model along with training and the testing data.

Model decision boundary with virtual adversarial training

For this example dataset, it is pretty evident that the model with virtual adversarial training has generalized better and its decision boundary lies in the bounds of the test data as well.

Model decision boundary without virtual adversarial training

For the model without virtual adversarial training, we see some overfitting on the training data-points. The decision boundary, in this case, is not good and overlapping with the other class.

Applications of virtual adversarial training

Virtual adversarial training has shown incredible results for various applications in semi-supervised learning and unsupervised learning.

VAT for semi-supervised learning: Virtual adversarial training has shown good results in semi-supervised learning. Here, we have a large number of unlabeled data-points and a few labeled data points. Applying the vat_loss on the unlabeled set and the supervised loss on the labeled set gives a boost in testing accuracy. The authors show the superiority of the method over several other semi-supervised learning methods. You can read more in the paper here.

Virtual adversarial ladder networks: Ladder networks have shown promising results for semi-supervised classification. There, at each input layer, random noise is added and a decoder is trained to denoisify the outputs at each layer. In virtual adversarial ladder networks, rather than using random noise, adversarial noise is used. You can read more in the paper here.

Unsupervised clustering using self-augmented training: Here the goal is to cluster the data-points in a fixed number of clusters without using any labeled samples. Regularized Information Maximization is a technique for unsupervised clustering. Here the mutual information between the input and the model output is maximized. IMSAT has extended the approach by adding virtual adversarial training. Along with the mutual information loss, the authors apply the vat_loss. They show great improvements after adding virtual adversarial training. You can read more in the paper and my earlier blog post.

Conclusion

In this post, we discussed an efficient regularization technique called virtual adversarial training. We also dived in the implementation using Tensorflow and Keras. We observed that the model with VAT performs better when there are very few training samples. We also discussed various other works which use virtual adversarial training. If you have any questions or want to suggest any changes feel free to contact me or write a comment below.

Get the full source code from here

References

Originally published at https://divamgupta.com on May 31, 2019.

An Overview of Deep Learning Based Clustering Techniques

Divam Gupta — Fri, 08 Mar 2019 00:00:00 GMT

This post gives an overview of various deep learning based clustering techniques. I will be explaining the latest advances in unsupervised clustering which achieve state-of-the-art performance by leveraging deep learning.

This is a re-post of the original article: https://divamgupta.com/unsupervised-learning/2019/03/08/an-overview-of-deep-learning-based-clustering-techniques.html

Unsupervised learning is an active field of research and has always been a challenge in deep learning. Finding out meaningful patterns from large datasets without the presence of labels is extremely helpful for many applications. Advances in unsupervised learning are very crucial for artificial general intelligence. Performing unsupervised clustering is equivalent to building a classifier without using labeled samples.

In the past 3–4 years, several papers have improved unsupervised clustering performance by leveraging deep learning. Several models achieve more than 96% accuracy on MNIST dataset without using a single labeled datapoint. However, we are still very far away from getting good accuracy for harder datasets such as CIFAR-10 and ImageNet.

In this post, I will be covering all the latest clustering techniques which leverage deep learning. The goal of most of these techniques is to clusters the data-points such that the data-points of the same ground truth class are assigned the same cluster. The deep learning based clustering techniques are different from traditional clustering techniques as they cluster the data-points by finding complex patterns rather than using simple pre-defined metrics like intra-cluster euclidean distance.

Clustering with unsupervised representation learning

One method to do deep learning based clustering is to learn good feature representations and then run any classical clustering algorithm on the learned representations. There are several deep unsupervised learning methods available which can map data-points to meaningful low dimensional representation vectors. The representation vector contains all the important information of the given data-point, hence clustering on the representation vectors yield better results.

One popular method to learn meaningful representations is deep auto-encoders. Here the input is fed into a multilayer encoder which has a low dimensional output. That output is fed to a decoder which produces an output of the same size as input. The training objective of the model is to reconstruct the given input. In order to do that successfully, the learned representations from the encoder contain all the useful information compressed in a low dimensional vector.

Running K-means on representation vectors learned by deep autoencoders tend to give better results compared to running K-means directly on the input vectors. For example in MNIST, clustering accuracy of K Means is 53.2% while running K-means on the learned representations from auto-encoders yield an accuracy of 78.9%.

Other techniques to learn meaningful representations include :

Clustering via information maximization

Regularized Information Maximization is an information theoretic approach to perform clustering which takes care of class separation, class balance, and classifier complexity. The method uses a differentiable loss function which can be used to train multi-logit regression models via backpropagation. The training objective is to maximize the mutual information between the input x and the model output y while imposing some regularisation penalty on the model parameters.

Mutual information can be represented as the difference between marginal entropy and conditional entropy. Hence the training objective to minimize is :

Here it is maximizing the marginal entropy H(Y) and minimizing the conditional entropy H( Y|X ).

By maximizing H(Y), the cluster assignments are diverse, hence the model cannot degenerate by assigning a single cluster to all the input data points. In fact, it will try to make the distribution of clusters as uniform as possible because entropy will be maximum when the probability of each cluster is the same.

The neural network model with the softmax activation estimates the conditional probability p( y|x ). By minimizing H( Y|X ), it ensures that the cluster assignment of any data point is with high confidence. If H( Y|X ) is not minimized and only H(Y) is maximized, the model can degenerate by assigning an equal conditional probability to each cluster given any input.

While implementing in order to compute H(Y), p(y) is computed by marginalizing p( y|x ) over a mini-batch. For a given x, p( y|x ) is the output of the network after the softmax activation.

Information maximization with self augmented training.

The method described above assigns clusters while trying to balance the number of data points in the clusters. The only thing which tries to ensure the cluster assignments to be meaningful is the regularization penalty on the model parameters. A better way to ensure the cluster assignments to be meaningful is to have a way such that similar data-points go to the same clusters.

Information maximization with self augmented training ( IMSAT) is an approach which uses data augmentation to generate similar data-points. Given a data point x, an augmented training example T( x) is generated where T : X → X denote a pre-defined data augmentation function. The cross entropy between p( y|x ) and p( y|T(x) ) is minimized.

The authors of IMSAT propose two ways to augment the data and train the model : 1) Random Perturbation Training 2) Virtual Adversarial Training

Random Perturbation Training ( RPT ) : Here a random perturbation r from a pre-defined noise distribution is added to the input. Hence the, augmentation function will be T(x) = x + r . The random perturbation r is chosen randomly from hyper-sphere. As you can see, this is a very naive way to do augmentation.

Virtual Adversarial Training ( VAT ) : Here, rather than randomly choosing the perturbation randomly, the perturbation is chosen such that the model fails to assign them to the same cluster. A limit is imposed on the perturbation r so that input is not changed a lot.

This training is somewhat similar to how GANs are trained. Rather than having a generator fooling the discriminator, we generate a perturbation such the model is fooled in assigning the pair different clusters. Simultaneously it makes the model better and it does not make the same mistake in the future. The paper reports significant improvement in VAT over RPT for some datasets.

You can read more about virtual adversarial training here.

Deep Adaptive Clustering

Deep adaptive clustering ( DAC ) uses a pairwise binary classification framework. Given two input data-points, model outputs whether the inputs belong to the same cluster or not. Basically, there is a network with a softmax activation which takes an input data-point and produces a vector with probabilities of the input belong to the given set of clusters. Given two input data-points, the dot product of the model outputs of both the inputs is taken. When the cluster assignments of the two inputs are different, the dot product will be zero and for the same cluster assignments, the dot product will be one. As dot product is a differentiable operation, we can train it with backpropagation with pairwise training labels.

As the ground truth data is not available, the features of the same network are used to create binary labels for the pairwise training. Cosine distance between the features of the two data-points is used. Given an input pair, if the cosine distance is greater than the upper threshold, then the input pair is considered a positive pair ( meaning both should be in the same cluster). Similarly, if the cosine distance is lesser than the lower threshold then the input pair is considered a negative pair ( meaning both should be in different clusters ). If the distance lies between the lower threshold and the upper threshold, the pair is ignored. After getting the positive and the negative pairs, the pairwise loss is minimized.

As the pairwise loss is minimized, it becomes better in classifying pair of data-points and the features of the network become more meaningful. With features becoming more meaningful, the binary labels obtained via cosine distance of the features become more accurate.

You may think this as a chicken and egg problem and the question is how to get a good start. The solution is having a good random initialization distribution. With standard initialization techniques, even with random model weights output is related to the inputs ( behaving like extreme learning machine ). Hence cosine distance of the features is somewhat meaningful in the beginning. In the beginning, the upper threshold is set to a large value as the cosine distance measure is not very accurate. Over iterations, the upper threshold is decreased.

Conclusion

I hope this post was able to give you an insight into various deep learning based clustering techniques. The deep learning based methods have outperformed traditional clustering techniques in many benchmarks. Most of the methods discussed are promising and there is huge potential for improvement in several datasets. If you have any questions or want to suggest any changes feel free to contact me or write a comment below.

References

Originally published at https://divamgupta.com on March 8, 2019.