Stories by DECILE on Medium

CORDS — Energy and Time Efficient Toolkit for training Large Datasets

DECILE — Sun, 20 Jun 2021 07:49:24 GMT

CORDS — Energy and Time Efficient Toolkit for Training Large Datasets

1. Introduction

With the growth of computers, there is substantial growth in data. In today’s world, machine learning tasks often involve using large neural networks which require access to high-end GPUs to train large datasets which is expensive and takes a lot of time to train to achieve good accuracy. When there are limited resources and time, it may then pose a problem. This tutorial has a solution for handling these large datasets.

So why waste lot of time and resources for training the deep learning models when you can do the same even with fewer resources 😉

2. What is CORDS?

CORDS is a toolkit that is based on the PyTorch library that allows researchers and developers to train models by reducing the training time from days to hours (or hours to minutes) and energy requirements/costs by an order of magnitude using coresets and data selection. The goal of CORDS is to make machine learning time more energy, cost, resource and time efficient while not sacrificing accuracy. The idea behind cords is to select right representative subsets of large datasets using state of the art subset selection and coreset algorithms like -

GLISTER [1]
GradMatch [2]
CRAIG [2,3]
Submodular Selection [4,5,6] (Facility Location, Feature Based Functions, Coverage, Diversity)
Random Selection

3. Getting started with CORDS

To get started with CORDS, click here to access the GitHub repository and follow the installation procedure

4. CORDS Demo

For this demo we will cover the basics of CORDS using GLISTER strategy by training on the Cifar10 dataset. Cifar10 is a classic dataset for deep learning, consisting of 32x32 images belonging to 10 different classes, such as dog, frog, truck, ship, and so on. Training a complete dataset on ResNet architecture takes a couple of hours. The mission is to reduce the training time and costs without sacrificing accuracy.

NOTE: Here for this example, we have used GLISTER strategy with Cifar10 dataset and ResNet architecture. It is possible to run any strategy

Step 1

Modify the model architecture in the following way:

The forward method should have two more variables:

A boolean variable ‘last’ which -

▹ If true: returns the model output and the output of the second last layer

▹ If false: Returns the model output.

A boolean variable ‘freeze’ which -

▹ If true: disables the tracking of any calculations required to later calculate a gradient i.e skips gradient calculation over the weights

▹ If false: otherwise

2. get_embedding_dim() method which returns the number of hidden units in the last layer.

Modifying Model Architecture

Step 2

Use a configuration file as shown below with required parameters (config_glister_cifar10.py)

Training Configuration File

Step 3

Run the below code

from train import TrainClassifier

config_file = “configs/config_glister_cifar10.py

classifier = TrainClassifier(config_file)

classifier.train()

VOILA! Training has never been as simple as this!

In the next section we go through the complete training procedure step-by-step for the code train.py.

5. Code Walk-through

Highly modularized code of CORDS makes developers and researchers simple to understand and make changes as required for their needs.

Initially, define the model with the required model architecture:

Define the loss function and optimizers in the same way

The train method incorporates following steps:

▹ Load the training, testing and validation dataset and declare batch sizes for the corresponding datasets.

▹ Load the dataset using PyTorch DataLoader object for training, testing and validation which takes care of shuffling the data (if required) and constructing the batches.

▹ The next step is to initialize the data subset selection algorithm

▹ In the next step we train the network on the subset of the training data. We simply have to loop over the data iterator for the required number of epochs and feed the inputs to the network and optimize. The subset selection algorithm returns the right representative data subset indices and corresponding gradients. These are then loaded into PyTorch DataLoader object for further evaluation.

▹ In the next step, we evaluate the model losses on the subset of the training data

▹ At last, we calculate the validation and test losses for validation and test datasets loaded using PyTorch DataLoader objects

6. CORDS Results

Currently we see between 3x to 7x improvements in energy and runtime with around 1–2% drop in accuracy. We expect to push the Pareto-optimal frontier even more over time.

CORDS Results

7. Conclusion

As seen, CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy.

The following are the goals CORDS tries to achieve:

Data Efficiency
Reducing End to End Training Time
Reducing Energy Requirement
Faster Hyper-parameter tuning
Reducing Resource (GPU) Requirement and Costs

In this part, we discussed the basic introduction to CORDS and how it can be used for subset selection for massive datasets. In the next tutorial we will talk about hyper-parameter tuning to obtain SOTA results using CORDS (click here).

For more examples and tutorials, visit the CORDS GitHub repository.

8. Publications

[1] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer, [GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning], 35th AAAI Conference on Artificial Intelligence, AAAI 2021

[2] Krishnateja Killamsetty, Durga Sivasubramanian, Abir De, Ganesh Ramakrishnan, Baharan Mirzasoleiman, Rishabh Iyer, [Grad-Match: A Gradient Matching based Data Selection Framework for Efficient Learning], 2021

[3] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. [Coresets for Data-efficient Training of Machine Learning Models]. In International Conference on Machine Learning (ICML), July 2020

[4] Kai Wei, Rishabh Iyer, Jeff Bilmes, [Submodularity in Data Subset Selection and Active Learning], International Conference on Machine Learning (ICML) 2015

[5] Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, [Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision], 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[6] Wei, Kai, et al. [Submodular subset selection for large-scale speech training data], 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

Author:

Dheeraj Bhat

Research & Development Intern

DECILE Research Group

Active Learning Strategies & DISTIL

DECILE — Wed, 05 May 2021 16:52:05 GMT

1. INTRODUCTION

Deep learning models, which are often deemed to be the state of the art, are specially equipped to find hidden patterns from large datasets as they learn to craft features. However, training these deep learning models is very demanding both in terms of computational resources and large training data. The deeper the model, the more are parameters to be learnt. This makes models more and more data-hungry to achieve good generalization. This begs the question what is the cost of acquiring the data? Are the datasets always labelled and if not, what is the cost incurred in getting unlabeled datasets labelled?

Price for labelling 1000 data points. (Source: https://cloud.google.com/ai-platform/data-labeling/pricing). Similar rates can be found at https://aws.amazon.com/sagemaker/groundtruth/pricing/

Though depending on the underlying task, the cost of labelling varies, still, it is clear that even for the simplest of tasks labelling cost could be staggering if one wants to label the data points to train modern-day deep models. Also, it is important to note that these prices are for one annotator and even the task doesn’t require a domain expert. Often, to improve the reliability more annotators are needed.

Can something be done to reduce this staggering labelling cost when a labelled dataset is unavailable? Are all data points needed to achieve good performance?

It turns out that large datasets often have a lot of redundancies. Therefore, if carefully chosen, even with a few data points models can get good accuracy. This is where active learning comes into play. Active learning allows machine learning algorithms to achieve greater accuracy with fewer training labels. Here a machine learning algorithm chooses the data from which it wants to learn and gets it labelled by an oracle (e.g., a human annotator). Active learning is useful where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming, or expensive to obtain.

2. Deep dIverSified inTeractIve Learning (DISTIL)

Active learning can be easily incorporated with the new DISTIL Toolkit. DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It has the most state-of-the-art active learning strategies. DISTIL employs mini-batch adaptive active learning, which is more appropriate for deep neural networks. Thus, in each round DISTIL strategies select k sized mini-batch for n rounds. Now let’s see the various strategies present in DISTIL.

3. Various Active Learning Strategies

1. Uncertainty Sampling

One way to reduce labelling cost is to identify the data points that the underlying model finds most difficult to classify and provide labels only for those. We score a data point as simple or complex based on the softmax output for that point. Suppose the model has ncl output nodes and each output node is denoted by Zj. Thus, j ∈ [1,ncl]. Then for an output node Zi from the model, the corresponding softmax would be

A. Least Confidence
Then the softmax can be used to pick k no. of elements for which the model has the lowest confidence as follows,

where U denotes the Data without labels.

B. Margin Sampling
Then Margin sampling would pick k no. of elements using softmax as follows,

where U denotes the Data without labels.

C. Entropy
Then Entropy sampling would pick k no. of elements using softmax as follows,

where U denotes the Data without labels.

Interestingly, we see that both least confidence sampling and margin sampling pick some data points that have pairwise confusion however entropy focuses on the data points which have confusion among most of the labels.

2. Coreset

This technique tries to find data points that can represent the entire dataset. For this, it tries to solve a k-Center Problem on the set of points represented by the embeddings obtained from the penultimate layer of the model. Embeddings from the penultimate layer can be thought of as the extracted features, therefore, solving the k-Center Problem in this new feature space can help us get representative points. The idea in Coreset strategy is that if those representative points are labelled, then the model will have enough information. For example, Coreset strategy would select the blue points if the union of red and blue points were given as input and the budget was 4.

4 centres chosen. Source: Core-Set Paper

3. FASS

Filtered Active Submodular Selection (FASS) combines uncertainty sampling idea with Coreset idea to most representative points. To select the most representative points it uses a submodular data subset selection framework.
Here we select a subset F of size β based on uncertainty sampling, such that β ≥ k.
Using one of the submodular functions — ‘facility location’, ‘graph cut’, ‘saturated coverage’, ‘sum redundancy’, ‘feature based’, we select subset S of size k.

Submodular functions are often used to get the most representative or diverse subsets.

4. BADGE

Batch Active learning by Diverse Gradient Embeddings (BADGE) samples groups of points that are disparate and high magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. This allows it to trades off between uncertainty and diversity without requiring any hand-tuned hyperparameters. Here at each round of selection, loss gradients are computed using the hypothesized labels.

5. GLISTER-ACTIVE

Glister-Active performs data selection jointly with parameter learning by trying to solve a bi-level optimization problem,

Inner level optimization: This is very similar to the problem encountered while training a model except that here the data points used are from a subset. Therefore this tries to maximize the log-likelihood (LLT) with the given subset.
Outer level Optimization: This is also a log-likelihood maximization problem. The objective here is to select a subset S that maximizes the log-likelihood of the validation set with given model parameters.

This bi-level optimization is often expensive or impractical to solve for general loss functions, especially when the inner optimization problem cannot be solved in closed form. Therefore, instead of solving the inner optimization problem completely, a one-step approximation is made as follows,

while solving the outer optimization.

6. Adversarial Techniques

These techniques are motivated by the fact that often the distance computation from decision boundary is difficult and intractable for margin-based methods. Adversarial techniques such as Deep-Fool, BIM(Basic Iterative Method) etc. have been tried out in active learning setting to estimate how much adversarial perturbation is required to cross the boundary. The smaller the required perturbation, the closer the point is to the boundary.

Choosing samples in adversarial Setting. Paper: Adversarial active learning for deep networks: a margin based approach

7. BALD

Bayesian Active Learning by Disagreement(BALD) assumes a Bayesian setting. Therefore the parameters are probability distributions. This allows the model to quantify its beliefs: a wide distribution for a parameter means that the model is uncertain about its true value, whereas a narrow one quantifies high certainty. BALD scores a data point x based on how well the model’s predictions y inform us about the model parameters. For this, it uses mutual information,

Since the mutual information can be re-written as:

Looking at the two terms in the equation, for the mutual information to be high, the left term has to be high and the right term low. The left term is the entropy of the model prediction, which is high when the model’s prediction is uncertain. The right term is an expectation of the entropy of the model prediction over the posterior of the model parameters and is low when the model is overall certain for each draw of model parameters from the posterior. Both can only happen when the model has many possible ways to explain the data, which means that the posterior draws are disagreeing among themselves. Therefore each round, k points are selected as follows,

The intuition behind BALD. Areas in grey contribute to the BALD score. Paper: Efficient and diverse batch acquisition for deep Bayesian active learning

4. Video Explanation

https://medium.com/media/ba5718f68f869cf0f8b3e580fee6bb5a/href

5. Resources

More about Active Learning & DISTIL:

YouTube Playlist:

https://medium.com/media/e99cd4629e9eece787082ec3df802888/href

Author:

Durga Subramanian

DECILE Research Group

Cut Down on Labeling Costs with DISTIL

DECILE — Mon, 03 May 2021 18:11:51 GMT

In this Article

Introduction
Reducing Labeling Costs
DISTIL
Robustness against Redundancy
Video Explanation
Conclusion
Resources

1. Introduction

Much of deep learning owes its success to the staggering amount of data used in model training. While throwing data at these deep models has shown to improve their accuracies time and time again, it comes at the great expense of data labeling. Indeed, mid-size datasets of tens of thousands of points may cost anywhere from a couple thousand USD to a couple hundred thousand USD.

For example, if your labeling task does not require specialist knowledge, Google’s AI Platform Data Labeling Service can be used to procure labeled data. In that instance, labeling 50,000 units (see Google’s per-1000-unit pricing chart) could cost up to $43,500 for one annotation per data point. Typically, multiple people annotate the data to ensure the quality of the labels, so this large cost is made even worse by a factor of the number of annotations required! This example even precludes the possibility that your data needs specialist knowledge to label. For example, a medical dataset of images often requires specialist knowledge for most labeling tasks. If you end up needing to label a very large dataset with difficult labels, well… I hope you have some spare pallets of cash lying around.

2. Reducing Labeling Costs

If you are like most people, you do not have a couple hundred thousand USD to shell out on labeling. A natural question to ask is how you can alleviate your labeling costs. A promising area of machine learning is active learning, which serves to answer the following question: Based on my model performance so far, what data should I have labeled so that my model’s performance is maximized once I train on the new collection of labeled data? The answer to this question effectively allows you to cut to the chase — instead of labeling all your data, you can instead label only the most important data to achieve good model performance. In essence, active learning aims to distil the large amount of unlabeled data at your disposal so that you can get the best labeling efficiency allowable under current methods.

3. Deep dIverSified inTeractIve Learning (DISTIL)

Luckily, an open-source Python library exists to make active learning easy and accessible! DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It allows users to modularly insert active learning selection into their pre-existing training loops with minimal change. Most importantly, it features promising results in achieving high model performance with less amount of labeled data. By comparing the performance of these active learning algorithms against the strategy of randomly selecting points to label, the labeling efficiency of these active learning algorithms becomes clear. Here are some of the results obtained on common datasets using some of the active learning algorithms in DISTIL:

The best strategies show 2x labeling efficiency compared to random sampling. BADGE does better than entropy sampling with a larger budget, and all strategies do better than random sampling.

All strategies exhibit a gain over random sampling, but the per-batch version of BADGE performs similarly to random sampling. (Regular BADGE does not scale to CIFAR-100!)

All strategies exhibit a gain over random sampling, and both entropy sampling and BADGE achieve a 4x labeling efficiency compared to random sampling.

All strategies exhibit a gain over random sampling, and both entropy sampling and BADGE achieve a 3x labeling efficiency compared to random sampling.

4. Robustness against Redundancy

A valid criticism of the above results might be that the above datasets are not representative of real-world datasets. Indeed, many datasets used in industry feature an astronomical amount of data. In fact, it is often the case where much of the data is redundant. A natural question to ask, then, is whether active learning is robust against redundancy. An answer to this question would give some evidence to the effectiveness of active learning on real-world datasets.

Very large datasets have many redundant data instances.

Luckily, DISTIL offers a wide repertoire of active learning algorithms, and some of them are robust against redundancy. In particular, we can examine how entropy sampling and BADGE perform on redundant data versus random sampling. The following shows some results on a modified CIFAR-10 dataset, where only a few unique points are drawn and increasingly duplicated:

BADGE and entropy sampling perform better than random sampling. The labeling efficiency is not pronounced with few unique points to select.

With more redundancy, BADGE begins to perform better than before. Entropy sampling begins to perform worse than random.

With even more redundancy, BADGE continues to do better than random sampling, while entropy sampling continues to do worse than random sampling.

Takeaway: Compared to random sampling, entropy sampling handles redundant data poorly while BADGE handles redundant data proficiently.

5. Video Explanation

https://medium.com/media/036c948c936a179db863998aef7c152d/href

6. Conclusion

As you can see, the active learning algorithms in DISTIL show promise in greatly reducing the number of labeled data points required for your model, and DISTIL offers a wide enough range of active learning algorithms to handle your problem instance. Hence, DISTIL can save you the cost of labeling significant portions of your data, which allows you to deploy your final models quicker, which also saves you development costs! Better yet, DISTIL is actively expanding its repertoire of active learning algorithms to ensure state-of-the-art performance. As such, if you are looking to cut down on labeling costs, DISTIL should be your go-to for getting the most out of your data.

7. Resources

More about Active Learning & DISTIL:

YouTube Playlist:

https://medium.com/media/e99cd4629e9eece787082ec3df802888/href

Author:

Nathan Beck

DECILE Research Group

Getting Started With DISTIL & Active Learning

DECILE — Thu, 22 Apr 2021 07:35:30 GMT

1. Introduction

Distil is a toolkit in PyTorch which provides access to different active learning algorithms. Active Learning (AL) helps in reducing labeling cost and also reduces training time and resources. AL helps in selecting only the required data and experiments show that using only 10% of data for training can reach accuracy levels close to the levels reached when using the entire dataset.

This article provides a step by step explanation of how to use DISTIL along with your existing pipelines. Different active strategies supported by DISTIL are listed below:

Uncertainty Sampling
Margin Sampling
Least Confidence Sampling
FASS
BADGE
GLISTER ACTIVE
CoreSets based Active Learning
Random Sampling
Submodular Sampling
Adversarial Bim
Adversarial DeepFool
Baseline Sampling
BALD
Kmeans Sampling

The documentation for the same can be found at: DISTIL Documentation

2. Incorporating Custom Models & Data with DISTIL

There are two main things that needs to be incorporated in the code before using DISTIL.

Model

The model should have a function get_embedding_dim which returns the number of hidden units in the last layer.
The forward function should have a boolean flag “last” where:

if True: It should return the model output and the output of the second last layer
if False: It should only return the model output.

Data Handler

Since active learning works with data without labels, default data handlers cannot be used. The custom data handler should have following support:

Your data handler class should have a Boolean parameter “use_test_transform=False” with default value False. This parameter is used to identify if handler is used for testing data.
The data handler class should have a Boolean parameter “select” with default value True:

If True: It should return only X and not Y (used by active learning strategies)

If False: It should return both X and Y (used while training the model)

That’s it folks! Just a couple of changes to get your model ready for DISTIL.

3. DISTIL Workflow

Now we are ready to work with DISTIL. This section will describe the step by step workflow of DISTIL and how active learning exactly works.

Budget: It is the number of points added in the training set after every iteration. This needs to be decided before the training is initiated.

The yellow boxes in the flow chart denote the initial loop and it runs only once during the starting of the process. Let the budget be denoted by n.

There is a set of unlabeled data which needs to be used for training the model.
First n random points are selected for the initial round of training.
These points needs to be manually labeled
The model is trained with these labelled data.
After the training is completed, DISTIL selects the next set of n data points based on hypothesized labels, gradient embeddings, etc. depending on the active learning algorithm chosen.
These new selected points are labeled and added to the training data
The model is trained again with the new training data.
Repeat steps 5–7 until it reaches the desired testing accuracy or the points in training data reaches the threshold decided.

4. Code Walk-through

Based, on the above steps, let’s go through the code step by step based on the example provided here: DISTIL Example Code.

Step 1:

Loading the unlabeled data. Lines 60–63 loads the data which is in libsvm format. This is just an example, you can load data of your choice.

Step 2:

Line 96–99, first set of random points are selected for the initial loop of training. In line 100, the points selected from the training are removed from the unlabeled set.

Step 3:

Note: Here we have assumed we already have the labels of the dataset. In an ideal scenario, these labels won’t be present and need to be labelled manually.

In line, 102, labeling of the selected data points is being done.

Step 4:

In line 108–110, DISTIL object is initiated with Glister strategy. DISTIL provides support for various active learning strategies such as FASS, Margin Sampling, BADGE, BALD, etc. and it can be selected in this step.

In line 120–122, the model is trained with the current labeled training set. DISTIL focuses on decoupling training from active learning. The training loop is completely in the hand of the user and has no restrictions on the way the model is trained. Thus, after training the model, DISTIL needs to be made aware of the current model state. In line 123, the model state in DISTIL is updated.

Step 5:

In line 135, DISTIL is called to choose a new set of points using the select point and passing the budget which is the number of points to be selected. Assuming that new points will be required to be labeled might take some time, and the loop is not continuous, the state of DISTIL needs to be maintained before training the model for the new training set. In line 137, save_state function is called which saves the current state of DISTIL and can be loaded again before starting the next training iteration.

Step 6:

In line 140–141, the new selected data points are added to the training set and deleted from the unlabeled data pool. In line 144–145, the new chosen points are labeled. As explained in the above step, since labeling might take some time, the DISTIL state is saved beforehand. In line 151, the previous DISTIL state is loaded and since training and active learning are decoupled, In line 152–153, the new training data is being updated in DISTIL as well as the training class.

Step 7:

In line 155, the model is trained with the updated training set and in line 156, the new model state is updated in DISTIL using the update_model method of the DISTIL object.

Step 8:

Step 5–7 are repeated until a stopping criterion is met. In the example.py, the stopping criterion is the number of rounds or if testing accuracy crosses 98%.

5. Conclusion

Thus, DISTIL can be easily incorporated in your code as it focuses on the following principles:

Minimal changes to add it to the existing training structure.
Independent of the training strategy used.
Achieving similar test accuracy with less amount of training data.
Huge reduction in labelling cost and time.
Access to various active learning strategies with just one line of code.

For latest discussions join the Decile_DISTIL_Dev group.

You can also refer to the video for a DISTIL tutorial based on this blog.

6. Video Explanation

https://medium.com/media/208249235ecb5f082cf5026dc3cc3e36/href

7. Resources

DISTIL Documentation.

https://decile-team-distil.readthedocs.io/en/latest/

Code Repository:

https://github.com/decile-team/distil

Colab Examples:

https://github.com/decile-team/distil#demo-notebooks

Complete Code to the example discussed in the article:

decile-team/distil

More about Active Learning & DISTIL:

YouTube Playlist:

https://medium.com/media/e99cd4629e9eece787082ec3df802888/href

8. Publications

[1] Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.

[2] Wang, Dan, and Yi Shang. “A new active labeling method for deep learning.” 2014 International joint conference on neural networks (IJCNN). IEEE, 2014

[3] Kai Wei, Rishabh Iyer, Jeff Bilmes, Submodularity in data subset selection and active learning, International Conference on Machine Learning (ICML) 2015

[4] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. CoRR, 2019. URL: http://arxiv.org/abs/1906.03671, arXiv:1906.03671.

[5] Sener, Ozan, and Silvio Savarese. “Active learning for convolutional neural networks: A core-set approach.” ICLR 2018.

[6] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer, GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning, 35th AAAI Conference on Artificial Intelligence, AAAI 2021

[7] Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision, 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[8] Wei, Kai, et al. “Submodular subset selection for large-scale speech training data.” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

Author:

Apurva Dani

AI Research & Development

DECILE Research Group