Stories by Bharath Raj on Medium

How to deploy ONNX models on NVIDIA Jetson Nano using DeepStream

Bharath Raj — Thu, 05 Dec 2019 05:38:19 GMT

An experiment to test the multi-stream neural network inference performance of DeepStream on Jetson Nano.

Jetson Nano. (Source)

Deploying complex deep learning models onto small embedded devices is challenging. Even with hardware optimized for deep learning such as the Jetson Nano and inference optimization tools such as TensorRT, bottlenecks can still present itself in the I/O pipeline. These bottlenecks can potentially compound if the model has to deal with complex I/O pipelines with multiple input and output streams. Wouldn’t it be great to have a tool that can take care of all bottlenecks in an end-to-end fashion?

Say Hello to DeepStream

Turns out there is a SDK that attempts to mitigate this problem. DeepStream is an SDK that is optimized for NVIDIA Jetson and T4 platforms to provide a seamless end-to-end service to convert raw streaming data into actionable insights. It is built on top of the GStreamer framework. Here, “raw streaming data” is typically continuous (and multiple) video streams and “actionable insights” are the final outputs of your deep learning or other analytics algorithms.

The DeepStream pipeline. (Source)

DeepStream SDK uses its custom GStreamer Plugins to provide various functionalities. Notably, it has plugins for TensorRT based inference and object tracking. The below image lists out the capabilities of their plugins. For an exhaustive technical guide about their plugins, you can refer to their Plugin Manual.

Plugins available in DeepStream. (Source)

One feature I particularly liked about DeepStream is that it optimally takes care of the entire I/O processing in a pipelined fashion. We can also stack multiple deep learning algorithms to process information asynchronously. This allows you to increase throughput without the hassle of manually creating and managing a multiprocessing system design.

The best part is, for some supported applications such as object detection, tracking, classification or semantic segmentation, DeepStream is easy to use! For such an application, as long you have a deep learning model in a compatible format, you can easily launch DeepStream by just setting a few parameters in some text files.

In this blog, we will design and run an experiment on DeepStream to test out its features and to see if it is easy to use on the Jetson Nano.

The Experiment

To test the features of DeepStream, let's deploy a pre-trained object detection algorithm on the Jetson Nano. This is an ideal experiment for a couple of reasons:

DeepStream is optimized for inference on NVIDIA T4 and Jetson platforms.
DeepStream has a plugin for inference using TensorRT that supports object detection. Moreover, it automatically converts models in the ONNX format to an optimized TensorRT engine.
It has plugins that support multiple streaming inputs. It also has plugins to save the output in multiple formats.

The ONNX model zoo has a bunch of pre-trained object detection models. I chose the Tiny YOLO v2 model from the zoo as it was readily compatible with DeepStream and was also light enough to run fast on the Jetson Nano.

Note: I did try using the SSD and YOLO v3 models from the zoo. But there were some compatibility issues. These issues are discussed in my GitHub repository, along with tips to verify and handle such cases. I ended up using Tiny YOLO v2 as it was readily compatible without any additional effort.

Now, the features we want to investigate are as follows:

Multiple input streams: Run DeepStream to perform inference on multiple video streams simultaneously. Specifically, we will try using up-to 4 video streams.
Multiple output sinks: Display the result on screen and stream it using RTSP. The stream will be accessed by another device connected to the network.

The performance (Frames per Second, FPS) and ease-of-use will be evaluated for the experiment. The next few sections will guide you through how to set up DeepStream on Jetson Nano to run this experiment. All code used for this experiment is available on my GitHub repository. If you are just curious about how it turned out, feel free to skip to the results section.

Getting Started

In this section, we will walk through some instructions to set things up for our experiment.

Part 1: Setting up your Jetson Nano

Follow the instructions on the Getting Started With Jetson Nano Developer Kit to set up and boot your Jetson Nano. In case you face some issues with the setup, I would highly recommend following these resources.

I would like to highlight some pointers that might save you some trouble:

It is recommended to use at least a 32GB MicroSD card (I used 64GB).
You need a wired ethernet connection. If you need to connect your Jetson Nano to WiFi, you need to use a dongle such as the Edimax EW-7811Un.
You need a monitor that directly accepts HDMI input. I could not use my VGA monitor using a VGA-HDMI adapter.

Part 2: Installing the DeepStream SDK

Now that you have your Jetson Nano up and running, we can install DeepStream. Nvidia has put together the DeepStream quick start guide where you can follow the instructions under the section Jetson Setup.

Before you go ahead and install DeepStream using the above link, I would like to highlight a few points from my setup experience:

The setup would suggest you install Jetpack using the Nvidia SDK Manager. I skipped that step as I realized using the OS image in Part-1 (above) had most of the required dependencies by default.
In the sub-section “To install the DeepStream SDK” of the quick start guide, I used Method-2.

After installing DeepStream and boosting the clocks (as mentioned in the guide), we can run one of their samples to verify that the installation is properly done. Move (cd) into your DeepStream installation folder and run the following command:

deepstream-app -c ./samples/configs/deepstream-app/source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

On execution, you should see something like this:

Output on executing DeepStream using the sample configuration file.

If you see something similar, congrats! You can play around with more samples if you would like. The guide has a section named “Reference Application Source Details” which provides a description of the samples.

Part 3: Setting up the Experiment

Now that you have installed and tested DeepSteam, we can go ahead with our experiment. I have bundled up all the files required for the experiment in my GitHub repository. You can follow the step by step instruction in the repository’s readme file about the setup instructions.

Before moving on with the experiment, if you have not used GStreamer before, it would be worth your time to go through their foundations page. This helps with better understanding of some of the jargons used in DeepStream’s documentation.

Interfacing your custom ONNX model with DeepStream

In this section, we will explore how to interface the output of our ONNX model with DeepStream. More specifically, we will walk-through the process of creating a custom processing function in C++ to extract bounding box information from the output of the ONNX model and provide it to DeepStream.

Part 1: Understanding the Output of Tiny YOLOv2

The ONNX model outputs a tensor of shape (125, 13, 13) in the channels-first format. However, when used with DeepStream, we obtain the flattened version of the tensor which has shape (21125). Our goal is to manually extract the bounding box information from this flattened tensor.

Let us first try to visually understand the output tensor output by the ONNX model. Consider the output tensor to be a cuboid of dimensions (B, H, W), which in our case B=125,H=13,W=13. We can consider the axes X, Y and B along the width (W), height (H) and depth (B) respectively. Now, each location in the XY plane represents a single grid cell.

Let us visualize a single grid cell (X=0, Y=0). There 125 values along the depth axis (B) for this given (X,Y) location. Let us rearrange the 125 values in groups of 25 as shown below:

Figure A: Interpreting the meaning of the 125 b-values along the B-axis for the grid cell (X = 0, Y = 0).

As we see here, each of the contiguous 25 values belong to a separate bounding box. Among each set of 25 values, the first 5 values are of the bounding box parameters and the last 20 values are class probabilities. Using this, we can extract the coordinates and confidence score for each of the 5 bounding boxes as shown here:

Formulae for extracting the bounding box parameters. (Source)

Note that we have only performed this operation at one grid cell (X=0, Y=0). We must iterate over all combinations of X and Y to find the 5 bounding box predictions at each grid cell.

Now that we have a visual idea of how the information is stored, let us try to extract it using indexing. After flattening the output tensor, we get a single array in which information is stored as shown in the image below:

Figure B: Flattened representation of the output tensor.

The flattened array has 125 * 13 * 13 = 21125 elements. As shown above, each location in the array corresponds to the indices (b, y, x). We can observe that for a given (y,x) value, the corresponding b values are separated by 13 * 13 = 169 .

The following code snippet in Python shows how we can obtain the locations of b values corresponding to each of the 5 bounding boxes in a given (y, x) location. Do note that, as shown in Figure A, there are 25 b values for each bounding box for a given (y, x) location.

## Let arr be the flattened array.
## The array values contains the value of arr at the 25 b_values per ## bbox,x,y combination.
num_anchors = 5
num_classes = 20
xy_offset = y * 13 + x
b_offset = 13 * 13
bbox_offset = 5 + num_classes
for bbox in range(num_anchors):
  values = []
  for b in range(bbox_offset):
     value = arr[xy_offset + b_offset * (b + bbox * bbox_offset)]
     values.append(value)

All that is left to do is to write the C++ equivalent of the same.

Part 2: Writing the Bounding Box Parsing Function

Now that we understand how the output is stored and can be extracted, we need to write a function in C++ to do the same. DeepStream expects a function with arguments as shown below:

extern "C" bool NvDsInferParseCustomYoloV2Tiny(
    std::vector const& outputLayersInfo,
    NvDsInferNetworkInfo const& networkInfo,
    NvDsInferParseDetectionParams const& detectionParams,
    std::vector& objectList
);

In the above function prototype, outputLayersInfo is a std::vector containing information and data about each output layer of our ONNX model. In our case, since we have just one output layer, we can access the data using outputLayersInfo[0].buffer. The variable networkInfo has information about the height and width expected by the model and detectionParams has information about some configurations such as numClassesConfigured.

The variable objectList should be updated with a std::vector of bounding box information stored as objects of type NvDsInferParseObjectInfo at every call of the function. Since the variable was passed by reference, we don’t need to return it as the changes will be reflected at the source. However, the function must return true at the end of its execution.

For our use case, we create NvDsInferParseCustomYoloV2Tiny such that it will first decode the output of the ONNX model as described in Part-1 of this section. For each bounding box, we create an object of type NvDsInferParseObjectInfo to store its information. We then apply non-maximum suppression to remove duplicate bounding box detections of the same object. We then add the resulting bounding boxes to the objectList vector.

My GitHub repository has nvdsparsebbox_tiny_yolo.cpp inside the directory custom_bbox_parser with the function already written for you. The below flowchart explains the flow of logic within the file. The code may seem large but that is only because it is heavily documented and commented for your understanding!

Flowchart approximately describing the flow of logic in the code file.

Part 3: Compiling the Function

All that’s left now is to compile the function into a .so file so that DeepStream can load and use it. Before you compile it, you may need to set some variables inside the Makefile. You can refer to step 4 of the ReadMe in my GitHub repository for instructions. Once that is done, cd into the GitHub repository and run the following command:

make -C custom_bbox_parser

Setting the Configuration Files

The good news is that most of the heavy lifting work is done. All that is left is to set up some configuration files which will tell DeepStream how to run the experiments. A configurations file has a set of “groups”, each of which has a set of “properties” that are written in the key-file format.

For our experiment, we need to set up two configuration files. In this section we will explore some important properties within these configuration files.

Part 1: Configuration file for Tiny YOLOv2

Our ONNX model is used by the Gst-Nvinfer plugin of DeepStream. We need to set-up some properties to tell the plugin information such as the location of our ONNX model, location of our compiled bounding box parser and so on.

In the GitHub repository, the configuration file named config_infer_custom_yolo.txt is already setup for our experiment. Comments are given in the file with reasoning for each property setting. For a detailed list of all the supported properties, check out this link.

Some interesting properties that we have not used are the “net-scale-factor” and the “offset” properties. They essentially scale the input (x) using the formula: net_scale_factor * (x — mean). We did not use those properties as our network directly takes the unscaled image as the input.

Part 2: Configuration file for DeepStream

We also need to set a configuration file for DeepStream to enable and configure the various plugins that will be used by it. As mentioned before, the GitHub repository contains the configuration file deepstream_app_custom_yolo.txt which is already setup for our experiment.

Unlike the previous part, this configuration has a lot of groups such as “osd” (On Screen Display), “primary-gie” (Primary GPU Inference Engine) and so on. This link has information about all possible groups that can be configured and the properties supported for each group.

For our experiment, we define a single source group (source0) and three sink groups (sink0, sink1 and sink2). The single source group is responsible for reading four input video streams parallely. The three sink groups are used for displaying output on screen, streaming output using RTSP and for saving output to disk respectively. We provide the path of the configuration file of Tiny YOLOv2 in the primary-gie group. Moreover, we also set the titled-display and osd groups to control how the output appears on screen.

Running DeepStream

This is the easiest part. All you have to do is to run the following command:

deepstream-app -c ./config/deepstream_app_custom_yolo.txt

Launching DeepStream for the first time would take a while as the ONNX model would need to be converted to a TensorRT Engine. It is recommended to close memory hungry apps such as Chromium during this process. Once the engine file is created, subsequent launches will be fast provided the path of the engine file is defined in the Tiny YOLOv2 configuration file.

Results

On running DeepStream, once the engine file is created we are presented with a 2x2 tiled display as shown in the video below. Each unit in the tiled display corresponds to a different streaming input. As expected, all four different inputs are processed simultaneously.

https://medium.com/media/486e46fd52126c761103ee84ee45235e/href

Since we also enabled RTSP, we can access the stream at rtsp://localhost:8554/ds-test. I used VLC and the RTSP address (after replacing localhost with the IP address of my Jetson Nano) to access the stream on my laptop which was connected to the same network. Note that, another sink is also used to save the output stream to disk. It is impressive to note that the console periodically logs an FPS of nearly 6.7 per video stream!

FPS per video stream while simultaneously using four video streams.

If we had a single input stream, then our FPS should ideally be four times greater than this four video case. I test this out by changing the values in the configuration files and launching DeepStream once again. As expected, we get a whopping near 27 FPS for the single video stream! The performance is impressive considering it still is sending output to three different sinks.

FPS while using a single video stream.

We do however note that the detection accuracy of Tiny YOLOv2 is not as phenomenal as the FPS. This is particularly because the model was optimized for speed at the cost of some accuracy. Moreover, the people in the video had blurred faces and the model might not have encountered this blurriness while training. Hence, the model might have faced additional difficulty for that class.

Verdict and Thoughts

DeepStream is blazingly fast. Even though Tiny YOLOv2 is optimized for speed rather than accuracy, a stable high FPS performance while providing amazing features such as seamless multi-stream processing and an RTSP stream is something to be appreciated.

However, using DeepStream may not be straightforward, especially if your model is not completely compatible with TensorRT. In such cases, manually writing your own TensorRT layers might be a more viable (albeit tedious) option. Moreover, it may so happen that the readily available ONNX models may have an opset version greater than what is currently accepted by DeepStream.

Nevertheless, I do feel that the functionality offered by DeepStream is worth the effort. I would recommend you to give it a shot by replicating my experiment!

How to deploy ONNX models on NVIDIA Jetson Nano using DeepStream was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Introduction to Super Resolution using Deep Learning

Bharath Raj — Mon, 01 Jul 2019 17:17:57 GMT

An elaborate discussion on the various Components, Loss Functions and Metrics used for Super Resolution using Deep Learning.

Written by Bharath Raj with feedback from Yoni Osin.

Photo by Jeremy Thomas on Unsplash

Introduction

Super Resolution is the process of recovering a High Resolution (HR) image from a given Low Resolution (LR) image. An image may have a “lower resolution” due to a smaller spatial resolution (i.e. size) or due to a result of degradation (such as blurring). We can relate the HR and LR images through the following equation: LR = degradation(HR)

A low resolution image kept besides its high resolution version. (Photo by Jarrad Horne on Unsplash)

Clearly, on applying a degradation function, we obtain the LR image from the HR image. But, can we do the inverse? In the ideal case, yes! If we know the exact degradation function, by applying its inverse to the LR image, we can recover the HR image.

But, there in lies the problem. We usually do not know the degradation function before hand. Directly estimating the inverse degradation function is an ill-posed problem. In spite of this, Deep Learning techniques have proven to be effective for Super Resolution.

This blog primarily focuses on providing an introduction to performing Super Resolution using Deep Learning by using Supervised training methods. Some important loss functions and metrics are also discussed. A lot of the content is derived from this literature review which the reader can refer to.

Supervised Methods

As mentioned before, deep learning can be used to estimate the High Resolution (HR) image given a Low Resolution (LR) image. By using the HR image as a target (or ground-truth) and the LR image as an input, we can treat this like a supervised learning problem.

In this section, we group various deep learning approaches in the manner the convolution layers are organized. Before we move on to the groups, a primer on data preparation and types of convolutions is presented. Loss functions used to optimize the model are presented separately towards the end of this blog.

Preparing the Data

One easy method of obtaining LR data is to degrade HR data. This is often done by blurring or adding noise. Images of lower spatial resolution can also be scaled by a classic upsampling method such as Bilinear or Bicubic interpolation. JPEG and quantization artifacts can also be introduced to degrade the image.

Degrading a high resolution image to obtain a low resolution version of it. (Photo by Jarrad Horne on Unsplash)

One important thing to note is that it is recommended to store the HR image in an uncompressed (or lossless compressed) format. This is to prevent degradation of the quality of the HR image due to lossy compression, which may give sub-optimal performance.

Types of Convolutions

Besides classic 2D Convolutions, several interesting variants can be used in networks for improved results. Dilated (Atrous) convolutions can provide a greater effective field of view, hence using information that are separated by a large distance. Skip connections, Spatial Pyramid Pooling and Dense Blocks motivate combining both low level and high level features to enhance performance.

Network design strategies. (Source)

The above image mentions a number of network design strategies. You can refer to this paper for more information. For a primer on the different types of convolutions commonly used in deep learning, you may refer to this blog.

Group 1 — Pre-Upsampling

In this method, the low resolution images are first interpolated to obtain a “coarse” high resolution image. Now, CNNs are used to learn an end-to-end mapping from the interpolated low resolution images to the high resolution images. The intuition was that it may be easier to first upsample the low-resolution images using traditional methods (such as Bilinear interpolation) and then refine the resultant than learn a direct mapping from a low-dimensional space to a high-dimensional space.

A typical pre-upsampling network. (Source)

You can refer to page 5 of this paper for some models using this technique. The advantage is that since the upsampling is handled by traditional methods, the CNN only needs to learn how to refine the coarse image, which is simpler. Moreover, since we are not using transposed convolutions here, checkerboard artifacts maybe circumvented. However the downside is that the predefined upsampling methods may amplify noise and cause blurring.

Group 2— Post-Upsampling

In this case the low resolution images are passed to the CNNs as such. Upsampling is performed in the last layer using a learnable layer.

A typical post-upsampling network. (Source)

The advantage of this method is that feature extraction is performed in the lower dimensional space (before upsampling) and hence the computational complexity is reduced. Furthermore, by using an learnable upsampling layer, the model can be trained end-to-end.

Group 3— Progressive Upsampling

In the above group, even though the computational complexity was reduced, only a single upsampling convolution was used. This makes the learning process harder for large scaling factors. To address this drawback, a progressive upsampling framework was adopted by works such as Laplacian Pyramid SR Network (LapSRN) and Progressive SR (ProSR). The models in this case use a cascade of CNNs to progressively reconstruct high resolution images at smaller scaling factors at each step.

A typical progressive-upsampling network. (Source)

By decomposing a difficult task into simpler tasks, the learning difficulty is greatly reduced and better performance can be obtained. Moreover, learning strategies like curriculum learning can be integrated to further reduce learning difficulty and improve final performance.

Group 4 — Iterative Up and Down Sampling

Another popular model architecture is the hourglass (or U-Net) structure. Some variants such as the Stacked Hourglass network use several hourglass structures in series, effectively alternating between the process of upsampling and downsampling.

A typical iterative up-and-down sampling network. (Source)

The models under this framework can better mine the deep relations between the LR-HR image pairs and thus provide higher quality reconstruction results.

Loss Functions

Loss functions are used to measure the difference between the generated High Resolution image and the ground truth High Resolution image. This difference (error) is then used to optimize the supervised learning model. Several classes of loss functions exist where each of which penalize a different aspect of the generated image.

Often, more than one loss function is used by weighting and summing up the errors obtained from each loss function individually. This enables the model to focus on aspects contributed by multiple loss functions simultaneously.

total_loss = weight_1 * loss_1 + weight_ 2 * loss_2 + weight_3 * loss_3

In this section we will explore some popular classes of loss functions used for training the models.

Pixel Loss

Pixel-wise loss is the simplest class of loss functions where each pixel in the generated image is directly compared with each pixel in the ground-truth image. Popular loss functions such as the L1 or L2 loss or advanced variants such as the Smooth L1 loss are used.

Plot of Smooth L1 Loss. (Source)

The PSNR metric (discussed below) is highly correlated with the pixel-wise difference, and hence minimizing the pixel loss directly maximizes the PSNR metric value (indicating good performance). However, pixel loss does not take into account the image quality and the model often outputs perceptually unsatisfying results (often lacking high frequency details).

Content Loss

This loss evaluates the image quality based on its perceptual quality. An interesting way to do this is by comparing the high level features of the generated image and the ground truth image. We can obtain these high level features by passing both of these images through a pre-trained image classification network (such as a VGG-Net or a ResNet).

Content loss between a ground truth image and a generated image. (Source)

The equation above calculates the content loss between a ground-truth image and a generated image, given a pre-trained network (Φ) and a layer (l) of this pre-trained network at which the loss is computed. This loss encourages the generated image to be perceptually similar to the ground-truth image. For this reason, it is also known as the Perceptual loss.

Texture Loss

To enable the generated image to have the same style (texture, color, contrast etc.) as the ground truth image, texture loss (or style reconstruction loss) is used. The texture of an image, as described by Gatys et. al, is defined as the correlation between different feature channels. The feature channels are usually obtained from a feature map extracted using a pre-trained image classification network (Φ).

Computing the Gram Matrix. (Source)

The correlation between the feature maps is represented by the Gram matrix (G), which is the inner product between the vectorized feature maps i and j on layer l (shown above). Once the Gram matrix is calculated for both images, calculating the texture loss is straight-forward, as shown below:

Computing the Texture Loss. (Source)

By using this loss, the model is motivated to create realistic textures and visually more satisfying results.

Total Variation Loss

The Total Variation (TV) loss is used to suppress noise in the generated images. It takes the sum of the absolute differences between neighboring pixels and measures how much noise is in the image. For a generated image, the TV loss is calculated as shown below:

Total Variation Loss used on a generated High Resolution image. (Source)

Here, i,j,k iterates over the height, width and channels respectively.

Adversarial Loss

Generative Adversarial Networks (GANs) have been increasingly used for several image based applications including Super Resolution. GANs typically consist of a system of two neural networks — the Generator and the Discriminator — dueling each other.

Given a set of target samples, the Generator tries to produce samples that can fool the Discriminator into believing they are real. The Discriminator tries to resolve real (target) samples from fake (generated) samples. Using this iterative training approach, we eventually end up with a Generator that is really good at generating samples similar to the target samples. The following image shows the structure of a typical GAN.

GANs in action. (Source)

Advances to the basic GAN architecture were introduced for improved performance. For instance, Park et. al. used a feature-level discriminator to capture more meaningful potential attributes of real High Resolution images. You can checkout this blog for a more elaborate survey about the advances in GANs.

Typically, models trained with adversarial loss have better perceptual quality even though they might lose out on PSNR compared to those trained on pixel loss. One minor downside is that, the training process of GANs is a bit difficult and unstable. However, methods to stabilize GAN training are actively worked upon.

Metrics

One big question is how do we quantitatively evaluate the performance of our model. A number of Image Quality Assessment (IQA) techniques (or metrics) are used for the same. These metrics can be broadly classified into two categories — Subjective metrics and Objective metrics.

Subjective metrics are based on the human observer’s perceptual evaluation whereas objective metrics are based on computational models that try to assess the image quality. Subjective metrics are often more “perceptually accurate”, however some of these metrics are inconvenient, time-consuming or expensive to compute. Another issue is that these two categories of metrics may not be consistent with each other. Hence, researchers often display results using metrics from both categories.

In this section, we will briefly explore a couple of the widely used metrics to evaluate the performance of our super resolution model.

PSNR

Peak Signal-to-Noise Ratio (PSNR) is commonly used objective metric to measure the reconstruction quality of a lossy transformation. PSNR is inversely proportional to the logarithm of the Mean Squared Error (MSE) between the ground truth image and the generated image.

Calculation of PSNR. (Source)

In the above formula, L is the maximum possible pixel value (for 8-bit RGB images, it is 255). Unsurprisingly, since PSNR only cares about the difference between the pixel values, it does not represent perceptual quality that well.

SSIM

Structural Similarity (SSIM) is a subjective metric used for measuring the structural similarity between images, based on three relatively independent comparisons, namely luminance, contrast, and structure. Abstractly, the SSIM formula can be shown as a weighted product of the comparison of luminance, contrast and structure computed independently.

SSIM is a weighted product of comparisons as described above. (Source)

In the above formula, alpha, beta and gamma are the weights of the luminance, contrast and structure comparison functions respectively. The commonly used representation of the SSIM formula is as shown below:

Commonly used representation of the SSIM formula. (Source)

In the above formula μ(I)represents the mean of a particular image, σ(I) represents the standard deviation of a particular image,σ(I,I’)represents the covariance between two images, and C1, C2 are constants set for avoiding instability. For brevity, the significance of the terms and the exact derivation is not explained in this blog and the interested reader can checkout Section 2.3.2 in this paper.

Due to the possible unevenly distribution of image statistical features or distortions, assessing image quality locally is more reliable than applying it globally. Mean SSIM (MSSIM), which splits the image into multiple windows and averages the SSIM obtained at each window, is one such method of assessing quality locally.

In any case, since SSIM evaluates the reconstruction quality from the perspective of the Human Visual System, it better meets the requirements of the perceptual assessment.

Other IQA Scores

Without explanation, some other methods of assessing image quality are listed below. The interested reader can refer to this paper for more details.

Mean Opinion Score (MOS)
Task-based Evaluation
Information Fidelity Criterion (IFC)
Visual Information Fidelity (VIF)

Conclusion

This blog article covered some introductory material and procedures for training deep learning models for Super Resolution. There are indeed more advanced techniques introduced by state of the art research which may yield better performance. Furthermore, researching on avenues such as unsupervised super resolution, better normalization techniques and better representative metrics could greatly further this field. The interested reader is encouraged to experiment with their innovative ideas by participating in challenges such as the PIRM Challenge.

An Introduction to Super Resolution using Deep Learning was originally published in BeyondMinds on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Overview of Human Pose Estimation with Deep Learning

Bharath Raj — Sun, 28 Apr 2019 16:50:19 GMT

An introduction to the techniques used in Human Pose Estimation based on Deep Learning.

Written by Bharath Raj with feedback from Yoni Osin.

Photo by Alain Pham on Unsplash

A Human Pose Skeleton represents the orientation of a person in a graphical format. Essentially, it is a set of coordinates that can be connected to describe the pose of the person. Each co-ordinate in the skeleton is known as a part (or a joint, or a keypoint). A valid connection between two parts is known as a pair (or a limb). Note that, not all part combinations give rise to valid pairs. A sample human pose skeleton is shown below.

Left: COCO keypoint format for human pose skeletons. Right: Rendered human pose skeletons. (Source)

Knowing the orientation of a person opens avenues for several real-life applications, some of which are discussed towards the end of this blog. Several approaches to Human Pose Estimation were introduced over the years. The earliest (and slowest) methods typically estimating the pose of a single person in an image which only had one person to begin with. These methods often identify the individual parts first, followed by forming connections between them to create the pose.

Naturally, these methods are not particularly useful in many real-life scenarios where images contain multiple people.

Multi-Person Pose Estimation

Multi-Person pose estimation is more difficult than the single person case as the location and the number of people in an image are unknown. Typically, we can tackle the above issue using one of two approaches:

The simple approach is to incorporate a person detector first, followed by estimating the parts and then calculating the pose for each person. This method is known as the top-down approach.
Another approach is to detect all parts in the image (i.e. parts of every person), followed by associating/grouping parts belonging to distinct persons. This method is known as the bottom-up approach.

Top: Typical Top-Down approach. Bottom: Typical Bottom-Up approach. (Image Source)

Typically, the top-down approach is easier to implement than the bottom-up approach as adding a person detector is much simpler than adding associating/grouping algorithms. It is hard to judge which approach has better performance overall as it really comes down to which among the person detector and associating/grouping algorithms is better.

In this blog, we will focus on multi-person human pose estimation using deep learning techniques. In the next section, we will review some of the popular top-down and bottom-up approaches for the same.

Deep Learning Methods

1. OpenPose

OpenPose is one of the most popular bottom-up approaches for multi-person human pose estimation, partly because of their well documented GitHub implementation.

As with many bottom-up approaches, OpenPose first detects parts (keypoints) belonging to every person in the image, followed by assigning parts to distinct individuals. Shown below is the architecture of the OpenPose model.

Flowchart of the OpenPose architecture. (Source)

The OpenPose network first extracts features from an image using the first few layers (VGG-19 in the above flowchart). The features are then fed into two parallel branches of convolutional layers. The first branch predicts a set of 18 confidence maps, with each map representing a particular part of the human pose skeleton. The second branch predicts a set of 38 Part Affinity Fields (PAFs) which represents the degree of association between parts.

Steps involved in human pose estimation using OpenPose. (Source)

Successive stages are used to refine the predictions made by each branch. Using the part confidence maps, bipartite graphs are formed between pairs of parts (as shown in the above image). Using the PAF values, weaker links in the bipartite graphs are pruned. Through the above steps, human pose skeletons can be estimated and assigned to every person in the image. For a more thorough explanation of the algorithm, you may refer to their paper and to this blog post.

2. DeepCut

DeepCut is a bottom-up approach for multi-person human pose estimation. The authors approached the task by defining the following problems:

Produce a set of D body part candidates. This set represents all possible locations of body parts for every person in the image. Select a subset of body parts from the above set of body part candidates.
Label each selected body part with one of C body part classes. The body part classes represent the types of parts, such as “arm”, “leg”, “torso” etc.
Partition body parts that belong to the same person.

Pictorial representation of the approach. (Source)

The above problems were jointly solved by modeling it into an Integer Linear Programming (ILP) problem. It is modeled by considering triples (x, y, z) of binary random variables with domains as stated in the images below.

Domains of the binary random variables. (Source)

Consider two body part candidates d and d' from the set of body part candidates D and classes c and c' from the set of classes C. The body part candidates were obtained through a Faster RCNN or a Dense CNN. Now, we can develop the following set of statements.

If x(d,c) = 1 then it means that body part candidate d belongs to class c.
Also, y(d,d') = 1 indicates that body part candidates d and d' belong to the same person.
They also define z(d,d’,c,c’) = x(d,c) * x(d’,c’) * y(d,d’). If the above value is 1, then it means that body part candidate d belongs to class c, body part candidate d' belongs to class c', and finally body part candidates d,d’ belong to the same person.

The last statement can be used to partition pose belonging to different people. Clearly, the above statements can be formulated in terms of linear equations as functions of (x,y,z). In this way, the Integer Linear Program (ILP) is set up, and the pose of multiple persons can be estimated. For the exact set of equations and much more detailed analysis, you can check out their paper here.

3. RMPE (AlphaPose)

RMPE is a popular top-down method of Pose Estimation. The authors posit that top-down methods are usually dependent on the accuracy of the person detector, as pose estimation is performed on the region where the person is located. Hence, errors in localization and duplicate bounding box predictions can cause the pose extraction algorithm to perform sub-optimally.

Effect of duplicate predictions (left) and low confidence bounding boxes (right). (Source)

To resolve this issue, the authors proposed the usage of Symmetric Spatial Transformer Network (SSTN) to extract a high-quality single person region from an inaccurate bounding box. A Single Person Pose Estimator (SPPE) is used in this extracted region to estimate the human pose skeleton for that person. A Spatial De-Transformer Network (SDTN) is used to remap the estimated human pose back to the original image coordinate system. Finally, a parametric pose Non-Maximum Suppression (NMS) technique is used to handle the issue of redundant pose deductions.

Furthermore, the authors introduce a Pose Guided Proposals Generator to augment training samples that can better help train the SPPE and SSTN networks. The salient feature of RMPE is that this technique can be extended to any combination of a person detection algorithm and an SPPE.

4. Mask RCNN

Mask RCNN is a popular architecture for performing semantic and instance segmentation. The model parallelly predicts both the bounding box locations of the various objects in the image and a mask that semantically segments the object. The basic architecture can be quite easily extended for human pose estimation.

Flowchart describing the Mask RCNN Architecture. (Source)

The basic architecture first extracts feature maps from an image using a CNN. These feature maps are used by a Region Proposal Network (RPN) to get bounding box candidates for the presence of objects. The bounding box candidates select an area (region) from the feature map extracted by the CNN. Since the bounding box candidates can be of various sizes, a layer called RoIAlign is used to reduce the size of the extracted feature such that they are all of the uniform size. Now, this extracted feature is passed into the parallel branches of CNNs for final prediction of the bounding boxes and the segmentation masks.

Let us focus on the branch that performs segmentation. Suppose an object in our image can belong to one among K classes. The segmentation branch outputs K binary masks of size m x m, where each binary mask represents all objects belonging to that class alone. We can extract keypoints belonging to every person in the image by modeling each type of keypoint as a distinct class and treating this like a segmentation problem.

Parallely, the objection detection algorithm can be trained to identify the location of the persons. By combining the information of the location of the person as well as their set of keypoints, we obtain the human pose skeleton for every person in the image.

This method nearly resembles the top-down approach, but the person detection stage is performed in parallel to the part detection stage. In other words, the keypoint detection stage and person detection stage are independent of each other.

5. Other Methods

Multi-Person Human Pose Estimation is a vast field with a plethora of approaches to tackle the problem. For brevity, only a select few approaches are explained here. For a more exhaustive list of approaches, you may check out the following links:

Applications

Pose Estimation has applications in myriad fields, some of which are listed below.

1. Activity Recognition

Tracking the variations in the pose of a person over a period of time can also be used for activity, gesture and gait recognition. There are several use cases for the same, including:

Applications to detect if a person has fallen down or is sick.
Applications that can autonomously teach proper work out regimes, sport techniques and dance activities.
Applications that can understand full-body sign language. (Ex: Airport runway signals, traffic policemen signals, etc.).
Applications that can enhance security and surveillance.

Tracking the gait of the person is useful for security and surveillance purposes. (Image source)

2. Motion Capture and Augmented Reality

An interesting application of human pose estimation is for CGI applications. Graphics, styles, fancy enhancements, equipment and artwork can be superimposed on the person if their human pose can be estimated. By tracking the variations of this human pose, the rendered graphics can “naturally fit” the person as they move.

Example of CGI Rendering. (Source)

A good visual example of what is possible can be seen through Animoji. Even though the above only tracks the structure of a face, the idea can be extrapolated for the keypoints of a person. The same concepts can be leveraged to render Augmented Reality (AR) elements that can mimic the movements of a person.

3. Training Robots

Instead of manually programming robots to follow trajectories, robots can be made to follow the trajectories of a human pose skeleton that is performing an action. A human instructor can effectively teach the robot certain actions by just demonstrating the same. The robot can then calculate how to move its articulators to perform the same action.

4. Motion Tracking for Consoles

An interesting application of pose estimation is for tracking the motion of human subjects for interactive gaming. Popularly, Kinect used 3D pose estimation (using IR sensor data) to track the motion of the human players and to use it to render the actions of the virtual characters.

The Kinect sensor in action. (Source)

Conclusion

Great strides have been made in the field of human pose estimation, which enables us to better serve the myriad applications that are possible with it. Moreover, research in related fields such as Pose Tracking can greatly enhance its productive utilization in several fields. The concepts listed in this blog are not exhaustive but rather strives to introduce some popular variants of these algorithms and their real-life applications.

An Overview of Human Pose Estimation with Deep Learning was originally published in BeyondMinds on Medium, where people are continuing the conversation by highlighting and responding to this story.

How To Easily Classify Food Using Deep Learning And TensorFlow

Bharath Raj — Mon, 18 Mar 2019 09:33:01 GMT

An in-depth tutorial on creating Deep Learning models for Multi-Label Classification.

By now you would have heard about Convolutional Neural Networks (CNNs) and its efficacy in classifying images. The accuracy of CNNs in image classification is quite remarkable and its real-life applications through APIs quite profound.

Examples of Classification, Localization, Object Detection and Instance Segmentation. (Source)

But sometimes, this technique may not be adequate. An image may represent multiple attributes. For instance, all of the following tags are valid for the below image. A simple classifier would get confused on what label to provide in such a scenario.

An image with multiple possible correct labels. (Source)

This problem is known as Multi-Label classification.

Why Multi-Label Classification?

There are many applications where assigning multiple attributes to an image is necessary. In fact, it is more natural to think of images as belonging to multiple classes rather than a single class. Below are some applications of Multi Label Classification.

1. Scene Understanding

Multi Label Classification provides an easy to calculate prior for complex Scene Understanding algorithms. Identifying various possible tags for an image can help the Scene Understanding algorithm to create multiple vivid descriptions for the image.

Multiple descriptions can be created for a scene based on the labels identified from the image. (Source)

2. Content-Based Retrieval

Multi Label tags can enhance the ability of search engines to retrieve very specific queries of a given product. For instance, we could provide multiple tags for an image of a fashion model wearing branded attire. A search engine can retrieve this result when you search for any one of the tags. A Multi Label Classification engine can automatically build up a database for the search engine.

Content Based Image Retrieval in action. The Multi Label classifier performs the function of the Feature Extraction module in the above flowchart. (Source)

Moreover, we can use the tags to recommend related products based on the user’s activity or preferences. For instance, you can recommend similar songs or movies based on the user’s activity. A Multi Label Classifier can be used to automatically index such songs and movies.

How does it work?

If you are familiar with Machine Learning algorithms for classification, some minor modifications are enough to make the same algorithm work for a multi label problem. In any case, let us do a small review of how classification works, and how it can be expanded to a multi label scenario. For the rest of this blog, we will focus on implementing the same for images.

Single Label Classification

Neural Networks are among the most powerful (and popular) algorithms used for classification. They take inputs in the form of a vector, perform some computations and then produce an output vector. The output vector is then compared with the ground truth labels and the computation process is tweaked (i.e. trained) to yield better results. To train the Neural Network, we feed our input data in the form of feature vectors that represent the important gist of the data.

A Multi Layer Perceptron. (Source)

One hurdle you might have noticed is the issue of encoding images into a feature vector. Convolutional Neural Networks (CNNs) are used for this purpose. Convolutions extract important features from the images and convert them into a vector representation for further processing. The rest of the processing in a CNN is similar to that of a Multi Layered Perceptron. This, in a nutshell, is how single label classification is performed.

A Convolutional Neural Network. (Source)

Multi Label Classification

Now, how do we adapt this model for Multi Label Classification ? There are several strategies for doing the same.

Method 1 — Problem Transformation

In this case, we will transform the Multi Label problem into a Multi Class problem. One way of doing this is by training a separate classifier for each label. This method has the obvious downside of training too many classifiers. This also ignores possible correlation between each label.

Another method is by encoding each possible combination of labels as a separate class, thereby creating a label powerset. This method works well for a small number of label combinations, but they are hard to scale for large number of label combinations. For just 10 labels, we would have get a power set of size 1024 (2 raised to the power 10)!

Method 2 — Adapting the Algorithm

Sometimes, making some minor modifications to the algorithm would be enough for tackling a Multi Label Classification problem. For instance, in the case of a Neural Network, we can replace the final softmax layer with a Sigmoid layer and then use Binary Cross Entropy to optimize the model.

Clearly there are a lot of strategies that can be explored. Often, one strategy may not work best for all kinds of data and hence requires lots of experimentation.

Multi Label Food Classification

The theory sounds alright but, how do we implement it? In this section, we will build our own Multi Label Food Classification algorithm using Keras (with TensorFlow backend). We will modify a simple CNN model to enable multi label classification. We will then do a comparison with Nanonets Multi Label Classification API.

All the code is available on GitHub over here. You can follow the GitHub repository for an in-depth guide to replicate the experiments.

Problem Description

Let us work on a possible real life application of Multi Label Classification. Given a food item, we would like to identify possible tags for the image. For instance, given an image of a cake, we would like our model to provide tags such as “carbs” and “dessert”.

Sample images and their respective tags.

Such a model is extremely useful for Content Based Retrieval for businesses based on the food industry. For instance, we can create an automated dietary planar app based on the requirements of the user and retrieve relevant images and recipes for the appropriate food items.

Part 1 — Data Collection

The first step is to collect and clean the data. I sampled around 2000 images from the Recipes5k dataset and resized them to size 224 x 224. The original dataset had annotations of the ingredients of a food item. However, there were more than 1000 possible ingredients (i.e. labels) and this would created highly sparse label vectors. Hence, I created my own set of annotations for the same images.

In our case, an image can have at most 10 possible labels. The list of labels are: [“Soups”, “Mains”, “Appetizer”, “Dessert” ,”Protein”, “Fats”, “Carbs”, “Healthy”, “Junk”, “Meat”]. To encode the labels in a format that can be utilized by the neural network, we create a 10 dimensional vector such that there is a “1” if a label is present in the image and “0” if a label is absent.

To make the annotation process simpler, I made some bold assumptions such as: “All cake images are Desserts and have Carbs”. This greatly simplified the annotation process, and I wrote a simple python function to carry most of the heavy lifting. While this strategy makes the process simpler, it may create some noisy annotations (i.e. slightly wrong) and could impact the final accuracy. Nevertheless, for this toy experiment, we proceed as such. A sample annotation for a cake image and its label is shown below.

A sample image and its vector format.

Clearly, we are restricted by the quantity of data in hand. To better enhance the training ability of our CNN, we can perform Data Augmentation and Transfer Learning.

Part 2 — Building the Model

We will define the model using Keras as follows. The below model is a pretrained ResNet-50 with two Dense layers in the end. Notice that we used a sigmoid activation rather than softmax for the Dense Layer. We use Binary Cross Entropy as our loss function. To calculate the accuracy of the model, we use the F1 score averaged by samples (or similar) as the metric.

clear_session()
img = Input(shape = (224, 224, 3))

model = ResNet50(
weights = 'imagenet',
include_top = False, 
input_tensor = img, 
input_shape = None, 
pooling = 'avg'
)

final_layer = model.layers[-1].output

dense_layer_1 = Dense(128, activation = 'relu')(final_layer)
output_layer = Dense(10, activation = 'sigmoid')(dense_layer_1)

model = Model(input = img, output = output_layer)
model.compile(optimizer = 'adam',metrics = ['accuracy'], 
loss = 'binary_crossentropy')

Part 3 — Training

The data was split into train, validation and test sets. The data is normalized channel-wise before being fed into the CNN. Since our dataset is relatively small, we can directly use model.fit() to train our model. This is shown is the following code snippet:

model.fit(
trainX, trainY, 
batch_size = 32, 
epochs = 50, 
validation_data = (valX, valY))

Part 4 — Inference

Now that we have a trained model, we can visualize its performance using model.predict() . This will output an array with each element representing the probability of a tag (or label). We can obtain a binary vector by rounding the predicted array such that a 1 signifies the presence of a tag and a 0 signifies the absence of a tag. We can use this binary vector to decode the predicted tags as shown in the image below.

Decoding the output from the Neural Network

To analyze the performance, we repeat the experiment with different models pre-trained on the ImageNet dataset. Overall, the following pre-trained models were used:

ResNet-50
DenseNet-121
Xception
MobileNet

That was great! But…

The above example works pretty good. But there are some issues:

As mentioned earlier, there are several strategies to perform Multi Label Classification. Lots of experimentation is required.
Need to perform a Hyper Parameter Search to optimize performance.
Need to manually handle transfer learning and data augmentation.
Requires a powerful GPU and lots of time to train.
Would take additional time and effort (and skill) to move this model to production.

The above issues pose a great limitation to moving such models quickly into deployment. Luckily, there is an easy alternative.

Nanonets to the rescue!

Nanonets provides an easy to use API to train a Multi Label classifier. It takes care of all of the heavy lifting, including Data Augmentation, Transfer Learning and Hyper Parameter search on their GPU clusters. It does all of this within an hour, and provides a REST API to integrate the model with your services. They also provide an annotation service if required.

It is pretty easy to get started with the Nanonets API. This section gives an overview about the steps involved in setting up the API to perform the same Multi Label food classification experiment. For a more detailed set of instructions, check out the GitHub repository over here.

Part 1 — Setup

Clone the GitHub repository. Obtain a free API key from Nanonets, set the appropriate environment variables, and run create_model.py as explained in the repository.

Note: In create_model.py we have to specify the list of possible labels (in our case, 10 food categories). I have already specified the labels in the code so you can directly run the above step. If you are using it for any other application, edit the list of possible labels inside this file.

Part 2 — Upload the Dataset

Nanonets requires the dataset to be provided in the following directory structure:

-multilabel_data
|-ImageSets
||-image1.jpg
||-image2.jpg
|-Annotations
||-image1.txt
||-image2.txt

I have already created the dataset in this format and provided a download link (and some instructions) in the GitHub repository. By running upload_training.py , the data is automatically pushed to Nanonets.

Part 3 — Training and Inference

Once the dataset is uploaded, you can execute train_model.py to start the training process. The script model_state.py will keep you updated about the current state of the model. You can also check out the status of your model from your user page at Nanonets as shown below:

Status of your model in Nanonets

Once your model is trained, you can run prediction.py to use the deployed model! You can also observe the sample JSON output from your user page as shown below:

Sample JSON response as shown by Nanonets.

Performance

Let us first perform a rudimentary analysis of the training time of the various Keras models. The training time for 100 epochs in minutes is plotted in the below bar graph.

The MobileNet is the fastest to train owing to its efficient architecture. Unsurprisingly, the Xception network takes a lot of time as it is the most complex network among the ones we compared.

Training Time in Minutes.

Do note that the training time does not account for the time incurred for hyperparameter search, model tuning, and model deployment. These factors greatly add on to the time required to move a model to production. However, Nanonets provided a production ready model within 30 minutes, even after accounting for all of the above factors.

Without a doubt, Nanonets trained faster than the Keras models. But how does it fare performance wise? Below we plot the F1 score obtained by the various Keras models and Nanonets.

F1 Score of various models.

Nanonets clearly has a higher score than the Keras models. Surprisingly, the MobileNet model came very close to catching up. Due to its parameter efficient architecture, it can mitigate overfitting better compared to the other Keras models. The relatively lower score of all models can either be attributed to the complexity and limited size of the dataset, or to noisy annotations. Let do a visual observation of the output as well.

Predicted vs Actual labels

Looks like our training was pretty successful! By using a larger dataset we could achieve better performance. You can also further experiment by using different datasets for other innovative applications by applying the concepts discussed above.

Originally published at https://blog.nanonets.com on March 18, 2019.

How To Easily Classify Food Using Deep Learning And TensorFlow was originally published in NanoNets on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Simple Guide to Semantic Segmentation

Bharath Raj — Mon, 04 Mar 2019 07:37:17 GMT

A comprehensive review of Classical and Deep Learning methods for Semantic Segmentation

Written by Bharath Raj with feedback from Noy Shulman and Rotem Alaluf.

Photo by JFL on Unsplash

Semantic Segmentation is the process of assigning a label to every pixel in the image. This is in stark contrast to classification, where a single label is assigned to the entire picture. Semantic segmentation treats multiple objects of the same class as a single entity. On the other hand, instance segmentation treats multiple objects of the same class as distinct individual objects (or instances). Typically, instance segmentation is harder than semantic segmentation.

Comparison between semantic and instance segmentation. (Source)

This blog explores some methods to perform semantic segmentation using classical as well as deep learning based approaches. Moreover, popular loss function choices and applications are discussed.

Classical Methods

Before the deep learning era kicked in, a good number of image processing techniques were used to segment image into regions of interest. Some of the popular methods used are listed below.

Gray Level Segmentation

The simplest form of semantic segmentation involves assigning hard-coded rules or properties a region must satisfy for it to be assigned a particular label. The rules can be framed in terms of the pixel’s properties such as its gray level intensity. One such method that uses this technique is the Split and Merge algorithm. This algorithm recursively splits an image into sub-regions until a label can be assigned, and then combines adjacent sub-regions with the same label by merging them.

The problem with this method is that rules must be hard-coded. Moreover, it is extremely difficult to represent complex classes such as humans with just gray level information. Hence, feature extraction and optimization techniques are needed to properly learn the representations required for such complex classes.

Conditional Random Fields

Consider segmenting an image by training a model to assign a class per pixel. In case our model is not perfect, we may obtain noisy segmentation results that may be impossible in nature (such as dog pixels mixed with cat pixels, as shown in the image).

Pixels with label dog mixed with pixels with label cat (image c). A more realistic segmentation is shown in image d. (Source)

These can be avoided by considering a prior relationship among pixels, such as the fact that objects are continuous and hence nearby pixels tend to have the same label. To model these relationships, we use Conditional Random Fields (CRFs).

CRFs are a class of statistical modelling methods used for structured prediction. Unlike discrete classifiers, CRFs can consider “neighboring context” such as relationship between pixels before making predictions. This makes it an ideal candidate for semantic segmentation. This section explores the usage of CRFs for semantic segmentation.

Each pixel in the image is associated with a finite set of possible states. In our case, the target labels are the set of possible states. The cost of assigning a state (or label, u) to a single pixel (x) is known as its unary cost. To model relationships between pixels, we also consider the cost of assigning a pair of labels (u,v) to a pair of pixels (x,y) known as the pairwise cost. We can consider pairs of pixels that are its immediate neighbors (Grid CRF) or we can consider all pairs of pixels in the image (Dense CRF)

Dense vs Grid CRF. (Source)

The sum of the unary and pairwise cost of all pixels is known as the energy (or cost/loss) of the CRF. This value can be minimized to obtain a good segmentation output.

Deep Learning Methods

Deep Learning has greatly simplified the pipeline to perform semantic segmentation and is producing results of impressive quality. In this section we discuss popular model architectures and loss functions used to train these deep learning methods.

1. Model Architectures

One of the simplest and popular architecture used for semantic segmentation is the Fully Convolutional Network (FCN). In the paper FCN for Semantic Segmentation,the authors use the FCN to first downsample the input image to a smaller size (while gaining more channels) through a series of convolutions. This set of convolutions is typically called the encoder. The encoded output is then upsampled either through bilinear interpolation or a series of transpose-convolutions. This set of transposed-convolutions is typically called the decoder.

Downsampling and Upsampling in an FCN. (Source)

This basic architecture, despite being effective, has a number of drawbacks. One such drawback is the presence of checkerboard artifacts due to uneven overlap of the output of the transpose-convolution (or deconvolution) operation.

Formation of Checkerboard Artifacts. (Source)

Another drawback is poor resolution at the boundaries due to loss of information from the process of encoding.

Several solutions were proposed to improve the performance quality of the basic FCN model. Below are some of the popular solutions that proved to be effective:

U-Net

The U-Net is an upgrade to the simple FCN architecture. It has skip connections from the output of convolution blocks to the corresponding input of the transposed-convolution block at the same level.

U-Net. (Source)

This skip connections allows gradients to flow better and provides information from multiple scales of the image size. Information from larger scales (upper layers) can help the model classify better. Information from smaller scales (deeper layers) can help the model segment/localize better.

Tiramisu Model

The Tiramisu Model is similar to the U-Net except for the fact that it uses Dense Blocks for convolution and transposed-convolutions as done in the DenseNet paper. A Dense Block consists of several layers of convolutions where the feature-maps of all preceding layers are used as inputs for all subsequent layers. The resultant network is extremely parameter efficient and can better access features from older layers.

Tiramisu Network. (Source)

A downside of this method is that due to the nature of the concatenation operations in several ML frameworks, it is not very memory efficient (requires a large GPU to run).

MultiScale methods

Some Deep Learning models explicitly introduce methods to incorporate information from multiple scales. For instance, the Pyramid Scene Parsing Network (PSPNet) performs the pooling operation (max or average) using four different kernel sizes and strides to the output feature map of a CNN such as the ResNet. It then upsamples the size of all the pooling outputs and the CNN output feature map using bilinear interpolation, and concatenates all of them along the channel axis. A final convolution is performed on this concatenated output to generate the prediction.

PSPNet. (Source)

Atrous (Dilated) Convolutions present an efficient method to combine features from multiple scales without increasing the number of parameters by a large amount. By adjusting the dilation rate, the same filter has its weight values spread out farther in space. This enables it to learn more global context.

Cascaded Atrous Convolutions. (Source)

The DeepLabv3 paper uses Atrous Convolutions with different dilation rates to capture information from multiple scales, without significant loss in image size. They experiment with using Atrous convolutions in a cascaded manner (as shown above) and also in a parallel manner in the form of Atrous Spatial Pyramid Pooling (as shown below).

Parallel Atrous Convolutions. (Source)

Hybrid CNN-CRF methods

Some methods use a CNN as a feature extractor and then use the features as unary cost (potential) input to a Dense CRF. This hybrid CNN-CRF method offers good results due to the ability of CRFs to model inter-pixel relationships.

Methods using combinations of CNN and CRF. (Source)

Certain methods incorporate the CRF within the neural network itself, as presented in CRF-as-RNN where the Dense CRF is modelled as a Recurrent Neural Network. This enables end-to-end training, as illustrated in the above image.

2. Loss Functions

Unlike normal classifiers, a different loss function must be selected for semantic segmentation. Below are some of the popular loss functions used for semantic segmentation:

Pixel-wise Softmax with Cross Entropy

Labels for semantic segmentation are of the same size as of the original image. The label can be represented in one-hot encoded form as depicted below:

One-Hot format for semantic segmentation. (Source)

Since the label is in a convenient one-hot form, it can be directly used as the ground truth (target) for calculating cross-entropy. However, softmax must be applied pixel-wise on the predicted output before applying cross entropy, as each pixel can belong to any one our target classes.

Focal Loss

Focal Loss, introduced in the RetinaNet paper proposes an upgrade to the standard cross-entropy loss for usage in cases with extreme class imbalance.

Consider the plot of the standard cross entropy loss equation as shown below (Blue color). Even in the case where our model is pretty confident about a pixel’s class (say 80%), it has a tangible loss value (here, around 0.3). On the other hand, Focal Loss (Purple color, with gamma=2)does not penalize the model to such a large extent when the model is confident about a class (i.e. loss is nearly 0 for 80% confidence).

Standard Cross Entropy (Blue) vs Focal Loss with various values of gamma. (Source)

Let us explore why this is significant with an intuitive example. Assume we have an image with 10000 pixels, with only two classes: Background class (0 in one-hot form) and Target class (1 in one-hot form). Let us assume 97% of the image is the background and 3% of the image is the target. Now, say our model is 80% sure about pixels that are background, but only 30% sure about pixels that are the target class.

While using cross-entropy, loss due to background pixels is equal to (97% of 10000) * 0.3 which equals 2850 and loss due to target pixels is equal to (3% of 10000) * 1.2 which equals 360. Clearly, the loss due to the more confident class dominates, and there is very low incentive for the model to learn the target class. Comparatively, with focal loss, loss due to background pixels is equal to (97% of 10000) * 0 which is 0. This allows the model to learn the target class better.

Dice Loss

Dice Loss is another popular loss function used for semantic segmentation problems with extreme class imbalance. Introduced in the V-Net paper, the Dice Loss is used to calculate the overlap between the predicted class and the ground truth class. The Dice Coefficient (D) is represented as follows:

Dice Coefficient. (Source)

Our objective is to maximize the overlap between the predicted and ground truth class (i.e. to maximize the Dice Coefficient). Hence, we generally minimize (1-D) instead to obtain the same objective, as most ML libraries provide options for minimization only.

Derivative of Dice Coefficient. (Source)

Even though Dice Loss works well for samples with class imbalance, the formula for calculating its derivative (shown above) has squared terms in the denominator. When those values are small, we could get large gradients, leading to training instability.

Applications

Semantic Segmentation is used in various real life applications. Following are some of the significant use cases of semantic segmentation.

Autonomous Driving

Semantic segmentation is used to identify lanes, vehicles, people and other objects of interest. The resultant is used to make intelligent decisions to guide the vehicle properly.

Semantic segmentation for autonomous vehicles. (Source)

One constraint on autonomous vehicles is that performance must be real time. A solution to the above problem is to integrate a GPU locally along with the vehicle. To enhance performance of the above solution, lighter (low parameters) neural networks can be used or techniques to fit neural networks on the edge can be implemented.

Medical Image Segmentation

Semantic Segmentation is used to identify salient elements in medical scans. It is especially useful to identify abnormalities such as tumors. The accuracy and low recall of algorithms are of high importance for these applications.

Segmentation of medical scans. (Source)

We can also automate less critical operations such as estimating the volume of organs from 3D semantically segmented scans.

Scene Understanding

Semantic segmentation usually forms the base for more complex tasks such as Scene Understanding and Visual Question and Answer (VQA). A scene graph or a caption is usually the output of scene understanding algorithms

Scene Understanding in action. (Source)

Fashion Industry

Semantic Segmentation is used in the Fashion Industry to extract clothing items from an image to provide similar suggestions from retail shops. More advanced algorithms can “re-dress” particular items of clothing in an image.

Semantic segmentation used as an intermediate step to redress a human based on text input. (Source)

Satellite (Or Aerial) Image Processing

Semantic Segmentation is used to identify types of land from satellite imagery. Typical use cases involve segmenting water bodies to provide accurate map information. Other advanced uses cases involve mapping roads, identifying types of crops, identifying free parking space and so on.

Semantic segmentation of satellite/aerial images. (Source)

Conclusion

Deep Learning greatly enhanced and simplified Semantic Segmentation algorithms and paved the way for greater adoption in real-life applications. The concepts listed in this blog are not exhaustive as research communities continuously strive to enhance the accuracy and real-time performance of these algorithms. Nevertheless, this blog introduces some popular variants of these algorithms and their real-life applications.

A Simple Guide to Semantic Segmentation was originally published in BeyondMinds on Medium, where people are continuing the conversation by highlighting and responding to this story.

Depth Estimation

Bharath Raj — Sat, 16 Feb 2019 14:05:21 GMT

A comprehensive review of techniques used to estimate depth using Machine Learning and classical methods.

Written by Bharath Raj with feedback from Rotem Alaluf.

Photo by Osman Rana on Unsplash

Conventional displays are two dimensional. A picture or a video of the three dimensional world is encoded to be stored in two dimensions. Needless to say, we lose information corresponding to the third dimension which has depth information.

2D representation is good enough for most applications. However, there are applications that require information to be provided in three dimensions. An important application is robotics, where information in three dimensions is required to accurately move the actuators. Clearly, some provisions have to made to incorporate the lost depth information, and this blog explores such concepts.

How do we estimate depth?

Our eyes estimate depth by comparing the image obtained by our left and right eye. The minor displacement between both viewpoints is enough to calculate an approximate depth map. We call the pair of images obtained by our eyes a stereo pair. This, combined with our lens with variable focal length, and general experience of “seeing things”, allows us to have seamless 3D vision.

Stereo image pair formed due to the different viewpoints with respect to the left eye and the right eye. (Source)

Engineers and Researchers have realized this concept and tried to emulate the same to extract depth information from the environment. There are numerous approaches to reach the same outcome. We will explore the hardware and software approaches separately.

Hardware:

1. Dual camera technology

Some devices have two cameras separated by a small distance (usually a few millimeters) to capture images from different viewpoints. These two images form a stereo pair, and is used to compute depth information.

Dual camera separated by a small distance on a mobile phone. (Source)

2. Dual pixel technology

An alternative solution to the Dual Camera technology is Dual Pixel Autofocus (DPAF) technology.

Calculation of depth using DPAF on the Google Pixel 2. (Source)

Each pixel is comprised of two photodiodes, which are separated by a very small distance (less than a millimeter). Each photodiode considers the image signals separately, and then analyzes it. This distance of separation is surprisingly sufficient for the images produced by the photodiodes to be considered as a stereo-image pair. Popularly, Google Pixel 2 uses this technology to calculate depth information.

3. Sensors

A good alternative to multiple cameras is to use sensors that can infer distance. For instance, the first version of Kinect used an Infra-Red (IR) projector to achieve this. A pattern of IR dots is projected on to the environment, and a monochrome CMOS sensor (placed a few centimeters apart) received the reflected rays. The difference between the expected and received IR dot positions is calculated to produce the depth information.

Kinect sensor in action. (Source)

LIDAR systems fire laser pulses at the objects in the environment, and measures the time it takes for these pulses to get reflected back (also known as time of flight). They also additionally measure the change in wavelength of these laser pulses. This can give accurate depth information.

LIDAR in action. (Source)

An alternative and inexpensive solution would be to use Ultrasonic sensors. These sensors usually include a transmitter that projects ultrasonic sound waves towards the target. The waves are reflected by the target back to the sensor. By measuring the time the waves take to return to the sensor, we can measure the distance to the target. However, sound based sensors may perform poorly in noisy environments.

A typical low cost ultrasonic sensor. (Source)

Software:

Using additional hardware not only increases the cost of production, but also makes the depth estimation methods incompatible with other devices. Fortunately, methods to estimate depth by using software only techniques do exist, and is also an active research topic. Below are some of the popular methods to estimate depth using software:

1. Multiple image methods

The easiest way to calculate depth information without using additional hardware is to take multiple images of the same scene with slight displacements. By matching keypoints that are common with each image, we can reconstruct a 3D model of the scene. Algorithms such as Scale-Invariant Feature Transform (SIFT) are excellent at this task.

To make this method more robust, we can measure the change in orientation of the device to calculate the physical distance between the two images. This can be done by measuring the accelerometer and gyroscope data of the device. For instance, Visual-Intertial Odometry is used in Apple’s ARKit to calculate the depth and other attributes of the scene. User experience is refined as even slight motions of the device is enough to create stereo image information.

2. Single image methods

There are several single-image depth estimation methods as well. These methods usually involve a neural network trained on pairs of images and their depth maps. Such methods are easy to interpret and construct, and provide decent accuracy. Below are examples of some popular learning based methods.

A. Supervised Learning based methods

Supervised methods require some sort of labels to be trained. Usually, the labels are pixel-wise RGB-D depth maps. In such cases, the trained model can directly output the depth map. Commonly used depth datasets include the NYUv2 dataset, which contains RGB-D depth maps for indoor images, and the Make3D dataset, which contains RGB-D depth maps for outdoor images. You can checkout this GitHub repo for information on more datasets.

Sample image (left) and its depth annotation in RGB-D (right). (Source)

Target labels need not necessarily be pure depth maps, but can also be a function of depth maps, such as hazy images. Hence, we can use hazy and haze-free image pairs for training the model, and then the depth can be extracted using a function that relates a hazy image with its depth value. For this discussion, we will only concentrate on methods that use depth maps as target labels.

Autoencoder are among the simplest type of networks used to extract depth information. Popular variants involve using U-Nets, which are convolutional autoencoders with residual skip connections connecting feature maps from the downsampling (output of convolutions) and upsampling (output of transposed convolutions) arm.

Standard U-Net architecture. (Source)

Improvements can be made over the basic structure. For instance, in the paper “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture” multiple neural networks have been used, with each network operating on input in different scales. The parameters of each network such as kernel size and stride are different. The authors claim that extracting information from multiple scales yields a higher quality depth than single scale extraction.

An improvement over the above method is presented in “Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation”. Here they use a single end-to-end trainable model, but they fuse features maps of different scales using structured attention guided Conditional Random Fields (CRFs) before feeding it as input to the last convolution operation.

Other methods treat depth extraction as an image-to-image translation problem. Conventional image translation methods are based on the pix2pix paper. These methods directly extract the depth map given an input image.

Image translation in action. (Source)

Similarly, improvements can be made over this structure as well. The performance can be enhanced by improving GAN stability and output quality, by using methods like gradient penalty, self-attention and perceptual loss.

B. Unsupervised Learning based methods

It is hard to obtain depth datasets of high quality that account for all possible background conditions. Unsurprisingly, enhancing performance of supervised methods beyond some point is difficult due to the lack of accurate data. Semi-supervised and Unsupervised methods remove the requirement of a target depth image, and hence are not limited by this constraint.

The method introduced by “Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue” involves generating the right image, for a given left image in a stereo image pair (or vice versa). This can be performed by training an autoencoder as in the supervised scenario. Our trained model can output right-side images for any left-side image. Now, we calculate the disparity between the two images, which in our case is the displacement of a pixel (or block) in the right-image with respect to its location in the left-image. Using the value of disparity, we can calculate the depth, given the focal length of the camera and the distance between the two images.

Calculation of depth using disparity. Baseline is the distance between the two cameras (right and left images). (Source)

The above method is considered to be truly unsupervised when our algorithm can adapt to non stereo image pairs as well. This can be done by keeping track of the distance between the two images by checking the sensor data on the device. Improvements can be made over this method, as done in this work “Unsupervised Monocular Depth Estimation with Left-Right Consistency” where the disparity is calculated both with respect to the left image and the right image, and then the depth is calculated by considering both values.

Limitations

The limitation of using learning based methods, especially that of supervised methods, is that they may not generalize well to all use-cases. Analytical methods may not have enough information to create a robust depth map from a single image. However, incorporating domain knowledge can aid extraction of depth information in some cases.

For instance, consider Dark Channel Prior based haze removal. The authors observed that most local patches of hazy images have low intensity pixels in atleast one channel. Using this information, they created an analytical haze removal method. Since haze is a function of depth, by comparing the dehazed image with the original, the depth can be easily recovered.

A clear limitation of unsupervised methods is that they require additional domain information such as camera focal length and sensor data to measure image displacement. However, they do offer better generalization than supervised methods, atleast in theory.

Applications of depth estimation

1. Augmented reality

One of the key applications of depth estimation is Augmented Reality (AR). A fundamental problem in AR is to place an object in 3D space such that its orientation, scale and perspective are properly calibrated. Depth information is vital for such processes.

An AR app that can measure the dimensions of objects. (Source)

One impressive application is IKEA’s demo, where you can visualize products in your home using an AR module before actually purchasing them. Using this method, we can visualize its dimensions, as well as view it from multiple scales.

2. Robotics and object trajectory estimation

Objects in real life move in 3D space. However, since our displays are limited to two dimensions, we cannot accurately calculate motion along the third dimension.

With depth information, we can estimate the trajectory along the third dimension. Moreover, knowing the scale values, we can calculate the distance, velocity and acceleration values of the object within a reasonable accuracy. This is especially useful for robots to reach or track objects in 3D space

3. Haze and Fog removal

Haze and Fog are natural phenomena that are a function of depth. Distant objects are obscured to a greater extent.

Example of haze removal. (Source)

Hence, image processing methods that aim to remove haze must estimate the depth information first. Haze removal is an active research topic, and there are several quantitative and learning based solutions proposed.

4. Portrait mode

Portrait mode on certain smart phone devices involve focusing on certain objects of interest, and blurring other regions. Blur applied as a function of depth creates a much more appealing image than using just uniform blur.

Blurred image (right) created using portrait mode. (Source)

Conclusion

Depth Estimation is a challenging problem with numerous applications. Through efforts taken by the research community, powerful and inexpensive solutions using Machine Learning are becoming more commonplace. These and many other related solutions would greatly pave the path for innovative applications using Depth Estimation in many domains.

Depth Estimation was originally published in BeyondMinds on Medium, where people are continuing the conversation by highlighting and responding to this story.

Advances in Generative Adversarial Networks

Bharath Raj — Thu, 31 Jan 2019 09:32:03 GMT

Advances in Generative Adversarial Networks (GANs)

A summary of the latest advances in Generative Adversarial Networks

Written by Bharath Raj with feedback from Rotem Alaluf

Art by Lønfeldt on Unsplash

Generative Adversarial Networks are a powerful class of neural networks with remarkable applications. They essentially consist of a system of two neural networks — the Generator and the Discriminator — dueling each other.

GANs in action. (Source)

Given a set of target samples, the Generator tries to produce samples that can fool the Discriminator into believing they are real. The Discriminator tries to resolve real (target) samples from fake (generated) samples. Using this iterative training approach, we eventually end up with a Generator that is really good at generating samples similar to the target samples.

GANs have a plethora of applications, as they can learn to mimic data distributions of almost any kind. Popularly, GANs are used for removing artefacts, super resolution, pose transfer, and literally any kind of image translation, as shown below:

Image translation using GANs. (Source)

However, they are excruciatingly difficult to work with, owing to its fickle stability. Needless to say, many researchers have proposed brilliant solutions to mitigate some of the problems involved with training GANs. However, the research in this area evolved so fast that, it became hard to keep track of interesting ideas. This blog makes an effort to list out some popular techniques that are commonly used to make GAN training stable.

Drawbacks of using GANs — An Overview

GANs are difficult to work with for a bunch of reasons. Some of them are listed below in this section.

1. Mode collapse

Natural data distributions are highly complex and multimodal. That is, the data distribution has a lot of “peaks” or “modes”. Each mode represents a concentration of similar data samples, but are distinct from other modes.

During mode collapse, the generator produces samples that belong to a limited set of modes. This happens when the generator believes that it can fool the discriminator by locking on to a single mode. That is, the generator produces samples exclusively from this mode.

The image at the top represents the output of a GAN without mode collapse. The image at the bottom represents the output of a GAN with mode collapse. (Source)

The discriminator eventually figures out that samples from this mode are fake. As a result, the generator simply locks on to another mode. This cycle repeats indefinitely, and this essentially limits the diversity of the generated samples. For a more detailed explanation, you can check out this blog.

2. Convergence

A common question in GAN training is “when do we stop training them?”. Since the Generator loss improves when the Discriminator loss degrades (and vice-versa), we can not judge convergence based on the value of the loss function. This is illustrated by the image below:

Plot of a typical GAN loss function. Note how convergence cannot be interpreted from this plot. (Source)

3. Quality

As with the previous problem, it is difficult to quantitatively tell when the generator produces high quality samples. Additional perceptual regularization added to the loss function can help mitigate the situation to some extent.

4. Metrics

The GAN objective function explains how well the Generator or the Discriminator is performing with respect to its opponent. It does not however represent the quality or the diversity of the output. Hence, we need distinct metrics that can measure the same.

Terminologies

Before we dive deep into techniques that can aid performance, let us review some terminologies. This will simplify explanations of the techniques presented in the next section.

1. Infimum and Supremum

Put simply, Infimum is the largest lower bound of a set. Supremum is the smallest upper bound of a set. They differ from minimum and maximum in the sense that the infimum and supremum need not belong to the set.

2. Divergence Measures

Divergence measures represent the distance between two distributions. Conventional GANs essentially minimize the Jensen Shannon divergence between the real data distribution and the generated data distribution. GAN loss functions can be modified to minimize other divergence measures such as the Kulback Leibler divergence or Total Variation Distance. Popularly, the Wasserstein GAN minimises the Earth Mover distance.

3. Kantorovich Rubenstein Duality

Some divergence measures are intractable to optimize in their naive form. However, their dual form (replacing infimum with supremum or vice-versa) may be tractable to optimize. The duality principle lays a framework for transforming one form to another. For a very detailed explanation about the same, you can check out this blog post.

4. Lipschitz continuity

A Lipschitz continuous function is limited in how fast it can change. For a function to be Lipschitz continuous, the absolute value of the slope of the function’s graph (for any pair of points) cannot be more than a real value K. Such functions are also known as K-Lipschitz continuous.

Lipschitz continuity is desired in GANs as they bound the gradients of the discriminator, essentially preventing the exploding gradient problem. Moreover, the Kantorovich-Rubinstein duality requires it for a Wasserstein GAN, as mentioned in this excellent blog post.

Techniques for Improving Performance

There are a plethora of tricks and techniques that can be used for making GANs more stable and powerful. To keep this blog concise I’ve only explained techniques that are either relatively new or complex. I’ve listed out other miscellaneous tricks and techniques at the end of this section.

1. Alternative Loss Functions

One of the most popular fixes to the shortcomings of GANs is the Wasserstein GAN. It essentially replaces the Jensen Shannon divergence of conventional GANs with the Earth Mover distance (Wasserstein-1 distance or EM distance). The original form of the EM distance is intractable, and hence we use its dual form (calculated by the Kantorovich Rubenstein Duality). This requires the discriminator to be 1-Lipschitz, which is maintained by clipping the weights of the discriminator.

The advantage of using Earth Mover distance is that it is continuous even when the real and generated data distributions are disjoint, unlike JS or KL divergence. Also, there is a correlation between the generated image quality and the loss value (Source). The disadvantage is that, we need to perform several discriminator updates per generator update (as per the original implementation). Moreover, the authors claim that weight clipping is a terrible way to ensure 1-Lipschitz constraint.

The earth mover distance (left) is continuous, even if the distributions are not continuous, unlike the Jensen Shannon divergence (right). Refer to this paper for a detailed explanation.

Another interesting solution is to use mean squared loss instead of log loss. The authors of the LSGAN argue that the conventional GAN loss function does not provide much incentive to “pull” the generated data distribution close to the real data distribution.

The log loss in the original GAN loss function does not bother about the distance of the generated data from the decision boundary (the decision boundary separates real and fake data). LSGAN on the other hand penalizes generated samples that are far away from the decision boundary, essentially “pulling” the generated data distribution closer to the real data distribution. It does this by replacing the log loss with mean squared loss. For a detailed explanation of the same, check out this blog.

2. Two Timescale Update Rule (TTUR)

In this method, we use a different learning rate for the discriminator and the generator (Source). Typically, a slower update rule is used for the generator and a faster update rule is used for the discriminator. Using this method, we can perform generator and discriminator updates in 1:1 ratio, and just tinker with the learning rates. Notably, the SAGAN implementation uses this method.

3. Gradient Penalty

In the paper Improved Training of WGANs, the authors claim that weight clipping (as originally performed in WGANs) lead to optimization issues. They claim that weight clipping forces the neural network to learn “simpler approximations” to the optimal data distribution, leading to lower quality results. They also claim that weight clipping leads to the exploding or vanishing gradient problem, if the WGAN hyperparameter is not set properly. The author introduces a simple gradient penalty which is added to the loss function such that the above problems are mitigated. Moreover, 1-Lipschitz continuity is maintained, as in the original WGAN implementation.

Gradient penalty added as regularizer, as in the original WGAN-GP paper. (Source)

The authors of DRAGAN claim that mode collapse occurs when the game played by the GAN (i.e. discriminator and generator going against each other) reaches a “local equilibrium state”. They also claim that the gradients contributed by the discriminator around such states are “sharp”. Naturally, using a gradient penalty will help us circumvent these states, greatly enhancing stability and reducing mode collapse.

4. Spectral Normalization

Spectral normalization is a weight normalization technique that is typically used on the Discriminator to enhance the training process. This essentially ensures that the Discriminator is K-Lipschitz continuous.

Some implementations like the SAGAN used spectral normalization on the Generator as well. It is also stated that this method is computationally more efficient than Gradient Penalty (Source).

5. Unrolling and Packing

As stated in this excellent blog, one way to prevent mode hopping is to peek into the future and anticipate counterplay when updating parameters. Unrolled GANs enables the Generator to fool the Discriminator, after the discriminator had a chance to respond (taking counterplay into account).

Another way of preventing mode collapse is to “pack” several samples belonging to the same class before passing it to the Discriminator. This method is incorporated in PacGAN, in which they have reported decent reduction of mode collapse.

6. Stacking GANs

A single GAN may not be powerful enough to handle a task effectively. We could instead use multiple GANs placed consecutively, where each GAN solves an easier version of the problem. For instance, FashionGAN used two GANs to perform localized image translation.

FashionGAN used two GANs to perform localized image translation. (Source)

Taking this concept to the extreme, we can gradually increase the difficulty of the problem presented to our GANs. For instance, Progressive GANs (ProGANs) can generate high quality images of excellent resolution.

7. Relativistic GANs

Conventional GANs measure the probability of the generated data being real. Relativistic GANs measure the probability of the generated data being “more realistic” than the real data. We can measure this “relative realism” using an appropriate distance measure, as mentioned in the RGAN paper.

Output of the discriminator when using the standard GAN loss (image B). Image C represents how the output curve should actually look like. Image A represents the optimal solution to the JS divergence. (Source)

The authors also mention that the discriminator output should converge to 0.5 when it has reached the optimal state. However, conventional GAN training algorithms force the discriminator to output “real” (i.e. 1) for any image. This, in a way, prevents the discriminator from reaching its optimal value. The relativistic method solves this issue as well, and has pretty remarkable results, as shown below.

Output of a standard GAN (left) and a relativistic GAN (right) after 5000 iterations. (Source)

8. Self Attention Mechanism

The authors of Self Attention GANs claim that convolutions used for generating images look at information that are spread locally. That is, they miss out on relationships that span globally due to their restrictive receptive field.

Adding the attention map (calculated in the yellow box) to the standard convolution operation. (Source)

Self-Attention Generative Adversarial Network allows attention-driven, long-range dependency modeling for image generation tasks. The self-attention mechanism is complementary to the normal convolution operation. The global information (long range dependencies) aid in generating images of higher quality. The network can choose to ignore the attention mechanism, or consider it along with normal convolutions. For a detailed explanation, you can check out their paper.

Visualization of the attention map for the location marked by the red dot. (Source)

9. Miscellaneous Techniques

Here is a list of some additional techniques (not exhaustive!) that are used to improve GAN training:

Feature Matching
Mini Batch Discrimination
Historical Averaging
One-sided Label Smoothing
Virtual Batch Normalization

You can read up more about these techniques in this paper, and from this blog post. A lot more techniques are listed in this GitHub repository.

Metrics

Now that we have established methods to improve training, we need to quantitatively prove it. The following metrics are often used to measure the performance of a GAN:

1. Inception Score

The inception score measures how “real” the generated data is.

The Inception Score. (Source)

The equation has two components p(y|x) and p(y) . Here, x is the image that is produced by the Generator, and p(y|x) is the probability distribution obtained, when you pass image x through a pre-trained Inception Network (pretrained on the ImageNet dataset, as in the original implementation). Also, p(y) is the marginal probability distribution, which can be calculated by averaging p(y|x) over a few distinct samples of generated images (x). These two terms represent two different qualities that are desirable on real images:

The generated image must have objects that are “meaningful” (objects are clear, and not blurry). This means that p(y|x) should have “low entropy”. In other words, our Inception Network must be strongly confident that the generated image belongs to a particular class.
The generated images should be “diverse”. This means that p(y) should have “high entropy”. In other words, generator should produce images such that each image represents a different class label (ideally).

Ideal plots of p(y|x) and p(y). Such a pair would have a really large KL divergence. (Source)

If a random variable is highly predictable, it has low entropy (i.e. p(y|x)must be a distribution with a sharp peak). On the contrary, if it is unpredictable, it has high entropy (i.e. p(y) must be a uniform distribution). If both these traits are satisfied, we should expect a large KL divergence between p(y|x) and p(y) . Naturally, a large Inception Score (IS) is better. For a deeper analysis on the Inception Score, you can checkout this paper.

2. Fréchet Inception Distance (FID)

A drawback of the Inception Score is that statistics of the real data are not compared with the statistics of the generated data (Source). Fréchet distance resolves the drawback by comparing the mean and covariance of the real and generated images. Fréchet Inception Distance (FID) performs the same analysis, but on the feature maps produced by passing the real and generated images through a pre-trained Inception-v3 Network (Source). The equation is described as follows:

FID compares the mean and covariance of the real and generated data distributions. Tr stands for Trace. (Source)

A lower FID score is better, as it explains that the statistics of the generated images are very similar to that of the real images.

Conclusion

The research community has produced numerous solutions and hacks to overcome the shortcomings of GAN training. However, it is difficult to keep track of significant contributions due to the sheer volume of new research. The details shared in this blog is not exhaustive for the same reason, and may become outdated in the near future. Nevertheless, I hope this blog serves as a guideline for people looking for methods to improve the performance of their GANs.

Advances in Generative Adversarial Networks was originally published in BeyondMinds on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Fit Large Neural Networks on the Edge

Bharath Raj — Tue, 21 Aug 2018 14:37:09 GMT

Exploring techniques used to fit neural networks in memory-constrained edge settings

Deploying memory-hungry deep learning algorithms is a challenge for anyone who wants to create a scalable service. Cloud services are expensive in the long run. Deploying models offline on edge devices is cheaper, and has other benefits as well. The only disadvantage is that they have a paucity of memory and compute power.

This blog explores a few techniques that can be used to fit neural networks in memory-constrained settings. Different techniques are used for the “training” and “inference” stages, and hence they are discussed separately.

Training

Certain applications require online learning. That is, the model improves based on feedback or additional data. Deploying such applications on the edge places a tangible resource constraint on your model. Here are 4 ways you can reduce the memory consumption of such models.

1. Gradient Checkpointing

Frameworks such as TensorFlow consume a lot of memory for training. During a forward pass, the value at every node in the graph is evaluated and saved in memory. This is required for calculating the gradient during backprop.

Value of every node is saved after a forward pass to calculate the gradient in a single backward pass. (Source)

Normally this is okay, but when models get deeper and more complex, the memory consumption increases drastically. A neat sidestep solution to this is to recompute the values of the node when needed, instead of saving them to memory.

Recomputing the node values to calculate the gradient. Note that we need to do several partial forward passes to complete a single backward pass. (Source)

However, as shown above, the computational cost increases significantly. A good trade-off is to save only some nodes in memory while recomputing the others when needed. These saved nodes are called checkpoints. This drastically reduces deep neural network memory consumption. This is illustrated below:

The second node from the left is a checkpoint node. It reduces memory consumption while providing a reasonable time penalty. (Source)

2. Trade speed for memory (Recomputation)

Extending on the above idea, we can recompute certain operations to save time. A good example for this is the Memory Efficient DenseNet implementation.

A dense block in a DenseNet. (Source)

DenseNets are very parameter efficient, but are also memory inefficient. The paradox arises because of the nature of the concatenation and batchnorm operations.

To make convolution efficient on the GPU, the values must be placed contiguously. Hence, after concatenation, cudNN arranges the values contiguously on the GPU. This involves a lot of redundant memory allocation. Similarly, batchnorm involves excess memory allocation, as explained in this paper. Both operations contribute to a quadratic growth in memory. DenseNets have have a large number of concatenations and batchnorms, and hence they are memory inefficient.

Comparing naive concat and batchnorm operations to their memory efficient counterparts. (Source)

A neat solution to the above involves two key observations.

Firstly, concatenation and batchnorm operations are not time intensive. Hence, we can just recompute the values when needed, instead of storing all the redundant memory. Secondly, instead of allocating “new” memory space for the output, we can use a “shared memory space” to dump the output.

We can overwrite this shared space to store output of other concatenation operations. We can recompute the concatenation operation for gradient calculation when needed. Similarly, we can extend this for the batchnorm operation. This simple trick saves a lot of GPU memory, in exchange for slightly increased compute time.

3. Reduce Precision

In an excellent blog, Pete Warden explains how neural networks can be trained with 8-bit float values. There are a number of issues that accrue because of reduction in precision, some of which are listed below:

As stated in this paper, “activations, gradients, and parameters” have very different ranges. A fixed-point representation would not be ideal. The paper a claims that a “dynamic fixed point” representation would be an excellent fit for low precision neural networks.
As stated in Pete Warden’s other blog, lower precision implies a greater deviation from the exact value. Normally, if the errors are totally random, they have a good chance of canceling each other out. However, zeros are used extensively for padding, dropout, and ReLU. An exact representation of zero in the lower precision float format may not be possible, and hence might introduce an overall bias in the performance.

4. Architecture Engineering for Neural Networks

Architecture engineering involves designing the neural network structure that best optimizes accuracy, memory, and speed. There are several ways by which convolutions can be optimized space-wise and time-wise.

Factorize NxN convolutions into a combinations of Nx1 and 1xN convolutions. This conserves a lot of space while also boosting computational speed. This and several other optimization tricks were used in newer versions of the Inception network. For a more detailed discussion, check out this blog post.
Use Depthwise Separable convolutions as in MobileNet and Xception Net. For an elaborate discussion on the types of convolutions, check out this blog post.
Use 1x1 convolutions as a bottleneck to reduce the number of incoming channels. This technique is used in several popular neural networks.

Illustration of Google’s AutoML. (Source)

An interesting solution is to let the machine decide the best architecture for a particular problem. Neural Architecture Search uses machine learning to find the best neural network architecture for a given classification problem. When used on ImageNet, the network formed as a result (NASNet) was among the best performing models created so far. Google’s AutoML works on the same principle.

Deep learning — For experts, by experts. We’re using our decades of experience to deliver the best deep learning resources to your inbox each week.

Inference

Fitting models for edge inference is relatively easier. This section covers techniques that can be used to optimize your neural network for such edge devices.

1. Removing “Bloatware”

Machine learning frameworks such as TensorFlow consume a lot of memory space for creating graphs. This additional space is useful for speeding up the training processes, but it isn’t used for inference. Hence, the part of the graph used exclusively for training can be pruned off. Let’s call this part of the graph bloatware.

For TensorFlow, it’s recommended to convert model checkpoints to frozen inference graphs. This process automatically removes the memory-hungry bloatware. Graphs from model checkpoints that throw Resource Exhausted Error can sometimes be fit into memory when converted to a frozen inference graph.

2. Pruning Features

Some machine learning models on Scikit-Learn (such as Random Forest and XGBoost) output an attribute named feature_importances_. This attribute represents the significance of each feature for the classification or regression task. We can simply prune the features with the least significance. This can be extremely useful if your model has an excessively large number of features that you cannot reduce by any other method.

Example of a feature importance plot. (Source)

Similarly, in neural networks, a lot of weight values are close to zero. We can simply prune those connections. However, removing individual connections between layers creates sparse matrices. There is work being done on creating efficient inference engines (hardware), which can handle sparse operations seamlessly. However, most machine learning frameworks convert sparse matrices to their dense form before sending them to the GPU.

Pruning an insignificant filter. (Source)

Instead, we can remove insignificant neurons and slightly retrain the model. For CNNs, we can remove entire filters, too. Research and experiments have shown that we could retain most of the accuracy, while obtaining a massive reduction in size, by using this method.

3. Weight Sharing

To best illustrate weight sharing, consider the example given in this Deep Compression paper. Consider a 4x4 weight matrix. It has 16 32-bit float values. We require 512 bits (16 * 32) to represent the matrix.

Let us quantize the weight values to 4 levels, but let’s preserve their 32-bit nature. Now, the 4x4 weight matrix has only 4 unique values. The 4 unique values are stored in a separate (shared) memory space. We can give each of the 4 unique values a 2-bit address (Possible address values being 0, 1, 2, and 3).

Weight sharing illustrated. (Source)

We can reference the weight values by using the 2-bit addresses. Hence, we obtain a new 4x4 matrix with 2-bit addresses, with each location in the matrix referring to a location in the shared memory space. This method requires 160 bits (16 * 2 + 4 * 32) for the entire representation. We obtain a size reduction factor of 3.2.

Needless to say, this reduction in size comes with an increase in time complexity. However, the time to access shared memory would not be a severe time penalty.

4. Quantization and Lower Precision (Inference)

Recall that we covered reduction in precision in the training part of this blog. For inference, reduction in precision is not as cumbersome as for training. The weights can just be converted to a lower precision format and be shipped off to inference. However, a sharp decrease in precision might require a slight readjustment of weight values.

5. Encoding

The pruned and quantized weights can be size-optimized further by using encoding. Huffman encoding can represent the most frequent weight values with a lower number of bits. Hence, at the bit level, a Huffman encoded string takes a smaller space than a normal string.

Deep compression explores encoding using lossless compression techniques such as Huffman. However, research has explored the use of lossy compression techniques as well. The downside to either method is the overhead of translation.

6. Inference Optimizers

We’ve discussed some great ideas so far, but implementing them from scratch would take quite some time. This is where inference optimizers kick in. For instance, Nvidia’s TensorRT incorporates all of these great ideas (and more) and provides an optimized inference engine given a trained neural network.

TensorRT. (Source)

Moreover, TensorRT can optimize the model such that it can make better use of Nvidia’s hardware. Below is an example where a model optimized with TensorRT uses Nvidia’s V100 more efficiently.

Using model optimized by TensorRT on Nvidia’s V100. (Source)

7. Knowledge Distillation

Instead of performing fancy optimization techniques, we can teach smaller models to mimic the performance of beefy, larger models. This technique is called knowledge distillation, and it’s an integral part of Google’s Learn2Compress.

Teacher-Student models. (Source)

By using this method, we can force smaller models that can fit on edge devices to reach the performance levels of larger models. The drop in accuracy is reported to be minimal. You can refer to Hinton’s paper on the same for more information.

Thank you for reading this article! Hope you found it interesting. Hit the clap button if you did! If you have any questions, you could hit me up on social media or send me an email (bharathrajn98[at]gmail[dot]com).

Discuss this post on Reddit and Hacker News.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

How to Fit Large Neural Networks on the Edge was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Automate Surveillance Easily with Deep Learning

Bharath Raj — Fri, 03 Aug 2018 08:23:14 GMT

This article is a quick tutorial for implementing a surveillance system using Object Detection based on Deep Learning. It also compares the performance of different Object Detection models using GPU multiprocessing for inference, on Pedestrian Detection.

Surveillance is an integral part of security and patrol. For the most part, the job entails extended periods of looking out for something undesirable to happen. It is crucial that we do this, but also it is a very mundane task.

Wouldn’t life be much simpler if there was something that could do the “watching and waiting” for us? Well, you’re in luck. With the advancements in technology over the past few years, we could write some scripts to automate the above tasks — and that too, rather easily. But, before we dive deeper, let us ask ourselves:

Are machines are good as humans?

Anyone familiar with Deep Learning would know that image classifiers have surpassed human level accuracy.

Error rate on the ImageNet dataset over time, for Humans, Traditional Computer Vision (CV) and Deep Learning. (Image source: Link)

So yes, a machine can keep a lookout for objects at the same standard (or better) when compared to a human. With that being said, using technology to perform surveillance is much more efficient.

Surveillance is a repetitive and mundane task. This may cause performance dips for us human beings. By letting technology do the surveillance, we could focus on taking action if something goes amiss.
To survey a large strip of land, you need lots of personnel. Stationary cameras also have a limited range of view. With mobile surveillance bots (such as micro drones) these problems can be mitigated.

Moreover, the same technology can be used for a variety of applications which are not limited to security, such as baby monitors or automated product delivery.

Fair enough. But how do we automate it?

Before we contrive complicated theories, let us think about how surveillance works normally. We look at a video feed, and if we spot something abnormal, we take action. So in essence, our technology should peruse every frame of the video, hoping to spot something abnormal. Does this process ring a bell?

As you may have guessed, this is the very essence of Object Detection with Localization. It is slightly different from classification that, we need to know the exact location of the object. Moreover, we may have multiple objects in a single image.

To find the exact location, our algorithm should inspect every portion of the image to find the existence of a class. It is harder than it sounds. But since 2014, continuous iterative research in Deep Learning has introduced heavily engineered neural networks that can detect objects in real time.

Look at how performance increased over just a span of 2 years!

There are several Deep Learning architectures, that use different methods internally, to perform the same task. The most popular variants are the Faster RCNN, YOLO and the SSD networks.

Speed vs accuracy trade-off. A higher mAP and a lower GPU Time is optimal.

Each model depends on a base classifier, which greatly affects the final accuracy and model size. Moreover, the choice of the object detector can heavily influence computational complexity and final accuracy.

There is always a Speed vs Accuracy vs Size trade-off when choosing an Object Detection algorithm.

In this blog post, we will learn how to build a a simple but effective surveillance system, using Object Detection. Let us first discuss the constraints we are bound to because of the nature of the surveillance task.

Constraints for Deep Learning in Surveillance

Often we would like to keep a look-out over a large stretch of land. This brings forth a couple of factors that we may need to consider before automating surveillance.

1. Video Feed

Naturally, to keep a look-out over a large area, we may require multiple cameras. Moreover, these cameras need to store this data somewhere; either locally, or to a remote location.

Typical surveillance cameras. (Photo by Scott Webb on Unsplash)

A higher quality video will take a lot more memory than a lower quality one. Moreover, an RGB input stream is 3x larger than a BW input stream. Since we can only store a finite amount of the input stream, the quality is often lowered to maximize storage.

Therefore, a scalable surveillance system should be able to interpret low quality images. Hence, our Deep Learning algorithm must be trained on such low quality images as well.

2. Processing Power

Now that we have resolved the input constraint, we can answer a bigger question. Where do we process the data obtained from camera sources? There are two methods of doing this.

Processing on a centralized server:

The video streams from the cameras are processed frame by frame on a remote server or a cluster. This method is robust, and enables us to reap the benefits of complex models with high accuracies. The obvious problem is latency; you need a fast Internet connection for limited delay. Moreover, if you are not using a commerical API, the server setup and maintenance costs can be high.

Memory consumption vs Inference GPU Time (milliseconds). Most high performance models consume a lot of memory. (Source)

Processing on the edge:

By attaching a small microcontroller, we can perform realtime inference on the camera itself. There is no transmission delay, and abnormalities can be reported faster than the previous method. Moreover, this is an excellent add on for bots that are mobile, so that they need not be constrained by range of WiFi/Bluetooth available. (such as microdrones).

FPS capability of various object detectors. (Source)

The disadvantage is that, microcontrollers aren’t as powerful as GPUs, and hence you may be forced to use models with lower accuracy. This issue can be circumvented by using onboard GPUs, but that is an expensive solution. An interesting solution would be to use software such as TensorRT, which can optimize your program for inference.

Training a Surveillance System

In this section, we will checkout how to identify pedestrians using Object Detection. We’ll use the TensorFlow Object Detection API to create our Object Detection module. We will explore in brief on how to set up the API and train it for our surveillance task. For a more detailed explanation, you can checkout this blog post.

The entire process can be summarized in three phases:

Data preparation
Training the model
Inference

The workflow involved in training an Object Detection model.

If you feel like seeing the results would motivate you more to try it out, feel free to scroll down to Phase 3!

Phase 1: Data Preparation

Step 1: Obtain the dataset

Surveillance footage taken in the past is probably the most accurate dataset you can get. But, it’s often hard to obtain such surveillance footage for most cases. In that case, we can train our object detector to generally recognize our targets from normal images.

Sample annotated image from our dataset.

As discussed before, the images in your camera feed maybe of lower quality. So you must train your model to work in such conditions. A very elegant way of doing that is by performing data augmentation, which is explained in detail here. Essentially, we have to add some noise to degrade the image quality of the dataset. We could also experiment with blur and erosion effects.

We’ll use the TownCentre Dataset for our object detection task. We’ll use the first 3600 frames of the video for training and validation, and the remaining 900 for testing. You can use the scripts in my GitHub repo to extract the dataset.

Step 2: Annotate the dataset

You could use a tool such as LabelImg to perform the annotations. This is a tedious task, but important all the same. The annotations are saved as XML files.

Luckily, the owners of the TownCentre Dataset have provided annotations in csv format. I wrote a quick script to convert the annotations to the required XML format, which you can find in my GitHub repo.

Step 3: Clone the repository

Clone the repository. Run the following commands to install requirements, compile some Protobuf libraries and set path variables

pip install -r requirements.txt
sudo apt-get install protobuf-compiler
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

Step 4: Prepare the supporting inputs

We need to assign an ID to our target. We define the ID in file called label_map.pbtxt as follows

item {
 id: 1
 name: ‘target’
}

Next, you must create a text file with the names of the XML and image files. For instance, if you have img1.jpg, img2.jpg and img1.xml, img2.xml in your dataset, you trainval.txt file should look like this:

img1
img2

Separate your dataset into two folders, namely images and annotations. Place the label_map.pbtxt and trainval.txt inside your annotations folder. Create a folder named xmls inside the annotations folder and place all your XMLs inside that. Your directory hierarchy should look something like this:

-base_directory
|-images
|-annotations
||-xmls
||-label_map.pbtxt
||-trainval.txt

Step 5: Create TF Records

The API accepts inputs in the TFRecords file format. Use the create_tf_record.py file provided in my repo to convert your dataset into TFRecords. You should execute the following command in your base directory:

python create_tf_record.py \
    --data_dir=`pwd` \
    --output_dir=`pwd`

You will find two files, train.record and val.record, after the program finishes its execution.

Phase 2: Training the model

Step 1: Model Selection

As mentioned before, there is a trade off between speed and accuracy. Also, building and training an object detector from scratch would be extremely time consuming. So, the TensorFlow Object Detection API provides a bunch of pre-trained models, which you can fine tune to your use case. This process is known as Transfer Learning, and it speeds up your training process by an enormous amount.

A bunch of models pre-trained on the MS COCO Dataset

Download one of these models, and extract the contents into your base directory. You will receive the model checkpoints, a frozen inference graph, and a pipeline.config file.

Step 2: Defining the training job

You have to define the “training job” in the pipeline.config file. Place the file in the base directory. What really matters is the last few lines of the file — you only need to set the highlighted values to your respective file locations.

gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "model.ckpt"
  from_detection_checkpoint: true
  num_steps: 200000
}
train_input_reader {
  label_map_path: "annotations/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "train.record"
  }
}
eval_config {
  num_examples: 8000
  max_evals: 10
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "annotations/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  num_readers: 1
  tf_record_input_reader {
    input_path: "val.record"
  }
}

Step 3: Commence training

Execute the below command to start the training job. It’s recommended to use a machine with a large enough GPU (provided you installed the gpu version of tensorflow) to accelerate the training process.

python object_detection/train.py \
--logtostderr \
--pipeline_config_path=pipeline.config \
--train_dir=train

Phase 3: Inference

Step 1: Export the trained model

Before you can use the model, you need to export the trained checkpoint files to a frozen inference graph. It’s actually easier done than said — just execute the code below (Replace ‘xxxxx’ with the checkpoint number):

python object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=pipeline.config \
--trained_checkpoint_prefix=train/model.ckpt-xxxxx \
--output_directory=output

You will obtain a file named frozen_inference_graph.pb, along with a bunch of checkpoint files.

Step 2: Use it on a video stream

We need to extract individual frames from our video source. It can done by using OpenCV’s VideoCapture method, as follows:

cap = cv2.VideoCapture()
flag = True

while(flag):
    flag, frame = cap.read()
    ## -- Object Detection Code --

The data extraction code used in Phase 1 automatically creates a folder ‘test_images’ with our test set images. We can run our model on the test set by executing the following:

python object_detection/inference.py \
--input_dir={PATH} \
--output_dir={PATH} \
--label_map={PATH} \
--frozen_graph={PATH} \
--num_output_classes=1 \
--n_jobs=1 \
--delay=0

Experiments

As mentioned earlier, there is trade off between speed and accuracy while choosing an object detection model. I ran some experiments which measured the FPS and count accuracy of the people detected using three different models. Moreover, the experiments were run on different resource constraints (GPU parallelism constraints) . The outcome of these experiments can give you some valuable insights while selecting an object detection model.

Setup

The following models were selected for our experiment. These are available in the TensorFlow Object Detection API’s Model Zoo.

Faster RCNN with ResNet 50
SSD with MobileNet v1
SSD with InceptionNet v2

All models were trained on Google Colab for 10k steps (or until their loss saturated). For inference, an AWS p2.8xlarge instance was used. The count accuracy was measured by comparing the number of people detected by the model and the ground truth. The inference speed in Frames per Second (FPS) was tested under the following constraints:

Single GPU
Two GPUs in parallel
Four GPUs in parallel
Eight GPUs in parallel

Results

Here’s an excerpt from the output produced by using FasterRCNN on our test set. I’ve also attached a video comparing the output produced by each model near the end of this blog. Feel free to scroll down and check it out!

Training Time

The plot below shows the time needed to train each model for 10k steps(in hours). This is excluding the time required for a hyperparameter search.

When your application is very different from the pretrained model you use for transfer learning, you may need to heavily adjust the hyperparameters. However, when your application is similar, you wouldn’t need to do an extensive search. Nonetheless, you may still require to experiment with training parameters such as the learning rate and choice of optimizer.

Speed (Frames per Second)

This was the most interesting part of our experiment. As stated earlier, we measured the FPS performance of our three models on five different resource constraints. The results are shown below:

SSDs are extremely fast, easily beating Faster RCNN’s speed when we use a single GPU. However, Faster RCNN quickly catches up with SSD when we increase the number of GPUs (working in parallel). Needless to say, SSD with MobileNet is much faster than SSD with InceptionNet at a low GPU environment.

One notable feature from the above graph is that, FPS slightly decreases when we increase the number of GPUs for SSD with MobileNet. There’s actually a simple answer to this apparent paradox. It turns out that our setup processed the images faster than they were being supplied by the image read function!

Speed of your video processing system can not be greater than the speed at which images are fed to the system.

To prove my hypothesis, I gave the image read function a head-start. The plot below shows the improvement in FPS for SSD with MobileNet when a delay was added. The slight reduction in FPS in the earlier graph is because of the overhead involved due to multiple GPUs requesting for input.

Needless to say, we observe a sharp increase in FPS if we introduce delays. The bottom line is, we need to have an optimized image transfer pipeline to prevent a bottleneck for speed. But since our intended use case is surveillance, we have an additional bottleneck. The FPS of the surveillance camera sets the upper limit for the FPS of our system.

Count Accuracy

We define count accuracy as the percentage of people correctly recognized by our object detection system. I felt like it’s more apt with respect to surveillance. Here’s how each of our models performed:

Needless to say, Faster RCNN is the most accurate model. Also surprisingly MobileNet performs better than InceptionNet.

Based on the experiments, it is evident that there is indeed a speed vs accuracy trade-off. However, we can use a model with high accuracy at a good FPS rate if we have enough resources. We observe that Faster RCNN with ResNet-50 offers the best accuracy, and a very good FPS rating when deployed on 4+ GPUs in parallel.

That was a lot of steps!

Well.. I wouldn’t argue. It is indeed a lot of steps. Moreover, setting a up cloud instance for this model to work in real time would be burdensome and expensive.

A better solution would be to use an API service that is already deployed on servers so that you can just worry about developing your product. That’s where Nanonets kicks in. They have their API deployed on quality hardware with GPUs such that you get insane performance with none of the hassle!

I converted my existing XML annotations to JSON format and fed it to the Nanonets API. As a matter of fact, if you dont want to manually annotate your dataset, you can request them to annotate it for you. Here’s the reduced workflow when Nanonets takes care of the heavy lifting.

Reduced workflow with Nanonets

Earlier, I mentioned how mobile surveillance units such as micro drones can greatly enhance efficiency. We can create such drones quite easily using micro controllers such the Raspberry Pi, and we can use API calls to perform inference.

It’s pretty simple to get started with the Nanonets API for Object Detection, but for a well explained guide, you can checkout this blog post.

Results with Nanonets

It took about 2 hours for Nanonets to finish the training process. This is including the time required for hyperparameter search. In terms of time taken for training, Nanonets is the clear winner. Nanonets also defeated FasterRCNN in terms of count accuracy.

FasterRCNN Count Accuracy = 88.77%
Nanonets Count Accuracy = 89.66%

Here is the performance of all four models on our test dataset. It is evident that both SSD models are a bit unstable and have lower accuracy. Moreover, even though FasterRCNN and Nanonets have comparable accuracies, the latter has bounding boxes that are more stable.

https://medium.com/media/ae27df7c6c53eeb6e98272efaab42b77/href

Is automated surveillance accountable?

Deep learning is an amazing tool that provides exemplary results with ease. But, to what extent can we trust our surveillance system to act on its own? There are a few instances where automation is questionable.

Update: In light of GDPR and the reasons stated below, it is imperative that we ponder about the legality and ethical issues concerning automation of surveillance. This blog is for educational purposes only, and it used a publicly available dataset. It is your responsibility to make sure that your automated system complies with the law in your region.

1. Dubious Conclusions

We do not know how a deep learning algorithm arrives at a conclusion. Even if the data feeding process is impeccable, there may be a lot of spurious hits. For instance, this AI profanity filter used by British cops kept removing pictures of sand dunes thinking they were obscene images. Techniques such as guided backpropagation can explain decisions to some extent, but we still have a long way to go.

2. Adversarial Attacks

Deep Learning systems are fragile. Adversarial attacks are akin to optical illusions for image classifiers. But the scary part is, a calculated unnoticeable perturbation can force a deep learning model to mis-classify. Using the same principle, researchers have been able to circumvent surveillance systems based on deep learning by using “adversarial glasses”.

3. False positives

Another problem is, what do we do in the event of false positives. The severity of the issue depends on the application itself. For instance, a false positive on a border patrol system may be more significant than a garden monitoring system. There should be some amount of human intervention to avoid mishaps.

4. Similar faces

Sadly, your look is not as unique as your fingerprint. It is possible for two people (or more) to look very similar. Identical twins are one of the prime examples. It was reported that, Apple Face ID failed to distinguish two unrelated Chinese coworkers. This could make surveillance and identifying people harder.

5. Lack of diversity in datasets

Deep Learning algorithms are only as good as the data your provide it. Most popular datasets of human faces, only have samples of white people. While it may seem obvious to a child that humans can exist in various colors, Deep Learning algorithms are sort of dumb. In fact, Google got into trouble because it classified a black person incorrectly as a gorilla.

About Nanonets: Nanonets is building APIs to simplify deep learning for developers. Visit us at https://www.nanonets.com for more)

How to Automate Surveillance Easily with Deep Learning was originally published in NanoNets on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep Learning on the Edge

Bharath Raj — Sun, 24 Jun 2018 17:28:24 GMT

An overview of performing Deep Learning on mobile and edge devices.

Photo by Alexandre Debiève on Unsplash

Scalable Deep Learning services are contingent on several constraints. Depending on your target application, you may require low latency, enhanced security or long-term cost effectiveness. Hosting your Deep Learning model on the cloud may not be the best solution in such cases.

Computing on the Edge (Source)

Deep Learning on the edge alleviates the above issues, and provides other benefits. Edge here refers to the computation that is performed locally on the consumer’s products. This blog explores the benefits of using edge computing for Deep Learning, and the problems associated with it.

Why edge? Why not use the cloud?

There is a plethora of compelling reasons to favor edge computing over cloud computing.

1. Bandwidth and Latency

It’s no doubt that there’s a tangible Round Trip Time (RTT) associated with API calls to a remote server. Applications that demand near instantaneous inference can not function properly with the latency.

Latency and Power consumption stats for Object Detection (DET), Tracking (TRA) and Localization (LOC) on four different edge devices (Source)

Take self driving cars for example. A large enough latency could significantly increase the risk of accidents. Moreover, unexpected events such as animal crossing or jay walking can happen over just a few frames. In these cases, response time is extremely critical. This is why Nvidia has their custom on-board compute devices to perform inference on the edge.

Moreover, when you have a large number of devices connected to the same network, the effective bandwidth is reduced. This is because of the inherent competition to use the communication channel. This can be significantly reduced if computation is done on the edge.

Bandwidth requirement for various applications. (Source)

Take the case of processing 4K HD videos on multiple devices. Processing them locally would heavily save bandwidth usage. This is because we do not need to upload data to the cloud for inference. Due to this, we can scale this network relatively easily.

2. Security and Decentralization

Commercial servers are prone to attacks and hacks. Of course, the risk is negligible if you use a trusted vendor. But, you are required to trust a third party for the security of the data you collect and your intellectual property (IP). Having devices on the edge gives you absolute control over your IP.

Centralized vs Decentralized vs Distributed. (Source)

If you’ve heard about blockchain, decentralization or distribution may be familiar to you. Nonetheless, having several devices on the edge reaps all the benefits of decentralization. It’s harder to bring down an entire network of hidden devices using a single DDoS attack, than a centralized server. This is especially useful for applications such as using drones for border patrol.

3. Job Specific Usage (Customization)

Imagine you have a factory that produces toys. It has a couple hundred work stations. You require an image classification service at every work station. Problem is, each work station has a different set of objects, and training a single classifier may not be effective. Moreover, hosting multiple classifiers on the cloud would be expensive.

The cost effective solution is to train classifiers specific to each part on the cloud, and ship the trained models to the edge devices. Now, these devices are customized to their work station. They would have better performance than a classifier predicting across all work stations.

4. Swarm Intelligence

Continuing with the idea mentioned above, edge devices can aid in training machine learning models too. This is especially useful for Reinforcement Learning, for which you could simulate a large number of “episodes” in parallel.

Multiple agents trying to grasp objects. (Source)

Moreover, edge devices can be used to collect data for Online Learning (or Continuous Learning). For instance, we can use multiple drones to survey an area for classification. Using optimization techniques such as Asynchronous SGD, a single model can be trained in parallel among all edge devices. It can also be merely used for aggregating and processing data from various sources.

5. Redundancy

Redundancy is extremely vital for robust memory and network architectures. Failure of one node in a network could have serious impacts on the other nodes. In our case, edge devices can provide a good level of redundancy. If one of the our edge devices (here, a node) fail, its neighbor can take over temporarily. This greatly ensures reliability and heavily reduces downtime.

6. Cost effective in the long run

In the long run, cloud services will turn out to be more expensive than having a dedicated set of inference devices. This is especially true if your devices have a large duty cycle (that is, they are working most of the time). Moreover, edge devices are much cheaper if they’re fabricated in bulk, reducing the cost significantly.

Constraints for Deep Learning on the Edge

Deep Learning models are known for being large and computationally expensive. It’s a challenge to fit these models into edge devices which usually have frugal memory. There are a number of ways by which we can approach these problems.

1. Parameter Efficient Neural Networks

A striking feature about neural networks is their enormous size. Edge devices typically can not handle large neural networks. This motivated researchers to minimize the size of the neural networks, while maintaining accuracy. Two popular parameter efficient neural networks are the MobileNet and the SqueezeNet.

The SqueezeNet incorporates a lot of strategies such as late down-sampling and filter count reduction, to get high performance at a low parameter count. They introduce “Fire modules” that have “squeeze” and “expand” layers that optimize the parameter efficiency of the network.

Fire module in the SqueezeNet. (Source)

The MobileNet factorizes normal convolutions into a combination of depth wise convolutions and 1x1 convolutions. This arrangement greatly reduces the number of parameters involved.

Top 1 accuracy in the ImageNet dataset with respect to number of Multiply-Accumulates (MACs). (Source)

2. Pruning and Truncation

A large number of neurons in trained networks are benign and do not contribute to the final accuracy. In this case, we can prune such neurons to save some space. Google’s Learn2Compress has found that we can obtain a size reduction by factor of 2, while retaining 97% of the accuracy.

Moreover, most neural network parameters are 32 bit float values. Edge devices on the other hand can be designed to work on 8 bit values, or less. Reducing precision can significantly reduce the model size. For instance, reducing a 32 bit model to 8 bit model ideally reduces model size by a factor of 4.

3. Distillation

Distillation is the process of teaching smaller networks using a larger “teacher” network. Google’s Learn2Compress incorporates this in their size reduction process. Combined with transfer learning, this becomes a powerful method to reduce model size without losing much accuracy.

Joint training and distillation approach to learn compact student models. (Source)

4. Optimized Microprocessor Designs

So far we have discussed way to scale down neural networks to fit in our edge devices. An alternate (or complementary) method would be to scale up the performance of the microprocessor.

The simplest solution would be to have a GPU on a microprocessor, such as the popular Nvidia Jetson. However, these devices may not be cost effective when deployed on a large scale.

Nvidia Jetson (Source)

A more interesting solution would be to use Vision Processing Units (VPUs). Intel claims that their Movidius VPUs have “high speed performance at an ultra low power consumption”. Google’s AIY kits and Intel’s Neural Compute Stick internally use this VPU.

Google AIY’s Vision Bonnet using a Movidius VPU. (Source)

Alternatively, we could use FPGAs. They have low power consumption than GPUs and can accommodate lower bit (< 32 bit) architectures. However, there could be a slight drop in performance compared to GPUs owing to their lower FLOPs rating.

For large scale deployment, custom ASICs would be the best solution. Fabricating micro-architectures similar to Nvidia’s V100 to accelerate matrix multiplications could greatly boost performance.

Pascal vs Volta architecture; Nvidia. (Source)

Deep Learning on the Edge was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.