Stories by Max Ehrlich on Medium

Compression in the ImageNet Dataset

Max Ehrlich — Fri, 14 May 2021 18:08:29 GMT

Making Sense of Big Data

A deep dive into the compression settings of deep learning’s most popular benchmark

Frankenstein’s Monster of datasets? Read on to find out. Image by author.

I’ve been on a mission lately to get more people thinking about how lossy compression affects their deep learning models [1]. In the process I spent a lot of time with ImageNet [2], which consists entirely of JPEG files, and I started noticing some peculiar compression settings. To see how systemic these odd settings are, I decided to survey the compression settings over the entire dataset. In this post, I report what I saw, including why I think some of these settings are weird, and show the statistics I computed for each of the relevant compression settings. At the end I show that by plotting a 2D projection of these compression settings, it’s actually possible to see graphically that there were several different sources involved in the creation of ImageNet.

Methodology

To examine the images, I used the read_coefficients function from torchjpeg. This function reads DCT coefficients directly from a JPEG file without decoding them, allowing me to examine low level details like chroma subsampling and quantization. This process revealed that one image in the training set is actually not a JPEG at all. It’s a PNG that someone renamed to have a .JPEG extension. In all it took around 4.5 hours to process the training set and around 10 minutes for the validation set. After processing I plotted the results. I’ve made all the data collection and plotting code available in this gist along with full size PDF plots.

Note that the Y axis on all charts uses a log scale.

Overview

Overall most of the images in ImageNet are compressed “lightly” and are quite small. The most common image size appears to be around 500 by 500 though there are some outlier sizes that are very large or oddly proportioned. For example, a 2592 by 3888 (very large) image, or a 500 by 33 image (oddly proportioned). This is important because most people resize images to 224 by 224 during preprocessing, and weird aspect ratios or much larger/smaller images will create artifacts from resampling.

The vast majority of the images are color images, are not chroma subsampled and were quantized at quality 96. This indicates very light compression which would not have a noticeable effect on the images. Of course there are exceptions here, such as 4:1:1 subsampling (which is VERY bad) and some quality < 10 images (also VERY bad). There also appears to be some disparity between the training and test sets for all of these parameters.

If you’re unfamiliar with any of the terms I used above, don’t worry I’ll be explaining everything as we dive into the results in detail.

What About That PNG?

Before we go any further: yes, “n02105855/n02105855_2933.JPEG” is actually a PNG that someone renamed to .JPEG. Here it is:

“n02105855/n02105855_2933.JPEG” Image credit: ImageNet [2].

Opening it in a hex editor gives it away pretty clearly:

Big oof. Image by author.

Not much else to say about it, except to note that it’s about an order of magnitude larger than the other images of comparable size.

Color

Lets start simple, how many images are color and how many are grayscale.

Color results. Image by author.

Fairly straightforward, almost all the images are color. One thing to note about this though is that JPEG makes a distinction between a grayscale image stored with three color channels exactly the same and an actual grayscale image stored in one channel, what we’re counting here is the later: images which when loaded returned only one channel.

Image Sizes

Next, let’s look at the image sizes, this section has some of the most interesting results. Image sizes are tricky to visualize, and plotting them graphically is very hard to interpret (I do have these graphs along with the plotting code in the gist), so instead I plotted the width and height on heatmaps. Because these are quite large, I cropped them to 1000x1000 for clarity (full size heatmaps are in the gist). Here’s the training set:

Training set heatmap. Image by author.

Each pixel in this map represents a size, for example the pixel at position (10, 70) shows the count of images that have a width of 10 and a height of 70. Brighter colors indicate more images.

We can see some interesting behavior. There’s a clear preference for width and height of 500, as well as some other intervals. and there are some interesting diagonal lines going from the top left to the bottom right of the map. Here’s the same image with some things labeled:

Training heatmap annotated. Image by author.

To make the interpretation of the diagonal lines easier, I overlayed a set of lines indicating aspect ratios 1:1 (red), 4:3 (green), and 3:2 (blue).

Training heatmap with aspect ratios. Image by author.

So we can see the lines correspond to these aspect ratios. 1:1 and 4:3 make sense, but 3:2 I only know of from 35mm film so frankly I’m sure how it ended up in here in such quantity.

Let’s briefly look at the same heatmap for the validation set:

Validation set heatmap. Image by author.

Not only is it significantly more sparse (in fact almost all the images are in that 500 width or height area), but the aspect ratios are much more sensible. This is concerning because the size distribution in the validation set doesn’t reflect the training set.

Time for some pathological examples. Here’s an example of a small image from the training set, its size is only 20 by 17:

“n07760859/n07760859_5275.JPEG” Image credit: ImageNet [2]

I have no idea what this is supposed to be, zooming doesn’t help, and I doubt your neural network could figure this out either.

Here’s one with a crazy aspect ratio, it’s 500 by 32:

“n04228054/n04228054_11471.JPEG” Image credit: ImageNet [2]

I think it’s a ski? It’s sure to look weird after resizing to 224x224 with or without center cropping:

“n04228054/n04228054_11471.JPEG” after center crop and resize (left) and only resize (right). Image credit: ImageNet [2]

Chroma Subsampling

Next we can look at the chroma subsampling settings. I’ll first explain what chroma subsampling is, feel free to skip this section if you’re familiar, then I’ll go into the results.

What is chroma subsampling?
Human vision is less sensitive to small changes in color than it is to small changes in brightness. JPEG compression leverages this to save additional space by subsampling color information, in other words, it stores less color information than brightness information. The algorithm does this by converting the standard RGB image that it is given into the YCbCr color space. This color space separates the brightness or luma of a pixel from the color or chroma. The Y channel stores brightness, and is saved at full resolution. The Cb and Cr channels store color information (roughly blueness and redness respectively) and are often downsampled.

When we talk about how the downsampling is done, we use the following notation: “4:a:b”. This scheme refers to a 4 column, 2 row block of pixels. “a” indicates the number of color samples in the first row, and “b” indicates the number of these samples which change in the second row. So if we have 4:2:2 subsampling, we are saying that for every 4 luma samples, the first row only has 2 chroma samples, and both of them change in the second row. We interpret this as the chroma channels being half the width, but the same height as the luma channel.

The notation is strange at first but makes sense when you’re used to looking at it and I’ll fully explain the interpretation of the schemes when discussing results in the next section.

Results

Chroma subsampling results. Image by author.

Above you can see the chroma subsampling results. There are a couple of interesting things to note here, the first is that the vast majority of images are using “4:4:4” which means there is no subsampling. Around 10% use “4:2:0” meaning that the chroma channels are half the width and height. This is the most common setting in practice because it’s the default in many JPEG implementations, so if you’re deploying a system that’s going to work on real images, ImageNet might not be representative enough for you.

One thing that really stands out is the number of “4:1:1” images. This is a weird one (uncommon in practice), and it indicates that the chroma channels have only 1/4th the width of the luma channel (but the height is the same). This is going to incur a very large and noticeable degradation to the image. Also note that there are around an order of magnitude more of these in the validation set than there are in the training set, although they still make up a small fraction of the total images.

Here’s an example of a 4:1:1 image from the training set

“n02445715/n02445715_2673.JPEG”. Image credit ImageNet [2].

Note how it looks terrible and the colors largely don’t make sense.

Quality

The setting that has the largest effect on the size and fidelity of a JPEG is its quality setting. This is actually non-standard but fairly common, anyone who has exported a JPEG file may be familiar with the slider that comes up asking for a quality from 0 to 100. Lower quality images look worse but are considerably smaller than high quality images. As in the last section I’ll first explain what this quality actually is, then we’ll look at the results.

What is JPEG Quality?
When a JPEG file is saved, it isn’t actually storing pixels, it’s storing coefficients of the Discrete Cosine Transform (DCT). The DCT is applied to the pixels to produce transform coefficients, these coefficients are then quantized by rounding them to save space. This rounding is the primary source of information loss in JPEG compression and is also responsible for the majority of its space saving. Essentially, the quality is used to control the amount of rounding, so high quality means less rounding resulting in a larger file. JPEG controls the rounding by computing a matrix from the quality factor which is used to element-wise divide the coefficients. Larger entries in this matrix means smaller coefficients after dividing and therefore more rounding. The rounding allows the coefficients to be represented as integers and creates runs of zeros and repeated elements (lower entropy representations).

Since quality is non-standard it is not stored in the JPEG file, and estimating quality is not always straightforward. I used the torchjpeg.quantization.ijg library to compute quantization matrices for every quality from 0 to 100 for each image until I found one that matched exactly the quantization matrix stored in the file. This is time consuming and it only works if the images were compressed using libjpeg, which luckily they all were.

Results

Quality results. Image by author.

Above are the quality results. We can see a large spike at quality 96 indicating that the vast majority of images were compressed at this quality. 96 is very high, and wouldn’t noticeably effect the images. Interesting things to note here are the small proportion of very low quality (generally less than 10) images in the training set, these images would be almost completely destroyed by compression. Also note the sparsity of the validation set, where the training set covers a wide range of diverse qualities (albeit in small proportion), these are not generally represented in the validation set.

Here’s an example of a quality 3 image from the training set.

“n02441942/n02441942_6428.JPEG” Image credit ImageNet [2]

Note how it’s only somewhat recognizable and the colors are largely missing.

Exploring the Space of Images

One thing that immediately stands out to me is that ImageNet seems to have been assembled from several very different sources, kind of like the Frankenstein’s monster of datasets. There was clearly a source that was assembled with great care for the compression settings, changing the defaults to quality 96 and 4:4:4 subsampling and using 500 by 500 images. Then there are some others which seem to lack that kind of intentional design but which are present in enough numbers that they appear to be related in someway. This may have been supplemented with single images from various sources which would explain some of the outliers. This can likely be corroborated by someone who knows the history of the dataset.

We can actually visualize this graphically. To do this, I stored the compression settings as 4D vectors (chroma subsampling type, width, height, quality) and projected them into 2D using UMAP [8]. I computed this on the training set and I used only 10% of the images with a width or height of 500 since those tend to dominate the signal otherwise. Here’s what that looks like, after I colored in some very clear clusters:

Image space with prominent clusters highlighted. Image by author.

Examining these clusters gives us an idea of why they are grouped together. The orange cluster contains only size 500 by 375 images compressed at quality 96 and with 4:4:4 chroma subsampling. The green cluster contains 375 by 500 images (transposed of the orange cluster), with otherwise the same settings. The red cluster is again the same, but with 333 by 500 images.

Next let’s color the points by chroma subsampling scheme

Image space colored by chroma subsampling scheme. Image by author.

We get a nice, clear, separation on this one. Yellow points are 4:2:0 and purple points are 4:4:4, the rest are in between. A cluster of 4:2:2 (blue) shows up on the left hand side, looking back at the original plot above, this cluster sticks out a little more now that we’ve identified it.

Coloring the points by quality gives another interesting result

Image space colored by quality. Image by author.

We can see lower qualities heavily represented on the lower right. This is the same region that 4:2:0 chroma subsampling was featured prominently.

If I had to guess based on these plots, I would say that the smaller clusters on the left hand side represent some initial sources of data. They have similar parameters, differ only in their dimensions, and are small in number. Around them on the left hand side are images which were gathered from other sources but with similar parameters. The right hand side represents a large departure in the method of data collection, with very different parameters represented. Take this with a grain of salt, because projection techniques like UMAP are not guaranteed to perfectly model the space, and this is just my speculation.

Conclusion

Although ImageNet remains the most popular computer vision dataset, it is generally becoming known that it has some major issues with its labels [3, 4], its widespread use [5], and its potential for societal impact [6, 7]. I’d like to echo those concerns while raising one of my own: data quality. In my most recent paper [1] I showed that compression settings can have a large and sometimes unexpected effect on deep networks. While most of the compression is light, there are enough outliers to cause concern and there are disparities between the training and validation sets. Also the image sizes are quite varied and contain extreme aspect ratios that could cause problems when resizing images for input to the network. Based on this analysis I definitely recommend thinking about these issues and whether they will affect your performance the next time you consider using ImageNet. It’s not that ImageNet is objectively a bad dataset, it has served the community quite well over the years, and it may even help to have this kind of variation in some cases. But as deep learning advances into a more precise science it’s good to take a proactive approach to these issues and figure out early whether they’re important for your particular application.

Acknowledgement

This post was inspired by my research [1] that was graciously supported by independent grants from DARPA MediFor, DARPA SemaFor, and Facebook AI. I want to also thank my co-authors, Professors Abhinav Shrivastava and Larry Davis of UMD and Dr. Ser-Nam Lim of Facebook AI for their contributions.

References

Ehrlich, Max, et al. “Analyzing and Mitigating Compression Defects in Deep Learning.” arXiv preprint arXiv:2011.08932 (2020).
Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
Beyer, Lucas, et al. “Are we done with ImageNet?.” arXiv preprint arXiv:2006.07159 (2020).
Yun, Sangdoo, et al. “Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels.” arXiv preprint arXiv:2101.05022 (2021).
Tuggener, Lukas, Jürgen Schmidhuber, and Thilo Stadelmann. “Is it enough to optimize cnn architectures on imagenet?.” arXiv preprint arXiv:2103.09108 (2021).
Birhane, Abeba, and Vinay Uday Prabhu. “Large Image Datasets: A Pyrrhic Win for Computer Vision?.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021.
Yang, Kaiyu, et al. “A Study of Face Obfuscation in ImageNet.” arXiv preprint arXiv:2103.06191 (2021).
McInnes, Leland, John Healy, and James Melville. “Umap: Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).

Compression in the ImageNet Dataset was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Revisiting DCT Domain Deep Learning

Max Ehrlich — Wed, 27 Jan 2021 00:41:24 GMT

Opinion

Deep learning on JPEG and DCT domain data represents a promising new direction for research.

While working on my dissertation proposal, I had the opportunity to revisit my ICCV 2019 paper “Deep Residual Learning in the JPEG Transform Domain”. It was an interesting experience to look back on it after about a year and see how the field has evolved since then. In this article, I will give some details about the method we presented in the paper, then talk a little about the latest advances in DCT domain deep learning.

JPEG vs spatial throughput, more on this later. Image by author.

Overview

I expect many readers will be familiar with deep learning, the ubiquitous technique for modern machine learning. Fewer people, however, will be familiar with DCT domain techniques and how they apply here. These ideas were extremely popular in the late 1980s and early 1990s when decompressing a JPEG was a time-consuming process for a computer. The idea is simple: redefine the operations you want to perform so that they can be done on the compressed data rather than on the pixels themselves. This is possible because the JPEG transform is (mostly) linear. We usually refer to these techniques as JPEG, DCT, or compressed domain operations.

While these ideas fell out of style as compute power increased, in my opinion, they are still highly applicable to computer vision where saving a few milliseconds per image can add up to massive time savings over the long run (for example: training a convolutional neural network), and this is exactly what I wanted to show in the ICCV paper [1]. 2019 was an extremely active area of research for this field and it’s been exciting to look back at the progress, with four orthogonal techniques eventually emerging as the foundation.

Up front, here are some additional resources:

The paper itself is freely available on arxiv
I took extensive notes while developing the technique and you can read them here these notes are also available as runnable Jupyter notebooks
The code used in the paper is public, you can get it from my gitlab

JPEG and the DCT

There’s a lot written on how JPEG works and on the discrete cosine transform (DCT) so I’m going to only cover them briefly. JPEG encoding consists of roughly the following steps, starting from an RGB image:

Convert the image YCbCr color space, which separates the brightness (grayscale or Y) data from the color data (Cb and Cr).
Pad and center the image. Most encoders will do chroma subsampling (4:2:0 is the most common) which means that the color channels are halved in each dimension. Since the DCT will be taken in 8 by 8 blocks, this means that the image needs to be padded to a multiple of 16 so that after subsampling the color channels the size is divisible by 8. The pixels are then centered by subtracting 128 from each pixel.
Compute the 2D type II discrete cosine transform (often just called the DCT) of each of the blocks. This is the real meat of the JPEG algorithm, and most of the work in compressed domain operations deals the DCT vs the other steps of the algorithm which are often trivial to derive operations for.
The results of the above transform are called DCT coefficients, they are then quantized using a precomputed quantization table. The is just an element-wise divide of each of the 8 by 8 blocks, with rounding applied to the result. This is the main lossy step in JPEG, all the complex JPEG artifacts are caused by this relatively simple operation on the DCT coefficients.
The coefficients are vectorized in a zig-zag order which places low frequencies in the beginning and high frequencies at the end. This is because the quantization tends to zero out high frequencies which our brain doesn’t perceive clearly anyway. By concentrating the zeroes at the end of the vector, they can be efficiently run-length encoded.
The RLE vectors are then entropy coded

JPEG decoding is essentially the reverse process. Note that we say that step 4 is lossy because it can’t really be undone. We know what the quantization matrix used to divide the coefficients was, but because the coefficients were rounded, when we multiply we only get an approximation of the original coefficients.

For reference the DCT is given by the following equation:

For blocks of pixels P, block offsets m,n and spatial frequencies α, β. The function V() computes a scale factor which makes the transform orthonormal. There are two important results to note here. The first is that each coefficient is a function of all the pixels in the block, the second is that the first frequency (α = β = 0) :

In other words, it is an unweighted sum of the pixels and is proportional to the mean of the block.

Linearity of JPEG

As I mentioned above, one way to model JPEG compression is as a linear map. We can do this for JPEG encoding as long as we ignore the rounding in step 4 in the previous section, and it works exactly for JPEG decoding. This is useful because any other linear map can be multiplied through the JPEG linear map to create a new map which operates on JPEG data. For a simple illustration of this, consider the following two linear functions of real numbers:

If I wanted to apply f then g to the same input, I could simply compute f(x) then g(x) , which would take two multiplies, or I could compute

which only takes a single multiply. This is the idea behind compressed domain techniques: if f is JPEG and g is some operations on pixels, then you can multiply them out to make an operation on JPEG data.

So what exactly does this look like? Brian Smith figured this out in 1993 [2] and I attempted to formalize it a little using ideas from multi-linear algebra. If you don’t know what multi-linear algebra is, it’s just an extension of linear algebra to arbitrarily shaped tensors (note: the proper mathematical term for a tensors shape is its type). Fair warning this gets a little mathy.

First, let's go over some notation. All equations in this section are going to be in what's called “Einstein notation” (which he developed while working on general relativity). This is essentially a short-form way of writing summations and it makes tensor products much more readable. In Einstein notation, there are upper and lower indices and it’s important to remember that if you see a superscript, it is an upper index not a power. Unsurprisingly, a subscript is called a lower index. Any time an index appears as an upper index for one tensor in an expression, and a lower index for another tensor, the elements are multiplied and summed. Any time an index appears in the same (upper or lower) position in two tensors, the elements are multiplied but not summed. Any indices that were not summed out carry over to the result. Here’s an example of what a matrix-vector product looks like for a matrix Q and a vector x:

So we multiply over the index i since it appears in the top of Q and the bottom of x and sum, and we leave j alone yielding a vector q indexed by j.

Next we need to define what an image looks like under this model. We’ll say that a single plane image (e.g., a grayscale image) is a type (0, 2) tensor meaning that it has no upper indices and two lower indices (essentially it’s a matrix). We denote this as:

The circled times there is a tensor product (sometimes called the outer product), the H and W are vector spaces, and the * denotes a dual space, which for our purposes just means that the result of the tensor product has lower indices.

Finally, we can derive the JPEG compression and decompression tensors. For brevity, I won’t put the derivation in here (it is quite long), but you can find it in Section 3.2 of the paper. The result of the derivation is two linear maps, the first is:

J performs JPEG compression, note that H, W here are without the *, this means they are upper indices so they get multiplied out when we apply this to I. Also note that the X and Y index a block in the image and the K indexes the stored coefficient (there are 64 per block). The second map is:

which performs JPEG decompression. With these two maps, we can take a new linear map which computes a linear function of pixels, this looks like:

and compute the corresponding JPEG domain map as:

Reading the right hand side of this last equation we can see what the new map does. It decompresses the image, applies C, and then compresses the result. But since we multiplied all this out beforehand, it does all three of these in a single step.

Before continuing, I’ll show another important tensor derived in the paper that we will make use of in the next section: the DCT tensor. This can be used to perform both the forward and inverse DCT:

As in the DCT equation in the previous section m,n index a block of pixels and α, β index the spatial frequency. V is a constant used to make the transform orthonormal. Note the missing summation: this is taken care of by the tensor product.

JPEG Domain Deep Learning

It turns out that this gets us most of the way to JPEG domain deep learning. For the ICCV paper we had a simple goal: use the analysis in the previous section to make a ResNet [3] which operates on JPEG domain data but gets as close as possible to the same result as if it was operating on pixels. ResNet has several operations and we need to define each of them.

Convolutions This one is pretty straightforward and follows directly from the previous section. Convolutions are linear maps on pixels, so they are essentially the C we discussed prior. So using the same technique we can define Ξ as our compressed domain convolution. There are still some tricks we can do here, see Section 4.1 if you’re interested.

Batch Normalization This one is also straightforward. Batch norm defines learnable affine parameters γ and β and measures the running mean and variance of the batches. These values are then applied using the following formula:

As we discussed previously, the 0,0 coefficient is proportional to the mean of the block, which makes it easy to extract and significantly faster than computing the pixel mean (an unconditional read vs 64 sums). Computing the variance is also easy, it turns out that if the pixels have a zero mean, then the mean of the DCT coefficients is equal to the variance of the pixels. Similarly, applying β is as simple as adding it to the 0,0 coefficient and applying γ is a simple as multiplying each coefficient by it. See Section 4.3 for more details on this procedure.

Global Average Pooling This one carries over a lot from batch normalization. The idea is to take only the average of each channel for each feature map. This is as simple as reading off the 0,0 coefficient for each block which already contains the mean. See the figure below and Section 4.5 of the paper. Note that from this point, ResNet uses fully connected layers, and the GAP output is exactly what these fully connected layers expect as input so there are no further compressed domain operations that need to be derived.

DCT domain global average pooling. Image by author.

Model Conversion One last thing to mention in this section is that nothing in our formulation depends on starting from random weights, it is just as valid to start from a pre-trained model and derive a JPEG domain model from it. We call this model conversion and it basically means that you can get a model that operates on JPEG domain data without having to retrain anything.

While this gets us most of the way to a JPEG domain ResNet, there is still one big piece missing: ReLU [4].

The ReLU Problem

CNNs are inherently non-linear which is what allows them to learn such complex mappings. In modern architectures, ReLU is almost exclusively used to introduce non-linearity. The previous analysis only works for linear functions, so we need to rely on an approximation technique.

Recall that ReLU is defined as

setting all negative values to zero. This can instead be accomplished with a binary mask defined as

which is then multiplied with the original feature map. The advantage of this is that it’s a lot easier to approximate the mask than it is to approximate the ReLU itself (see figure below).

Example blocks from our approximation technique. Green indicates negative values, red indicates positive values, blue indicates zero values. Image by author.

We compute the approximate mask using a subset of the DCT coefficients from each JPEG block. By only using a subset of the coefficients, we still have higher throughput than doing a full decompression.

The most interesting part of this by far is the method we came up with for applying the mask, which is a pixel mask, to DCT coefficients, and this is a direct result of the JPEG linearity we discussed in the previous section. For this part, we can use the DCT tensor D since we don’t have to worry about cross-block interactions. We will operate with a DCT block F its pixel domain block I, and the mask G.

If we were to do this naively, we would:

Decompress the image
Pixelwise multiply the mask
Recompress the image

All of which are either linear or bilinear maps:

so we can multiply these steps out to get a single bilinear map which does all three steps:

We can then make the following definition:

and H is a bilinear map which applies the pixel mask to DCT coefficients. This operation is a little more involved than the previous ones, and the full details are in Section 4.2 of the paper.

Results

So how does all of this work? Fairly well actually. We tested this technique on MNIST [5] and CIFAR [6]. Here’s the toy network we used:

Toy network structure. Image by author.

which is basically a tiny ResNet [3].

The first interesting thing to talk about is the accuracy of the approximation. We can start by noting that if you do no approximation at all (e.g., by using all spatial frequencies to compute the mask something which is quite slow) we match the pixel domain network to within a floating point error. When we start approximating the ReLU, we get the following accuracy, comparing the naive method (APX) with our method (ASM):

Left: RMSE for individual 8 by 8 blocks. Middle: network accuracy after model conversion. Right: network accuracy with retraining. Image by author.

Even with fewer frequencies used for the reconstruction, our method does quite well especially when retraining.

The next thing to look at is throughput. Our goal was to make a network that runs faster because it doesn’t need to decompress anything, did we succeed? Sort of:

Throughput results. Image by author.

We can see that JPEG training is a tiny bit faster, while JPEG inference is almost 4 times faster, a great improvement. So why is training slower? Not only does inference allow us to precompute some of the maps we need offline, but the backward pass of the JPEG training is significantly more complex because it needs gradients of the JPEG compression and decompression tensors.

Further Advances

While my paper was neat, there were some other great advances in 2019 for this kind of work. First, we should talk about the major drawback of my method: the weights are significantly larger than in a traditional CNN. So much so, that it is essentially impossible to apply this to a “real” architecture or to a dataset like ImageNet for example. Weights for the method presented here have size in O(HW) vs the nice O(1) weights in a traditional CNN. I did come up with an extension of this work that has O(1) weights but this hasn’t been published and there is significant memory overhead still. One of the goals of this method was to get as close as possible to the pixel domain operations, which is why we only used one approximation. If you relax this constraint, there are some really interesting things you can do. There are now three other foundational techniques in this area which have been developed.

Do Nothing Probably the most well known is by Gueguen et al. [7] and which basically says “hand the ResNet DCT coefficients with no modification to the network structure”. It turns out that this works with only a mild accuracy penalty and you can remove initial layers from the ResNet to get high throughput, but it doesn’t seem to work outside of whole-image classification (e.g., detection, segmentation, etc don’t work well with this).

Block Representation The next technique exploits the DCT block structure by defining a “block representation” [8] for each of the DCT blocks. They do this by using an 8 by 8 stride-8 convolution. This gives an image which is one eighth the size in each dimension. They used this idea for object detection.

Frequency Coefficient Rearrangement The final technique exploits the frequency structure of the coefficients [9]. They do this by rearranging the frequencies to be channel-wise. So again, you end up with an image which is one eighth the size in each dimension, this time with 64 channels. They used this idea for semantic segmentation.

Overall it’s been exciting to watch this field develop. I think this is really important and really interesting, so I’m always encouraging people to get involved. JPEG files are everywhere, so there is no shortage of applications for this kind of math. As another example at ECCV 2020, I used both of the last two techniques to achieve state-of-the-art in JPEG artifact correction [10].

References

M. Ehrlich and L. Davis, Deep Residual Learning in the JPEG Transform Domain (2019), In: Proceedings of the International Conference on Computer Vision.
B. Smith, Fast software processing of motion JPEG video (1994), In: Proceedings of the Second ACM International Conference on Multimedia.
K. He et al., Deep residual learning for image recognition (2016), In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
V. Nair and G. Hinton, Rectified linear units improve restricted boltzmann machines (2010), International Conference on Machine Learning.
Y. LeCun, The MNIST database of handwritten digits
A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images (2009)
L. Gueguen et al., Faster neural networks straight from jpeg (2018) Advances in Neural Information Processing Systems.
B. Deguerre et al., Fast object detection in compressed jpeg images (2019), IEEE Intelligent Transportation Systems Conference.
SY. Lo and H. Hsueh-Ming, Exploring semantic segmentation on the DCT representation (2019) Proceedings of the ACM Multimedia Asia.
M. Ehrlich et al., Quantization Guided JPEG Artifact Correction (2020) Proceedings of the European Conference on Computer Vision.

Revisiting DCT Domain Deep Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.