Jimmy Ye

Metric Compression Part 1

Tue, 04 Jul 2023 00:00:00 -0700

Prologue: What is space?

In the field of machine learning, there’s a lot of focus on dealing with high-dimensional vector spaces. Usually, though not always, the standard Euclidean space \(\mathbb{R}^d\), and if we want to measure the distance between two vectors in \(\mathbb{R}^d\), we typically choose the familiar Euclidean distance: \(d(x, y) = \sqrt{(x_1-y_1)^2 + ... + (x_d-y_d)^2}\). The combination of an ambient “space” and distance “metric” is aptly referred to as a metric space.¹

I would be missing out on audience engagement if I didn’t somehow mention Deep Learning :tm:, so here’s the tie-in: an artificial neural network is just a bunch of “dumb” differentiable transformations composed together. Some are linear, some are non-linear, and the “deep” part just means we’re composing a lot of ‘em in a big chain like \(g(x) = f_1(f_2(f_3(...f_{1000}(x)...)))\). The way we train them is with backpropagation, which is just a technical term for a way of efficiently calculating the derivative and optimizing a loss function \(L(x, g(x))\).

The requirement here is that our functions are all differentiable - which means we need to be able to do calculus somehow. The bog-standard metric space to do calculus in is, you guessed it, Euclidean space.

Now let’s say we chop a trained neural network in half, perhaps one that classifies images, to see what the intermediate function outputs are. We get some points which don’t resemble the inputs or the final outputs. The idea of representation learning is that these intermediate outputs do mean something - the neural network has learned to “see” things like different lines, shapes, and colors patches in the image. Our intermediate vector might have coordinates which are \(1\) if it “sees” a vertical line in the image, and \(0\) if not. But when we try to interpret these intermediate network outputs in general, there is a simple question which is very difficult to answer concretely. What does the Euclidean distance between two points in this intermediate space mean?²

Even though we rely on a distance metric for the idea of backpropagation to even make sense, the question is unanswered. Doing the math to outline the algorithm is one thing. Doing the math (and experiments!) to figure out what are the structures that result in the “representation space” when we work with real-world data, and explaining the meaning of any such structures, appears to be orders of magnitude more difficult. I’ll leave it at this: the question “what is space?” might seem like it’s only for mathematicians or physicists, or maybe even philosophers - but it can actually be a hard question even in applications of machine learning.

That said, we’ll sweep that under the rug and pretend that as long as we’re working with Euclidean space, everything is cozy and well-understood.

Introduction

One commonly encountered fundamental problem is nearest neighbor search (NNS). In an “Intro to Machine Learning” course the idea of nearest neighbors might be discussed as part of a simple classifier called k-nearest neighbors (k-NN), which is simple in concept: if you have a bunch of training data \((x_i, y_i)\), and you see a new input \(x'\) then you can classify \(x'\) by looking at the \(k\) nearest \(x_i\) points in your training data and seeing what their corresponding labels \(y_i\) are. Related applications are in recommendations and information retrieval, where the goal is just to find some of the closest \(x_i\) without any particular labels required.

However, to actually implement these, well, you have to find those \(k\) nearest neighbors. In other words, you need a solution to the NNS problem. The naive solution is to do an exhaustive search every time - compute the distances between \(x'\) and all \(x_i\). Problem is, if you have \(d\)-dimensional data and \(n\) training points, that’s \(O(nd)\) scaling. If you’re looking at more than a thousand dimensions and a billion points, things quickly get out of hand.

In this post, we won’t be looking at direct solutions to the NNS problem. Instead I’ll explore some background and theory for a related problem, compressing metric spaces, concluding with a powerful data structure and algorithm with both practical effectiveness and strong theoretical justification.³

Fundamentals

Let’s outline our goal a bit more clearly: what we want is a data structure and algorithm, AKA an index, that lets us query for the distance between any two data points in given data set \(X\), and scales up to thousands of dimensions and billions or even trillions of points.

That means we have to be really efficient - both constructing and querying the index have to be close to linear in computation and storage. In terms of main memory (RAM) we would like it if, similar to classic databases⁴, we could operate with much less RAM than the whole index.⁵ The storage size is of particular interest, as with all DBs, because reading from storage and the network is slow but cheap, while RAM is fast but expensive.

To achieve this, we’re generally willing to accept a reasonable approximation to the true distances, as long as it’s still practically useful.

JL Lemma

A famous classical result in this space is the Johnson-Lindenstrauss (JL) lemma first published in 1984. Since that first publishing, there have been various different proofs of the lemma, as well as variants of it applicable to different contexts. We’ll just look at applying the original result in Euclidean space.

Essentially, the JL lemma tells us that if we have \(n\) points in \(\mathbb{R}^d\), we can project those points down to a lower dimensional space \(\mathbb{R}^k\), while preserving the distances between any pair of the \(n\) points.

Specifically, we can choose any “tolerance” \(\epsilon\) - normally referred to in this context as the distortion - which is basically a percentage delta between \(0\%\) and \(50\%\). Then there exists a \(k \times d\) matrix \(A\) that projects our \(d\)-dimensional data into \(k\)-dimensional space, while preserving distances up to a relative error of \(\pm \epsilon\). Crucially, \(k\) scales logarithmically with \(n\), and is completely independent of the original dimensionality \(d\).⁶

Here’s the mathematical statement:

Let \(\epsilon \in (0, 1/2)\), and let \(X = \{x_1,...,x_n\} \subset \mathbb{R}^d\) be a set of high-dimensional points. For some positive integer \(k = O(\log (n)/\epsilon^2)\), there exists a matrix \(A \in \mathbb{R}^{k \times d}\) such that for any \(x, y \in X\), we have

\[(1 - \epsilon) \|x - y\|_2^2 \leq \|Ax - Ay\|_2^2 \leq (1 + \epsilon)\|x - y\|_2^2\]

Intuition

To review some linear algebra, \(A\) is a \(k \times d\) matrix which is projecting a higher-dimensional space into a lower-dimensional one. There are two interesting views of this matrix: the column view and the row view.

The column vectors of \(A\) are what the standard basis vectors of \(\mathbb{R}^d\) get mapped to, forming a new basis in \(\mathbb{R}^k\). You can see how that works by multiplying \(A\) by the standard basis vectors, or in aggregate, the \(d\)-dimensional identity \(I\). Since we want \(d\) to be much larger than \(k\), it’s impossible for the column vectors to be linearly independent. In other words, a projection to lower dimensions is inherently lossy - some pairs of distinct points in the original higher-dimensional space will get mapped to the same lower-dimensional points.

The row vectors of \(A\) are \(k\) different \(d\)-dimensional vectors that we are projecting each data point \(x_i\) onto. In this case, it’s possible for these row vectors to be linearly independent (geometrically, orthogonal/perpendicular). In fact, the larger \(d\) gets, the more likely that a “small” set of random “almost”-unit vectors will be “nearly” linearly independent. This is good for us - although the mapping is lossy, we are extracting the maximum amount of information from the points as we can get for a given \(k\).

JL Lemma Applied

OK, so conceptually we could maybe turn our 10,000-dimensional data, even with a trillion data points, into 100-dimensional data. We haven’t answered some crucial questions, though. How do we actually compute this projection matrix \(A\)? What are the constants involved here - will we actually end up with 10-dimensional data, or 1000-dimensional data? Does this projection require each individual coordinate to be more precise (e.g. by going from 32-bit single-precision to 64-bit double precision), nullifying some of the compression?

I won’t review the entire body of work here, but there is a paper published 2003 from Dimitris Achlioptas from Microsoft Research ⁷ which provides us with a particularly nice result in applying the JL lemma. The basic idea, which is common across most applications of the JL lemma, is to construct \(A\) by random sampling. The interesting contribution of this result is that we can sample from very simple distributions, and with high probability get a “good” \(A\).

More formally:

Let \(\beta > 0\) and

\[k_0 = \frac{4 + 2\beta}{\epsilon^2/2-\epsilon^3/3} \log n\]

Let \(k > k_0\) and \(A\) be constructed by randomly sampling from either one of these following two distributions: \(a_{ij} = \{\pm 1\}\) uniformly, or \(a_{ij}=\pm 1\) with \(1/6\) probability each and \(a_{ij} = 0\) with \(2/3\) probability.

Then after some scaling by \(\sqrt{k}\), and \(\sqrt{3}\) for the latter probability distribution, this \(A\) projects \(X\) with \((1 \pm \epsilon)\)-distortion with probability \(> 1 - n^{-\beta}\).

Now we have a pretty well-specified algorithm with some nice properties, we can take a look at the practical consequences.

Computation

Since the distributions are pretty simple, we need very few random bits. In the distribution with only \(\pm 1\), we only need a single bit of randomness per matrix entry. For the distribution that includes \(0\), we only need \(\approx 2.6\) bits per entry.

The original paper suggests an interesting possibility for specializing the required computation: for each of the \(k\) coordinates, we can randomly throw away \(2/3\) of the \(d\) inputs. Then we randomly partition the \(d\) inputs into two halves, sum each half, and then subtract the difference. Originally they intended this to exploit the performance of SQL databases, but in the modern day we might consider whether this has efficient implementations on GPUs or even further specialized hardware.

Just on the face of it, I find it incredibly interesting that we can get away with avoiding the multiplication part of matrix multiplication, and have this elegant construction of \(A\) which takes as little as a single “coin flip” per entry. You don’t even need to do anything special to maintain the length of the vectors.

Choice of \(\beta\)

For any \(n\) big enough that we don’t want to do brute force search, \(\beta = 1\) is good enough – the chance of getting a bad \(A\) would be \(1/n\). In fact, if \(n > 10^8\), we’re probably even fine with \(\beta = \frac{1}{2}\). Of course, the \(2 \beta\) term only accounts for \(1/3\) of \(k\), so while going down to \(\beta = \frac{1}{2}\) reduces \(k\) by \(\approx 17\%\), trying to get it down much further will reduce our probability of success for tiny marginal savings.

Values of \(k\)

Let’s say we have \(n = 10^8\), and we choose \(\beta = \epsilon = \frac{1}{2}\). Then \(k_0 \approx 1105\). OK, that might be interesting if the original data is 10,000-dimensional, but if it was under 1,000 dimensions to begin with, we’re not achieving much. This is already with a fairly small \(\beta\) and large \(\epsilon\), so it seems like this is a limitation of the JL transform…

Unluckiness

One trade-off we have to consider is whether it’s necessary to maintain \((1 \pm \epsilon)\)-distortion across all points simultaneously. In other words, a single pair of points which gets mapped to a too-wrong distance is considered a violation of the JL lemma. If we’re willing to accept that some pairs of points will have distances that are slightly outside the \(\epsilon\) error range, what exactly is that trade-off? That is, if we intentionally choose a too-small \(k\), how often do we get bad pairs of points, and how off is the mapped distance between them?

If the answer is not too often and not too far off, then maybe we could do a lot better (smaller \(k\)), if we’re willing to accept that, say, \(1\%\) of the time, a pair of points would have an error between \(\epsilon\) and \(2\epsilon\). More generally, can we quantify the probability distribution of the relative error between randomly chosen points?

I’m not aware of any published results in this context, but I’m fairly certain that it’s a straightforward exercise⁸ to compute this distribution, so it’s likely been done.

An advantage of this construction compared to other methods for nearest neighbor indices is that we get a smooth fall-off of the error, so in that sense the output (probabilistically) gets gradually worse with smaller \(k\). However, using the JL lemma has a marked disadvantage in that, if many of the distances in the metric are close to each other, \(\epsilon\) relative error could give wrong neighbors frequently.

Precision

Since this approach only involves a handful of multiplications by constants, it’s pretty clear that the coordinates we end up with have the same precision as the ones we started with. Plus with only additions and subtractions (and scaling by relatively small constants) we don’t have to worry about numerical stability or anything like that.

Conclusion

In this post, we covered the idea of metric spaces and nearest neighbor search, particularly in a high-dimensional context, to motivate the need for a compressed index that does a good job of approximately preserving distances. Next time, we’ll dig deeper into a concrete analysis of the storage requirements, and a more sophisticated data structure which is provably near-optimal as a compressed index.

Many of the results referenced in this post apply generally to all sorts of metric spaces. For simplicity we’ll only cover the “standard” Euclidean \(\mathbb{R}^n\). ↩
This also applies for various models which produce embeddings or any form of latent representation. ↩
Most of the content of this post series I originally wrote for one of my graduate topics reports ↩
Nowadays databases solving this particular problem are referred to as vector databases. ↩
If this system is part of a major feature which makes a ton of money, like say Amazon’s recommendation engine, you can probably justify keeping everything in RAM. The rest of us probably can’t. ↩
One important note here is that the JL bound is tight in the sense that, for any set size \(n\), there’s some pathological cases where you can’t use any fewer than \(k\) dimensions and still keep a low distortion. In fact, the bound even applies to nonlinear embeddings, not just the linear ones guaranteed by the JL lemma. To do any better in terms of compression, we have to give up the full distortion guarantee, or we have to look outside of pure dimensionality reduction. ↩
Beyond just the result, this paper is also quite readable, outlines the motivations and intuitions quite well, and reviews specific instances of prior work. Technically it was published in 2001 but officially in a journal only in 2003. ↩
The original proof of the JL lemma involves starting from the distributional JL lemma, which is a consequence of the concentration of measure (AKA the soap bubble), and taking a union bound over pairs of points. ↩

On modern computation

Sat, 29 Feb 2020 00:00:00 -0800

Introduction

Welcome to 2020. In this day and age, there are 64-core consumer CPUs,smartphones that fold in half, and retail stores where you just walk out with what you want to buy. There have been weak-AI systems that can beat humans at Go, StarCraft, and DotA. We’re just a bit shy of cars that drive themselves (though, sadly, not ones that fly). In short, 20 years into the 21st century, we are surrounded by technology which was the stuff of dreams not so long ago.

And yet, it all runs on software and hardware which is, by comparison, quite ancient. The trinity of operating systems, Windows, Linux, and macOS, originated in the 80s and 90s. C, the predominant systems programming language, still used as the foundation of each OS, not to mention embedded applications on pretty much every electronic device manufactured in the last decade, was born in 1972. C++? 1983. Python? 1990. Java, JavaScript, PHP, Ruby? 1995. Not only that, among many experienced and skilled programmers, Emacs (1985) and Vim (1991) are the text editors of choice.

Are these the only options? No, of course not, there’s certainly newer software being used. And every piece of software evolves over time. Modern C++ certainly looks almost nothing like its early days as “C with Classes.” But on the other hand, many of these examples were incremental improvements over their predecessors, doing little to advance the state-of-the-art. No C++ programmer would say that the invention of Java, for all the benefits it might have, was like achieving enlightenment.

So, what gives? Let’s take a whirlwind tour of computer history.

What’s inside a computer?

Once upon a time, computers were simple. They were just a collection of glorified on/off switches with simple circuits to manipulate them. A marvel of engineering at the time, but by today’s standards, childish toys.

Then, the invention of the transistor. With the advent of microprocessors came a complete revolution in the capabilities of computers. What’s more, they discovered they were consistently able to make smaller and smaller transistors, enabling faster, more efficient processors. Soon they realized that a simple CPU design that simply did one operation at a time was inefficient - the rate that the CPU clock ticked at was fixed, and some operations are simply inherently much faster than others. So they invented a trick called pipelining. And out-of-order execution. Multiprocessors (multi-core).

And as a result, what happened on the software side? People invented higher-level languages like C, so that they wouldn’t have to be bogged down by having to translate their thoughts into machine code. Compilers, software designed to translate higher-level languages into machine code, grew ever more complex. At one point, the system grew complex enough that the CPU architects couldn’t change the machine language of their CPUs without breaking the compilers. So they had a bright idea: they would design their CPUs using whatever they wanted internally, and add some extra circuitry to translate the common machine language into the hardware-level instructions their CPUs could understand.

This process continued in this manner for decades. It’s still ongoing now. The general trend can be described as continually increasing abstraction. The clear benefit is that programmers can write programs more quickly and easily, not only increasing the number of programmers and programs, but also enabling programs of greater scale and sophistication. The downside is that there are a huge number of inefficiencies due to a countless number of translations between what the programmer intended to do, and what the hardware actually does.

Operating systems and user interfaces

Before the development of microprocessors, computers had little in the way of “user interfaces.” They were basically complicated calculators where the state of the machine was almost directly manipulated by the operator. Considering that these were a tiny number of specialists, the designs had little in common with the modern concept of UI.

But when computers became more widespread, professionals from many other walks of life began using them. Anybody who wanted to store or communicate text or numbers. Computers were now general-purpose. They had to be taught how to do more than just one thing well and how to do more than one thing at a time. So programmers invented operating systems, programs that had the sole purpose of managing other programs. The beginnings of what we would call user interfaces sprang into life: keyboards, printers, monitors. No longer were these electromechanical devices to be operated by a handful of specialists; they were now a proper appliance, to be used, even if not fully understood, by essentially untrained individuals.

Computers became cheaper and more powerful, so suddenly there was a potential for a much broader audience. Entrepreneurs like Bill Gates and Steve Jobs capitalized on the opportunity by providing hardware and software that could be marketed towards the average Joe, ushering in the age of the Personal Computer. A key part of their early successes was their development of proper Graphical User Interfaces (GUIs).¹ The popularity of computers grew, as their applicability expanded from an arcane incantation-accepting device into a simple-to-understand interactive canvas.

Text editors and programming languages

Alongside processors and operating systems, there was the development of text editors. Text was originally, of course, the only human-accessible mode of input and output for a computer. But even as graphics began to play a larger role, most people focused on productivity found that text was simply irreplaceable. While there was some experimentation, ultimately programmers also found that the most effective way of representing their programs was in text.

The venerable vi and Emacs both originated in other programs of a bygone era. vi embraced a system for maximum efficiency of basic text operations in an extremely constrained environment. For its simplicity and minimalism, it became nearly universal, in one form or another, on pretty much every non-Windows computer. On the other hand, Emacs sought to provide the ultimate flexibility, wanting to exploit their interface with the computer for what it was: the ability to create and execute arbitrary programs.

Free & open source software

The most popular open source operating system today is, of course, Linux, or, as a certain someone would insist, GNU/Linux. And the only popular Emacs variant today is GNU Emacs. That certain someone is Richard Stallman, who religiously champions free-as-in-speech (libre) software. Regardless of one’s personal opinions on the matter, it’s undeniable that this philosophy and the software which came of it has profoundly influenced the modern landscape. It is this philosophy which one might say inspired GNU Emacs to be what it is: a program which can be modified in almost any way imaginable by the user.

The Lisp machine

In those days, programming language research was quite fruitful, in the sense that the space had room for exploration in almost every direction. Many languages with very different ideas came to be, among them functional languages (Haskell, Ocaml), truly dynamic languages (Lisp, Smalltalk), logical languages (Prolog), and a wide variety of others, better and lesser known (APL, Forth, etc.).

Lisp and Prolog, in particular, were subject to quite some study in the pursuit of developing AI. In the 70s and 80s, processors were slower and compilers were not particularly advanced. General-purpose computing hadn’t really hit its stride. Consequently specialized hardware was developed at the MIT AI Lab to efficiently execute Lisp, namely Lisp machines. While they pioneered a variety of extremely influential technologies, they were ultimately unsuccessful thanks to general-purpose PCs coming of age and the AI winter.

As something of a twist of fate, Stallman encountered Lisp at MIT. As a dynamic, high-level language with very advanced features for its time, and a lineage already measured in decades, it was an unsurprising choice for an Emacs. It is for this reason that Emacs uses Emacs Lisp.

This was going somewhere after all

GNU Emacs has famously been referred to, a little tongue-in-cheek, as both “the infinitely extensible editor” and “a great operating system, only in need of a decent text editor.” It’s true, Emacs is a piece of software with a depth beyond any single human being, one that comes with its own programming language and has lasted nigh on 40 years. One can spend a decade using it and not come remotely close to mastering it.² I personally can attest to at least a fraction of that, having used it for several years.

Given the historical context, this is no coincidence. Emacs is perhaps the last remaining descendant of the Lisp machines. Similarly to other Lisp implementations in both software and hardware, and the other languages which took inspiration from them, Emacs is a core runtime written in C supporting a much larger labyrinth of Emacs Lisp, resembling something like the bytecode interpreter of Python. It is a system which defines a large set of primitives for displaying and manipulating text, interactions with the filesystem and network, and an interface with the underlying operating system. Upon this foundation, Emacs Lisp programs of all kinds have been written, including but not limited to: music players, PDF viewers, web browsers, email clients, terminal emulators, and Tetris.

What is the significance of this? It’s a little hard to put into words, but I will try nonetheless. Emacs is something of a monstrosity of a program. However, it represents an unbroken line to a time when computers looked completely different than their modern-day counterparts. Not only does it still function, it is still evolving. Its language and its core are sufficiently flexible to allow that, and as a result, the community is able to modify and extend it to support features which might not have been dreamt of at its creation. In short - while Linux, for example, might be a remarkably engineered program, Emacs is a testament to organic growth that defies expectations.

What the future holds

Many people have been remarkably pessimistic about the current state of software development, remarking that it is crude, obsolete, inefficient, and just plain dumb. ³⁴⁵

However, in my view, in spite of all these shortcomings, we have so much to appreciate, and so much to look forward to.

In the programming language space, I believe the likes of Rust will usher in a new era of safer systems programming, without paying any penalty in performance. As a side effect, many ideas, like algebraic data types, which were only popular in more niche programming languages, will continue to gain traction. Looking out even further, I’m interested in seeing much more powerful type systems being developed, such as that of F^*, which will allow programmers to write code which can be rigorously reasoned about. Rather than being subject to the limitations of our foresight in writing the right tests, we can confidently make claims that our software will never encounter certain errors. We can state with certainty that code adheres to the permissions it requests and is allowed. And all this with only a minimal burden imposed on the programmer.

In the operating systems space, I am quite certain that innovation will continue at a fairly rapid pace. Today, there are operating systems for smartphones which consistently achieve a UI that renders at 60FPS. With better languages for systems programming and the right goals in mind, I would not be surprised to see operating systems which handle concurrency much more robustly, encounter none of those seemingly arbitrary failures, run on multiple devices of different form factors effortlessly and perhaps even in unison, and attain a low enough interface latency at a high enough consistency as to be indistinguishable from real physical interaction. When this goal is achieved we will have many forms of ubiquitous computing to look forward to - I’m certainly excited for AR/VR applications to become commonplace.⁶

Finally, for lack of a better term, the Emacs-y space. This really has nothing to do with text editors, it is simply that I do not have a better example in mind. Today, many of us write programs which must be compiled before they are executed, a process that can take between a few seconds and several hours. More dynamic (i.e. interpreted and JIT compiled) languages have a lesser lag time, but at a severe loss of performance. Programs are frequently not cross-platform, compatible only with a specific OS or CPU architecture. And one of the few languages which opposes these trends to an extent, I shudder to admit, is JavaScript.

But I believe that this is only a growing pain. We will soon see a day when the distinction between compiled and interpreted becomes blurred. Not only will compilation get faster, other hybrid approaches will be developed. The trade-off between performance and dynamism is not fundamental, it is an engineering problem. And when we solve those problems, every program will be something of an Emacs. Long before we achieve general AI (although I do believe that’s coming within the century), every person will be able to communicate with the programs on any of their computing devices, instructing it as precisely as they wish. Computers will be tools, not appliances.

“The Mother of All Demos” demo’d (duh) these concepts in 1968, decades ahead of the first releases of Windows and the Mac OS. ↩
http://edward.oconnor.cx/2009/07/learn-emacs-in-ten-years ↩
https://tonsky.me/blog/disenchantment/ ↩
http://blog.rfox.eu/en/Programmer_s_critique_of_missing_structure_of_oper.html ↩
http://doc.cat-v.org/bell_labs/utah2000/utah2000.html ↩
https://jnd.org/natural_user_interfaces_are_not_natural/ ↩

Everything is information

Wed, 01 Jan 2020 00:00:00 -0800

Introduction

As humans, we tend to see the world at our level. Which is to say, at the size, mass, and energy scales common to our day-to-day lives. Tables and chairs, cars and bikes, etc. However, a physicist would probably tell you that this intuitive, anthropocentric conception of the universe is only a reasonable approximation for the very small physical regime that we live in. What the universe really looks like at small scales is the weirdness of quantum physics; at very large scales, we encounter relativistic effects due to gravity; and in some cases, like around black holes, some unknown combination of both.

This isn’t an article about conventional physics, though - there is another interesting way to view the universe. And that is in terms of information.

What is information?

I don’t want to get too philosophical, but we do have to start somewhere. When we think of information, we probably think in terms of communicating it, via speech or text. We also record and retrieve information, but for the purposes of this post, we can just think of that as a form of communication between the recorder and retriever (even if, say, they’re the same person at different points in time).

OK, but what exactly is it that we’re communicating, and why? Fundamentally, when two people communicate, there’s something that one person knows that the other doesn’t, and the knowledge is what’s communicated. Even if it’s as trivial as “what I ate for breakfast.” So, putting aside the philosophical questions about what “knowledge” is, we can think of information as “new knowledge.” If the knowledge is already available, then no information is communicated.

When framed this way, it’s natural for us to describe information in terms of a question and an answer. In the simplest possible case, the space of possible answers must have at least two elements, i.e. a yes/no question. After all, if there’s only one possible answer, then we would always know the answer without any communication.

This describes the basic, implicit premise of a huge swath of computer science, which we usually take for granted. We measure the storage of data in terms of bits (bytes being 8 bits), most modern CPUs use either 32-bit or 64-bit memory addresses, the rate of communication (bandwidth) of a network connection is measured in bits per second, and so on and so forth.

Still, the title is “everything is information,” not just data manipulated by computers…

The many forms of information

The most familiar example of quantifiable information, to most people, and most especially computer scientists, will be digital data. However, information takes many other forms. Every answer to every question is a specific amount of information - as mentioned before, a single bit of information for a yes/no question. We can think of any number as a certain amount of information, namely the number of bits it takes to specify it in binary, and of course, pretty much anything else we can think of can be converted to numbers and back.¹

This already seems fairly comprehensive. Every spoken word is some number of bits of information, every piece of data touched by a computer, everything ever written down. These are familiar enough, but the interesting part is we can think of all of them in terms of bits.

All things physical

But one might be tempted to think that, surely, there are some things which can’t be described as just a bunch of bits. For example, what something smells like, or how it feels to laugh until your sides hurt. Certainly, we know of no way to precisely describe these things on paper or a computer. We can describe them approximately with words, but ultimately, language does not convey all the complexities of a smell, or an emotion.²

We’ll skirt around more philosophical issues, by asserting that everything ever experienced by any human corresponds exactly one-to-one with the state of a physical system. In other words, the assertion that there is nothing special about consciousness, and every subjective experience corresponds to some configuration of neurons firing in our brains. This will of course be a rather controversial assertion, but probably not an unfamiliar one.³

Under this assumption that subjective experience really just reduces down to some physical states, how do we proceed? Well, we already know how to talk about physical systems, as alluded to in the introduction. It’s the domain of physicists, and the details aren’t too important for us. What matters is that we could, in theory, quantitatively describe physical states in some way, with a bunch of numbers that we could write down.

Then, for some theoretical data format, we could describe the entire universe in terms of this concept of information. If we play God just for the sake of argument, we could describe the position and velocity of every particle in the universe at a subatomic level, and that would capture everything physical, including the brains of humans.⁴

What’s the point?

“OK,” you might say, “I get it, we can describe stuff with bits and bytes, but that doesn’t seem that interesting. Isn’t it just theoretical wiffle-waffling?”

Well, actually, information theory has many applications.

Compression

One interesting application is compression. By definition, compression is about taking some information, and somehow making it smaller. It’s applied everywhere, for pictures, videos, programs, etc. With the information theoretic lens we can make an observation about compression: the only way we can compress information is to rely on the mutual assumptions of communicating parties. We must assume certain kinds of redundancies in the data, and ways of encoding that redundancy. Otherwise, if we reduce the amount of information communicated, we must always lose something. This is intricately linked with a basic mathematical principle - we can’t make an invertible mapping between sets of different sizes. It also demonstrates why there is no compression algorithm that will work well for all possible data (e.g. random data) - for compression to work at all, we make certain biased assumptions, and by definition, some data will violate those assumptions.

Cryptography

Cryptography, being the study of secure communication, is unsurprisingly all about information. It’s no coincidence that cryptographic algorithms tend to be described with some extra metadata about how many bits they use (AES-256, SHA-512, etc.). How much information do we need to authenticate, or verify the author of a message? How many bits can be protected with a given number of “secret” bits? Exactly how random do we need random numbers to be to ensure security? These questions are at the heart of cryptographic methods used in modern digital security.

Physics, again

As it turns out, the physical interpretation of information has very real consequences. A famous example is the second law of thermodynamics:

The second law of thermodynamics states that the total entropy of an isolated system can never decrease over time

At first glance this doesn’t say anything about information, but that’s just because I’ve avoided using the technical terminology of entropy. Those who remember entropy from their chemistry or physics classes might think of it as a measure of how “disordered” a system is, in terms of its indistinguishable microstates. As it turns out, the thermodynamic definition of entropy is precisely equivalent to the ideas of information outlined above. In fact, properly speaking, entropy is measured in joules per kelvin per bit!⁵

Another interesting principle is the law of conservation of information. In classical physics, we can seemingly create or destroy information at will. That is, it is physically possible to copy any arbitrary sequence of bits perfectly, at least in theory, and likewise we can take information and irreversibly destroy it.

However, with quantum mechanics, information takes a different form, which is quite different than classical bits. In particular, quantum information can neither be created nor destroyed. The late Stephen Hawking once made a famous bet against this law, based on a belief that black holes destroy information - and he lost that bet. Why this should be the case is rather technical, and anyways, to paraphrase Feynman, a technical explanation would not really be a “why” so much as a “how.”

The key point is that, much as physics forbids faster-than-light travel, it appears to also forbid non-conservative operations on information at the lowest scale. (And as an aside, we can rephrase the restriction of FTL as “information cannot travel FTL.”)

Based on similar ideas about black holes, there is a proven upper bound on how much information can be contained in finite space with finite energy, known as the Bekenstein bound. This places fundamental theoretical limits on how fast computers can be, and how dense their storage can be.⁶ In fact, it is this result which confirms the possibility of the previously mentioned idea of describing any physical system - if a certain volume of space can only contain a certain finite amount of information, then it is possible to describe that volume using at most that much information.

There are whole theories of physics which are founded upon physical information. So, I think, one should begin to suspect that information is a genuinely powerful and pervasive concept.

And much, much more

That I’m not going to list out for you! Wikipedia has plenty of information (heh) about the topic.

The conclusion that maybe should’ve been the introduction

In 1948, information theory was essentially invented by Claude Shannon, with the paper “A Mathematical Theory of Communication”. Of course, he wasn’t the only person to have these ideas or develop them, but he’s the one known as the father of information theory. At the time, he was mainly concerned with cryptography and literal, electrical/mechanical communication. In the better part of a century since then, the theory has found many applications, across too many fields to count.

As far as we can tell, anything in the universe is, in a sense, the information which describes it.⁷ Viewing things this way is no less correct than, for example, viewing everything as particles. So, as it turns out, information is a fundamental principle, not just of human activities, but, apparently, of the very fabric of the universe.⁸

The specific usage of binary isn’t necessary, though it is fundamental in the sense of being the smallest discrete base. The important thing is that information is logarithmic in the number of states. ↩
One heuristic argument for this is that it’s impossible to imagine what something you’ve never encountered smells or tastes like based purely on a verbal description. ↩
At any rate, it’s certainly much harder to start talking about what quality subjective, conscious feelings (qualia) have which is beyond merely physical. For those who wish to disagree with the assertion, well, I will simply say that in your view of the world, not everything is information - but on the other hand, the stuff that’s not information isn’t a physically measurable thing, so you could never detect it in other people. ↩
You can nitpick that it’s a little more complex thanks to quantum physics, and possibly other physics stuff I’m not familiar with. Or maybe physics we don’t even know about yet. It’s not relevant to the argument, so long as we can assume that physical states are completely quantifiable. ↩
Up to some conversion factors. SI units treat bits as dimensionless. ↩
We’re nowhere close to those limits, and honestly, we might never get anywhere close. But the fact remains that such limits exist. ↩
Again, philosophical arguments about definitionally unobservable state would evade this principle. And again, it is my personal opinion that such philosophical considerations are of no particular interest. ↩
Of course this is deliberately grandiose. The caveat is that you could really say this about a lot of things, like mathematics, or computers. In fact, this post is only the first of a series, hopefully, which covers many such ideas. ↩