<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Jimmy Ye</title>
    <description>Blog about programming, math, software engineering, science, and whatever else</description>
    <link>https://rationalis.github.io/</link>
    <atom:link href="https://rationalis.github.io/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sun, 09 Jul 2023 22:37:33 -0700</pubDate>
    <lastBuildDate>Sun, 09 Jul 2023 22:37:33 -0700</lastBuildDate>
    <generator>Jekyll v3.9.3</generator>
    
      <item>
        <title>Metric Compression Part 1</title>
        <description>&lt;h1 id=&quot;prologue-what-is-space&quot;&gt;Prologue: What is space?&lt;/h1&gt;

&lt;p&gt;In the field of machine learning, there’s a lot of focus on dealing with
high-dimensional vector spaces. Usually, though not always, the standard
Euclidean space \(\mathbb{R}^d\), and if we want to measure the distance between
two vectors in \(\mathbb{R}^d\), we typically choose the familiar Euclidean
distance: \(d(x, y) = \sqrt{(x_1-y_1)^2 + ... + (x_d-y_d)^2}\). The combination
of an ambient “space” and distance “metric” is aptly referred to as a &lt;em&gt;metric
space&lt;/em&gt;.&lt;sup id=&quot;fnref:space&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:space&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;I would be missing out on audience engagement if I didn’t somehow mention Deep
Learning :tm:, so here’s the tie-in: an &lt;em&gt;artificial neural network&lt;/em&gt; is just a
bunch of “dumb” &lt;em&gt;differentiable&lt;/em&gt; transformations composed together. Some are
linear, some are non-linear, and the “deep” part just means we’re composing a
lot of ‘em in a big chain like \(g(x) = f_1(f_2(f_3(...f_{1000}(x)...)))\). The
way we train them is with &lt;em&gt;backpropagation&lt;/em&gt;, which is just a technical term for
a way of efficiently calculating the derivative and optimizing a loss function
\(L(x, g(x))\).&lt;/p&gt;

&lt;p&gt;The requirement here is that our functions are all differentiable - which means
we need to be able to do calculus somehow. The bog-standard metric space to do
calculus in is, you guessed it, Euclidean space.&lt;/p&gt;

&lt;p&gt;Now let’s say we chop a trained neural network in half, perhaps one that
classifies images, to see what the intermediate function outputs are. We get
some points which don’t resemble the inputs or the final outputs. The idea of
&lt;em&gt;representation learning&lt;/em&gt; is that these intermediate outputs do mean something -
the neural network has learned to “see” things like different lines, shapes, and
colors patches in the image. Our intermediate vector might have coordinates
which are \(1\) if it “sees” a vertical line in the image, and \(0\) if not. But
when we try to interpret these intermediate network outputs in general, there is
a simple question which is very difficult to answer concretely. What does the
Euclidean distance between two points in this intermediate space
&lt;em&gt;mean&lt;/em&gt;?&lt;sup id=&quot;fnref:embedding&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:embedding&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Even though we rely on a distance metric for the idea of backpropagation to even
make sense, the question is unanswered. Doing the math to outline the algorithm
is one thing. Doing the math (and experiments!) to figure out what are the
&lt;em&gt;structures&lt;/em&gt; that result in the “representation space” when we work with
real-world data, and explaining the &lt;em&gt;meaning&lt;/em&gt; of any such structures, appears to
be orders of magnitude more difficult. I’ll leave it at this: the question “what
is space?” might seem like it’s only for mathematicians or physicists, or maybe
even philosophers - but it can actually be a hard question even in applications
of machine learning.&lt;/p&gt;

&lt;p&gt;That said, we’ll sweep that under the rug and pretend that as long as we’re
working with Euclidean space, everything is cozy and well-understood.&lt;/p&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;One commonly encountered fundamental problem is &lt;em&gt;nearest neighbor search&lt;/em&gt; (NNS).
In an “Intro to Machine Learning” course the idea of nearest neighbors might be
discussed as part of a simple classifier called &lt;em&gt;k-nearest neighbors&lt;/em&gt; (k-NN),
which is simple in concept: if you have a bunch of training data \((x_i, y_i)\),
and you see a new input \(x'\) then you can classify \(x'\) by looking at the
\(k\) nearest \(x_i\) points in your training data and seeing what their
corresponding labels \(y_i\) are. Related applications are in recommendations
and information retrieval, where the goal is just to find some of the closest
\(x_i\) without any particular labels required.&lt;/p&gt;

&lt;p&gt;However, to actually implement these, well, you have to &lt;em&gt;find&lt;/em&gt; those \(k\)
nearest neighbors. In other words, you need a solution to the NNS problem. The
naive solution is to do an exhaustive search every time - compute the distances
between \(x'\) and &lt;em&gt;all&lt;/em&gt; \(x_i\). Problem is, if you have \(d\)-dimensional data
and \(n\) training points, that’s \(O(nd)\) scaling. If you’re looking at more
than a thousand dimensions and a billion points, things quickly get out of hand.&lt;/p&gt;

&lt;p&gt;In this post, we won’t be looking at direct solutions to the NNS problem.
Instead I’ll explore some background and theory for a related problem,
compressing metric spaces, concluding with a powerful data structure and
algorithm with both practical effectiveness and strong theoretical
justification.&lt;sup id=&quot;fnref:original&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:original&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h1 id=&quot;fundamentals&quot;&gt;Fundamentals&lt;/h1&gt;

&lt;p&gt;Let’s outline our goal a bit more clearly: what we want is a data structure and
algorithm, AKA an index, that lets us query for the distance between any two
data points in given data set \(X\), and scales up to thousands of dimensions
and billions or even trillions of points.&lt;/p&gt;

&lt;p&gt;That means we have to be really efficient - both constructing and
querying the index have to be close to linear in computation and storage. In
terms of main memory (RAM) we would like it if, similar to classic
databases&lt;sup id=&quot;fnref:db&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:db&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, we could operate with much less RAM than the whole index.&lt;sup id=&quot;fnref:mem&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:mem&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;
The storage size is of particular interest, as with all DBs, because reading
from storage and the network is slow but cheap, while RAM is fast but expensive.&lt;/p&gt;

&lt;p&gt;To achieve this, we’re generally willing to accept a reasonable &lt;em&gt;approximation&lt;/em&gt;
to the true distances, as long as it’s still practically useful.&lt;/p&gt;

&lt;h2 id=&quot;jl-lemma&quot;&gt;&lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; Lemma&lt;/h2&gt;

&lt;p&gt;A famous classical result in this space is the &lt;em&gt;Johnson-Lindenstrauss (&lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt;)
lemma&lt;/em&gt; first published in 1984. Since that first publishing, there have been
various different proofs of the lemma, as well as variants of it applicable to
different contexts. We’ll just look at applying the original result in Euclidean
space.&lt;/p&gt;

&lt;p&gt;Essentially, the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma tells us that if we have \(n\) points in
\(\mathbb{R}^d\), we can project those points down to a lower dimensional space
\(\mathbb{R}^k\), while preserving the distances between any pair of the \(n\)
points.&lt;/p&gt;

&lt;p&gt;Specifically, we can choose any “tolerance” \(\epsilon\) - normally referred to
in this context as the &lt;em&gt;distortion&lt;/em&gt; - which is basically a percentage delta
between \(0\%\) and \(50\%\). Then there exists a \(k \times d\) matrix \(A\)
that projects our \(d\)-dimensional data into \(k\)-dimensional space, while
preserving distances up to a relative error of \(\pm \epsilon\). Crucially,
\(k\) scales &lt;em&gt;logarithmically&lt;/em&gt; with \(n\), and is completely &lt;em&gt;independent&lt;/em&gt; of
the original dimensionality \(d\).&lt;sup id=&quot;fnref:tight&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tight&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Here’s the mathematical statement:&lt;/p&gt;

&lt;p&gt;Let \(\epsilon \in (0, 1/2)\), and let \(X = \{x_1,...,x_n\} \subset \mathbb{R}^d\) be
a set of high-dimensional points. For some positive integer \(k = O(\log
(n)/\epsilon^2)\), there exists a matrix \(A \in \mathbb{R}^{k \times d}\) such that for any
\(x, y \in X\), we have&lt;/p&gt;

\[(1 - \epsilon) \|x - y\|_2^2 \leq \|Ax - Ay\|_2^2 \leq (1 + \epsilon)\|x - y\|_2^2\]

&lt;h3 id=&quot;intuition&quot;&gt;Intuition&lt;/h3&gt;

&lt;p&gt;To review some linear algebra, \(A\) is a \(k \times d\) matrix which is
projecting a higher-dimensional space into a lower-dimensional one. There are
two interesting views of this matrix: the column view and the row view.&lt;/p&gt;

&lt;p&gt;The column vectors of \(A\) are what the standard basis vectors of
\(\mathbb{R}^d\) get mapped to, forming a new basis in \(\mathbb{R}^k\). You can
see how that works by multiplying \(A\) by the standard basis vectors, or in
aggregate, the \(d\)-dimensional identity \(I\). Since we want \(d\) to be much
larger than \(k\), it’s impossible for the column vectors to be linearly
independent. In other words, a projection to lower dimensions is inherently
&lt;em&gt;lossy&lt;/em&gt; - some pairs of distinct points in the original higher-dimensional space
will get mapped to the same lower-dimensional points.&lt;/p&gt;

&lt;p&gt;The row vectors of \(A\) are \(k\) different \(d\)-dimensional vectors that we
are projecting each data point \(x_i\) onto. In this case, it’s possible for
these row vectors to be linearly independent (geometrically,
orthogonal/perpendicular). In fact, the larger \(d\) gets, the more likely that
a “small” set of random “almost”-unit vectors will be “nearly” linearly
independent. This is good for us - although the mapping is lossy, we are
extracting the maximum amount of information from the points as we can get for a
given \(k\).&lt;/p&gt;

&lt;h2 id=&quot;jl-lemma-applied&quot;&gt;&lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; Lemma Applied&lt;/h2&gt;

&lt;p&gt;OK, so conceptually we could maybe turn our 10,000-dimensional data, even with a
trillion data points, into 100-dimensional data. We haven’t answered some
crucial questions, though. How do we actually compute this projection matrix
\(A\)? What are the constants involved here - will we actually end up with
10-dimensional data, or 1000-dimensional data? Does this projection require each
individual coordinate to be more precise (e.g. by going from 32-bit
single-precision to 64-bit double precision), nullifying some of the
compression?&lt;/p&gt;

&lt;p&gt;I won’t review the entire body of work here, but there is a paper published 2003
from Dimitris Achlioptas from Microsoft Research &lt;sup id=&quot;fnref:randomproj&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:randomproj&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt; which provides us
with a particularly nice result in applying the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma. The basic idea, which
is common across most applications of the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma, is to construct \(A\) by
random sampling. The interesting contribution of this result is that we can
sample from very simple distributions, and with high probability get a “good”
\(A\).&lt;/p&gt;

&lt;p&gt;More formally:&lt;/p&gt;

&lt;p&gt;Let \(\beta &amp;gt; 0\) and&lt;/p&gt;

\[k_0 = \frac{4 + 2\beta}{\epsilon^2/2-\epsilon^3/3} \log n\]

&lt;p&gt;Let \(k &amp;gt; k_0\) and \(A\) be constructed by randomly sampling from either one of
these following two distributions: \(a_{ij} = \{\pm 1\}\) uniformly, or
\(a_{ij}=\pm 1\) with \(1/6\) probability each and \(a_{ij} = 0\) with \(2/3\)
probability.&lt;/p&gt;

&lt;p&gt;Then after some scaling by \(\sqrt{k}\), and \(\sqrt{3}\) for the latter
probability distribution, this \(A\) projects \(X\) with \((1 \pm
\epsilon)\)-distortion with probability \(&amp;gt; 1 - n^{-\beta}\).&lt;/p&gt;

&lt;p&gt;Now we have a pretty well-specified algorithm with some nice properties, we can
take a look at the practical consequences.&lt;/p&gt;

&lt;h3 id=&quot;computation&quot;&gt;Computation&lt;/h3&gt;

&lt;p&gt;Since the distributions are pretty simple, we need very few random bits. In the
distribution with only \(\pm 1\), we only need a single bit of randomness per
matrix entry. For the distribution that includes \(0\), we only need \(\approx
2.6\) bits per entry.&lt;/p&gt;

&lt;p&gt;The original paper suggests an interesting possibility for specializing the
required computation: for each of the \(k\) coordinates, we can randomly throw
away \(2/3\) of the \(d\) inputs. Then we randomly partition the \(d\) inputs
into two halves, sum each half, and then subtract the difference. Originally
they intended this to exploit the performance of SQL databases, but in the
modern day we might consider whether this has efficient implementations on GPUs
or even further specialized hardware.&lt;/p&gt;

&lt;p&gt;Just on the face of it, I find it incredibly interesting that we can get away
with avoiding the &lt;em&gt;multiplication&lt;/em&gt; part of matrix multiplication, and have this
elegant construction of \(A\) which takes as little as a single “coin flip” per
entry. You don’t even need to do anything special to maintain the length of the
vectors.&lt;/p&gt;

&lt;h3 id=&quot;choice-of-beta&quot;&gt;Choice of \(\beta\)&lt;/h3&gt;

&lt;p&gt;For any \(n\) big enough that we don’t want to do brute force search, \(\beta =
1\) is good enough – the chance of getting a bad \(A\) would be \(1/n\). In
fact, if \(n &amp;gt; 10^8\), we’re probably even fine with \(\beta = \frac{1}{2}\). Of
course, the \(2 \beta\) term only accounts for \(1/3\) of \(k\), so while going
down to \(\beta = \frac{1}{2}\) reduces \(k\) by \(\approx 17\%\), trying to get
it down much further will reduce our probability of success for tiny marginal
savings.&lt;/p&gt;

&lt;h3 id=&quot;values-of-k&quot;&gt;Values of \(k\)&lt;/h3&gt;

&lt;p&gt;Let’s say we have \(n = 10^8\), and we choose \(\beta = \epsilon =
\frac{1}{2}\). Then \(k_0 \approx 1105\). OK, that might be interesting if the
original data is 10,000-dimensional, but if it was under 1,000 dimensions to
begin with, we’re not achieving much. This is already with a fairly small
\(\beta\) and large \(\epsilon\), so it seems like this is a limitation of the
&lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; transform…&lt;/p&gt;

&lt;h3 id=&quot;unluckiness&quot;&gt;Unluckiness&lt;/h3&gt;

&lt;p&gt;One trade-off we have to consider is whether it’s necessary to maintain \((1 \pm
\epsilon)\)-distortion across all points simultaneously. In other words, a
single pair of points which gets mapped to a too-wrong distance is considered a
violation of the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma. If we’re willing to accept that &lt;em&gt;some&lt;/em&gt; pairs of
points will have distances that are slightly outside the \(\epsilon\) error
range, what exactly is that trade-off? That is, if we intentionally choose a
too-small \(k\), how often do we get bad pairs of points, and how off is the
mapped distance between them?&lt;/p&gt;

&lt;p&gt;If the answer is not too often and not too far off, then maybe we could do a lot
better (smaller \(k\)), if we’re willing to accept that, say, \(1\%\) of the
time, a pair of points would have an error between \(\epsilon\) and
\(2\epsilon\). More generally, can we quantify the probability distribution of
the relative error between randomly chosen points?&lt;/p&gt;

&lt;p&gt;I’m not aware of any published results in this context, but I’m fairly certain
that it’s a straightforward exercise&lt;sup id=&quot;fnref:exercise&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:exercise&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt; to compute this distribution, so
it’s likely been done.&lt;/p&gt;

&lt;p&gt;An advantage of this construction compared to other methods for nearest neighbor
indices is that we get a smooth fall-off of the error, so in that sense the
output (probabilistically) gets gradually worse with smaller \(k\). However,
using the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma has a marked disadvantage in that, if many of the distances
in the metric are close to each other, \(\epsilon\) relative error could give
wrong neighbors frequently.&lt;/p&gt;

&lt;h3 id=&quot;precision&quot;&gt;Precision&lt;/h3&gt;

&lt;p&gt;Since this approach only involves a handful of multiplications by constants,
it’s pretty clear that the coordinates we end up with have the same precision as
the ones we started with. Plus with only additions and subtractions (and scaling
by relatively small constants) we don’t have to worry about numerical stability
or anything like that.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;In this post, we covered the idea of &lt;em&gt;metric spaces&lt;/em&gt; and &lt;em&gt;nearest neighbor
search&lt;/em&gt;, particularly in a high-dimensional context, to motivate the need for a
compressed index that does a good job of approximately preserving distances.
Next time, we’ll dig deeper into a concrete analysis of the storage
requirements, and a more sophisticated data structure which is provably
near-optimal as a compressed index.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:space&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Many of the results referenced in this post apply generally to all sorts of metric spaces. For simplicity we’ll only cover the “standard” Euclidean \(\mathbb{R}^n\). &lt;a href=&quot;#fnref:space&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:embedding&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This also applies for various models which produce &lt;em&gt;embeddings&lt;/em&gt; or any form of &lt;em&gt;latent representation&lt;/em&gt;. &lt;a href=&quot;#fnref:embedding&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:original&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Most of the content of this post series I originally wrote for &lt;a href=&quot;https://github.com/rationalis/cse-291-final-report/blob/master/report_postsubmit.pdf&quot;&gt;one of my graduate topics reports&lt;/a&gt; &lt;a href=&quot;#fnref:original&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:db&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Nowadays databases solving this particular problem are referred to as &lt;em&gt;vector databases&lt;/em&gt;. &lt;a href=&quot;#fnref:db&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:mem&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If this system is part of a major feature which makes a ton of money, like say Amazon’s recommendation engine, you can probably justify keeping everything in RAM. The rest of us probably can’t. &lt;a href=&quot;#fnref:mem&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:tight&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;One important note here is that the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; bound is tight in the sense that, for any set size \(n\), there’s some pathological cases where you can’t use any fewer than \(k\) dimensions and still keep a low distortion. In fact, &lt;a href=&quot;https://arxiv.org/abs/1609.02094&quot;&gt;the bound even applies to &lt;em&gt;nonlinear&lt;/em&gt; embeddings&lt;/a&gt;, not just the linear ones guaranteed by the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma. To do any better in terms of compression, we have to give up the full distortion guarantee, or we have to look outside of pure dimensionality reduction. &lt;a href=&quot;#fnref:tight&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:randomproj&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Beyond just the result, &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S0022000003000254&quot;&gt;this paper&lt;/a&gt; is also quite readable, outlines the motivations and intuitions quite well, and reviews specific instances of prior work. Technically it was published in 2001 but officially in a journal only in 2003. &lt;a href=&quot;#fnref:randomproj&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:exercise&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The original proof of the &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma involves starting from the &lt;em&gt;distributional &lt;abbr title=&quot;Johnson-Lindenstrauss&quot;&gt;JL&lt;/abbr&gt; lemma&lt;/em&gt;, which is a consequence of the concentration of measure (AKA &lt;a href=&quot;https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/&quot;&gt;the soap bubble&lt;/a&gt;), and taking a union bound over pairs of points. &lt;a href=&quot;#fnref:exercise&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Tue, 04 Jul 2023 00:00:00 -0700</pubDate>
        <link>https://rationalis.github.io/articles/2023-07/compression-of-metric-spaces</link>
        <guid isPermaLink="true">https://rationalis.github.io/articles/2023-07/compression-of-metric-spaces</guid>
        
        
        <category>computer science</category>
        
        <category>machine learning</category>
        
        <category>curse of dimensionality</category>
        
        <category>metric spaces</category>
        
        <category>math</category>
        
      </item>
    
      <item>
        <title>On modern computation</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Welcome to 2020. In this day and age, there are &lt;a href=&quot;https://www.anandtech.com/show/15483/&quot;&gt;64-core consumer
CPUs&lt;/a&gt;,&lt;a href=&quot;https://www.theverge.com/2019/10/30/20938653/&quot;&gt;smartphones that fold in
half&lt;/a&gt;, and &lt;a href=&quot;https://www.theverge.com/2020/3/9/21171230/&quot;&gt;retail stores where
you just walk out with what you want to
buy.&lt;/a&gt; There have been weak-AI
systems that can beat humans at &lt;a href=&quot;&quot;&gt;Go&lt;/a&gt;, &lt;a href=&quot;&quot;&gt;StarCraft&lt;/a&gt;, and &lt;a href=&quot;&quot;&gt;DotA&lt;/a&gt;. We’re just
a bit shy of cars that drive themselves (though, sadly, not ones that fly). In
short, 20 years into the 21st century, we are surrounded by technology which was
the stuff of dreams not so long ago.&lt;/p&gt;

&lt;p&gt;And yet, it all runs on software and hardware which is, by comparison, quite
ancient. The trinity of operating systems, Windows, Linux, and macOS, originated
in the 80s and 90s. C, the predominant systems programming language, still used
as the foundation of each OS, not to mention embedded applications on pretty
much every electronic device manufactured in the last decade, was born in 1972.
C++? 1983. Python? 1990. Java, JavaScript, PHP, Ruby? 1995. Not only that, among
many experienced and skilled programmers, Emacs (1985) and Vim (1991) are the
text editors of choice.&lt;/p&gt;

&lt;p&gt;Are these the &lt;em&gt;only&lt;/em&gt; options? No, of course not, there’s certainly newer
software being used. And every piece of software evolves over time. Modern C++
certainly looks almost nothing like its early days as “C with Classes.” But on
the other hand, many of these examples were incremental improvements over their
predecessors, doing little to advance the state-of-the-art. No C++ programmer
would say that the invention of Java, for all the benefits it might have, was
like achieving enlightenment.&lt;/p&gt;

&lt;p&gt;So, what gives? Let’s take a whirlwind tour of computer history.&lt;/p&gt;

&lt;h1 id=&quot;whats-inside-a-computer&quot;&gt;What’s inside a computer?&lt;/h1&gt;

&lt;p&gt;Once upon a time, computers were simple. They were just a collection of
glorified on/off switches with simple circuits to manipulate them. A marvel of
engineering at the time, but by today’s standards, childish toys.&lt;/p&gt;

&lt;p&gt;Then, the invention of the transistor. With the advent of microprocessors came a
complete revolution in the capabilities of computers. What’s more, they
discovered they were consistently able to make smaller and smaller transistors,
enabling faster, more efficient processors. Soon they realized that a simple CPU
design that simply did one operation at a time was inefficient - the rate that
the CPU clock ticked at was fixed, and some operations are simply inherently
much faster than others. So they invented a trick called pipelining. And
out-of-order execution. Multiprocessors (multi-core).&lt;/p&gt;

&lt;p&gt;And as a result, what happened on the software side? People invented
higher-level languages like C, so that they wouldn’t have to be bogged down by
having to translate their thoughts into machine code. Compilers, software
designed to translate higher-level languages into machine code, grew ever more
complex. At one point, the system grew complex enough that the CPU architects
couldn’t change the machine language of their CPUs without breaking the
compilers. So they had a bright idea: they would design their CPUs using
whatever they wanted internally, and add some extra circuitry to translate the
common machine language into the hardware-level instructions their CPUs could
understand.&lt;/p&gt;

&lt;p&gt;This process continued in this manner for decades. It’s still ongoing now. The
general trend can be described as continually increasing abstraction. The clear
benefit is that programmers can write programs more quickly and easily, not only
increasing the number of programmers and programs, but also enabling programs of
greater scale and sophistication. The downside is that there are a huge number
of inefficiencies due to a countless number of translations between what the
programmer intended to do, and what the hardware actually does.&lt;/p&gt;

&lt;h1 id=&quot;operating-systems-and-user-interfaces&quot;&gt;Operating systems and user interfaces&lt;/h1&gt;

&lt;p&gt;Before the development of microprocessors, computers had little in the way of
“user interfaces.” They were basically complicated calculators where the state
of the machine was almost directly manipulated by the operator. Considering that
these were a tiny number of specialists, the designs had little in common with
the modern concept of UI.&lt;/p&gt;

&lt;p&gt;But when computers became more widespread, professionals from many other walks
of life began using them. Anybody who wanted to store or communicate text or
numbers. Computers were now &lt;em&gt;general-purpose&lt;/em&gt;. They had to be taught how to do
more than just one thing well and how to do more than one thing at a time. So
programmers invented operating systems, programs that had the sole purpose of
managing other programs. The beginnings of what we would call user interfaces
sprang into life: keyboards, printers, monitors. No longer were these
electromechanical devices to be operated by a handful of specialists; they were
now a proper appliance, to be used, even if not fully understood, by essentially
untrained individuals.&lt;/p&gt;

&lt;p&gt;Computers became cheaper and more powerful, so suddenly there was a potential
for a much broader audience. Entrepreneurs like Bill Gates and Steve Jobs
capitalized on the opportunity by providing hardware and software that could be
marketed towards the average Joe, ushering in the age of the Personal Computer.
A key part of their early successes was their development of proper Graphical
User Interfaces (GUIs).&lt;sup id=&quot;fnref:M&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:M&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; The popularity of computers grew, as their
applicability expanded from an arcane incantation-accepting device into a
simple-to-understand interactive canvas.&lt;/p&gt;

&lt;h1 id=&quot;text-editors-and-programming-languages&quot;&gt;Text editors and programming languages&lt;/h1&gt;

&lt;p&gt;Alongside processors and operating systems, there was the development of text
editors. Text was originally, of course, the &lt;em&gt;only&lt;/em&gt; human-accessible mode of
input and output for a computer. But even as graphics began to play a larger
role, most people focused on productivity found that text was simply
irreplaceable. While there was some experimentation, ultimately programmers also
found that the most effective way of representing their programs was in text.&lt;/p&gt;

&lt;p&gt;The venerable vi and Emacs both originated in other programs of a bygone era. vi
embraced a system for maximum efficiency of basic text operations in an
extremely constrained environment. For its simplicity and minimalism, it became
nearly universal, in one form or another, on pretty much every non-Windows
computer. On the other hand, Emacs sought to provide the ultimate flexibility,
wanting to exploit their interface with the computer for what it was: the
ability to create and execute arbitrary programs.&lt;/p&gt;

&lt;h2 id=&quot;free--open-source-software&quot;&gt;Free &amp;amp; open source software&lt;/h2&gt;

&lt;p&gt;The most popular open source operating system today is, of course, Linux, or, as
a certain someone would insist, GNU/Linux. And the only popular Emacs variant
today is GNU Emacs. That certain someone is Richard Stallman, who religiously
champions free-as-in-speech (libre) software. Regardless of one’s personal
opinions on the matter, it’s undeniable that this philosophy and the software
which came of it has profoundly influenced the modern landscape. It is this
philosophy which one might say inspired GNU Emacs to be what it is: a program
which can be modified in almost any way imaginable by the user.&lt;/p&gt;

&lt;h2 id=&quot;the-lisp-machine&quot;&gt;The Lisp machine&lt;/h2&gt;

&lt;p&gt;In those days, programming language research was quite fruitful, in the sense
that the space had room for exploration in almost every direction. Many
languages with very different ideas came to be, among them functional languages
(Haskell, Ocaml), truly dynamic languages (Lisp, Smalltalk), logical languages
(Prolog), and a wide variety of others, better and lesser known (APL, Forth,
etc.).&lt;/p&gt;

&lt;p&gt;Lisp and Prolog, in particular, were subject to quite some study in the pursuit
of developing AI. In the 70s and 80s, processors were slower and compilers were
not particularly advanced. General-purpose computing hadn’t really hit its
stride. Consequently specialized hardware was developed at the MIT AI Lab to
efficiently execute Lisp, namely Lisp machines. While they pioneered a variety
of extremely influential technologies, they were ultimately unsuccessful thanks
to general-purpose PCs coming of age and the AI winter.&lt;/p&gt;

&lt;p&gt;As something of a twist of fate, Stallman encountered Lisp at MIT. As a dynamic,
high-level language with very advanced features for its time, and a lineage
already measured in decades, it was an unsurprising choice for an Emacs. It is
for this reason that Emacs uses Emacs Lisp.&lt;/p&gt;

&lt;h2 id=&quot;this-was-going-somewhere-after-all&quot;&gt;This was going somewhere after all&lt;/h2&gt;

&lt;p&gt;GNU Emacs has famously been referred to, a little tongue-in-cheek, as both “the
infinitely extensible editor” and “a great operating system, only in need of a
decent text editor.” It’s true, Emacs is a piece of software with a depth beyond
any single human being, one that comes with its own programming language and has
lasted nigh on 40 years. One can spend a decade using it and not come remotely
close to mastering it.&lt;sup id=&quot;fnref:E10&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:E10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; I personally can attest to at least a fraction of
that, having used it for several years.&lt;/p&gt;

&lt;p&gt;Given the historical context, this is no coincidence. Emacs is perhaps the last
remaining descendant of the Lisp machines. Similarly to other Lisp
implementations in both software and hardware, and the other languages which
took inspiration from them, Emacs is a core runtime written in C supporting a
much larger labyrinth of Emacs Lisp, resembling something like the bytecode
interpreter of Python. It is a system which defines a large set of primitives
for displaying and manipulating text, interactions with the filesystem and
network, and an interface with the underlying operating system. Upon this
foundation, Emacs Lisp programs of all kinds have been written, including but
not limited to: music players, PDF viewers, web browsers, email clients,
terminal emulators, and Tetris.&lt;/p&gt;

&lt;p&gt;What is the significance of this? It’s a little hard to put into words, but I
will try nonetheless. Emacs is something of a monstrosity of a program. However,
it represents an unbroken line to a time when computers looked completely
different than their modern-day counterparts. Not only does it still function,
it is still evolving. Its language and its core are sufficiently flexible to
allow that, and as a result, the community is able to modify and extend it to
support features which might not have been dreamt of at its creation. In short -
while Linux, for example, might be a remarkably engineered program, Emacs is a
testament to organic growth that defies expectations.&lt;/p&gt;

&lt;h1 id=&quot;what-the-future-holds&quot;&gt;What the future holds&lt;/h1&gt;

&lt;p&gt;Many people have been remarkably pessimistic about the current state of software
development, remarking that it is crude, obsolete, inefficient, and just plain
dumb. &lt;sup id=&quot;fnref:D&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:D&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:OS&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:OS&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:S&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:S&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;However, in my view, in spite of all these shortcomings, we have so much to
appreciate, and so much to look forward to.&lt;/p&gt;

&lt;p&gt;In the programming language space, I believe the likes of Rust will usher in a
new era of safer systems programming, without paying any penalty in performance.
As a side effect, many ideas, like algebraic data types, which were only popular
in more niche programming languages, will continue to gain traction. Looking out
even further, I’m interested in seeing much more powerful type systems being
developed, such as that of F^*, which will allow programmers to write code
which can be rigorously reasoned about. Rather than being subject to the
limitations of our foresight in writing the right tests, we can confidently make
claims that our software will &lt;em&gt;never&lt;/em&gt; encounter certain errors. We can state
with certainty that code adheres to the permissions it requests and is allowed.
And all this with only a minimal burden imposed on the programmer.&lt;/p&gt;

&lt;p&gt;In the operating systems space, I am quite certain that innovation will continue
at a fairly rapid pace. Today, there are operating systems for smartphones which
consistently achieve a UI that renders at 60FPS. With better languages for
systems programming and the right goals in mind, I would not be surprised to see
operating systems which handle concurrency much more robustly, encounter none of
those seemingly arbitrary failures, run on multiple devices of different form
factors effortlessly and perhaps even in unison, and attain a low enough
interface latency at a high enough consistency as to be indistinguishable from
real physical interaction. When this goal is achieved we will have many forms of
ubiquitous computing to look forward to - I’m certainly excited for AR/VR
applications to become commonplace.&lt;sup id=&quot;fnref:N&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:N&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Finally, for lack of a better term, the Emacs-y space. This really has nothing
to do with text editors, it is simply that I do not have a better example in
mind. Today, many of us write programs which must be compiled before they are
executed, a process that can take between a few seconds and several hours. More
dynamic (i.e. interpreted and JIT compiled) languages have a lesser lag time,
but at a severe loss of performance. Programs are frequently &lt;em&gt;not&lt;/em&gt;
cross-platform, compatible only with a specific OS or CPU architecture. And one
of the few languages which opposes these trends to an extent, I shudder to
admit, is JavaScript.&lt;/p&gt;

&lt;p&gt;But I believe that this is only a growing pain. We will soon see a day when the
distinction between compiled and interpreted becomes blurred. Not only will
compilation get faster, other hybrid approaches will be developed. The trade-off
between performance and dynamism is not fundamental, it is an engineering
problem. And when we solve those problems, every program will be something of an
Emacs. Long before we achieve general AI (although I do believe that’s coming
within the century), every person will be able to communicate with the programs
on any of their computing devices, instructing it as precisely as they wish.
Computers will be tools, not appliances.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:M&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;“The Mother of All Demos” demo’d (duh) these concepts in 1968, decades ahead of the first releases of Windows and the Mac OS. &lt;a href=&quot;#fnref:M&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:E10&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;http://edward.oconnor.cx/2009/07/learn-emacs-in-ten-years &lt;a href=&quot;#fnref:E10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:D&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;https://tonsky.me/blog/disenchantment/ &lt;a href=&quot;#fnref:D&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:OS&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;http://blog.rfox.eu/en/Programmer_s_critique_of_missing_structure_of_oper.html &lt;a href=&quot;#fnref:OS&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:S&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;http://doc.cat-v.org/bell_labs/utah2000/utah2000.html &lt;a href=&quot;#fnref:S&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:N&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;https://jnd.org/natural_user_interfaces_are_not_natural/ &lt;a href=&quot;#fnref:N&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 29 Feb 2020 00:00:00 -0800</pubDate>
        <link>https://rationalis.github.io/articles/2020-02/on-modern-computation</link>
        <guid isPermaLink="true">https://rationalis.github.io/articles/2020-02/on-modern-computation</guid>
        
        
        <category>computer science</category>
        
        <category>history of computers</category>
        
        <category>programming language theory</category>
        
        <category>user interfaces</category>
        
        <category>operating systems</category>
        
        <category>processors</category>
        
      </item>
    
      <item>
        <title>Everything is information</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;As humans, we tend to see the world at our level. Which is to say, at the size,
mass, and energy scales common to our day-to-day lives. Tables and chairs, cars
and bikes, etc. However, a physicist would probably tell you that this
intuitive, anthropocentric conception of the universe is only a reasonable
&lt;em&gt;approximation&lt;/em&gt; for the very small physical regime that we live in. What the
universe really looks like at small scales is the weirdness of quantum physics;
at very large scales, we encounter relativistic effects due to gravity; and in
some cases, like around black holes, some unknown combination of both.&lt;/p&gt;

&lt;p&gt;This isn’t an article about conventional physics, though - there is another
interesting way to view the universe. And that is in terms of &lt;em&gt;information&lt;/em&gt;.&lt;/p&gt;

&lt;h1 id=&quot;what-is-information&quot;&gt;What is information?&lt;/h1&gt;

&lt;p&gt;I don’t want to get too philosophical, but we do have to start somewhere. When
we think of information, we probably think in terms of communicating it, via
speech or text. We also record and retrieve information, but for the purposes of
this post, we can just think of that as a form of communication between the
recorder and retriever (even if, say, they’re the same person at different
points in time).&lt;/p&gt;

&lt;p&gt;OK, but what exactly &lt;em&gt;is it&lt;/em&gt; that we’re communicating, and why? Fundamentally,
when two people communicate, there’s something that one person knows that the
other doesn’t, and the &lt;em&gt;knowledge&lt;/em&gt; is what’s communicated. Even if it’s as
trivial as “what I ate for breakfast.” So, putting aside the philosophical
questions about what “knowledge” is, we can think of information as “new
knowledge.” If the knowledge is already available, then no information is
communicated.&lt;/p&gt;

&lt;p&gt;When framed this way, it’s natural for us to describe information in terms of a
&lt;em&gt;question&lt;/em&gt; and an &lt;em&gt;answer&lt;/em&gt;. In the simplest possible case, the space of
&lt;em&gt;possible&lt;/em&gt; answers must have at least two elements, i.e. a yes/no question.
After all, if there’s only one possible answer, then we would always know the
answer without any communication.&lt;/p&gt;

&lt;p&gt;This describes the basic, implicit premise of a huge swath of computer science,
which we usually take for granted. We measure the storage of data in terms of
bits (bytes being 8 bits), most modern CPUs use either 32-bit or 64-bit memory
addresses, the rate of communication (bandwidth) of a network connection is
measured in bits per second, and so on and so forth.&lt;/p&gt;

&lt;p&gt;Still, the title is “&lt;em&gt;everything&lt;/em&gt; is information,” not just data manipulated by
computers…&lt;/p&gt;

&lt;h1 id=&quot;the-many-forms-of-information&quot;&gt;The many forms of information&lt;/h1&gt;

&lt;p&gt;The most familiar example of quantifiable information, to most people, and most
especially computer scientists, will be digital data. However, information takes
many other forms. Every answer to every question is a specific amount of
information - as mentioned before, a single bit of information for a yes/no
question. We can think of any number as a certain amount of information, namely
the number of bits it takes to specify it in binary, and of course, pretty much
anything else we can think of can be converted to numbers and back.&lt;sup id=&quot;fnref:L&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:L&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;This already seems fairly comprehensive. Every spoken word is some number of
bits of information, every piece of data touched by a computer, everything ever
written down. These are familiar enough, but the interesting part is we can
think of all of them in terms of bits.&lt;/p&gt;

&lt;h2 id=&quot;all-things-physical&quot;&gt;All things physical&lt;/h2&gt;

&lt;p&gt;But one might be tempted to think that, surely, there are some things which
&lt;em&gt;can’t&lt;/em&gt; be described as just a bunch of bits. For example, what something smells
like, or how it feels to laugh until your sides hurt. Certainly, we know of no
way to precisely describe these things on paper or a computer. We can describe
them approximately with words, but ultimately, language does not convey all the
complexities of a smell, or an emotion.&lt;sup id=&quot;fnref:I&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:I&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;We’ll skirt around more philosophical issues, by asserting that everything ever
experienced by any human corresponds exactly one-to-one with the state of a
physical system. In other words, the assertion that there is nothing special
about consciousness, and every subjective experience corresponds to some
configuration of neurons firing in our brains. This will of course be a rather
controversial assertion, but probably not an unfamiliar one.&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Under this assumption that subjective experience really just reduces down to
some physical states, how do we proceed? Well, we already know how to talk about
physical systems, as alluded to in the introduction. It’s the domain of
physicists, and the details aren’t too important for us. What matters is that we
could, in theory, quantitatively describe physical states in some way, with a
bunch of numbers that we could write down.&lt;/p&gt;

&lt;p&gt;Then, for some theoretical data format, we could describe the &lt;em&gt;entire universe&lt;/em&gt;
in terms of this concept of information. If we play God just for the sake of
argument, we could describe the position and velocity of every particle in the
universe at a subatomic level, and that would capture everything physical,
including the brains of humans.&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h1 id=&quot;whats-the-point&quot;&gt;What’s the point?&lt;/h1&gt;

&lt;p&gt;“OK,” you might say, “I get it, we can describe stuff with bits and bytes, but
that doesn’t seem that interesting. Isn’t it just theoretical wiffle-waffling?”&lt;/p&gt;

&lt;p&gt;Well, actually, information theory has many applications.&lt;/p&gt;

&lt;h2 id=&quot;compression&quot;&gt;Compression&lt;/h2&gt;

&lt;p&gt;One interesting application is compression. By definition, compression is about
taking some information, and somehow making it smaller. It’s applied everywhere,
for pictures, videos, programs, etc. With the information theoretic lens we can
make an observation about compression: the only way we can compress information
is to rely on the mutual assumptions of communicating parties. We must assume
certain kinds of redundancies in the data, and ways of encoding that redundancy.
Otherwise, if we reduce the amount of information communicated, we must always
lose something. This is intricately linked with a basic mathematical principle -
we can’t make an invertible mapping between sets of different sizes. It also
demonstrates why there is no compression algorithm that will work well for all
possible data (e.g. random data) - for compression to work at all, we make
certain &lt;em&gt;biased&lt;/em&gt; assumptions, and by definition, some data will violate those
assumptions.&lt;/p&gt;

&lt;h2 id=&quot;cryptography&quot;&gt;Cryptography&lt;/h2&gt;

&lt;p&gt;Cryptography, being the study of secure communication, is unsurprisingly all
about information. It’s no coincidence that cryptographic algorithms tend to be
described with some extra metadata about how many bits they use (AES-256,
SHA-512, etc.). How much information do we need to authenticate, or verify the
author of a message? How many bits can be protected with a given number of
“secret” bits? Exactly how random do we need random numbers to be to ensure
security? These questions are at the heart of cryptographic methods used in
modern digital security.&lt;/p&gt;

&lt;h2 id=&quot;physics-again&quot;&gt;Physics, again&lt;/h2&gt;

&lt;p&gt;As it turns out, the physical interpretation of information has very real
consequences. A famous example is the second law of thermodynamics:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The second law of thermodynamics states that the total entropy of an isolated
system can never decrease over time&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At first glance this doesn’t say anything about information, but that’s just
because I’ve avoided using the technical terminology of &lt;em&gt;entropy&lt;/em&gt;. Those who
remember entropy from their chemistry or physics classes might think of it as a
measure of how “disordered” a system is, in terms of its indistinguishable
microstates. As it turns out, the thermodynamic definition of entropy is
precisely equivalent to the ideas of information outlined above. In fact,
properly speaking, entropy is measured in joules per kelvin per bit!&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Another interesting principle is the law of conservation of information. In
classical physics, we can seemingly create or destroy information at will. That
is, it is physically possible to copy any arbitrary sequence of bits perfectly,
at least in theory, and likewise we can take information and irreversibly
destroy it.&lt;/p&gt;

&lt;p&gt;However, with quantum mechanics, information takes a different form, which is
quite different than classical bits. In particular, quantum information can
neither be created nor destroyed. The late Stephen Hawking once made a famous
bet &lt;em&gt;against&lt;/em&gt; this law, based on a belief that black holes destroy information -
and he lost that bet. Why this should be the case is rather technical, and
anyways, to paraphrase Feynman, a technical explanation would not really be a
“why” so much as a “how.”&lt;/p&gt;

&lt;p&gt;The key point is that, much as physics forbids faster-than-light travel, it
appears to also forbid non-conservative operations on information at the lowest
scale. (And as an aside, we can rephrase the restriction of FTL as
“&lt;em&gt;information&lt;/em&gt; cannot travel FTL.”)&lt;/p&gt;

&lt;p&gt;Based on similar ideas about black holes, there is a proven upper bound on how
much information can be contained in finite space with finite energy, known as
the Bekenstein bound. This places fundamental theoretical limits on how fast
computers can be, and how dense their storage can be.&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; In fact, it is this
result which confirms the possibility of the previously mentioned idea of
describing any physical system - if a certain volume of space can only contain a
certain finite amount of information, then it is possible to describe that
volume using at most that much information.&lt;/p&gt;

&lt;p&gt;There are whole theories of physics which are founded upon physical information.
So, I think, one should begin to suspect that information is a genuinely
powerful and pervasive concept.&lt;/p&gt;

&lt;h2 id=&quot;and-much-much-more&quot;&gt;And much, much more&lt;/h2&gt;

&lt;p&gt;That I’m not going to list out for you! Wikipedia has plenty of information
(heh) about the topic.&lt;/p&gt;

&lt;h1 id=&quot;the-conclusion-that-maybe-shouldve-been-the-introduction&quot;&gt;The conclusion that maybe should’ve been the introduction&lt;/h1&gt;

&lt;p&gt;In 1948, information theory was essentially invented by Claude Shannon, with the
paper “A Mathematical Theory of Communication”. Of course, he wasn’t the only
person to have these ideas or develop them, but he’s the one known as the father
of information theory. At the time, he was mainly concerned with cryptography
and literal, electrical/mechanical communication. In the better part of a
century since then, the theory has found many applications, across too many
fields to count.&lt;/p&gt;

&lt;p&gt;As far as we can tell, anything in the universe &lt;em&gt;is&lt;/em&gt;, in a sense, the
information which describes it.&lt;sup id=&quot;fnref:U&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:U&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt; Viewing things this way is no less correct
than, for example, viewing everything as particles. So, as it turns out,
information is a fundamental principle, not just of human activities, but,
apparently, of the very fabric of the universe.&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:L&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The specific usage of binary isn’t necessary, though it is fundamental in the sense of being the smallest discrete base. The important thing is that information is &lt;em&gt;logarithmic&lt;/em&gt; in the number of states. &lt;a href=&quot;#fnref:L&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:I&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;One heuristic argument for this is that it’s impossible to imagine what something you’ve never encountered smells or tastes like based purely on a verbal description. &lt;a href=&quot;#fnref:I&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;At any rate, it’s certainly much harder to start talking about what quality subjective, conscious feelings (qualia) have which is beyond merely physical. For those who wish to disagree with the assertion, well, I will simply say that in your view of the world, not everything is information - but on the other hand, the stuff that’s not information isn’t a physically measurable thing, so you could never detect it in other people. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;You can nitpick that it’s a little more complex thanks to quantum physics, and possibly other physics stuff I’m not familiar with. Or maybe physics we don’t even know about yet. It’s not relevant to the argument, so long as we can assume that physical states are completely quantifiable. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Up to some conversion factors. SI units treat bits as dimensionless. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;We’re nowhere close to those limits, and honestly, we might never get anywhere close. But the fact remains that such limits exist. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:U&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Again, philosophical arguments about definitionally unobservable state would evade this principle. And again, it is my personal opinion that such philosophical considerations are of no particular interest. &lt;a href=&quot;#fnref:U&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Of course this is deliberately grandiose. The caveat is that you could really say this about a lot of things, like mathematics, or computers. In fact, this post is only the first of a series, hopefully, which covers many such ideas. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Wed, 01 Jan 2020 00:00:00 -0800</pubDate>
        <link>https://rationalis.github.io/articles/2020-01/everything-is-information</link>
        <guid isPermaLink="true">https://rationalis.github.io/articles/2020-01/everything-is-information</guid>
        
        
        <category>information theory</category>
        
        <category>musings</category>
        
        <category>everything is</category>
        
      </item>
    
  </channel>
</rss>
