Sorting and Searching

Need a PRNG? Use a CSPRNG

2023-11-25T00:00:00-08:00

This post is about Random Number Generators, RNGs for short.

We often want computers to behave randomly, for various reasons:

Cryptography. Encryption keys need to be random to prevent attackers from guessing them.
Monte Carlo simulations. We want to calculate the statistics of some random phenomenon by random sampling.
Monte Carlo algorithms. Certain computational problems are easier to solve with high probability using randomness than deterministically with certainty. Primality testing is an example.
Las Vegas algorithms. These always produce the correct answer, but their efficiency depends on using randomness. Quicksort is an example.
Machine learning. Neural network weights are initialized randomly, training uses random sampling.
Games with randomness. An online backgammon server needs random dice.
Video games. We want certain elements of the game to behave randomly.
Art. Predictable patterns are boring.

There are several ways to generate random numbers. You can use a Hardware RNG, or you can use a Pseudorandom Number Generator, PRNG for short. There are several PRNG algorithms available. The question I will try to answer is: which one should you use?

I have already given away my answer is the title. You should always use a Cryptographically Secure PRNG, CSPRNG for short, and ignore all other PRNGs. There is no good reason to use anything else. I know it sounds simplistic, but I will explain why it is the only reasonable way to do it.

This advice goes somewhat against the common practice and the usual advice. What people typically say is: use a CSPRNG if you are doing cryptography and need security against adversarial attackers. For all other purposes pick among the “classic” non-cryptographic PRNGs. It sounds reasonable. After all, it’s in the name: CSPRNG stands for “cryptographically secure” PRNG. If you don’t need to worry about security, why bother doing that?

But I will argue that always using CSPRNGs is the only thing that really makes sense.

A motivating example

Consider the following problem in linear algebra:

What is the probability that a random $n \times n$ matrix in $\mathbb{Z}_2$ (modulo 2) is full rank, i.e. invertible?

We can test this experimentally. I wrote a C++ program that calculates this for $n \le 700$ by sampling 1000 random matrices for each $n$ .

It looks like we get about 30% for $n \le 491$ , then we suddenly start flipping between 0% and 30% for $492 \le n \le 526$ , then we get 100% for $n = 527$ , and then 0% for $528 \le n \le 700$ .

You can try running this code yourself and see if you get the same answers. This experiment is repeatable.

Such interesting results! We have some sort of phase transition around 500. You can easily imagine somebody writing a scientific paper based on an experiment similar to this.

But the effect is not real. The true answer is that the probability is:

\prod_{i=1}^n (1 - 2^{-i}) \rightarrow \prod_{i=1}^\infty (1 - 2^{-i}) = 0.2887\ldots

I used the rand() function from the GNU C library to generate the random matrices. This PRNG uses a Lagged Fibonacci Generator. The results are an artifact of this particular PRNG. There are detectable dependencies between the pseudo-random bits generated by the generator which results in matrix rows being linearly dependent for large $n$ (and, surprisingly, always independent for $n = 527$ ).

What should I have used instead? How can I know that if I use a different PRNG, I will get more trustworthy results?

I can’t if I use any old PRNG. But if I use a CSPRNG, then I know by definition of cryptographic security that one of two things will happen:

Either the results will be statistically correct, or
I will have broken the cryptographic security of that particular CSPRNG.

Both cases are pretty neat!

If, hypothetically, my results are statistically off with a CSPRNG, I will have just found a way to break cryptographic security of the cryptographic primitive used byt that CSPRNG. By definition, that’s what it means to break cryptographic security. That’s even better than learning about matrices!

That scenario is very unlikely in practice. A lot of people have deliberately tried and failed to break the security of standard CSPRNGs – that’s why we treat them as cryptographically secure. It’s unlikely that just messing around with matrices I have stumbled upon a way to break one without even really deliberately trying.

Hardware RNGs

Maybe we should just use true randomness rather than settling with pseudorandomness? There are hardware devices that generate random bits. Some modern processors have them built in, for instance, modern x86-64 CPUs have the RDRAND instruction. One could use /dev/random in Linux, or the online service random.org.

There are a few reasons PRNGs are usually preferred:

Truly random bits are hard to obtain efficiently.
It’s difficult to make them unbiased.
With PRNGs, we can easily replay the whole random sequence without having to store all the bits.

In fact, for the first two reasons, the RDRAND instruction is implemented in hardware with a hybrid of a hardware generator and a CSPRNG. /dev/random also does this.

Hardware generators are still very important however. We need them to seed PRNGs. A PRNG takes a small truly random seed and uses it to generate a long sequence of pseudorandom bits.

Non-cryptographic PRNGs

There are several non-cryptographic PRNGs in common use:

Lehmer generator. Available in C++ as minstd_rand0 and minstd_rand.
Linear congruential generator. Previously used by the rand function in the GNU C library.
Lagged Fibonacci generator. Currently used by the rand functions in the GNU C library.
Linear-feedback shift register.
Mersenne Twister. Very popular. Available in C++ as mt19937, the default PRNG used by the Python random module.
PCG, Permuted Congruential Generator.
Xorshift / Xoroshiro. A popular recent generator. Used by default in JavaScript engines in Chrome, Firefox, Safari.

I am arguing: these should never be used! They are weak generators. They all are easily predictable and have statistical weaknesses. Of course they do – if they didn’t, they would be considered CSPRNGs!

It is a bad state of affairs that they are provided as the default PRNG in various programming languages.

The Dark Art of choosing a weak PRNG

Donald Knuth in his classic books The Art of Computer Programming devotes a good part of Volume 2 just to this topic. He analyzes various parameters of Linear Congruential Generators and Lagged Fibonacci Generators and considers their strengths and weaknesses.

He never considers CSPRNGs at all. That’s because modern cryptographic primitives on which CSPRNGs are built didn’t exist yet when the books were written in the 1960s.

There are a lot of papers analyzing the pros and cons of various algorithms and their weak spots.

Sebastiano Vigna, one of the authors of xoroshiro, has written a paper about why one shouldn’t use the Mersenne Twister and has argued online about why PCG is flawed.

The arguments got rather heated when PCG’s author M.E. O’Neill wrote in defense of PCG.

My take: just use a CSPRNG and you will never have to worry about any of this. CSPRNGs don’t have any such weaknesses, by design! As long as their security claims hold up.

CSPRNGs

Cryptographically secure PRNGs are closely related to encryption. Any CSPRNG can be turned into a stream cipher, and vice-versa. What a stream cipher really is is a CSPRNG that you xor your message with in order to encrypt it.

To generate a CSPRNG from a block cipher, you can simply run the block cipher in counter mode. In other words, you process the numbers 0, 1, 2, 3, etc using a random encryption key, and that gives you the pseudo-random bits.

There are other ways to do this in order to get some extra protection when your key leaks but let’s not get into that because it’s not really relevant outside of cryptography.

Two common CSPRNGs are:

AES-CTR, based on AES encryption.
ChaCha20.

I recommend ChaCha20. It’s rather simple. It’s easy to implement from scratch in not that many lines of code. It doesn’t require special hardware support. AES uses special hardware instructions to be efficient.

The de-facto standard Rust library for random numbers, rand, uses ChaCha12, a reduced-round variant of ChaCha20, as its default PRNG.

What makes a good PRNG

A good PRNG generates a sequence that “looks random”. This is usually taken as “it passes as many various statistical tests as possible”. Statistical tests try to distinguish pseudo-random numbers from truly random numbers.

The definition of a CSPRNG is: given the first n bits, there is no efficient algorithm that will predict the next bit with success rate non-negligibly better than 50%.

This implies that a CSPRNG will pass all statistical tests that run in reasonable time.

Can you see the similarity between what makes a good PRNG and what makes a CSPRNG? A CSPRNG is the ultimate PRNG, by definition.

CSPRNGs are better

Suppose you are creating an online backgammon server. Backgammon uses dice. You need fair dice. What PRNG should you choose?

If you use a weak PRNG, players can relatively easily reverse engineer what is going on and predict future dice based on previous dice rolls. Clearly a bad situation all around!

How about if you are running a scientific Monte Carlo simulation. We already saw an example with matrices. You can’t trust the results if you use a weak PRNG!

Can you trust the results if you use a CSPRNG? Sort of.

Hypothetically you could get biased results because there is no proof that any particular CSPRNG really is secure. But if you do see biased results, you will have accidentally broken the security of that CSPRNG, by the very definition of cryptographic security. You would be able to read encrypted messages by exploiting this bias.

So for practical purposes: yes, you can trust the results obtained by CSPRNG.

What if you’re just running some Las Vegas algorithm such as quicksort? Do you really need a CSPRNG?

Yes! The analysis of quicksort assumes that the numbers are truly random. If they can be distinguished from truly random, that analysis doesn’t hold any more. Your quicksort might become quadratically slow for certain types of data. You never know. Why risk it?

With CSPRNG, if you notice that your quicksort becomes quadratically slow, again you would have accidentally broken the cryptographic security of that PRNG.

What if you’re just building a video game? Do you really care about the quality of the random numbers used for the behavior of characters in the game? Well, in this case maybe not. But then, why bother even thinking about this? Just plug in a CSPRNG and forget about it. There is no harm.

What about performance?

By this point many of you are probably thinking: what about performance? Aren’t CSPRNGs less efficient than simple weak PRNGs?

Let’s look at some numbers.

If you use a weak PRNGs such as Xoshiro256PlusPlus, you are getting 7 GB/s.

If you use ChaCha20, you are getting 1.8 GB / s. About 4x less.

This seems like a reasonable argument for weak PRNGs: 4x speed-up is a lot!

But that would only be true if generating random bits was the hot spot, the bottleneck of your program. It never is in practice.

1.8 GB / s is 14 billion random bits per second. Do you really need 14 billion random bits per second in your video game, or in your Monte Carlo simulation? Probably not. Probably many orders of magnitude less. Whatever you do with those random bits most likely takes orders of magnitude more effort than just generating those bits.

And if you ever do need billions or trillions of bits, you probably care a lot about the quality of those bits. Any bias will show with such a large sample.

“But I don’t care about quality”

So you might say: in my program I don’t care whether the numbers are really all that random, so shouldn’t I just use some weak PRNG?

I have two answers to this.

The first answer is: even if you don’t care, there is no harm from using CSPRNG. So why not just always use it?

My second answer is: if you really don’t care, why not just do this?

int getRandomNumber() {
    return 4;
}

Or more realistically, if you only care that the numbers don’t repeat too often, why not use a simple counter that returns 0, 1, 2, 3, …?

If you say “that’s not random enough”, that indicates that you do care about quality, and so you should just use a CSPRNG and be done.

“My favorite language doesn’t provide CSPRNGs”

That’s the only reasonable argument for using a weaker PRNG. I hope this situation changes in the future.

You can still use a third-party library. Or you can just implement ChaCha20 yourself. It’s not that complicated. Wikipedia has an implementation of the core algorithm in 31 lines of C code.

Weak PRNGs are poor man’s wanna-be CSPRNGs

If you look at how a weak PRNG such as Xoroshiro128** is implemented, you will see a bunch of xors, bit shifts, and multiplications.

If you look at how ChaCha20 is implemented, you will see a bunch of xors and bit shifts and additions. There are just more iterations of them.

The way I see it: weak PRNGs are wanna-be CSPRNGs that don’t quite get there. They do the same kinds of things that CSPRNGs do, they just don’t quite get the job done fully. They don’t mix the bits enough that the output becomes indistinguishable from truly random.

Just use the real thing!

Summary

I hope that this convincingly argues that CSPRNGs are the way to go. We should all just stop using weaker PRNGs altogether.

It is a shame that major programming languages provide weak PRNGs as the default in their standard libraries.

Rust’s third-party crate rand that is the de-facto Rust standard for PRNGs is a notable exception: it uses a CSPRNG as the default generator.

I hope everybody else follows suit and does the same.

Roger Penrose’s AI skepticism

2021-07-18T00:00:00-07:00

Despite recent advances in Artificial Intelligence, I sometimes encounter the claim that while computers can do many tasks well, human-level AI is not possible for fundamental reasons. Skeptics claim that computers can never be as smart, creative, generally intelligent as humans.

Accomplished mathematician and physicist Roger Penrose has been very outspoken about his AI skepticism. I once had the priviledge of attending a talk he gave on the subject. Here is a presentation he gave in 2020 with a very similar structure.

Penrose believes that humans can do something no classical computer can do, i.e. compute non-computable functions. In other words, he thinks we can solve problems no Turing machine can solve. This implies no standard computer can solve them, and not even quantum computers can solve them. The advantage of quantum computers is that they can solve some problems faster, but in principle they too can be simulated on Turing machines.

One has to admire the fact that he realizes, and goes along with, the full implications of this view.

If we believe standard neuroscience, human thoughts are encoded in the strength of interactions between neurons, which act as summing and thresholding devices. In other words, our brains are a kind of deep-learning neural network. Neural networks however obviously can be simulated by classical computers. So if we believe Penrose’s arguments, this implies that neuroscience is completely wrong about this.

He understands this very well, and thus, together with Stuart Hameroff, has developed an alternative hypothesis of how human thinking works. It has to do with supposed quantum entanglements between microtubules in our brain cells.

But that’s not enough: even quantum physics can be simulated on classical computers. Thus Penrose goes further, and posits a whole new physical hypothesis, Orchestrated Objective Reduction, that adds a non-computational component to quantum physics that somehow our thoughts must be tapping into.

In the first 25 minutes of the talk he discusses evidence that how human brains supposedly exhibit non-computational elements that cannot be simulated by computers. This is what I will address. I do not find any of these arguments convincing.

The remainder of the talk discusses the hypotheses for new physics and new neuroscience that would explain non-computability, which I am not going to attempt to discuss.

Computer chess

At 2:40 of the presentation he discusses a chess position.

Humans chess players quickly realize that the position is a draw. All white has to do is move the king around on black squares, and black can’t make progress.

Computer programs don’t immediately realize this. They think black has a massive advantage.

Penrose cites “Fritz at grandmaster level” as the computer program. The same is true for Stockfish. It takes Stockfish a long time to realize it’s a draw.

If you don’t give Stockfish a very long time to think, it is likely to blunder by giving up its free bishop in order to avoid a draw, thinking it still has a big advantage, but in fact this will allow white to win.

In other talks, Penrose has shown other similar positions.

Humans quickly find the drawing move: Bb4. After that just move the king around and black will never get through the wall.

Stockfish takes a very long time to find this move, and even longer to realize it is a draw.

Stockfish thinks white is losing badly. Humans quickly realize that white can just move the king around on white squares and never move the pawns, and the black pieces are forever trapped, resulting in a draw.

Penrose claims these examples demonstrate that humans have something no computer can ever have. But do they really?

No, they don’t. What they do show is that Fritz and Stockfish have still not reached the human level at chess for these particular weird situations. Humans still have a better algorithm than these computer chess programs, for these positions. It does not mean that no computer program ever will be able catch up and understand these positions quickly just as humans can.

Fritz and Stockfish are chess programs that perform a game tree search and evaluate positions using a hand-coded evaluation function. All these examples have the same pattern in common. They have a bunch of pieces forever trapped behind their own pawns. Apparently it’s not a pattern that the programmers of Fritz and Stockfish have implemented in their programs. They could implement this sort of topological reasoning, but they have decided not to. It’s a lot of work to implement, and the situation occurs rarely in games, so they didn’t bother. Maybe they will in a future version of Stockfish. Or maybe some other computer program, like a future version of AlphaZero, will be able to figure this out by itself.

50 years ago computers were worse at chess than human amateurs. Some people were claiming it would be very hard or impossible for computers to beat humans. 20 years ago the same was true of go. Many people claimed that go was the kind of game that was inherently very hard for computers and it was either impossible for computers to beat humans, or that it would take centuries of AI research. Now computers are better than humans in the vast majority of chess and go situations, but we can still construct some exceptional positions that computers analyze worse than humans. Penrose is repeating the same mistake as chess skeptics made 50 years ago when he suggests that these positions are inherently hard for any possible computer chess program.

The point that Penrose is making is that these computer chess programs do not exhibit the kind of general pattern-recognition and intelligence that humans possess. That is true. Of course they don’t. Yet. Nobody claims that Fritz or Stockfish has achieved general AI. They haven’t. This doesn’t demonstrate that general AI can’t ever be achieved.

The argument from Gödel’s first incompletness theorem

At 8:27, Penrose switches to his main argument from mathematical logic. He says “this is the key to what I want to say”.

In the past many people have criticized this argument, and Penrose has responded with various variants of the argument, some of them more complicated than others. All of them are faulty in various ways. In trying to fix one problem somebody has pointed out, he introduces other problems.

Nevertheless, in this talk, he returned to the most basic version of the argument, which is great, because it is also the one that makes it easiest to see where the mistake lies.

Here is the argument as presented in his slide:

Turing’s version of Gödel’s theorem tells us that, for any set of mechanical theorem-proving rules R, we can construct a mathematical statement G(R) which, if we believe in the validity of R, we must accept as true; yet G(R) cannot be proved using R alone.

He then goes on to say that this shows we humans can do something the theorem-proving machine cannot do.

There should immediately be something fishy about this. We listen to a quick one-slide argument and that already shows we’re smarter than any future AI? That’s not even the height of our potential as humans! Surely a robot could grasp the gist of the argument he’s making! But let’s not be so quick to dismiss.

First a couple quick comments.

When he says “Gödel’s theorem”, he refers to Gödel first incompleteness theorem. There are a few other famous and relevant Gödel’s theorems: Gödel completeness theorem, and Gödel second incompleteness theorem, but it’s clear he’s talking about the first one.

I am a bit confused about the “Turing’s version of” Gödel’s theorem. I don’t know what that is. There is a proof of Gödel’s incompleteness theorem that uses Turing machines, but then it’s still the same theorem. There is also a different argument that Penrose has used in the past that uses the Halting Problem rather than Gödel’s incompleteness theorem, which Turing proved to be non-computable. But the Halting Problem doesn’t refer to mathematical proofs in a formal system of logic. So I’m just going to assume we’re talking about the actual Gödel’s first incompleteness theorem, which is consistent with the rest of the slide and what Penrose says about it.

When talking about subtle theorems in mathematical logic, one has to be very careful and precise. It’s easy to fall into contradictions and paradoxes if one is not careful.

To make the concepts of “theorem proving” and “mathematical statements” precise, let’s be specific here. There is no 100% unanomous agreement as to what the foundations of mathematics really are, but the most standard approach that most mathematicians are fine with is to take axioms of set theory and rules of first order logic as the foundation of mathematics. Specifically, most mathematicians tend to assume Zermelo-Fraenkel set theory, abbreviated as ZFC, as the foundations of mathematics. Let’s go with that. I have certainly never seen a mathematical proof that couldn’t be formalized in ZFC. So I think it’s a pretty good definition of what we mean today when we say “a mathematical proof”.

So let’s take R = ZFC in Penrose’s slide. The theorem proving rules are certainly mechanical. We can indeed write an algorithm that will enumarate all proofs that follow from ZFC axioms. Gödel’s incompleteness theorem then implies that there is a certain statement, G(ZFC) that, if ZFC is consistent, then it has no mathematical proof. It also implies that if ZFC is consistent then G(ZFC) is true.

Seems almost like a contradiction, which is why we have to very careful and precise.

When Penrose’s slide says “if we believe in the validity of R”, what he really means is that R (i.e. ZFC) is consistent (or a stronger property such as arithmetic soundness, which in turn implies consistency). We write that as Cons(ZFC). Statement “X is provable in ZFC” is usually written as $\text{ZFC} \vdash \text{X}$ . So what we get from Gödel’s theorem are the following two statements:

\text{Cons}(\text{ZFC}) \Rightarrow (\text{ZFC} \not\vdash \text{G}(\text{ZFC})) \\ \text{Cons}(\text{ZFC}) \Rightarrow \text{G}(\text{ZFC})

The first statement says that if ZFC is consistent, then there is no proof of G(ZFC). The second statement says that if ZFC is consistent, then G(ZFC) is true.

Penrose’s argument starts with: we believe ZFC is consistent, otherwise it would make no sense to use this as a foundation of mathematics. Let’s grant that this is something we believe in.

As a side note: the history of set theory is that we first had a set of axioms, so called “naive set theory”, that later turned out to be inconsistent. ZFC is our latest iteration at formalizing mathematics. We hope that this time there are no more inconsistencies. We have been using it for a while and nobody has found any contradictions. If there is a contradiction lurking, a lot of time invested into developing set theory will have been wasted. So hopefully it is indeed consistent this time.

But it’s not really 100% certain, we don’t have a proof of this. In fact, Gödel’s second incompleteness theorem says we can’t have a proof of this if it’s true, at least as long as ZFC remains our foundation of mathematics.

But let’s grant that we believe this to be true. So the argument goes: given that we believe Cons(ZFC) to be true, the first statement says that the theorem-proving machine can never prove G(ZFC). The second statement says that G(ZFC) is true.

And that’s it. We know something the machine doesn’t!

But wait a second. We asked the machine to prove things to us mathematically! Did we prove mathematically that G(ZFC) is true? No we didn’t! There is a big difference between the following two statements:

\text{Cons}(\text{ZFC}) \Rightarrow \text{G}(\text{ZFC}) \\ \text{G}(\text{ZFC})

We proved the first one. We didn’t prove the second one! We may believe that the second one is true based on our belief in Cons(ZFC). A belief is very different from a mathematical proof however!

The theorem-proving machine cannot produce a mathematical proof of G(ZFC), and neither can humans.

The theorem-proving machine can produce a mathematical proof of the first statement (Gödel’s theorem), and so can humans. That’s what Gödel did. The machine produces all possible proofs, so it will also produce the proof that Gödel found.

So in the end this doesn’t demonstrate any difference between the machine’s and humans’ abilities to produce mathematical proofs.

What about beliefs? We may believe G(ZFC). The theorem-proving machine will never state a belief in G(ZFC). Does it mean we are smarter than machines?

No, the machine doesn’t state a belief in G(ZFC) because we’re not asking the machine to tell us about its beliefs. It’s a theorem-proving machine, not a theorem-believing machine, by assumption. We are asking it to produce proofs. The assumption of the whole argument was that the machine produces formal proofs of mathematical statements, not that it tells us its beliefs without proof.

Stating a belief in G(ZFC) is not non-computable. I can easily write a Python program that can do the same trick that Penrose does and state its belief in G(ZFC):

print('I am 100% convinced that G(ZFC) is true.')

The fact that Roger Penrose states this belief without proof is thus not evidence of his brain’s non-computable abilities.

Plane tilings

At 14:45, having finished the argument from Gödel’s theorem, the talk switches to the example of tiling the plane with polyominoes.

The question of whether a given set of polyominoes can be used to tile the plane is a non-computable problem. There is no computer algorithm that, given a set of polyominoes, decides whether it is possible to fully tile the plane using them.

Penrose then presents a set of 3 polyominoes that can be used to tile the whole plane, but in a non-trivial way. It’s an example of a so called “Penrose tiling”, Penrose having done a lot of study of this kind of tilings.

The suggestion here seems to be this is an example of something Penrose can do which computers cannot do. But again this argument doesn’t work!

What the cited theorem implies is that there are some sets of polyominoes that computers can’t figure out. It doesn’t mean that this particular 3-polyomino set is one of them. There is also no reason to think that Penrose could figure this out for every set of polyominos, demonstrating his non-computable ability.

As Penrose says later in the talk, this 3-polyomino set tiling actually works according to a regular pattern. It’s just that it’s not the most trivial, translationally-symmetric pattern. It is possible to write a computer program to search through such patterns and eventually find this one that works.

He also states that the brute force algorithm would be really slow for this when trying to cover even a finite area with these tiles. OK, but the brute force algorithm is not the only possible algorithm! A better computer program could do what Penrose did, look for patterns, and find the one that works.

Goodstein’s theorem

At 18:30 he switches to Goodstein’s theorem. It is a theorem about natural numbers. It has been shown that it doesn’t follow from Peano’s Arithmetic axioms (which he calls “ordinary induction”), but it does follow from set theory (ZFC) axioms.

It is an interesting fact, but honestly I don’t understand the relevance of this. Neither humans nor computers can prove it from Peano’s Arithmetic axioms since it simply doesn’t follow from them. Both humans and computers can prove the theorem from ZFC axioms of set theory, since it does follow from them.

Conclusion

General human-level AI is in our future. None of Penrose’s arguments do anything to convince me that humans have abilities that computers can’t have. Recent advances in deep learning continuously shrink the set of abilities that are unique to humans, and the trend will continue.

Zero to the power of zero

2020-11-11T00:00:00-08:00

The controversy

The value of $0^0$ is controversial.

Wikipedia says:

Zero to the power of zero, denoted by 0⁰, is a mathematical expression with no agreed-upon value. The most common possibilities are 1 or leaving the expression undefined, with justifications existing for each, depending on context.

Mathematica and WolframAlpha refuse to compute the value.

Some textbooks on mathematical analysis, when defining exponentiation, explicitly leave $0^0$ undefined as an exception.

On the other hand, the IEEE 754 floating point standard specifies that $0.0^{0.0} = 1.0$ and as a result, most programming languages implement it that way.

Spoiler: I will argue that the value of 1 is clearly “correct”. Of course it’s a matter of definition, one can in theory define the operation to do anything, but it is “correct” in the sense that it is the only sensible value, consistent with all applications, and moreover, it is a very important value. It is also implicitly assumed to be 1 in various formulas even by those people who insist it should not be 1.

I will also show that the argument against defining $0^0$ essentially relies on a mistake: an incorrect algorithm for computing limits that is unfortunately often taught in schools. Refusal to define $0^0$ is a futile attempt to salvage the correctness of the algorithm, which however does not actually solve the problem in general.

Definition

The simplest way to resolve the issue seems to be to start with a definition, plug in zeroes, and see what we get.

A semigroup is a set of objects with an associative multiplication operation. This is a very general concept: it can be natural numbers, real numbers, square matrices, linear operators, all kinds of things.

In any semigroup we can define exponentiation to any positive integral power:

\begin{aligned} x^1 &= x \\ x^{n+1} &= x^n \cdot x \end{aligned}

Often a semigroup has an identity element $I$ such that $I\cdot x = x\cdot I = x$ . For numbers, it’s just the number 1. For matrices, it’s the identity matrix. Such a semigroup is called a monoid.

In a monoid we expand and simplify the definition of exponentiation to include the 0 exponent:

\begin{aligned} x^0 &= I \\ x^{n+1} &= x^n \cdot x \end{aligned}

Well, now just plug in $x=0$ into that definition and what do you get: $0^0 = 1$ .

This definition can then be extended to negative exponents, rational exponents, even irrational exponents. But since we’re only concerned with $0^0$ here, we’re not going to go further.

The 0ⁿ function

Since $0^n = 0$ for all $n > 0$ , one might think that the most natural thing to expect is that it would also be true for $n = 0$ . However we see from the definition that it is not so.

0^n = \begin{cases} 1 &\text{if } n = 0 \\ 0 &\text{if } n > 0 \end{cases} = [n=0]

We have here used the Iverson bracket notation.

This seems like a strangely complicated formula for $0^n$ , but we will see that it is in fact a very nice and useful function.

Combinatorics

What is the number of sequences of n letters, selected from an alphabet of size A? It’s $A^n$ .

What if the alphabet is empty, $A=0$ ? Then the number of sequences is:

0^n = [n = 0] = \begin{cases} 1 &\text{if } n = 0 \\ 0 &\text{if } n > 0 \end{cases}

Does this make sense? Yes! If $n > 0$ , we can’t form a sequence, because we will get stuck when trying to write the first letter. But when $n=0$ , there is no problem! We don’t have to write any letters, so it’s fine if the alphabet is empty. There is exactly one way to do it: write an empty sequence of letters.

The exp function

The exponential function has the following basic property, often taken to define exp in the first place:

\exp x = \sum_{n=0}^{\infty} \frac{x^n}{n!}

Let’s plug in $x=0$ :

\exp 0 = \sum_{n=0}^{\infty} \frac{0^n}{n!} = \sum_{n=0}^{\infty} \frac{[n=0]}{n!} = \frac{1}{0!} = 1

The $0^n$ function played an essential role in this calculation.

The binomial distribution

The binomial distribution is a probability distribution of the number of successes in n independent trials, each successful with probability p. The formula is:

\Pr(X = k) = \binom{n}{k}p^k(1-p)^{n-k}

What if $p=0$ ?

\begin{aligned} \Pr(X = k) &= \binom{n}{k}0^k1^{n-k} = \binom{n}{k}[k=0] = [k=0] \\ &= \begin{cases} 1 &\text{if } k = 0 \\ 0 &\text{if } k > 0 \end{cases} \end{aligned}

This makes sense! $k=0$ successes is certain, any other outcome is impossible.

Even-cardinality subsets

Given a set of n elements, how many more even-cardinality subsets are there than odd-cardinality subsets?

We can calculate it like this:

\begin{aligned} \sum_{k=0}^n \binom{n}{k}(-1)^k &= \sum_{k=0}^n \binom{n}{k}(-1)^k1^{n-k} = (-1 + 1)^n = 0^n \\ &= \begin{cases} 1 &\text{if } n = 0 \\ 0 &\text{if } n > 0 \end{cases} \end{aligned}

And indeed, for $n=0$ we have 1 even-cardinality subset (the empty set), and no odd-cardinality subsets, while for $n>0$ there are as many even as odd cardinality subsets.

Möbius function

The Möbius function $\mu(n)$ is a useful multiplicative function in number theory.

One important property of it concerns sums over divisors of a positive integer n:

S(n) = \sum_{d|n} \mu(d)

It can be shown that since $\mu$ is multiplicative, S is also multiplicative.

Also for prime $p$ and $\alpha > 0$ :

S(p^\alpha) = \mu(1) + \mu(p) + \mu(p^2) + \ldots + \mu(p^\alpha) = 1 - 1 + 0 + \ldots + 0 = 0

Let’s factor n into prime numbers:

n = \prod_{i=1}^{k} p_i^{\alpha_i}

and then we have:

\begin{aligned} S(n) &= S\left(\prod_{i=1}^{k} p_i^{\alpha_i}\right) = \prod_{i=1}^k S(p_i^{\alpha_i}) = \prod_{i=1}^k 0 = 0^k = [k=0] = [n=1] \\ &= \begin{cases} 1 &\text{if } n = 1 \\ 0 &\text{if } n > 1 \end{cases} \end{aligned}

Fractional exponents

What about the $0^x$ function for real (rather than natural) exponents $x \ge 0$ ?

Some people argue that while the case for $0^n=[n=0]$ is convincing, the case for $0^x=[x=0]$ is less convincing, and $0^0$ should only be defined for the integral exponent 0, and left undefined for the real exponent 0.0.

I have three ways to answer that.

Natural numbers are real numbers

A ubiquitous convention in mathematics is that natural numbers are a subset of integer numbers, which in turn are a subset of rational numbers, which are a subset of real numbers.

\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R}

This lets us mix and match integers with rational numbers and irrational numbers in expressions without having to worry about converting between these types.

If so, it makes no sense to say that $0^0=1$ but $0^{0.0}$ is undefined, because the natural number 0 is the same number as the real number 0.0.

One reason to doubt this is how numbers are constructed from sets in set theory. Natural numbers are constructed first. Then integers are constructed as equivalence classes of pairs of natural numbers. Similarly rational numbers are then constructed as equivalence classes of pairs of integers. Finally real numbers are constructed from rational numbers using Dedekind cuts or Cauchy sequences.

If we literally follow such a construction, then indeed the natural number 0, the integer 0, the rational number 0, and the real number 0 will be four different objects. However, there is an easy fix. When constructing integers as certain equivalence classes of pairs of natural numbers, we can simply replace the non-negative integers with the actual natural numbers. Similarly, we can replace the “integral rationals” with actual integers, and “rational reals” with the actual rationals. After we do that, the 0 number is the same object belonging to all four sets.

Consistency is good

Even if one were to treat integers as disjoint from reals, it would be nice to know that if the notation $a^b$ means something for integers a and b, then it also means the equivalent thing for the real equivalents of a and b. Technically what it means is that it would be nice if the integer-to-real mapping was a homomorphism for the $a^b$ operation.

Otherwise, if the notation changed meaning between the “integer context” and “real context”, we would have to be extremely careful about which context we are in! And it wouldn’t be clear from notation such as $x^0$ . It would be a mess. We don’t want notation to be ambiguous.

0^x is sometimes useful for fractional exponents

What is the (right-sided) derivative of $x^p$ at $x = 0$ for $p\ge 1$ ? Let’s calculate:

\begin{aligned} \left.{\frac{d}{dx}x^p}\right\vert_{x=0} &= \left.{p x^{p-1}}\right\vert_{x=0} = p\cdot 0^{p-1} = p[p=1] = [p=1] \\ &= \begin{cases} 1 &\text{if } p = 1 \\ 0 &\text{if } p > 1 \end{cases} \end{aligned}

And indeed this is correct! The derivative at $x = 0$ is 1 for $p = 1$ , and 0 for $p > 1$ . The derivative at 0 discontinuously “jumps” from 1 to 0 as soon as we increase the exponent p even slightly above 1.

The naive limit algorithm

Given all these nice uses of $0^0 = 1$ , why do some people resist defining it like this?

The only reason I have seen has to do with what I call the “naive limit algorithm”.

Suppose we want to calculate this limit:

\lim_{n\to\infty} \left(n^2 3^{-n}\right)^{1/n}

The argument goes that somebody could calculate it like this:

\lim_{n\to\infty} n^2 3^{-n} = 0 \\ \lim_{n\to\infty} \frac{1}{n} = 0 \\ \lim_{n\to\infty} \left(n^2 3^{-n}\right)^{1/n} = 0^0 = 1

Which would give an incorrect answer. The correct answer is 1/3.

However, the mistake is not in the step $0^0 = 1$ . The mistake already happened in the previous step, where we simplified the limit to $0^0$ .

A common (incorrect) way of thinking about this is: we allow calculating limits separately for sub-expressions only if the resulting expression makes sense. If it does not make sense, then doing that is not allowed. If only we declare that $0^0$ is not a valid expression, the reduction to $0^0$ will not be allowed, so it solves the problem. If however we do define $0^0$ to mean something, the reduction would be allowed.

That’s what I call the “naive limit algorithm”. It doesn’t work.

Let’s apply the same algorithm to a different limit:

\lim_{n\to\infty} 10 + \frac{1}{n} = 10 \\ \lim_{n\to\infty} \left\lceil{10 + \frac{1}{n}}\right\rceil = \lceil 10 \rceil = 10

There is an error here. The correct value of the last limit is not 10, it is 11. But this time we can’t fix it the same way: we can’t say “let’s just leave $\lceil 10 \rceil$ undefined”. Everybody agrees that is a valid expression and has to be defined!

The naive limit algorithm simply doesn’t always work.

In general, the algorithm can be described as follows. If:

\lim_{n\to\infty} a_n = a \\ \lim_{n\to\infty} b_n = b \\ \lim_{n\to\infty} c_n = c \\ \ldots

and $f(a, b, c, \ldots)$ is a valid expression, then:

\lim_{n\to\infty} f(a_n, b_n, c_n, \ldots) = f(a, b, c, \ldots)

Is this true? It’s not always true! What we wrote here is precisely the definition of continuity of $f$ at the point $(a, b, c, \ldots)$ . Some functions are not continuous!

Therefore the appropriate condition shouldn’t have been “ $f(a, b, c, \ldots)$ is a valid expression”, it should have been “ $f$ is continuous at $(a, b, c, \ldots)$ ”.

Well, $x^y$ is simply not continuous at $(0, 0)$ . As we saw, even $0^x$ is not continuous at 0. It’s inherently so, it reflects deep mathematical reality.

Refusing to define the operation there doesn’t really help the situation at all. If we don’t define it at $(0, 0)$ , it’s still not going to be continuous there, it would not even be defined there, which is worse! We can’t use the naive limit algorithm at that point either way.

Conclusion

I think we should just all agree that:

0^0 = 1

It follows directly from definitions, and it’s a nice and consistent and useful property of exponentiation. There is no convincing reason to make an exception.

Let me know what you think!

How to pick a hash function, part 2

2020-06-28T00:00:00-07:00

In part 1 we discussed universal hashing and introduced a classic family of universal hash functions, the Carter-Wegman hash function for integers and its generalizations for bigger data structures.

While that works fine, there is a simpler way that is equally good. We don’t need to deal with prime numbers and modulo, we can make do with just multiplications, additions and bit shifts, and get an equally good universal hash function family!

Hashing integers

We want a function that hashes a w-bit integer into m bits ( $m \le w$ ).

Philipp Wölfel¹ defined the following function in his Ph.D. thesis:

Pick two random w-bit integers a and b, with odd a. Then use:

h_1(x) = (ax + b) \bmod 2^w \text{ div } 2^{w-m}

In other words: compute $ax+b$ , ignore overflow, and take the top m bits. Can’t get much simpler than this!

Incredibly this is a universal hash function family with optimal collision rate. For any pair of different x and y, the probability of collision is $\Pr(h_1(x) = h_1(y)) \le 2^{-m}$ .

In C:

unsigned hash(unsigned x, int m) {
  return (a * x + b) >> (w - m);
}

Hashing bigger data

Now suppose we have a bigger data structure consisting of n w-bit words, $(x_1, \ldots, x_n)$ .

Pick random 2w-bit numbers $a_1, \ldots, a_n, b$ and use:

h_2(x_1,\ldots,x_n) = (a_1x_1 + \ldots + a_nx_n + b) \bmod 2^{2w} \text{ div } 2^{2w-m}

There is an extra optimization possible - we can replace some multiplications by additions by taking input numbers in pairs. Suppose n is even.

h_3(x_1,\ldots,x_n) = ((x_1 + a_2)(x_2 + a_1) + \ldots + (x_{n-1} + a_n)(x_n + a_{n-1}) + b) \bmod 2^{2w} \text{ div } 2^{2w-m}

Both of these hash functions are universal and guarantee collision probability of at most $2^{-m}$ for any given two inputs.

Proof for h₁

Wölfel¹ has a rather complicated proof in his thesis of the collision probability, but we can see this in a simpler way.

First we need a little lemma:

Lemma: If r is an odd number, then $f(x) = rx \bmod 2^w$ is a 1-1 correspondence between w-bit numbers. It also maps odd numbers to odd numbers.

Proof: If $rx \equiv ry \pmod {2^w}$ , then $2^w | r(x-y)$ and, since r is odd, $2^w | (x-y)$ , and so $x \equiv y \pmod{2^w}$ . Thus f is a 1-1 function. Also it clearly maps odd numbers to odd numbers. QED.

Take two different numbers x and y. We want to bound hash collision probability for these two inputs. Let k be the smallest bit position in which x and y differ. Therefore: $x-y = r2^k$ , where r is an odd number.

Let $g(x) = (ax+b) \bmod 2^w$ . The hash function $h_1(x)$ is the top m bits of $g(x)$ . Also define D as follows:

D \equiv g(x) - g(y) \equiv (ax + b) - (ay+b) \equiv a(x-y) \equiv ar2^k \pmod{2^w}

By the lemma, since a is random odd and r is odd, D is a random odd integer shifted left by k bits.

Also since b does not appear in this formula for D, the random variable $g(y) = (ay+b)\bmod 2^w$ is independent of D. This is why we needed this +b term in the hash function.

Let’s split D into the top m bits and the rest: $D = H 2^{w-m} + L$ .

g(x) \equiv g(y) + D \equiv g(y) + H 2^{w-m} + L

If $k \ge w-m$ , then $H \neq 0$ and L = 0. Hashes differ by H and are therefore always different with 100% certainty in this case.

If $k \lt w-m$ then H is a uniformly random m-bit random variable independent of $g(y) + L$ . Exactly one value of H makes the hashes equal, so collision probability is exactly $2^{-m}$ .

Proof for h₂

Suppose we have two different inputs: $(x_1,\ldots,x_n)$ , $(y_1,\ldots,y_n)$ .

Since they are different at some position, we may as well assume without loss of generality that $x_1\ne y_1$ . Again let k bit the first bit where they differ: $x_1-y_1 = r2^k$ .

Define $g(x) = (a_1x_1 + \ldots + a_nx_n + b)\bmod 2^{2w}$ . The hash function $h_2(x)$ is the top m bits of g(x).

g(x) - g(y) \equiv a_1(x_1-y_1) + (a_2(x_2-y_2) + \ldots + a_n(x_n-y_n)) \equiv a_1r2^k + E \equiv D + E \pmod{2^{2w}}

By the lemma, D is a random integer shifted left by k bits. It is also independent of E (because E doesn’t depend on $a_1$ ) and of g(y) (because g(y) has the independent +b term).

Split D into the top m bits and the rest: $D = H2^{2w-m} + L$ . We know that $k \lt w \le {2w-m}$ , and therefore H is a uniformly random number, independent of E, g(y) and L.

g(x) \equiv g(y) + D + E \equiv g(y) + H 2^{w-m} + L + E

There is exactly one value of H that will make the top m bits of g(x) and g(y) match, therefore collision probability is always exactly $2^{-m}$ .

Proof for h₃

The proof is very similar as in the previous case.

g(x) - g(y) \equiv (x_1+a_2)(x_2+a_1) - (y_1+a_2)(y_2+a_1) + (\ldots) \\ \equiv a_1(x_1-y_1) + a_2(x_2-y_2) + (x_1x_2 - y_1y_2) + (\ldots) \equiv a_1r2^k + E

And the same proof works. We just incorporated the extra (non-random) term $(x_1x_2 - y_1y_2)$ into E, which doesn’t change anything that follows.

References

Wölfel, Philipp. Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen. Diss. Universität Dortmund, 2004. ↩ ↩²

Faster than radix sort: Kirkpatrick-Reisch sorting

2020-06-06T00:00:00-07:00

Radix sort sorts n w-bit integers by splitting them up into chunks of $\log n$ bits each, and sorting each chunk in linear time. Thus it achieves $O(nw/\log n)$ time.

In 1983 Kirkpatrick and Reisch¹ published an algorithm that improves on this. It achieves time that has an exponentially smaller factor next to n:

O\left(n + n \log\frac{w}{\log n}\right)

As originally published, the algorithm is deterministic, at the cost of using a huge $\left(\Theta(2^{w/2})\right)$ amount of memory. Instead, it is more practical to combine the idea with universal hashing to get a randomized algorithm with that expected time, and linear space.

Step 1. Build depth-2 trie.

Suppose we want to sort this list of 10 numbers:

We split each number into the top half of bits and bottom half (in our case we will take 4 decimal digits each). Then we add the number to the trie as a length-2 path: at the first level we have the more significant bits, in the leaves we have the least significant bits.

Note that when building the trie we have to be able to look up nodes at the first level by value, to avoid duplicating them. This is where hashing (and hence randomization) comes in. In the leaves duplicates are OK.

Step 2. Find the smallest leaf in each subtree.

We find the minimum leaf and make it the first child in each subtree.

Step 3. Sort remaining nodes.

We take all the nodes other than root and the minimum leaves, and sort them recursively.

There are n leaves. We added some number of level-1 nodes, but skip the same number of minimum leaves. Thus the recursive sort also sorts n numbers, with half as many bits each.

Nodes that we need to sort (in breadth-first order):

After sorting:

Step 4. Sort children edges.

Using the sorted order of nodes computed in the previous step we can reorder all the edges so that they are in sorted order.

We simply detach all the children (except the minimum leaves), and then walk all the nodes in sorted order and re-attach them to their original parent.

Step 5. Walk the sorted trie.

We now simply walk the trie left-to-right and re-combine high bits with low bits to get the final sorted answer:

Time complexity

Steps 1, 2, 4, 5 take linear time. Step 3 requires recursive sorting of numbers that are half as long.

If we let $T(n, b)$ be the time to sort n b-bit numbers, we get the recurrence:

T(n, b) = T(n, \lceil b/2 \rceil) + O(n)

We stop the recursion once the numbers have at most $\log n$ bits. At that point, we can just sort in linear time by counting each value. Thus:

T(n, \lfloor \log n \rfloor) = O(n)

We start with w bits each, and want to get to $\log n$ bits each. Each level of recursion halves the number of bits, so we need $\log (w / \log n)$ levels.

Thus the total time complexity is:

O\left(n + n \log\frac{w}{\log n}\right)

References

Kirkpatrick, David, and Stefan Reisch. “Upper bounds for sorting integers on random access machines.” Theoretical Computer Science 28.3 (1983): 263-276. ↩

Static perfect hashing in minimal memory

2020-05-26T00:00:00-07:00

Sometimes we want to build a static dictionary data structure. This is when we have a set of data that doesn’t have to change dynamically. When it changes, we can just rebuild the dictionary from scratch. So we first build the data structure, and then perform lookups without changing it any more.

Example scenarios where we might want to do this:

Spell checker. We have a set of words that doesn’t change, and want to perform lookups to see if a word is spelled correctly.
Search engine. We periodically crawl a lot of web pages and build an index containing all the words, with links to the pages. Once the index is built, we don’t change it, until we do a new crawl again

What we want is:

Linear build time in expectation. So for $n$ elements of size $S$ words each, we want to be able to build the data structure in $O(nS)$ expected time. If elements are of constant size, this is $O(n)$ time.
Guaranteed quick access time to all elements. We want $O(S)$ time per access. If the element size is constant, this is $O(1)$ time.
Not much memory overhead. We are aiming for memory use of $nS + o(n)$ . Note that $nS$ is just the space required to store the elements. So this means that for large $n$ the fraction of extra memory required for bookkeeping is small compared to the actual data.

As usual, we are using the “word RAM” computational model. What we mean by a “word” is a piece of memory that can store a pointer or a number in the range of 0 to $n$ . So a word has at least $\log n$ bits. Note that if we have $n$ distinct elements to store in the hash table, they necessarily have at least $\log n$ bits each, or else they couldn’t be distinct.

For a while it was an open problem whether such a data structure (or even just properties 1 and 2) is even possible. In 1984 Fredman, Komlós, Szemerédi¹ described a data structure that does this.

One could try to just use a regular hash table with universal hashing. It sort of works, but it fails properties 2 and 3. We get constant access time, but only in expectation. No matter how many hash functions we try, there will almost surely be some buckets that have more than a few entries in them. In fact it can be shown that if we use a completely random hash function to store $n$ elements in $n$ buckets, then with high probability the largest bucket will have $\Theta(\log n / \log \log n)$ entries. For universal hashing it could be a lot worse. A few customers could get angry that their queries are always slow! We don’t want that. We would also have significant memory overhead to store all the buckets and pointers.

Guaranteeing worst case access time

Start by defining a family of universal hash functions such that the probability of a collision when hashing into $m$ buckets is bounded by $c/m$ for some constant $c$ . Normally we can make $c \approx 1$ .

Now let’s create $n$ buckets and randomly select a hash function $H$ that maps elements to those buckets. The expected number of collisions, i.e. the number of pairs of data elements that hash to the same bucket, is bounded by:

\sum_{0 \le i < j < n} \Pr(H(x_i) = H(x_j)) \le \binom{n}{2} \frac{c}{n} < \frac{n^2}{2} \frac{c}{n} = \frac{cn}{2}

The probability that the number of collisions exceeds twice the expectation, i.e. $cn$ , is at most 50% (otherwise the expected value would be larger). If that happens, we just try again and pick a different $H$ . It will take on average 2 tries to get a valid $H$ .

Thus we have at most $cn$ collisions.

Let $b_i$ be the number of elements in bucket $i$ . Then the number of collisions can also be expressed as:

\sum_{i=0}^{n-1} \binom{b_i}{2} \le cn

Now comes the main trick: in each bucket let’s make another, second-level mini hash table!

We size a given secondary table based on its number of elements. Make it $\max\left(2c\binom{b_i}{2}, 1\right) = \max(cb_i(b_i-1), 1)$ (note that this is quadratic in the number of elements). Select a random hash function $h_i$ . We want no collisions in the second level hash table.

The expected number of collisions in a given second level hash table is bounded by:

\binom{b_i}{2} \frac{c}{\max\left(2c\binom{b_i}{2},1\right)} \le \frac{1}{2}

So with probability at least 50% we don’t get any collision at all! In case we get a collision, just try a different $h_i$ . After an average of 2 attempts we will get zero collisions in that table.

So we don’t need any secondary collision resolution method. Just store pointers to elements directly in mini-buckets.

A lookup is now constant time:

Calculate $i = H(x)$
Calculate $j = h_i(x)$
The element, if it exists, will be in bucket $i$ , entry $j$ .

Total space used for second level hash tables is:

\sum_{i=0}^{n-1} \max\left(2c\binom{b_i}{2}, 1\right) \le 2c\sum_{i=0}^{n-1}\binom{b_i}{2} + n \le 2c \cdot cn + n = (2c^2+1)n = O(n)

Memory overhead is therefore $\Theta(n)$ . It is not negligible, especially if the actual data elements are small.

Compression

Now we are aiming to reduce the memory overhead to $o(n)$ , i.e. something that is relatively small for large n.

First, let’s increase the number of main buckets by a factor of $z$ (to be determined later), so we have $zn$ buckets. Seems like this will only make it worse, but bear with me. We will pack them really tightly!

The expected number of collisions is now smaller:

\sum_{i=0}^{zn-1} \binom{b_i}{2} \le \binom{n}{2} \frac {c}{zn} < \frac{n^2}{2} \frac{c}{zn} = \frac{cn}{2z}

Again, we make sure that we don’t get more than twice that, i.e. $cn/z$ . If we do, just try a different hash function $H$ .

If $b_i = 1$ , we are not going to have a second level hash table. Instead, we will store a pointer to the element directly in the main bucket.

For $b_i \ge 2$ , we make a second level hash table of size $2c\binom{b_i}{2} = cb_i(b_i-1)$ . This again means that we can easily get zero collisions in each such table.

The total size of all second level hash tables (and thus also the number of such tables) is bounded by:

\sum_{i=0}^{zn-1} 2c\binom{b_i}{2} < 2c \frac{cn}{z} = 2c^2 \frac{n}{z} = O\left(\frac{n}{z}\right)

Now let’s put all the data elements into a dedicated array. Order them so that the ones that appear in single-element buckets are ordered first, in the same order as they appear in the buckets. Also put all the secondary table hash functions and pointers into a separate array, again in the same order as they appear in main buckets. In the main bucket table we just need to store what kind of entry it is (none, single element, or secondary table), and an index into the appropriate array.

Finally, we group the main buckets into groups of size $g$ . In each group we store just one index of the first single element in the group (if any), and one index of the first secondary table in the group (if any). In each individual bucket we just store an offset from those. The above bucket table now looks like this:

There are $zn/g$ groups. In each group, we need 2 words for the first element index and the first table index, for a total of $2zn/g$ words.

We also have $zn$ individual buckets. Each bucket has the type of bucket (there are 3 types, so 2 bits), and an offset. However, those offsets are small, between 0 and $g$ . So buckets only need $2 + \log g$ bits each. We can pack them into $O(zn \log g / \log n)$ words, because we know we can fit $\log n$ bits in a word.

The total space usage for all the elements, all the secondary tables, and all the buckets is therefore:

nS + O\left(\frac{n}{z} + \frac{zn}{g} + \frac{zn\log g}{\log n}\right)

Now we’ll pick $g$ and $z$ to make this as small as possible. Start with the group size $g$ . We want the last two terms to be approximately equal, which means $g\log g \approx \log n$ , or $g=\Theta(\log n / \log \log n)$ .

This makes the memory usage:

nS + O\left(\frac{n}{z} + \frac{zn\log \log n}{\log n}\right)

Finally optimize the number of buckets $zn$ . Again we want the two terms to be approximately equal, or $z = \Theta(\sqrt{\log n / \log \log n})$ .

The final memory usage then is:

nS + O\left(n \sqrt{\frac{\log \log n}{\log n}}\right) = nS + o(n)

Asymptotically smaller than linear overhead! Just as we wanted.

References

Fredman, Michael L., János Komlós, and Endre Szemerédi. “Storing a sparse table with O(1) worst case access time.” Journal of the ACM (JACM) 31.3 (1984): 538-544. ↩

Implementing 2-3 trees

2020-05-23T00:00:00-07:00

Today I show how I have implemented 2-3 trees in a straightforward way.

I consider 2-3 trees to be perhaps the simplest possible kind of balanced search tree data structure. At least conceptually.

A while ago I showed how to implement AVL trees in not too many lines of code. However, I had to be very careful, and there was one weird special case that I had to make sure to handle appropriately to avoid getting in an infinite loop.

The code for 2-3 trees is, I think, simpler. It’s not necessarily shorter, but it’s more straightforward. There are no exceptional cases to handle, there are no tree rotations, etc. Just an intuitive recursive implementation of each operation.

What are 2-3 trees

2-3 trees are a kind of balanced search tree. They have, in this implementation, the following properties:

Every internal node has 2 or 3 children.
All leaves are always at the same depth.
The above guarantees that the height is $\Theta(\log n)$ .
All keys are stored in the leaves. This makes it simpler, since we can just discard and create internal nodes at will.
Internal nodes store the height of the subtree, and the smallest element of the subtree.
The tree is ordered. For a node with two children, the left subtree contains elements less than or equal to the right subtree. For a node with three children, the left subtree contains elements less than or equal to the middle subtree, and the middle subtree contains elements less than or equal to the right subtree.

Define the data type

Let’s start implementing this in Haskell.

The data type Tree t is the type of 2-3 trees containing elements of type t.

A tree is either empty, or it is a leaf, or it starts with an internal node at the root. We make separate constructors for 2-children nodes and 3-children nodes.

data Tree t =
    Empty
  | Leaf t
  | Node2 Int t (Tree t) (Tree t)          -- Node2 height smallest a b
  | Node3 Int t (Tree t) (Tree t) (Tree t) -- Node3 height smallest a b c

Also define a couple helper functions to extract the height and the smallest element of a tree. The height of a leaf is 0.

height :: Tree t -> Int
height (Leaf _) = 0
height (Node2 h _ _ _) = h
height (Node3 h _ _ _ _) = h

smallest :: Tree t -> t
smallest (Leaf x) = x
smallest (Node2 _ s _ _) = s
smallest (Node3 _ s _ _ _) = s

Now a couple functions for building 2-nodes and 3-nodes out of subtrees, which are assumed to be of equal height. These functions calculate the height and the smallest element from the left-most child.

node2 :: Tree t -> Tree t -> Tree t
node2 a b = Node2 (height a + 1) (smallest a) a b

node3 :: Tree t -> Tree t -> Tree t -> Tree t
node3 a b c = Node3 (height a + 1) (smallest a) a b c

Merging trees

Our basic operation is merge. It takes two trees, where all elements in one are no larger than all elements in the other, and creates one tree that contains their union. Every other operation will be defined in terms of merge.

The following helper function will be useful: take between 2 and 4 trees of the same height, containing elements in sorted order (i.e. the first tree contains the smallest elements, etc), and “level up”, creating between 1 and 2 trees of height one larger.

For 2 or 3 subtrees we end up with one tree, for 4 subtrees we end up with 2 trees.

levelUp :: [Tree t] -> [Tree t]
levelUp [a,b] = [node2 a b]
levelUp [a,b,c] = [node3 a b c]
levelUp [a,b,c,d] = [node2 a b, node2 c d]

Next comes the recursive helper function, mergeToSameHeight. Given two non-empty trees to merge, it returns either 1 or 2 trees. The height of the output(s) is always equal to the maximum of the heights of the inputs.

If the two inputs are already same height, we just return them.

If the first tree is smaller, we merge it with the left-most subtree of the second tree, which generates either 1 or 2 subtrees to replace that left-most subtree. So we get between 2 and 4 subtrees, and we “level up” to get the output(s).

Similarly, if the second tree is smaller, we merge it with the right-most subtree of the first tree, and “level up”.

mergeToSameHeight :: Tree t -> Tree t -> [Tree t]
mergeToSameHeight a b
  | height a < height b =
    case b of
      Node2 _ _ b1 b2 -> levelUp (mergeToSameHeight a b1 ++ [b2])
      Node3 _ _ b1 b2 b3 -> levelUp (mergeToSameHeight a b1 ++ [b2, b3])
  | height a > height b =
    case a of
      Node2 _ _ a1 a2 -> levelUp ([a1] ++ mergeToSameHeight a2 b)
      Node3 _ _ a1 a2 a3 -> levelUp ([a1,a2] ++ mergeToSameHeight a3 b)
  | otherwise = [a, b]

merge just calls mergeToSameHeight. If two trees are generated at the top level, we add an extra level at the top. This is how 2-3 trees grow: they grow at the root!

merge :: Tree t -> Tree t -> Tree t

merge a Empty = a
merge Empty b = b

merge a b =
  case mergeToSameHeight a b of
    [t] -> t
    [t, u] -> node2 t u

The run time of merge is proportional to the difference of heights of the inputs.

Splitting trees

We define the split operation that takes a function to split the elements by (e.g. “all elements larger than 5 go to the right”) and a tree, and returns two trees. The function f returns True if the element should go to the right, and False if it should go to the left.

By looking at the smallest element in each subtree, we can figure out which subtree needs to be split. Then we use merge to merge the pieces of the subtree with the other subtrees.

split :: (t -> Bool) -> Tree t -> (Tree t, Tree t)

split _ Empty = (Empty, Empty)

split f (Leaf x)
  | f x   = (Empty, Leaf x)
  | otherwise  = (Leaf x, Empty)

split f (Node2 _ _ a b)
  | f (smallest b) =
    let (a1,a2) = split f a in (a1, merge a2 b)
  | otherwise =
    let (b1,b2) = split f b in (merge a b1, b2)

split f (Node3 _ _ a b c)
  | f (smallest b) =
    let (a1,a2) = split f a in (a1, merge a2 (node2 b c))
  | f (smallest c) =
    let (b1,b2) = split f b in (merge a b1, merge b2 c)
  | otherwise =
    let (c1,c2) = split f c in (merge (node2 a b) c1, c2)

The runtime of split is $O(\log n)$ . All the merging going on starts from small trees and works its way up to larger and larger trees. Because the time to merge only depends on the difference of heights, the total time adds up to the height of the tree, which is $O(\log n)$ .

Contains, insert, delete

These functions are now easy. We just split the tree around the element we are interested in, do the operation we want, and merge things back as appropriate.

contains :: Ord t => Tree t -> t -> Bool
contains a x =
  case split (>= x) a of
    (_, Empty) -> False
    (_, a2) -> smallest a2 == x

insert :: Ord t => Tree t -> t -> Tree t
insert a x =
  let (a1, a2) = split (>= x) a
  in a1 `merge` (Leaf x) `merge` a2

delete :: Ord t => Tree t -> t -> Tree t
delete a x =
  let (a1, a2) = split (>= x) a
      (_, a3) = split (>x) a2
  in merge a1 a3

Converting from and to lists

Just to make it easier to create trees and inspect trees, we add conversion functions to and from lists.

To create a tree from an unsorted list of elements, just insert all the elements starting from an empty tree.

fromList :: Ord t => [t] -> Tree t
fromList = foldl insert Empty

To convert to a list, we could recursively convert subtrees to lists and merge. But that would be $\Theta(n \log n)$ . We can do it better, in $\Theta(n)$ , by creating a helper function that prepends all the elements in front of a list. This way we don’t have to merge lists.

prepend :: Tree t -> [t] -> [t]
prepend Empty xs = xs
prepend (Leaf x) xs = x : xs
prepend (Node2 _ _ a b) xs = prepend a (prepend b xs)
prepend (Node3 _ _ a b c) xs = prepend a (prepend b (prepend c xs))

toList :: Tree t -> [t]
toList a = prepend a []

That’s all the basic operations. If you need more, they should be easy to add. It works, I’ve tested it.

How to pick a hash function, part 1

2020-05-21T00:00:00-07:00

Summary

If you don’t read the rest of the article, it can be summarized as:

Use universal hashing. It is simple, efficient, and provably generates minimal collisions in expectation regardless of your data.
Most hash table implementations don’t do this, unfortunately. Standard libraries of common programming languages don’t do this. Instead, they use inferior ad-hoc functions. That creates problems.

Note: This article only talks about hash functions for use in hash tables. Hash functions for use in cryptographic applications are a very different topic that we don’t cover here.

Hash tables

A hash table is a great data structure for unordered sets of data. Whenever you have a set of values where you want to be able to look up arbitrary elements quickly, a hash table is a good default data structure.

It typically looks something like this:

On the left we have $m$ buckets. Each bucket contains a pointer to a linked list of data elements.

We also need a hash function $h$ that maps data elements to buckets.

In the above example we have 10 buckets, data elements are numbers, and the hash function is the last digit of a number: $m=10$ , $h(x) = x \bmod m$ .

To find an element, we first compute the hash function, and then scan the list in the appropriate bucket. This is quick as long as we don’t have too many elements in the bucket.

The worry is: what if the hash function is not well-suited for our data and most elements end up in the same bucket? Then the access time will be bad.

For instance, if the keys are prices of products in a grocery store, then most prices will end with the digits .00 or .99. If we use the last two digits as the hash function, we would store everything in two buckets. This wouldn’t work well!

This is the reason many programmers are afraid of using hash tables. Is my hash function good for the data? Should I switch to the newest and fanciest hash function that somebody has recently published? How do I test it with my data?

I will argue that with appropriate implementation, these are non-issues. We will see that a hash table should really be a randomized data structure. If we do this correctly (and it’s not hard to do), the performance will be good in expectation regardless of the data.

Unfortunately, common libraries do not do this correctly (yet).

Ad-hoc hash functions are bad

Let’s solve the following artificial problem in a brute force way using hash tables:

Given $A$ and $B$ , compute:

B + 2B + 3B + \ldots + AB

There are better ways to solve the problem but we just want to use a hash table as an experiment.

OK, let’s write some C++ code.

std::unordered_set<long> generate(long A, long B) {
  std::unordered_set<long> data;
  for (long i=1; i<=A; ++i) data.insert(i * B);
  return data;
}

long sum(const std::unordered_set<long> &data) {
  long res = 0;
  for (long x : data) res += x;
  return res;
}

int main(int argc, char **argv) {
  long A = std::atoi(argv[1]);
  long B = std::atoi(argv[2]);
  auto data = generate(A, B);
  std::cout << sum(data) << "\n";
}

And run it:

$ time ./hashing 1000000 123
61500061500000

real  0m0.135s
user  0m0.119s
sys 0m0.016s
$ time ./hashing 1000000 3141592
1570797570796000000

real  0m0.153s
user  0m0.136s
sys 0m0.017s
$ time ./hashing 1000000 1056323
^C

real  1m10.137s
user  1m10.107s
sys 0m0.024s

In the fist two cases we get an answer in 0.15 seconds. In the last case, A=1000000, B=1056323, we never got an answer. I just killed the program after a minute. If would take about 45 minutes to complete.

It’s not that we were extremely unlucky. This will happen every time we run with these inputs. If you want to know the answer for A=1000000, B=1056323 you have no choice but to wait for 45 minutes! (Or you modify the program, but that’s not the point.)

Well this is ridiculous. We just want to add a million numbers. It shouldn’t be that hard.

I’m sure the readers already suspect what the issue is. Hash table collisions!

It turns out that my implementation of the C++ standard library uses 1056323 buckets for a hash table of size 1000000, and the hash function it uses is simply $h(x) = x \bmod m$ . Since our numbers are all divisible by 1056323, everything ends up in bucket 0.

Obfuscated hash functions: not a real solution

A solution many people use in practice is to, instead of a simple ad-hoc hash function like $h(x) = x \bmod m$ , use a much more complicated ad-hoc function, with a lot of arbitrary arithmetic instructions thrown together: shift bits around, add things together, multiply things, xor things, etc etc.

Examples of this are: MurmurHash, FNV Hash, PJW Hash, Jenkins Hash, Spooky Hash, etc etc. People come up with them all the time.

There isn’t really any secret sauce behind these functions. Various authors simply pick an arbitrary sequence of arithmetic operations to make the function more complicated.

This doesn’t really solve the underlying problem. What it does is sweep the problem under the rug. Because the functions are so complicated, it is much less obvious what kind of data is bad data. It is also less likely your data will accidentally happen to be worst case data.

But it’s not guaranteed. It could just so happen that your type of data interacts with one of these hash functions in a bad way.

Perhaps more importantly, it is very easy to deliberately search for and find bad data. This leads to a denial of service attack against your program. If, for example, you’re running a website and the backend of your website uses a hash table, an evil user can deliberately send you data that will cause your server to run very long computations due to hash collisions.

Utopia

What if we use a random function as the hash function? Just pick any function randomly out of the space of all possible functions.

It looks like this would work. Let’s say you have $n$ elements in the hash table, ${a_1, a_2, a_3, \ldots. a_n}$ . Let’s also assume we’re looking up an element $x$ that does not exist in the table (worst case scenario for lookup). The expected number of elemements in the same bucket as $x$ is:

\sum_{i=1}^n \Pr(h(a_i) = h(x)) = n\cdot\frac{1}{m} = \frac{n}{m}

Therefore the expected lookup time would be $O(1 + \frac{n}{m})$ .

As long as we have enough buckets with $m\ge n$ , lookup time is $O(1)$ , i.e. constant time in expectation.

Great! Can’t get any better than that.

There is a problem with this solution however. There are a lot of possible hash functions! If there are $U$ possible keys, there are $m^U$ possible hash functions.

Just to store a description of randomly chosen hash function, we need at least $\log_2 m^U = U \log_2 m$ bits. In other words, we would need to store a huge array of hash values, one entry for each possible key. But if we do that, then we could just as well not use a hash table at all, and just store the set elements directly in that array! The whole purpose of having a hash table is to avoid having an array with one entry for each possible key.

So this doesn’t work. But there is a better way.

Universal hashing

The idea behind universal hashing is similar to the the idea behind Utopia. We will still choose a random hash function. But we limit the set of possible hash functions, so that we can store it compactly in very small amount of memory.

The only property of random hash functions that we really needed in the Utopia proof was: for two different keys $x$ and $y$ :

\Pr(h(x) = h(y)) \le \frac{1}{m}

It would also be OK to have a slightly larger bound, say $\frac{2}{m}$ .

Fortunately, this is achievable! If a family of hash functions satisfies this property, we call it a “universal family of hash functions”, and call a randomly chosen function from that family a “universal hash function”.

Hashing an integer

If your keys are integers in some range, do this:

Pick a prime number $p$ that is at least as large as the range of keys.
Pick random $0 \le a, b < p$ , $a\ne 0$ .

The Carter-Wegman hash function is:

h(x) = ((ax + b) \bmod p) \bmod m

This is a universal hash function. $\Pr(h(x)=h(y)) \le \frac{1}{m}$ .

Proof sketch. For a given pair of keys, $x\neq y$ , $(ax+b) - (ay+b) = a(x-y)$ is not divisible by $p$ , because $a, x, y$ are all smaller than $p$ . Therefore $ax+b \not\equiv ay+b \pmod p$ . For a random choice of $a,b$ the pair $(ax+b, ay+b)$ is in fact a uniformly random pair of non-equal numbers modulo $p$ . There are $p(p-1)$ such pairs, and less than $\frac{p(p-1)}{m}$ of them match modulo $m$ . Therefore collision probability is less than $\frac{1}{m}$ .

So, just use this function and you’ll be fine!

One possible complaint might be that this function involves two expensive modulo operations. However both of them can be avoided:

If you choose $m$ to be a power of 2, then mod m is just a cheap bitmask of the lowest bits.
If $p$ is a compile-time constant, then there is a way to compute mod p using multiplication instead of division. The idea is to multiply by a precomputed fixed-precision approximation to $\frac{1}{p}$ instead of dividing by $p$ . Good compilers do this automatically.
If $p = 2^{k}-1$ is a Mersenne prime, mod p can be computed even easier using just bitshifts and addition. Again, good compilers do this automatically.

See also part 2 for even better hash functions.

Hashing bigger data

Suppose we have a data structure consisting not of just one number $x$ , but of $n$ numbers $(x_1, x_2, \ldots, x_n)$ .

In that case we randomly select $n$ multipliers $a_1, a_2, \ldots, a_n$ , and use:

h(x_1, x_2, \ldots, x_n) = ((a_1 x_1 + a_2 x_2 + \ldots + a_n x_n + b) \bmod p) \bmod m

This guarantees collision probability of at most $\frac{1}{m} + \frac{1}{p} < \frac{2}{m}$ for different keys, by a very similar proof. It’s sufficient that at least of the $x_i$ is different.

Hashing variable-length data

Suppose that we have variable length data $(x_0, x_1, \ldots, x_{n-1})$ , so $n$ is not a constant. We only have some large limit $L$ on the length.

In this case what we can do is:

Pick a prime number $p > mL$ .
Pick a random number $0 \le a < p$
Pick a random hash function $h$ for single integers in range 0 to $p-1$ .

The hash function for variable length data is:

H(x_0, x_1, \ldots, x_{n-1}) = h\left(\left(\sum_{i=0}^{n-1} x_i a^i + a^n\right) \bmod p\right)

Inside the parentheses we are evaluating a polynomial of degree at most $L$ modulo $p$ at a random point $a$ .

If we have two different variable-length keys $x$ and $y$ , then we are evaluating two different polynomials at the same random point. The difference between polynomials is itself a polynomial. A polynomial of degree at most $L$ can have at most $L$ roots. Therefore the probability that the two polynomials give the same value at $a$ is at most $\frac{L}{p}$ .

If the two polynomials are different at $a$ , then we’re applying $h$ to two different integers. In this case we know the probability of collision is at most $\frac{1}{m}$ .

Therefore the total collision probability is:

\Pr(H(x)=H(y)) \le \frac{L}{p} + \frac{1}{m} < \frac{L}{mL} + \frac{1}{m} = \frac{2}{m}

This is good enough to get $O(1)$ expected access time to the hash table.

Tournament-winning gomoku AI

2020-05-18T00:00:00-07:00

Introduction
Rules
Example game 1
Threats
Example game 2
Precomputed patterns
Static evaluation
Board representation
Threat sequence search
Principal variation search
Opening book
Conclusion
References

Introduction

I’m going to describe my Gomoku playing program that I submitted for CodeCup 2020. The program (named “OOOOO”) ended up winning the tournament, placing first in a field of 58 entries, and winning 98 out of 100 games.

CodeCup is an excellent annual tournament for game AIs organized by the Dutch National Olympiad in Informatics. Every year they host a tournament for programs playing a game, with a different board game each time.

In January 2020 the game was Gomoku. It was a bit unusual in that the game chosen was a well-known game. Usually it’s a completely new game, or an obscure game not well known. So I thought that winning submissions might be some existing programs that were already developed previously over many years. This turned out not to be the case.

I enjoyed coding this game very much. The game has very simple rules, and allows very interesting algorithmic ideas.

You can see the results and all the games on the tournament website.

Rules

The game was simple “free-style” gomoku played on a 16x16 board. Players alternate placing black and white stones anywhere on the board, until somebody makes 5 stones in row horizontally, vertically or diagonally, which wins.

To minimize first-player advantage, “swap rule” is employed. One player makes the first three moves (black, white, black), and the other player may choose to continue with white, or swap colors.

Example game 1

Take a look at an example game played between OOOOO (black) and Leopold Tschanter’s “ltgmk” (white).

The game ended with the following sequence:

When OOOOO played the move marked as 1, it already knew it was going to win, 13 ply (7 moves) before the game ended. This despite the fact that the space of possible moves is very large: each player has almost all 256 intersections available each move.

The endgame is a forcing sequence of threats to which defenses are more or less forced.

Move 1 is a diagonal “simple four” threat. 2 is forced, otherwise black will play at 2 and win immediately.

Move 3 is a diagonal “broken three” threat.

This time however, white doesn’t have to respond immediately. Instead, he makes a stronger counter-threat with a simple four at 4. Black has to defend the counter-threat at 5.

Now white has to go back and address the threat at 3. If he ignored the threat, black would make a diagonal “open four” threat at 6 to which there would be no defense.

So white defends at 6.

Now black makes another simple four at 7. White has to respond at 8.

Now black makes a double “open three” threat at 9. White doesn’t have a good counter-threat, so defends one of these open threes at 10.

Black converts the other, undefended open three into an open four with 11. There is no defense to an open four (other than making an immediate five). White defends on one side with 12, and black finishes with a five on the other side at 13, winning the game.

Pretty much every gomoku game ends in one of these long forcing threat sequences. It is therefore essential to be able to find such sequences and defend against opponent’s sequences.

Threats

We categorize threats as follows:

Winning threats. There are fives and open fours.
Forcing threats, i.e. opponent must respond. These are: simple fours, open threes, and broken threes.
Non-forcing threats. These don’t require the opponent to respond, but may become forcing threats later when additional stones are added.

Winning and forcing threats can also be ordered by their severity:

Fives.
Fours.
Threes.

This ordering will useful when thinking about counter-threats later. Instead of answering a forcing threat, one may counter with a more severe forcing counter-threat.

Five

Fives are pretty self-explanatory. You immediately make five in a row and win. 1, 2, 3 are fives.

Open four

A standard open four threat is a play like 1. Black has no defense. If black plays at a, white plays at b. If black plays at b, white plays at a. The only way to salvage the game for black would be to play a more severe counter-threat elsewhere, i.e. a five!

But 1 is not he only pattern like that. 2 and 3 work exactly the same way. They also create two ways for white to finish the game. Even though they involve 6 and 7 stones rather than 4, I also call such equivalent patterns “open fours” (because we have four stones out of five in a row).

Simple four

A simple four is a forcing threat, threatening a five next move, but only in one way. It doesn’t necessarily win, but forces a response. White plays at 1 or 2, black has to respond at a (or play a more severe threat, i.e. a five).

Open three

The most common open three looks like 1: three in a row, with at least two empty spaces on each side. Black has to defend at b or c (or play a four-threat elsewhere). Otherwise, white will make an open four at b or c and win.

Open three is a special kind of threat in that four empty spaces are required, but only two of them are valid defenses for black. For instance, if black tries to defend at a, white still makes an open four at c and wins. This will be relevant later, when we talk about sequences of threats: we have to make sure that a and d are still empty when this threat is played, even though black can’t defend there.

And again, there are other patterns like 2 and 3 above that we call “open threes” even though they involve more than three white stones. The defining characteristic is that there are four empty spaces of which white needs to fill any two consecutive ones to win.

The last pattern (3) is my favorite. It looks pretty. And it played a major role in example game 2 shown below.

Broken three

Finally we come to the weakest type of forcing threat, but probably the most common one: a broken three. Here only three empty spots are involved. Black can defend at any one of them, otherwise white will make an open four at b.

Non-forcing threats

All types of threats can be described by two numbers: $(a,b)$ where $1 \le a \le 5$ is the threat severity (we already have $a$ stones out of 5), and $1 \le b \le 6-a$ is the number of possible ways to make a five.

So all the threats types are as follows:

Winning threats:
- (5, 1): five
- (4, 2): open four
Forcing threats:
- (4, 1): simple four
- (3, 3): open three
- (3, 2): broken three
Non-forcing threats:
- (3, 1): simple three
- (2, 4), (2, 3), (2, 2), (2, 1): a two that can be extended to a five in 4, 3, 2, 1 ways respectively
- (1, 5), (1, 5), (1, 3), (1, 2), (1, 1): a single stone that can be extended in 5, 4, 3, 2, 1 ways respectively
No threat at all

All the above are non-forcing threats. Play at 1 is (3, 1), i.e. a simple three. Play at 2 is a (2, 3), a two that can be completed in 3 ways. Play at 3 is not a threat at all, because it can’t make five in a row in this direction.

Example game 2

Let’s look at another game from the tournament. OOOOO plays black versus Sjoerd Hemminga’s SjoerdsGomokuPlayer playing white.

The final forcing sequence that OOOOO employed looks like this:

Black plays a diagonal broken three at 1, white defends at 2. Black plays another diagonal broken three at 3, white defends at 4. Black plays a horizontal open three at 5, black defends at 6.

And now comes a pretty double-threat at 7. A standard vertical broken three, and a nice non-standard horizontal “open three” consisting of four stones, all separated by empty spaces.

White defends the vertical threat at 8, and black converts the undefended horizontal “open three” into a non-standard “open four” (actually consisting of five stones) with 9. White defends one of the two spots at 10, and black finishes the job at 11.

Precomputed patterns

My bot has a lot of things precomputed, to enable a quick recognition of threat patterns.

There are 16 types of threats (from no threat to a five). There are a total of 65 distinct threat patterns. For each pattern we store the set of squares that must be occupied, the set of squares that must be empty, and the set of squares that are possible defenses.

For each line length from 5 to 16, and for each pattern of opponent stones on that line, we precompute how the line is split into sub-lines by opponent stones. Any threat has to be contained within a sub-line.

For each sub-line length from 5 to 16, and for each pattern of our stones, we precompute what threat patterns are available within the subline.

Using these tables we can quickly look up threats available on the board for each player.

Static evaluation

My static evaluation of a position is very simple. I tried some simple machine learning, but ended up just using a simple hand-made formula that works reasonably well.

For each player, and for each empty intersection of the board, I look at what threats are available for that player at this intersection in the four direction. I take two best threats, and assign them numbers from 0 (no threat) to 16 (immediate five) based on the threat type.

If the two best threats at intersection i have values $a_i$ and $b_i$ , my evaluation for one side is:

\sum_{i\in\text{empty}} (1.5 \cdot 1.8^{a_i} + 1.8^{b_i})

Total static evaluation is the difference of single-side evaluations, plus a bonus for the side to move.

Board representation

We want the board representation to be a data structure that supports some important operations efficiently:

See if the game is over and who won.
Make a move.
Un-make a move (to support recursive game tree search).
Compute the static evaluation.
Find all winning or forcing threats for each player (to start the search for a winning threat sequence).

Our board representation consists of:

Rotated bitboards.
Threat Boards.
Incremental static evaluation.

Rotated bitboards

A bitboard is a sequence of 16x16 = 256 bits. We could represent the board as two bitboards: one for black stones and one for white stones.

We add redundancy so that we can easily extract patterns of bits corresponding to lines on the board. For each player we store the board in four copies, rotated by 0, 90, 45 and 135 degrees. So we have:

A row-major bitboard, where intersections in the same row are consecutive.
A column-major bitboard, where intersections in the same column are consecutive.
A NW-SE bitboard, where intersections in the same NW-SE diagonal are consecutive.
A NE-SW bitboard, where intersections in the same NE-SW diagonal are consecutive.

Threat Boards

For each player we maintain a Threat Board. A Threat Board contains information about each square in each direction. For empty intersections, we store the current threat pattern available in that square.

Every time we make or un-make a move, we update all threats in the neighborhood. We look up the pattern in each direction in our precomputed tables, and update all threats nearby. We have to go a distance of 4 in each direction, so we only have to update 8 * 4 = 32 nearby intersections.

We also keep track of whether the game is already over, so we can answer that question immediately.

Incremental static evaluation

Every time the current threat available at an intersection is updated, we recompute the score for that intersection, and update the total sum of these scores. This way we have the score always available without having to recompute it for the whole board.

Threat sequence search

This is the central and most compute-intensive part of the whole program. Every time we encounter a new position, we want to be able to tell whether there exists a winning combination of threats for the player to move. We also want to be able to tell whether there exists a winning combination of threats for the other player, and if so, how the current player can defend against it. In the tree search we will only consider those moves that defend against such combinations.

The algorithm for finding these threat sequences is inspired by the Ph.D. thesis of Victor Allis, “Searching for solutions in games and artificial intelligence.”¹

All-defenses trick

Consider a threat sequence like this:

White first plays a horizontal broken three at 1. Black can defend it at any of the three points marked as 2. Now white plays another horizontal broken three at 3, and again black can defend it at any of the three points marked as 4. Finally, white plays a vertical open four at 5 and wins the game next move.

If we were to search the game tree in a straighforward way, there are 9 different combinations here, because black can defend the first threat 3 ways, and the second threat 3 ways. But all of them are very similar, it’s essentially the same sequence.

We avoid checking all these combinations separately by using the trick discovered by Victor Allis. When searching for a winning threat sequence for white, we simply assume that black is allowed to play all the defenses to a single threat at once!

This is a conservative approach, but often good enough, and saves a lot of computation.

So above we simply say: first we play the first threat, which adds a white stone at 1 and three black stones at 2, 2, 2. Then we play the second threat, which adds a white stone at 3 and three black stones at 4, 4, 4. Then we play the final game-winning threat at 5.

This trick avoids some of the combinatorial explosion of the number of cases, and makes the threat search a single-player game.

Dependency-based search

Another way to reduce the number of combinations to look through is to notice that in the combination above, the order of the first two threats doesn’t matter. We could play 1 first, or we could play 3 first. We don’t need to check both orders separately. For longer sequences, this helps a lot.

This algorithm is again due to Victor Allis’ Ph.D. thesis¹.

Instead of searching for sequences of threats, we search the dependency graph of threats.

In the example above, threat 5 depends on threats 1 and 3.

We start from all immediate threats. For each threat node, we look at whether new threats are enabled by this threat, and create those nodes.

We also combine threats that are on a single line (such as 1 and 3) into “combination nodes”, and look for new threats that are created as a result of such a combination.

Before we create a combination node we check whether the two threats that we are combining, and their dependencies, do not interfere with each other (i.e. reuse the same squares).

When combining threats, special care is taken for open threes. An open three allows only two defense points, but requires two additional intersections to be empty when the threat is executed. These empty intersections can later be used for other threats. This creates additional ordering dependencies between open threes and other threats. When combining threats, we use topological sorting to see whether we can order threats so that these ordering dependencies are satisfied.

We thus build a directed acyclic graph of threat nodes and combination nodes, until we run out of possibilities, or we find a game-winning threat (open four or five).

Counter-threats and refutations

If only things could be that simple…

Imagine we have found the following threat sequence:

White plays a horizontal broken three at 1 (it’s not an open three due to the presence of a black stone), then a vertical broken three at 3, then a horizonal open four at 5, which wins.

It looks good. But it doesn’t work. Black can use the left-most 2 as a defense to 1. Then, after white plays the threat at 3, black ignores the threat, and instead plays a more severe counter-threat at a creating an open four and winning the game!

This is a refutation of a threat sequence. While playing out his own threats, white inadvertently helps black set up counter-threats that ultimately defeat the threat sequence.

There are two ways a threat sequence can be refuted by counter-threats:

The defender may win with his own counter-threats.
The defender may use his counter-threats to place a stone at a spot where it interferes with the original threat sequence.

After finding a threat sequence with the dependency-based search, we run another dependency-based search, this time for defender’s counter-threats, to see if we can find a refutation. Differences from regular search are:

We look at counter-threats after each move of the original threat sequence. These may combine with counter-threats made in response to other, earlier threats in the threat sequence.
We only consider counter-threats that are more severe than the original threat.
We declare victory for the defender when either he wins or manages to interfere with the original threat sequence.
We don’t consider refutations to refutations recursively. Instead, if we find a potential refutation, we just conservatively assume that it works, throwing away the potential threat sequence that we found.

Principal variation search

As the main game tree search algorithm, we use Principal Variation Search, which is a variant of alpha-beta prunning. In each node we run dependency-based search to see if a threat sequence is available, and also whether the current player has to defend against opponent’s possible threat sequence.

Defenses to potential threat sequences

In each node of the tree search, we run dependency-based search for the opponent of the player to move, to see if there is a potential danger we have to defend against.

If there is, we first want to determine the set of moves that defend against the danger.

We augment the dependency-based search algorithm to also return all moves that could potentially be defensive moves, whenever a winning sequence is found. These potential defensive moves are:

all intersections that are part of the threat sequence, and
all intersections that create additional counter-threats for the defender in the refutation search

Once we have this set of potential defenses, we try each one in turn, and again run the threat search for the other player.

If the threat search now returns no attack for the opponent, we have found a valid defense, so we add it to the set of valid defenses.
If it does return another winning threat sequence, then this potential defense doesn’t work. But also we then get another set of potentially valid defenses to this other threat sequence. A defense that works has to work against all threat sequences, so we take the intersection of the two sets of potential defenses.

We continue this until we have converged on the set of defensive moves that work. This set could be empty, which means that the position is lost, and we can score the node in the tree.

If it is not empty, we only consider those moves that are valid defenses as children of the tree node.

Null move forward prunning

Some moves don’t create any potential winning threats. Often these are weak moves. We want to try to prune these from the search tree.

The way this is implemented is similar to the null-move heuristic often used in chess programs.

If in a node there is no threat sequence for the opponent, there is no immediate danger. In that case, we first try a null move, i.e. no move at all, and run a shallower search. If this shallow search returns a good score (a beta-cut in the alpha-beta algorithm), we assume that this position is really good for the player to move. It probably is, since he doesn’t even have to move to get a good score. So, in that case, we never search this subtree to a full depth and just score the node as a good score.

Panic mode

Suppose we completed our search to depth N, and found a drawish score. Then we start a depth-(N+1) search, and consider the currently best move. It turns out that the move loses! Oops! So we start trying other moves. But at this point we run out of allocated time for the move. What to do?

We enter panic mode.

In panic mode, we ignore whatever time we had allocated for this move, and continue searching other moves. We continue looking until we have found a move that doesn’t lose, or we have tried every possible move and they all lose, or we have really run out of all most of the time on the clock.

Opening book

We only use an opening book when we are playing black. The black player chooses the opening, i.e. the first three moves. I picked that opening manually, trying to put the moves in the center of the board to create a nice fight, and to make chances for the two players as equal as possible, according to my bot’s evaluation (that’s what we want because of the swap rule).

Using a simple algorithm by Lincke², I automatically constructed an opening book for my chosen opening. The book contains 1379 positions that were analyzed off-line by the program.

Conclusion

This was a fun coding exercise!

Potential ideas I had that I didn’t have time to try:

Better static evaluation, using machine learning.
Proof number search instead of alpha-beta pruning. Proof number search is probably well suited for gomoku, because it is a very tactical game. It would allow searching some tactical lines much deeper than others. Interestingly, this algorithm was also designed by Victor Allis in the same Ph.D. thesis!¹
Instead of one evaluation function, use two evaluation functions, showing potential for each player separately. Some positions are very “offensive” (which would be a high score for both players) and some are “defensive”, locked-down (low score for both players). When looking for a win for one side (as in proof number search), one player should prioritize offensive positions, and the other player should try to defuse the situation by reducing the offensive potential of the opponent.

Maybe I’ll try these in the future if there is another Gomoku AI tournament.

References

Allis, Louis Victor. Searching for solutions in games and artificial intelligence. Wageningen: Ponsen & Looijen, 1994. ↩ ↩² ↩³
Lincke, Thomas R. “Strategies for the automatic construction of opening books.” International Conference on Computers and Games. Springer, Berlin, Heidelberg, 2000. ↩

Radix sort: sorting integers (often) faster than std::sort.

2015-09-26T00:00:00-07:00

This post will describe a very simple integer sorting algorithm: radix sort. Despite its simplicity, it is a very practical algorithm. As we will see, even a simple implementation can easily outperform std::sort from the C++ standard library.

It is also interesting theoretically, since its runtime complexity is in some cases better than standard comparison-based sorting. As we’ll see below, the run-time complexity for sorting n w-bit integers is:

\Theta\left(n \frac{w}{\log n}\right)

Algorithm

Suppose we start with an array of numbers such as this:

853, 872, 265, 238, 199, 772, 584, 204, 480, 173,
499, 349, 308, 314, 317, 186, 825, 398, 899, 161

Counter-intuively, we begin by sorting it based on the least significant decimal digit:

480, 161, 872, 772, 853, 173, 584, 204, 314, 265,
825, 186, 317, 238, 308, 398, 199, 499, 349, 899

Now, we sort it based on the middle decimal digit. But we take care that we do this in a stable fashion, that is: for numbers that are tied on the middle digit, keep them in the current order.

204, 308, 314, 317, 825, 238, 349, 853, 161, 265,
872, 772, 173, 480, 584, 186, 398, 199, 499, 899

The numbers are now sorted by the last two digits. It is not hard to guess what we will do next. Once we have sorted them by the most significant digit, taking care not to change the order in case of ties, we will have sorted the array.

I haven’t said how exactly we perform the sorting based on a single digit, so let’s do this last round slowly. We use count-sort. Here is how it works:

Step 1. Go through the data and count how many times each top digit appears. 0 appears 0 times, 1 appears 4 times, etc.:

count: 0, 4, 3, 5, 2, 1, 0, 1, 4, 0

Step 2. Compute prefix sums in count. This will give us, for each digit, the index of the first entry with that digit in the final sorted order.

position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20

For instance, we now know that numbers starting with the digit 4 will begin at index 12.

Step 3. Shuffle the data. For each number, we simply place it directly at the correct position! After placing a number we increment the position for the given digit.

X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 8, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, 314, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 9, 12, 14, 15, 15, 16, 20
...
161, 173, 186, 199, 204, 238, 265, 308, 314, 317,
349, 398, 480, 499, 584, 772, 825, 853, 872, 899

Step 4. We have shuffled the data into a new, temporary array. Move it back to the original array. In practice, we can simply swap pointers here.

Running time

Of course in practice we don’t sort based on decimal digits. We could sort based on individual bits but we can do better than that. Sort based on groups of bits.

If we sort k bits at a time, there are $2^k$ possible “digits”. The count array will need to be of that length. Hence, let’s make $k \le \log_2 n$ , so that the helper array isn’t longer than the data being sorted.

For added performance, it may be useful to make k somewhat smaller than $\log_2 n$ . In our implementation below, we use $k = \lfloor \frac{1}{3}\log_2 n \rfloor$ . This increases the number of rounds 3-fold, but has several advantages that outweigh that:

The count array only uses $n^{1/3}$ memory.
Computing prefix sums in step 2 takes negligible time.
Counting in step 1 doesn’t randomly increment counters all over memory. It randomly increments counters in a tiny section of memory, which is good for cache performance.
Shuffling in step 3 doesn’t randomly write all over memory. It writes consecutively in only $n^{1/3}$ different locations at a time, which also improves cache performance.

For the purposes of analyzing asymptotic performance, we simply say: $k = \Theta(\log n)$ .

If the numbers being sorted have w bits, we have $\Theta(\frac{w}{\log n})$ rounds. Each round is done in $\Theta(n)$ time, hence the total running time of radix sort is:

\Theta\left(n \frac{w}{\log n}\right)

Note what this means. The larger the n, the less time we spend per element! This is in contrast with comparison-based sorts, where we spend $\Theta(\log n)$ per element, which increases with n.

This indicates that there is a threshold: for small n it is better to use a comparison-based sort. For large n, radix sort is better.

What is the threshold? It should be about when $\frac{w}{\log n} \approx \log n$ , that is, when $n \approx 2^{w^{1/2}}$ .

For instance, when w=64, $n \approx 2^8 = 256$ or so should be the threshold. If n is significantly bigger than this, radix sort should start to dominate.

C++ implementation

template<class T>
void radix_sort(vector<T> &data) {
  static_assert(numeric_limits<T>::is_integer &&
                !numeric_limits<T>::is_signed,
                "radix_sort only supports unsigned integer types");
  constexpr int word_bits = numeric_limits<T>::digits;

  // max_bits = floor(log n / 3)
  // num_groups = ceil(word_bits / max_bits)
  int max_bits = 1;
  while ((size_t(1) << (3 * (max_bits+1))) <= data.size()) {
    ++max_bits;
  }
  const int num_groups = (word_bits + max_bits - 1) / max_bits;

  // Temporary arrays.
  vector<size_t> count;
  vector<T> new_data(data.size());

  // Iterate over bit groups, starting from the least significant.
  for (int group = 0; group < num_groups; ++group) {
    // The current bit range.
    const int start = group * word_bits / num_groups;
    const int end = (group+1) * word_bits / num_groups;
    const T mask = (size_t(1) << (end - start)) - T(1);

    // Count the values in the current bit range.
    count.assign(size_t(1) << (end - start), 0);
    for (const T &x : data) ++count[(x >> start) & mask];

    // Compute prefix sums in count.
    size_t sum = 0;
    for (size_t &c : count) {
      size_t new_sum = sum + c;
      c = sum;
      sum = new_sum;
    }

    // Shuffle data elements.
    for (const T &x : data) {
      size_t &pos = count[(x >> start) & mask];
      new_data[pos++] = x;
    }

    // Move the data to the original array.
    data.swap(new_data);
  }
}

Experiments

I generated arrays of random 64-bit integers and timed the time per element it takes to sort using std::sort and radix_sort.

n	`std::sort`	`radix_sort`
10	3.3 ns	284.2 ns
100	6.1 ns	91.6 ns
1 000	19.3 ns	59.8 ns
10 000	54.8 ns	46.8 ns
100 000	66.9 ns	40.1 ns
1 000 000	81.1 ns	40.8 ns
10 000 000	95.1 ns	40.7 ns
100 000 000	108.4 ns	40.6 ns

We see the effect as predicted: for std::sort, the running time per element increases with n, for radix_sort it decreases with n. It’s not exactly proportional and inversely proportional to $\log n$ due to various effects (mostly cache sizes), but the trend is there. Most importantly: for large n, radix_sort is clearly winning!

Further optimizations

More optimizations are possible which can lead to improvements in performance. Some ideas:

Optimize the number of rounds as a function of n. Taking $\frac{1}{3} \log n$ bits at a time is a rough guess at what should work well.
Currently we scan the data array twice in each iteration: once to count, a second time to shuffle. It can be reduced to a single scan: while shuffling based on the current digit, we could also be counting the next digit at the same time.

These tweaks might improve the algorithm by a constant factor. Some time in the future I will describe how to get a better asymptotic running time. Until then!

Sorting and Searching

Need a PRNG? Use a CSPRNG

A motivating example

Hardware RNGs

Non-cryptographic PRNGs

The Dark Art of choosing a weak PRNG

CSPRNGs

What makes a good PRNG

CSPRNGs are better

What about performance?

“But I don’t care about quality”

“My favorite language doesn’t provide CSPRNGs”

Weak PRNGs are poor man’s wanna-be CSPRNGs

Summary

Roger Penrose’s AI skepticism

Computer chess

The argument from Gödel’s first incompletness theorem

Plane tilings

Goodstein’s theorem

Conclusion

Zero to the power of zero

The controversy

Definition

The 0n function

Combinatorics

The exp function

The binomial distribution

Even-cardinality subsets

Möbius function

Fractional exponents

Natural numbers are real numbers

Consistency is good

0x is sometimes useful for fractional exponents

The naive limit algorithm

Conclusion

How to pick a hash function, part 2

Hashing integers

Hashing bigger data

Proof for h1

Proof for h2

Proof for h3

References

Faster than radix sort: Kirkpatrick-Reisch sorting

Step 1. Build depth-2 trie.

Step 2. Find the smallest leaf in each subtree.

Step 3. Sort remaining nodes.

Step 4. Sort children edges.

Step 5. Walk the sorted trie.

Time complexity

References

Static perfect hashing in minimal memory

Guaranteeing worst case access time

Compression

References

Implementing 2-3 trees

What are 2-3 trees

Define the data type

Merging trees

Splitting trees

Contains, insert, delete

Converting from and to lists

How to pick a hash function, part 1

Summary

Hash tables

Ad-hoc hash functions are bad

Obfuscated hash functions: not a real solution

Utopia

Universal hashing

Hashing an integer

Hashing bigger data

Hashing variable-length data

Tournament-winning gomoku AI

Introduction

Rules

Example game 1

Threats

Five

Open four

Simple four

Open three

The 0ⁿ function

0^x is sometimes useful for fractional exponents

Proof for h₁

Proof for h₂

Proof for h₃