cryptography

Experiment with side-channel attacks yourself!

In past articles I wrote how Python is useless for side-channel free programming and how you can debug code written with side-channel free processing in mind. But, you may ask how realistic are attacks like this, or how do side-channel attacks work in the first place. In this article I’ll show a simple network server, a script you can use to attack it, and few ideas how you can expand your side-channel knowledge.

I’ll skip the theory of statistical analysis and practical way of executing the test script reliably, if you’re interested in that, see the debugging code article and the tlsfuzzer documentation. That being said, to understand what we’ll be doing, you need to understand that small p-values indicate departure from expected result (where expected result in our case is “no discernable difference based on input”) and what 95% confidence interval means.

With that out of the way, let’s look at a server.

Server

In this example we’ll consider a simple server that expects a password and, in case the password is correct, replies with some kind of secret data. To make execution of the tests easier, and interpretation of the graphs more clear, let’s assume that the password in question contains only lower case Latin letters, i.e. ‘a’ to ‘z’.

The server code and test client is available on gist, you can compile the server with gcc:

gcc password_server.c -o password_server

Then execute it by running:

./password_server

It will listen on port 5001 on all network interfaces.

The critical part of it is the loop that inspects the characters of the provided password (data) with the expected password (password) until either string ends (through newline character or null character):

   /* check if all the letters match */
for (a=data, b=password; *a != '\n' && *a != '\x00' && *b != '\n' && *b != '\x00'; a++, b++) {
if (*a != *b) {
goto end;
}
}
/* check if length matches */
if (!((*a == '\n' || *a == '\x00') && (*b == '\n' || *b == '\x00'))) {
goto end;
}

Now, since it aborts as soon as a letter is different, it will provide a side-channel signal. Question is, how big, and can we attack it?

Test client

While running the server is fairly simple, getting the client script running is a bit more involved. You’ll need to have python available (new enough for scipy and numpy to work) and permissions to run the tcpdump packet capture (easiest way to do it is by running the script as root).

First of all, download the tlsfuzzer repository:

git clone https://github.com/tlsfuzzer/tlsfuzzer.git

Then create a virtual environment and install the dependencies needed to run script and the statistical analysis:

python3 -m venv tlsfuzz-venv
tlsfuzz-venv/bin/pip install -r tlsfuzzer/requirements.txt -r tlsfuzzer/requirements-timing.txt

Then as either user with permission to run tcpdump or as root, you can execute the test script against the server:

PYTHONPATH=./tlsfuzzer/ ./tlsfuzz-venv/bin/python3 ./test-passwd-guess.py -i lo -o /tmp --repeat 1000 --no-quickack

The PYTHONPATH environment variable tells python where to find the tlsfuzzer library, we specify python3 from inside the virtual environment so that it has access to the installed dependencies. The meaning of the options is as follows: -i lo specifies the interface on which the packet capture needs to be executed on (you will need to inspect output of ip addr ls to learn the interface name of external facing interface when running the test over real network), -o /tmp specifies where to keep the temporary files and the results, --repeat 1000 specifies that every letter needs to be used in 1000 connections, and finally the --no-quickack specifies that we don’t want to expect the TCP QUICKACK feature to be in use.

Internally, the script does something fairly simple: connects to the server, sends a password guess, expects error, and then connection close. The password guesses are just the 26 letters of the alphabet. In addition to that we have also the canary (or placebo) probe, that’s a second probe with letter ‘a’. By sending the probes in random order and measuring the response time from the server, we can effectively see if the server accepted the first letter of the password or not (as testing two letters for equality will take more time than testing just one).

Results

If you execute those commands on a typical system you will likely see the script report “No statistically significant difference detected”, or in more detail, summary that looks something like this:

Sign test mean p-value: 0.5348, median p-value: 0.5464, min p-value: 0.0009225
Friedman test (chisquare approximation) for all samples
p-value: 0.669298684947114
Worst pair: 5(e), 22(v)
Mean of differences: 6.99390e-06s, 95% CI: -1.42111e-06s, 2.193720e-05s (±1.168e-05s)
Median of differences: 4.00000e-08s, 95% CI: 1.00000e-08s, 7.000000e-08s (±3.000e-08s)
Trimmed mean (5%) of differences: 1.04511e-07s, 95% CI: 4.44867e-08s, 1.665733e-07s (±6.104e-08s)
Trimmed mean (25%) of differences: 7.88520e-08s, 95% CI: 2.81640e-08s, 1.227660e-07s (±4.730e-08s)
Trimmed mean (45%) of differences: 3.86200e-08s, 95% CI: 1.00900e-08s, 6.882000e-08s (±2.936e-08s)
Trimean of differences: 7.26250e-08s, 95% CI: 2.00000e-08s, 1.157937e-07s (±4.790e-08s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697395170/report.csv
No statistically significant difference detected

With accompanying graph looking like this:

Image

Does that mean that there is no side channel leakage? No. As the 95% confidence intervals of both median and trimmed means are in the range of tens of nanoseconds (30ns and 29ns respectively) we cannot exclude the possibility of a side channel. This test was executed on an AMD Ryzen 5 5600X that can boost up to 4.6GHz, which means that the smallest side channel it can have is on the order of 0.2ns. So a small side channel would be undetectable with such a small sample.

OK, let’s execute the same test again, but with --repeat 100000 option. Now, after about 10 minutes of execution time, the test reports failure (it’s a failure as correctly implemented server wouldn’t have a side-channel leak), with the overall statistics that look like this:

Sign test mean p-value: 0.42, median p-value: 0.4144, min p-value: 1.545e-31
Friedman test (chisquare approximation) for all samples
p-value: 4.028330499541504e-44
Worst pair: 13(m), 15(o)
Mean of differences: 6.71613e-08s, 95% CI: -7.92283e-07s, 9.030701e-07s (±8.477e-07s)
Median of differences: -2.00000e-08s, 95% CI: -2.00000e-08s, -1.900000e-08s (±5.000e-10s)
Trimmed mean (5%) of differences: -1.81057e-08s, 95% CI: -2.57158e-08s, -1.146633e-08s (±7.125e-09s)
Trimmed mean (25%) of differences: -1.80656e-08s, 95% CI: -2.33658e-08s, -1.299852e-08s (±5.184e-09s)
Trimmed mean (45%) of differences: -1.89289e-08s, 95% CI: -2.21798e-08s, -1.605050e-08s (±3.065e-09s)
Trimean of differences: -1.77500e-08s, 95% CI: -2.25000e-08s, -1.675000e-08s (±2.875e-09s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697385326/report.csv

Both the Friedman test has a statistically significant p-value, and the smallest sign test p-value is small enough to be significant.

But the graphs of the confidence intervals of the means still are not clear:

Image

That’s because I’ve executed the script on a machine with no special configuration, with other workloads running at the same time (I had no core isolation set up and I was watching YouTube in the meantime). If we take a look at the ECDF of the collected measurements, we’ll see that there’s a huge spread (of over 7 orders of magnitude larger than the difference we want to measure) in the calculated differences:

Image

And looking at heatmap of the differences, the data not only has long and heavy tails, it is clearly multi-modal (in this case we see 3 clear modes):

Image

Thankfully, there are simple ways to work with such data, we can either look at the confidence interval of the median (here it it’s limited by the resolution of the clock source the kernel decided to use):

Image

Or at the confidence interval of the trimmed means:

Image

In both of those graphs, clearly the class 13 is the outlier. By looking at the legend.csv or at the report, we can deduce that class 13 corresponds to ‘m’, so the first letter of the password is “m”.

With this information we can ask the script to start looking for the second letter:

PYTHONPATH=./tlsfuzzer/ ./tlsfuzz-venv/bin/python3 ./test-passwd-guess.py -i lo -o /tmp --repeat 100000 --no-quickack --base m

That will provide us with a summary like this:

Sign test mean p-value: 0.4052, median p-value: 0.4143, min p-value: 3.179e-25
Friedman test (chisquare approximation) for all samples
p-value: 2.0606607949733857e-53
Worst pair: 1(a - canary), 20(t)
Mean of differences: -4.81858e-08s, 95% CI: -6.41017e-07s, 5.886469e-07s (±6.148e-07s)
Median of differences: -1.90000e-08s, 95% CI: -2.00000e-08s, -1.000000e-08s (±5.000e-09s)
Trimmed mean (5%) of differences: -1.38332e-08s, 95% CI: -2.02997e-08s, -7.578911e-09s (±6.360e-09s)
Trimmed mean (25%) of differences: -1.68099e-08s, 95% CI: -2.17857e-08s, -1.185134e-08s (±4.967e-09s)
Trimmed mean (45%) of differences: -1.55797e-08s, 95% CI: -1.82030e-08s, -1.287140e-08s (±2.666e-09s)
Trimean of differences: -1.62500e-08s, 95% CI: -2.00000e-08s, -1.000000e-08s (±5.000e-09s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697386248/report.csv

And a graph like this:

Image

Does that mean that all letters are valid? No. Since the script uses ‘a’ as the baseline (class 0) and as the placebo/canary probe (class 1), and in the graph we’re comparing all classes to the baseline, only the canary probe will match it. Which means that the second letter of the password is ‘a’.

Exercises for the reader

We’ve cracked first two letters of the password in mere 20 minutes of machine time despite the side-channel being equivalent to around 75 CPU clock cycles. Given that, you can try few things to have a better feel for the side-channel attacks:

Try to set this setup yourself, decode remaining letters of the password. As I wrote, the only special thing you need to do, is have the permission to execute tcpdump, any other configuration will just require fewer connections.

Take the server application and execute it at a different system: either connected to the same network switch or few network hops away (use the -h option of the script to specify a remote host). How does the size of the sample necessary to detect side-channel change?

Look at the statistical results of the sign test and Wilcoxon signed-rank test in the report.csv file. How do they correlate with the graphs? Note: keep in mind Bonferroni correction.

Run the server on a specific CPU core together with some other heavy load (like sha1sum /dev/zero). Does that hide the side channel?

Try to modify the server code so that it is processing the password in side-channel free way.

Experiment with adding random delay to processing. Why that doesn’t help to hide the side-channel signal?

Safe primes in Diffie-Hellman

Diffie-Hellman key agreement protocol uses modular exponentiation and calls for use of special prime numbers. If you ever wondered why, I’ll try to explain.

Diffie-Hellman key agreement

The “classical” Diffie-Hellman key exchange also known as Finite Field Diffie-Hellman uses one type of operation — modular exponentiation — and two secrets for two communication peers to arrive at a single shared secret.

The protocol requires a prime number and a number that is a so-called “generator” number to be known to both peers. This is usually achieved by either the server sending those values to the client (e.g. in TLS before v1.3 or in some SSH key exchange types) or by distributing them ahead of time to the peers (e.g. in IPSec and TLS 1.3).

When both peers know which parameters to use, they generate a random number, perform modular exponentiation with this random number, group generator and prime and send the result to the other party as a “key share”. That other party takes this key share and uses it as the generator to perform modular exponentiation again. The result of that second operation is the agreed key share.

If we define g as the generator, p as the prime, S_r as the server selected random, S_k as the server key share, C_r as the client selected random and C_k as the client key share, and SK being the agreed upon secret, the server performs following operations:
g^{S_r} = S_k \mod p
C_k^{S_r} = SK \mod p

Client performs following operation:
g^{C_r} = C_k \mod p
S_k^{C_r} = SK \mod p

Because (g^{C_r})^{S_r} = (g^{S_r})^{C_r} = SK \mod p both parties agree on the same SK.

Unfortunately both peers need to operate on a value provided by the other party (not necessarily trusted or authenticated) and their secret value at the same time. This calls for the the prime number used to have some special properties.

Modular exponentiation

The basic operation we’ll be dealing with is modular exponentiation. The simple way to explain it is that we take a number, raise it to some power. Then we take that result and divide it by a third number. The remainder of that division is our result.

For 2^10 mod 12, the calculation will go as follows, first exponentiation:

2^10 = 1024

Then division:

1024 = 85*12 + 4

So the result is 4.

One of the interesting properties of modular exponentiation, is that it is cyclic. If we take a base number and start raising it to higher and higher powers, we will be getting the same numbers in the same order:

$ python powmod.py -g 3 -m 14 -e 28
3^0  mod 14 = 1
3^1  mod 14 = 3
3^2  mod 14 = 9
3^3  mod 14 = 13
3^4  mod 14 = 11
3^5  mod 14 = 5
3^6  mod 14 = 1
3^7  mod 14 = 3
3^8  mod 14 = 9
3^9  mod 14 = 13
3^10 mod 14 = 11
3^11 mod 14 = 5
3^12 mod 14 = 1
3^13 mod 14 = 3
3^14 mod 14 = 9
3^15 mod 14 = 13
3^16 mod 14 = 11
3^17 mod 14 = 5
3^18 mod 14 = 1
3^19 mod 14 = 3
3^20 mod 14 = 9
3^21 mod 14 = 13
3^22 mod 14 = 11
3^23 mod 14 = 5
3^24 mod 14 = 1
3^25 mod 14 = 3
3^26 mod 14 = 9
3^27 mod 14 = 13

This comes from the fact that in modulo arithmetic, for addition, subtraction, multiplication and exponentiation, the order in which the modulo operations are made does not matter; a + b mod c is equal to (a mod c + b mod c) mod c. Thus if we try to calculate the example for 3^17 mod 14 we can write it down as ((3^6 mod 14) * (3^6 mod 14) * (3^5 mod 14)) mod 14. Then the calculation is reduced to 1 * 1 * 3^5 mod 14.

The inverse of modular exponentiation is discrete logarithm, in which for a given base and modulus, we look for exponent that will result in given number:

g^x mod m = n

Where g, m and n are given, we’re looking for x.

Because there are no fast algorithms for calculating discrete logarithm, is one of the reasons we can use modulo exponentiation as the base of Diffie-Hellman algorithm.

Cyclic groups

Let’s see what happens if we start calculating results of modular exponentiation for 14 with different bases:

$ python groups.py -m 14
Groups modulo 14
g:  0, [1, 0]
g:  1, [1]
g:  2, [1, 2, 4, 8]
g:  3, [1, 3, 9, 13, 11, 5]
g:  4, [1, 4, 2, 8]
g:  5, [1, 5, 11, 13, 9, 3]
g:  6, [1, 6, 8]
g:  7, [1, 7]
g:  8, [1, 8]
g:  9, [1, 9, 11]
g: 10, [1, 10, 2, 6, 4, 12, 8]
g: 11, [1, 11, 9]
g: 12, [1, 12, 4, 6, 2, 10, 8]
g: 13, [1, 13]

Neither of the numbers can generate all of the numbers that are smaller than the integer we calculate the modulo operation with. In other words, there is no generator (in number theoretic sense) that generates the whole group.

To find such numbers, we need to start looking at prime numbers.

Cyclic groups modulo prime

Let’s see what happens for 13:

$ python groups.py -m 13
Groups modulo 13
g:  0, [1, 0]
g:  1, [1]
g:  2, [1, 2, 4, 8, 3, 6, 12, 11, 9, 5, 10, 7]
g:  3, [1, 3, 9]
g:  4, [1, 4, 3, 12, 9, 10]
g:  5, [1, 5, 12, 8]
g:  6, [1, 6, 10, 8, 9, 2, 12, 7, 3, 5, 4, 11]
g:  7, [1, 7, 10, 5, 9, 11, 12, 6, 3, 8, 4, 2]
g:  8, [1, 8, 12, 5]
g:  9, [1, 9, 3]
g: 10, [1, 10, 9, 12, 3, 4]
g: 11, [1, 11, 4, 5, 3, 7, 12, 2, 9, 8, 10, 6]
g: 12, [1, 12]

The obvious result is that we now have 4 generators — 2, 6, 7 and 11 generate the whole group.

But there is other result hiding. Let’s see the results for 19, but with sizes of those groups shown:

$ python groups-ann.py -m 19
Groups modulo 19
g:   0, len:   2, [1, 0]
g:   1, len:   1, [1]
g:   2, len:  18, [1, 2, 4, 8, 16, 13, 7, 14, 9, 18, 17, 15, 11, 3, 6, 12, 5, 10]
g:   3, len:  18, [1, 3, 9, 8, 5, 15, 7, 2, 6, 18, 16, 10, 11, 14, 4, 12, 17, 13]
g:   4, len:   9, [1, 4, 16, 7, 9, 17, 11, 6, 5]
g:   5, len:   9, [1, 5, 6, 11, 17, 9, 7, 16, 4]
g:   6, len:   9, [1, 6, 17, 7, 4, 5, 11, 9, 16]
g:   7, len:   3, [1, 7, 11]
g:   8, len:   6, [1, 8, 7, 18, 11, 12]
g:   9, len:   9, [1, 9, 5, 7, 6, 16, 11, 4, 17]
g:  10, len:  18, [1, 10, 5, 12, 6, 3, 11, 15, 17, 18, 9, 14, 7, 13, 16, 8, 4, 2]
g:  11, len:   3, [1, 11, 7]
g:  12, len:   6, [1, 12, 11, 18, 7, 8]
g:  13, len:  18, [1, 13, 17, 12, 4, 14, 11, 10, 16, 18, 6, 2, 7, 15, 5, 8, 9, 3]
g:  14, len:  18, [1, 14, 6, 8, 17, 10, 7, 3, 4, 18, 5, 13, 11, 2, 9, 12, 16, 15]
g:  15, len:  18, [1, 15, 16, 12, 9, 2, 11, 13, 5, 18, 4, 3, 7, 10, 17, 8, 6, 14]
g:  16, len:   9, [1, 16, 9, 11, 5, 4, 7, 17, 6]
g:  17, len:   9, [1, 17, 4, 11, 16, 6, 7, 5, 9]
g:  18, len:   2, [1, 18]

Note that all the sizes of those groups are factors of 18 – that is p-1.

The third observation we can draw from those results is that, for any number the size of the group of the generated elements will be at most as large as the size of the base number.

With 19, if we take generator 8, the size of its subgroup is 6. But size of subgroups of 7, 18, 11 and 12 is respectively 3, 2, 3 and 6.

Thus, not only is the subgroup much smaller than the full group, it is also impossible to “escape” from it.

Safe primes

We saw that for primes, some bases are better than others (look into finite fields to learn more).

As we noticed, sizes of all groups are factors of the prime less one (see Fermat’s little theorem for proof of this). Of course, with the exception of 2, all primes are odd numbers, so p-1 will always be divisible by two – it will be a composite number. But q = (p-1)/2 doesn’t have to be composite. Indeed, we call primes for which q is also prime safe primes.

Let’s see what happens if we calculate groups of such a prime:

$ python groups-ann.py -m 23
Group sizes modulo 23
g:   0, len:   2, [1, 0]
g:   1, len:   1, [1]
g:   2, len:  11, [1, 2, 4, 8, 16, 9, 18, 13, 3, 6, 12]
g:   3, len:  11, [1, 3, 9, 4, 12, 13, 16, 2, 6, 18, 8]
g:   4, len:  11, [1, 4, 16, 18, 3, 12, 2, 8, 9, 13, 6]
g:   5, len:  22, [1, 5, 2, 10, 4, 20, 8, 17, 16, 11, 9, 22, 18, 21, 13, 19, 3, 15, 6, 7, 12, 14]
g:   6, len:  11, [1, 6, 13, 9, 8, 2, 12, 3, 18, 16, 4]
g:   7, len:  22, [1, 7, 3, 21, 9, 17, 4, 5, 12, 15, 13, 22, 16, 20, 2, 14, 6, 19, 18, 11, 8, 10]
g:   8, len:  11, [1, 8, 18, 6, 2, 16, 13, 12, 4, 9, 3]
g:   9, len:  11, [1, 9, 12, 16, 6, 8, 3, 4, 13, 2, 18]
g:  10, len:  22, [1, 10, 8, 11, 18, 19, 6, 14, 2, 20, 16, 22, 13, 15, 12, 5, 4, 17, 9, 21, 3, 7]
g:  11, len:  22, [1, 11, 6, 20, 13, 5, 9, 7, 8, 19, 2, 22, 12, 17, 3, 10, 18, 14, 16, 15, 4, 21]
g:  12, len:  11, [1, 12, 6, 3, 13, 18, 9, 16, 8, 4, 2]
g:  13, len:  11, [1, 13, 8, 12, 18, 4, 6, 9, 2, 3, 16]
g:  14, len:  22, [1, 14, 12, 7, 6, 15, 3, 19, 13, 21, 18, 22, 9, 11, 16, 17, 8, 20, 4, 10, 2, 5]
g:  15, len:  22, [1, 15, 18, 17, 2, 7, 13, 11, 4, 14, 3, 22, 8, 5, 6, 21, 16, 10, 12, 19, 9, 20]
g:  16, len:  11, [1, 16, 3, 2, 9, 6, 4, 18, 12, 8, 13]
g:  17, len:  22, [1, 17, 13, 14, 8, 21, 12, 20, 18, 7, 4, 22, 6, 10, 9, 15, 2, 11, 3, 5, 16, 19]
g:  18, len:  11, [1, 18, 2, 13, 4, 3, 8, 6, 16, 12, 9]
g:  19, len:  22, [1, 19, 16, 5, 3, 11, 2, 15, 9, 10, 6, 22, 4, 7, 18, 20, 12, 21, 8, 14, 13, 17]
g:  20, len:  22, [1, 20, 9, 19, 12, 10, 16, 21, 6, 5, 8, 22, 3, 14, 4, 11, 13, 7, 2, 17, 18, 15]
g:  21, len:  22, [1, 21, 4, 15, 16, 14, 18, 10, 3, 17, 12, 22, 2, 19, 8, 7, 9, 5, 13, 20, 6, 11]
g:  22, len:   2, [1, 22]

The groups look very different to the ones we saw previously, with the exception of bases 0, 1 and p-1, all groups are relatively large – 11 or 22 elements.

One interesting observation we can make about bases that have group order of 2q, is that even exponents will remain in a group of size 2q while odd will move to a group with order of q. Thus we can say that use of generator that is part of group order of 2q will leak the least significant bit of the exponent.

That’s why protecting against small subgroup attacks with safe primes is so easy, it requires comparing the peer’s key share against just 3 numbers. It’s also the reason why it’s impossible to “backdoor” the parameters (prime and generator), as every generator is a good generator.

More nails to RC4 coffin

Last week Christina Garman, Kenneth G. Paterson and Thyla van der Merwe have published a new attacks on RC4 in a paper titled Attacks Only Get Better: Password Recovery Attacks Against RC4 in TLS. In it they outline an attack which recovers user passwords in IMAP and HTTP Basic authentication using 226 ciphertexts. Previous attacks required about 234 ciphertexts.

The other attack, published yesterday at the BlackHat conference, is the Bar-mitzvah attack which requires about 229 ciphertexts.

While connections to relatively few servers (~6% of Alexa top 1 million TLS enabled sites) will end up with RC4 cipher, the 75% market share of RC4 in general is not reassuring.

Mozilla recommends disabling RC4

Mozilla currently recommends using 3DES ciphers instead of RC4 for backwards compatibility with very old systems like Android 2 or Internet Explorer on Windows XP.

The current recommendation comes after similar recommendations from researchers that discovered the most recent flaws in it, Cisco, Microsoft and their proposition to IETF as well as Qualys and Bruce Schneier.

The message is clear: don’t use RC4.

If you had for some reason follow the Mozilla guide, you don’t have to use this insecure, nearly 30 year old cipher any more. While you’re changing the cipher suite defaults, consider also updating to Perfect Forward Secrecy capable configuration.