Articles

Experiment with side-channel attacks yourself!

In past articles I wrote how Python is useless for side-channel free programming and how you can debug code written with side-channel free processing in mind. But, you may ask how realistic are attacks like this, or how do side-channel attacks work in the first place. In this article I’ll show a simple network server, a script you can use to attack it, and few ideas how you can expand your side-channel knowledge.

I’ll skip the theory of statistical analysis and practical way of executing the test script reliably, if you’re interested in that, see the debugging code article and the tlsfuzzer documentation. That being said, to understand what we’ll be doing, you need to understand that small p-values indicate departure from expected result (where expected result in our case is “no discernable difference based on input”) and what 95% confidence interval means.

With that out of the way, let’s look at a server.

Server

In this example we’ll consider a simple server that expects a password and, in case the password is correct, replies with some kind of secret data. To make execution of the tests easier, and interpretation of the graphs more clear, let’s assume that the password in question contains only lower case Latin letters, i.e. ‘a’ to ‘z’.

The server code and test client is available on gist, you can compile the server with gcc:

gcc password_server.c -o password_server

Then execute it by running:

./password_server

It will listen on port 5001 on all network interfaces.

The critical part of it is the loop that inspects the characters of the provided password (data) with the expected password (password) until either string ends (through newline character or null character):

   /* check if all the letters match */
for (a=data, b=password; *a != '\n' && *a != '\x00' && *b != '\n' && *b != '\x00'; a++, b++) {
if (*a != *b) {
goto end;
}
}
/* check if length matches */
if (!((*a == '\n' || *a == '\x00') && (*b == '\n' || *b == '\x00'))) {
goto end;
}

Now, since it aborts as soon as a letter is different, it will provide a side-channel signal. Question is, how big, and can we attack it?

Test client

While running the server is fairly simple, getting the client script running is a bit more involved. You’ll need to have python available (new enough for scipy and numpy to work) and permissions to run the tcpdump packet capture (easiest way to do it is by running the script as root).

First of all, download the tlsfuzzer repository:

git clone https://github.com/tlsfuzzer/tlsfuzzer.git

Then create a virtual environment and install the dependencies needed to run script and the statistical analysis:

python3 -m venv tlsfuzz-venv
tlsfuzz-venv/bin/pip install -r tlsfuzzer/requirements.txt -r tlsfuzzer/requirements-timing.txt

Then as either user with permission to run tcpdump or as root, you can execute the test script against the server:

PYTHONPATH=./tlsfuzzer/ ./tlsfuzz-venv/bin/python3 ./test-passwd-guess.py -i lo -o /tmp --repeat 1000 --no-quickack

The PYTHONPATH environment variable tells python where to find the tlsfuzzer library, we specify python3 from inside the virtual environment so that it has access to the installed dependencies. The meaning of the options is as follows: -i lo specifies the interface on which the packet capture needs to be executed on (you will need to inspect output of ip addr ls to learn the interface name of external facing interface when running the test over real network), -o /tmp specifies where to keep the temporary files and the results, --repeat 1000 specifies that every letter needs to be used in 1000 connections, and finally the --no-quickack specifies that we don’t want to expect the TCP QUICKACK feature to be in use.

Internally, the script does something fairly simple: connects to the server, sends a password guess, expects error, and then connection close. The password guesses are just the 26 letters of the alphabet. In addition to that we have also the canary (or placebo) probe, that’s a second probe with letter ‘a’. By sending the probes in random order and measuring the response time from the server, we can effectively see if the server accepted the first letter of the password or not (as testing two letters for equality will take more time than testing just one).

Results

If you execute those commands on a typical system you will likely see the script report “No statistically significant difference detected”, or in more detail, summary that looks something like this:

Sign test mean p-value: 0.5348, median p-value: 0.5464, min p-value: 0.0009225
Friedman test (chisquare approximation) for all samples
p-value: 0.669298684947114
Worst pair: 5(e), 22(v)
Mean of differences: 6.99390e-06s, 95% CI: -1.42111e-06s, 2.193720e-05s (±1.168e-05s)
Median of differences: 4.00000e-08s, 95% CI: 1.00000e-08s, 7.000000e-08s (±3.000e-08s)
Trimmed mean (5%) of differences: 1.04511e-07s, 95% CI: 4.44867e-08s, 1.665733e-07s (±6.104e-08s)
Trimmed mean (25%) of differences: 7.88520e-08s, 95% CI: 2.81640e-08s, 1.227660e-07s (±4.730e-08s)
Trimmed mean (45%) of differences: 3.86200e-08s, 95% CI: 1.00900e-08s, 6.882000e-08s (±2.936e-08s)
Trimean of differences: 7.26250e-08s, 95% CI: 2.00000e-08s, 1.157937e-07s (±4.790e-08s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697395170/report.csv
No statistically significant difference detected

With accompanying graph looking like this:

Image

Does that mean that there is no side channel leakage? No. As the 95% confidence intervals of both median and trimmed means are in the range of tens of nanoseconds (30ns and 29ns respectively) we cannot exclude the possibility of a side channel. This test was executed on an AMD Ryzen 5 5600X that can boost up to 4.6GHz, which means that the smallest side channel it can have is on the order of 0.2ns. So a small side channel would be undetectable with such a small sample.

OK, let’s execute the same test again, but with --repeat 100000 option. Now, after about 10 minutes of execution time, the test reports failure (it’s a failure as correctly implemented server wouldn’t have a side-channel leak), with the overall statistics that look like this:

Sign test mean p-value: 0.42, median p-value: 0.4144, min p-value: 1.545e-31
Friedman test (chisquare approximation) for all samples
p-value: 4.028330499541504e-44
Worst pair: 13(m), 15(o)
Mean of differences: 6.71613e-08s, 95% CI: -7.92283e-07s, 9.030701e-07s (±8.477e-07s)
Median of differences: -2.00000e-08s, 95% CI: -2.00000e-08s, -1.900000e-08s (±5.000e-10s)
Trimmed mean (5%) of differences: -1.81057e-08s, 95% CI: -2.57158e-08s, -1.146633e-08s (±7.125e-09s)
Trimmed mean (25%) of differences: -1.80656e-08s, 95% CI: -2.33658e-08s, -1.299852e-08s (±5.184e-09s)
Trimmed mean (45%) of differences: -1.89289e-08s, 95% CI: -2.21798e-08s, -1.605050e-08s (±3.065e-09s)
Trimean of differences: -1.77500e-08s, 95% CI: -2.25000e-08s, -1.675000e-08s (±2.875e-09s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697385326/report.csv

Both the Friedman test has a statistically significant p-value, and the smallest sign test p-value is small enough to be significant.

But the graphs of the confidence intervals of the means still are not clear:

Image

That’s because I’ve executed the script on a machine with no special configuration, with other workloads running at the same time (I had no core isolation set up and I was watching YouTube in the meantime). If we take a look at the ECDF of the collected measurements, we’ll see that there’s a huge spread (of over 7 orders of magnitude larger than the difference we want to measure) in the calculated differences:

Image

And looking at heatmap of the differences, the data not only has long and heavy tails, it is clearly multi-modal (in this case we see 3 clear modes):

Image

Thankfully, there are simple ways to work with such data, we can either look at the confidence interval of the median (here it it’s limited by the resolution of the clock source the kernel decided to use):

Image

Or at the confidence interval of the trimmed means:

Image

In both of those graphs, clearly the class 13 is the outlier. By looking at the legend.csv or at the report, we can deduce that class 13 corresponds to ‘m’, so the first letter of the password is “m”.

With this information we can ask the script to start looking for the second letter:

PYTHONPATH=./tlsfuzzer/ ./tlsfuzz-venv/bin/python3 ./test-passwd-guess.py -i lo -o /tmp --repeat 100000 --no-quickack --base m

That will provide us with a summary like this:

Sign test mean p-value: 0.4052, median p-value: 0.4143, min p-value: 3.179e-25
Friedman test (chisquare approximation) for all samples
p-value: 2.0606607949733857e-53
Worst pair: 1(a - canary), 20(t)
Mean of differences: -4.81858e-08s, 95% CI: -6.41017e-07s, 5.886469e-07s (±6.148e-07s)
Median of differences: -1.90000e-08s, 95% CI: -2.00000e-08s, -1.000000e-08s (±5.000e-09s)
Trimmed mean (5%) of differences: -1.38332e-08s, 95% CI: -2.02997e-08s, -7.578911e-09s (±6.360e-09s)
Trimmed mean (25%) of differences: -1.68099e-08s, 95% CI: -2.17857e-08s, -1.185134e-08s (±4.967e-09s)
Trimmed mean (45%) of differences: -1.55797e-08s, 95% CI: -1.82030e-08s, -1.287140e-08s (±2.666e-09s)
Trimean of differences: -1.62500e-08s, 95% CI: -2.00000e-08s, -1.000000e-08s (±5.000e-09s)
For detailed report see /tmp/test-passwd-guess.py_v1_1697386248/report.csv

And a graph like this:

Image

Does that mean that all letters are valid? No. Since the script uses ‘a’ as the baseline (class 0) and as the placebo/canary probe (class 1), and in the graph we’re comparing all classes to the baseline, only the canary probe will match it. Which means that the second letter of the password is ‘a’.

Exercises for the reader

We’ve cracked first two letters of the password in mere 20 minutes of machine time despite the side-channel being equivalent to around 75 CPU clock cycles. Given that, you can try few things to have a better feel for the side-channel attacks:

Try to set this setup yourself, decode remaining letters of the password. As I wrote, the only special thing you need to do, is have the permission to execute tcpdump, any other configuration will just require fewer connections.

Take the server application and execute it at a different system: either connected to the same network switch or few network hops away (use the -h option of the script to specify a remote host). How does the size of the sample necessary to detect side-channel change?

Look at the statistical results of the sign test and Wilcoxon signed-rank test in the report.csv file. How do they correlate with the graphs? Note: keep in mind Bonferroni correction.

Run the server on a specific CPU core together with some other heavy load (like sha1sum /dev/zero). Does that hide the side channel?

Try to modify the server code so that it is processing the password in side-channel free way.

Experiment with adding random delay to processing. Why that doesn’t help to hide the side-channel signal?

Debugging timing side-channel leaks

Few days ago I have published research that used statistical methods to test for timing side-channel leaks in RSA key exchange in TLS, in the Marvin Attack paper. In this blog post I’ll describe how to debug which piece of code the timing signal is coming from.

Theoretical background

To be able to use statistical tests reliably, the collected data needs to meet few criteria. Typically the tests require the the measurements are independent and identically distributed (IID). If that property isn’t satisfied, then the tests will be unreliable, in either direction: depending on the kind of lack of independence the test may report difference when there isn’t one in reality or the test may not detect a difference despite there being a significant one.

Given that the measurements from timing data will be autocorrelated, the tests used must be able to handle this kind of data. The autocorrelation is caused by the inherent way that modern computers work: the fluctuations in CPU frequency as it boosts depending on its temperature, the general power state of the machine and the OS, etc. Thankfully, the fix is relatively simple: instead of collecting data from one kind of input, then for another input, and finally comparing those two distributions with each other, we perform tests on pairwise differences. As the name suggest, to perform this kind of tests, we execute the inputs in random order, but in pairs (or larger tuples), calculate the difference between them, and then perform analysis on the distribution of those differences. The most basic tests that work in this way are the box test and the Wilcoxon signed-rank test. Tests that can work on longer tuples are the Friedman tests and rANOVA (though rANOVA expects normally distributed data, which typically timing data doesn’t follow).

By running the tests with data classes in random order we also ensure one more thing: the code executed right before the code we’re interested in will be random, and what’s even more important, with the same distribution for each input. Thus, we will be able to compare those results even to very small differences (single clock cycles).

That also means that the test harness needs to behave in a side-channel free manner, as otherwise the system state before test execution will be correlated with the probe and thus will show up as signal in the system under test. Consider a pathologic example: a test harness is sending three exact same messages, let’s call them A, B, and C to the server. But, just before sending the message C, it flushes the entire CPU cache. In such situation, the server code and runtime data will have to be fetched from main memory instead of the the local cache, making the execution of them appear to take longer. Such a difference will be measurable, and thus will be detected by the statistical tests.

Basic setup

When testing a whole operation (like TLS handshake, or a complete API call), we don’t have to perform anything special to make the test work: we simply measure the time the whole operation took and perform statistical tests on such data.

For majority of tests, making the test harness side-channel free is fairly simple; as long as it can read same-sized byte string from the disk and pass them to the system under test without being impacted by the particular values of the bytes being shuffled around, the whole testing process just need to be performed in two steps. In first step the test data (inputs) are prepared, written in random order to disk, together with its order. In second step, code that’s oblivious to the order of the data, reads same-sized chunks from disk and passes them to the system under test.

Debugging

Unfortunately, when the test of a whole cryptographic operation shows a side channel (e.g. decryption of a message from a PKCS#1 v1.5 ciphertext), we won’t know where the timing signal is created.

To address that, we can employ a little bit of instrumentation together with the bisection strategy. As mentioned in the theory section, code executed before will influence code execute after. Thus, the first step, even if we have a suspicion where the side channel is coming from, is to make sure that any code executed before the code we’re interested in doesn’t contribute to the timing signal.

Example: RSA PKCS#1 v1.5 in TLS

Let’s consider the example of a state of the art implementation of RSA key exchange in TLS. To perform it, the server needs to:

  1. Convert the ciphertext string into an integer
  2. Apply the blinding factor to the ciphertext
  3. Calculate a new set of blinding factors
  4. Apply the blinding factor the the private exponents in modular exponentiation
  5. Calculate the modular exponentiation (ciphertext to the power of private exponent modulo public key)
  6. Apply the unblinding factor to the result of exponentiation
  7. Convert the result to a byte string to form the plaintext
  8. Generate a random 48 byte long key, to be used in case padding checks fail
  9. Verify that the padding in the plaintext is correct, that the encrypted message has correct length and that the first two bytes match the TLS version from Client Hello)
  10. In case the padding check doesn’t succeed, use the previously generated 48 byte long key
  11. Use either the message or the random key in the derivation of the master secret

Obviously, that’s a lot of operations that require a lot of code.

Now, since we can expect that the blinding will use unbiased random numbers, we can assume that all code up until (but excluding) the application of unblinding factor has access only to ciphertext or obfuscated plaintext, the first code that needs to be constant time with respect to the plaintext is the application of the unblinding factor. So, to verify that no code before it has timing leaks, we need to start the measurement right before parsing of the Client Key Exchange message, and stop the measurement right before application of the unblinding factor.

Timing measurement is best done using highest precision and lowest overhead time sources on a particular CPU. For x86_64 platform, that’s the TSC register. Since we want to reduce the effect of measurement as much as possible, for situations like TLS implementations, it’s best to save those start and stop values of the TSC in global variables, and then write the difference between them to a file (as those 8 bytes) after connection shutdown. Tlsfuzzer scripts execute just one connection at a time, so race conditions shouldn’t be an issue. Such collected timing data can then be converted to format expected by tlsfuzzer analysis script using the extract.py script. The --raw-times, --binary, and -l are the minimum options you will need to use. Using also the --clock-frequency option is likely to make interpretation of any results easier. More information about using the tlsfuzzer scripts is in the project’s documentation.

Once we verify that all the code up to and including modular exponentiation doesn’t leak (by collecting enough data such that the 95% confidence intervals of trimmed means are in the few clock cycles range, i.e. below or around a 1ns), we can then start applying the bisection.

Since we know that the whole operation leaks, and we verified that the code up until application of the blinder doesn’t leak, we can do either of two things. Either apply the bisection blindly and simply measure first half of the remaining code, or consider what’s the most likely next place for the side channel leak. In the latter case, we can measure just the application of the blinder and conversion of the result to a byte string.

If we find that the application of the blinder provides a signal if many (8+) most significant bytes are zero, then the code in question has exact same bug as the OpenSSL CVE-2022-4304 and thus requires similar fix: the multiplication and reduction modulo must be performed in a constant-time fashion with respect to the output.

Crucially, because execution of the earlier code will influence the latter code, there is no point in testing any code after a leaky code is found: because of the side effects of the leaky code, any code after it will appear leaky.

After fixing the bug (or, at least, applying some code fixes), you can run the same timing tests against the particular snippet, or, if you feel optimistic, against the whole operation.

In case the fixed part doesn’t show a side-channel any more, but the whole operation still shows a side-channel, you simply continue down the list of operations: check the conversion to a byte string, the depadding code, the selection between decrypted message and randomly generated message…

Summary

While using timing data isn’t as precise as formal analysis using tools like ctgrind in locating the exact leaky code, it doesn’t depend on any assumptions in regards to executed instructions (e.g. ctgrind won’t flag division operations that include secret data as inputs, despite most CPUs having variable execution time for the division operation) and it can be performed using very crude methods that will work irrespective of the underlying architecture. By using bisection we can still achieve very similar effect, and find the leaky instructions even inside large and complex code.

The same approach will work with any kind of cryptographic implementations that need to perform operations in constant-time with regards to inputs: other PKI algorithms (ECDSA, FFDH, ECDH) or symmetric encryption (AES) are the obvious candidates.

Making RAID work (dm-integrity with MD-RAID)

Two years ago, I wrote how RAID doesn’t work, as it’s unable to detect silent data corruption. We tried to see what happens if we inject data corruption and unfortunately Linux 4.16.6 wasn’t able to differentiate between hardware failures and soft failures coming from dm-integrity.

As those bugs are fixed now, let’s see how to configure Fedora or RHEL (with dracut) and Archlinux (with mkinitcpio) to automatically assemble the MD-RAID on dm-integrity volumes so that the root file system can reside on them.

Word of caution

As we’re modifying disk partitions and completely overwriting them, modifying boot-time configuration and in general performing rather complex and advanced configuration, make sure you have good and current backups of the data you care about!

Basic configuration

First, we need to prepare partitions for the volumes that will be used by dm-integrity. There are no special requirements for them, but it’s a good idea to keep them aligned to 1MiB boundaries. That’s especially important when working with SSDs or with HDDs that have 4KiB native sectors (so called Advanced Format disks). Current versions of tools like parted do that automatically.

As most hard disks do come in 4KiB sector sizes (and future replacements are only more and more likely to be like this), we need to format the partition with a 4096B sector size.

The main feature of dm-integrity is the calculation and verification of checksums, as such we need a checksum that is fast enough. To do a quick benchmark, you may use OpenSSL:

$ openssl speed md5 sha1 sha256 sha512
...
 type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
 md5             165603.90k   373152.90k   664617.35k   816564.66k   871069.01k   882553.30k
 sha1            185971.19k   427639.36k   846229.23k  1108107.61k  1228931.51k  1233747.97k
 sha256           99550.05k   219782.83k   413438.72k   517541.21k   556638.21k   559961.43k
 sha512           64932.77k   258782.93k   459379.29k   693776.04k   799465.47k   810766.90k

(Technically, we should benchmark the kernel hash performance as that’s what we’ll use, but I didn’t find an easy way to do that. Drop me a line if you know how to do it.)

As I wrote previously, by default dm-integrity uses a crc32 checksum, it’s small and possibly used by the disks themselves to check for read errors, thus we want to use something different. I’ll use SHA-1 in the following examples. You may also have good experience with BLAKE2b-256.

Finally, as the purpose of the checksums is just to detect accidental errors, not malicious changes, we don’t need the full output of the hash function. I’ve selected 8 bytes per sector (see the previous article for reasons why).

Let’s format the partitions (in this example a 4-disk RAID 6 array; switch the sda1, sdb1, sdc1 and sdd1 to the partitions you are actually using).

integritysetup format /dev/sda1 --sector-size 4096 --tag-size 8 --integrity sha1
integritysetup format /dev/sdb1 --sector-size 4096 --tag-size 8 --integrity sha1
integritysetup format /dev/sdc1 --sector-size 4096 --tag-size 8 --integrity sha1
integritysetup format /dev/sdd1 --sector-size 4096 --tag-size 8 --integrity sha1

This process will run for few hours on a modern multi-terabyte hard-drive so I suggest running them in parallel.

Once the devices are formatted we need to open them. Note: as the integritysetup superblock doesn’t save the algorithm used to calculate the checksums, you need to specify the --integrity option every time you open the device!

integritysetup open /dev/sda1 int1 -I sha1
integritysetup open /dev/sdb1 int2 -I sha1
integritysetup open /dev/sdc1 int3 -I sha1
integritysetup open /dev/sdd1 int4 -I sha1

This will create 4 block devices in the /dev/mapper directory named int1, int2, int3 and int4 (you can use other names too, like int-sda1 or such).

Opening the devices during boot

There are no ready to use standards to mount the dm-integrity volumes on boot so we need to modify the initramfs used by kernel to mount them ourselves.

Fedora

Fedora for a long time now has been using the dracut initramfs. It’s also used in RHEL-8 and CentOS 8, so the instructions for them are the same too. I’ve tested it with Fedora 31 and CentOS 8.3

Dracut uses a system of modules which automatically detect if they need to be included in the initramfs or not. After that it uses udev to detect when new devices show up and what to do with them. As such, we need to create a system of files that will automatically detect when the dm-integrity block devices show up and what to do with them (how to name the integrity device and what hash to use).

I’ve created the necessary files in the dracut-dm-integrity project on github.

As instructions in the README.md state, you need to copy the files from scripts directory to the /usr/lib/dracut/modules.d/90integrity directory. After that, edit the integrity-mount.sh to make it mount your integrity volumes. Finally, run dracut -f to include this new module in the initramfs.

Archlinux

Archlinux uses fairly simple system for construction of its initramfs. The modules live in the /usr/lib/initcpio/ directory and are enabled using the /etc/mkinitcpio.conf file.

You can find ready to use scripts in the mkinitcpio-dm-integrity repository.

Don’t forget to run mkinitcpio -P every time you edit /etc/mkinitcpio.conf!

MD-RAID creation

After restarting the system to verify that the devices are automatically opened on startup, we can create an MD RAID array on top of them.

The one special option that is beneficial to the arrays built on top of dm-integrity is specifying the size of chunk. As when writing data to dm-integrity device both the sector with data and the sector with checksums needs to be updated, performing writes large enough to update all checksums in a sector will mean that the sector with checksums doesn’t have to be read first to modify just part of it. In our example, the checksums are 8 bytes large, which means in a 4096 B sector we fit 512 checksums. So a write needs to be a multiple of: 512 (checksums in a sector) * 4096 (sector size) = 2097152 B = 2MiB, to not cause a read-modify-write operation. This option is specified in KiB, by default it’s 512KiB.

To create the MD-RAID use the following command:

mdadm -C /dev/md0 -n 4 -l 6 --chunk 2048 --assume-clean /dev/mapper/int[1234]
(we can use --assume-clean as integrtitysetup format creates volumes initialised to all zero)

Such an MD-RAID will require the usual setup of adding its settings to /etc/mdadm.conf and regenerating the initramfs again:

mdadm --detail --scan >> /etc/mdadm.conf
The reduced write speed caused by journaling on dm-integrity level means that making sure that the stripe_cache_size (in this case /sys/block/md0/md/stripe_cache_size) is tuned properly is even more important. In my experience setting it to at lest 4096 is a good idea and 8192 is where I see diminishing returns.

Usage

Such MD-RAID can be used as usual, as a backing device for a file system or to put the Physical Volume of an LVM2 system. One thing to note is because we created the md-integrity volumes with 4096B sectors, it presents a block device with 4096B native sectors. From what I’ve noticed, ext4 and btrfs don’t mind getting migrated from a 512B sector disk to a 4096B sector disk. The file system that did, is XFS, it refused to mount. If you still need to mount such an fs stored on such an array, you may want to use the dm-ebs module to emulate the 512B sectors. I haven’t tried it though.

If you want to create an XFS that uses such an array (4 disks in RAID 6, with 4KiB sectors and 8 byte checksums), you can use mkfs.xfs -d su=2M,sw=2 /dev/md0 to do so. That being said, with standard MD-RAID, it will automatically detect those settings.

Update 2020-10-03: in the above mkfs.xfs command there was an error, the sw option indicates number of data disks in a stripe. As we’re using RAID 6, we loose two disks to checksums, we’re using 4 disks total, so we’re left with two disks in a stripe

“Constant time” compare in Python

Note: while the conclusions from this article are correct (as the side channels measured are relatively large), the statistical methods used for doing that are NOT. See the later article Debugging timing side-channel leaks for the scientifically correct approach.

You may be familiar with the following piece of code to implement the constant time comparison function for strings:

def constant_time_compare(val1, val2):
    if len(val1) != len(val2):
        return False
    result = 0
    for x, y in zip(val1, val2):
        result |= x ^ y
    return result == 0


The idea behind this code is to compare all bytes of input using a flag value that will be flipped in any of the comparisons fail. Only when all the bytes were compared, is the ultimate result of the method returned. This is used to thwart attacks that use the time of processing queries to guess secret values.

Unfortunately, because of CPython specifics, this code doesn’t work for its intended purpose.

Sensitive code should always use hmac.compare_digest() method and you should not write code that needs to be side channel secure in Python.

With the tl;dr version out, let’s investigate why.

Timing side channel

Many attacks against cryptographic implementations don’t actually use maths to compromise the systems. The Bleichenbacher Million Messages attack, POODLE and Lucky 13 attacks use some kind of a side channel to guess the contents of encrypted messages, either the timing of responses or contents of responses (different TLS Alert description field values).

Side channel attacks don’t impact only cryptographic protocols, other places where secret values need to be compared to values that are controlled by attacker, like checking password equality, API tokens and HMAC value validation need to be performed in constant time too.

Already back in 00’s, differences in timing as low as 100ns could be distinguished over LAN environment. See research by Crosby at al. in Opportunities And Limits Of Remote Timing Attacks and Brumley and Boneh in Remote TIming Attacks are Practical. Currently we also have to worry about cross VM or cross-process attacks where ability to distinguish between single cycles may be possible.

Measuring timing differences

Let’s see what happens if we use the simple way to compare two strings in python, the == operator.

Benchmarking code:

import perf

setup = """
str_a = b'secret API key'

alt_s = b'XXXXXXXXXXXXXX'

str_b = str_a[:{0}] + alt_s[{0}:]

assert len(str_a) == len(str_b)
"""

fun = """str_a == str_b"""

if __name__ == "__main__":
    total_runs = 128
    runs_per_process = 4
    runner = perf.Runner(values=runs_per_process,
                         warmups=16,
                         processes=total_runs//runs_per_process)
    vals = list(range(14))  # length of str_a
    for delta in vals:
        runner.timeit("eq_cmp delta={0:#04x}".format(delta),
                      fun,
                      setup=setup.format(delta))

Running it will simulate what timings does the attacker see when the difference from the expected value is at different positions in the attacker provided string.

PYTHONHASHSEED=1 python3 timing-eq_cmp-perf.py \
-o timing-eq_cmp-perf-1.py --fast
.................
eq_cmp delta=0x00: Mean +- std dev: 18.4 ns +- 0.0 ns
.................
eq_cmp delta=0x01: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x02: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x03: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x04: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x05: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x06: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x07: Mean +- std dev: 20.8 ns +- 0.0 ns
.................
eq_cmp delta=0x08: Mean +- std dev: 21.3 ns +- 0.0 ns
.................
eq_cmp delta=0x09: Mean +- std dev: 21.3 ns +- 0.0 ns
.................
eq_cmp delta=0x0a: Mean +- std dev: 21.3 ns +- 0.0 ns
.................
eq_cmp delta=0x0b: Mean +- std dev: 21.3 ns +- 0.0 ns
.................
eq_cmp delta=0x0c: Mean +- std dev: 21.3 ns +- 0.0 ns
.................
eq_cmp delta=0x0d: Mean +- std dev: 21.3 ns +- 0.0 ns

Already we can see that the difference in timing when the first different byte is at fist position or the second position is quite huge, looking at a box plot of the specific values makes it quite obvious:

a = read.csv(file="timing-eq_cmp-perf-1.csv", header=FALSE)
data = as.matrix(a)
boxplot(t(data))
timing-eq_cmp-perf-1-boxplot

Let’s see how does it compare to the "constant_time"_compare. First the code:

import perf

setup = """
def constant_time_compare(val1, val2):
    if len(val1) != len(val2):
        return False
    result = 0
    for x, y in zip(val1, val2):
        result |= x ^ y
    return result == 0

str_a = b'secret API key'

alt_s = b'XXXXXXXXXXXXXX'

str_b = str_a[:{0}] + alt_s[{0}:]

assert len(str_a) == len(str_b)
"""

fun = """constant_time_compare(str_a, str_b)"""

if __name__ == "__main__":
    total_runs = 128
    runs_per_process = 4
    runner = perf.Runner(values=runs_per_process,
                         warmups=16,
                         processes=total_runs//runs_per_process)
    vals = list(range(14))  # length of str_a
    for delta in vals:
        runner.timeit("ct_eq_cmp delta={0:#04x}".format(delta),
                      fun,
                      setup=setup.format(delta))


The test run:

PYTHONHASHSEED=1 python3 timing-ct_eq_cmp-perf.py \
-o timing-ct_eq_cmp-perf-1.json --fast
.................
ct_eq_cmp delta=0x00: Mean +- std dev: 1.36 us +- 0.02 us
.................
ct_eq_cmp delta=0x01: Mean +- std dev: 1.37 us +- 0.01 us
.................
ct_eq_cmp delta=0x02: Mean +- std dev: 1.37 us +- 0.01 us
.................
ct_eq_cmp delta=0x03: Mean +- std dev: 1.37 us +- 0.01 us
.................
ct_eq_cmp delta=0x04: Mean +- std dev: 1.37 us +- 0.01 us
.................
ct_eq_cmp delta=0x05: Mean +- std dev: 1.37 us +- 0.00 us
.................
ct_eq_cmp delta=0x06: Mean +- std dev: 1.36 us +- 0.01 us
.................
ct_eq_cmp delta=0x07: Mean +- std dev: 1.35 us +- 0.01 us
.................
ct_eq_cmp delta=0x08: Mean +- std dev: 1.35 us +- 0.01 us
.................
ct_eq_cmp delta=0x09: Mean +- std dev: 1.34 us +- 0.00 us
.................
ct_eq_cmp delta=0x0a: Mean +- std dev: 1.35 us +- 0.00 us
.................
ct_eq_cmp delta=0x0b: Mean +- std dev: 1.33 us +- 0.01 us
.................
ct_eq_cmp delta=0x0c: Mean +- std dev: 1.33 us +- 0.01 us
.................
ct_eq_cmp delta=0x0d: Mean +- std dev: 1.32 us +- 0.01 us

The results don’t look too bad, but there’s definitely a difference between the first and last one, even accounting for one standard deviation between them. Let’s see the box plot:

a = read.csv(file="timing-ct_eq_cmp-perf-1.csv", header=FALSE)
data = as.matrix(a)
boxplot(t(data))
timing-ct_eq_cmp-perf-1

That doesn’t look good. Indeed, if we compare the distributions for the different delta values using the Kolmogorov–Smirnov test, we’ll see that results for all deltas are statistically different:

a = read.csv(file="timing-ct_eq_cmp-perf-1.csv", header=FALSE)
data = as.matrix(a)
r = c()
for (i in c(1:length(data[,1]))){
  r[i] = ks.test(data[1,], data[i,])$p.value}
which(unlist(r) < 0.05/(length(data1[,1])-1))
 [1]  2  3  4  5  6  7  9 10 11 12 13 14


Which means that the distributions are statistically distinguishable. (The 0.05 p-value is divided by the amount of performed tests because we’re applying the Bonferroni correction)

To make sure, we re-run the test 4 more times and check for correlation between medians (as the distributions are unimodal, median is robust statistic).

require(corrplot)

a = read.csv(file="timing-ct_eq_cmp-perf-1.csv", header=FALSE)
data = as.matrix(a)
vals = cbind(apply(data, 1, median))

for (i in 2:5) {
  name = paste("timing-ct_eq_cmp-perf-", i, ".csv", sep="")
  a = read.csv(file=name, header=FALSE)
  data = as.matrix(a)
  vals = cbind(vals, apply(data, 1, median))
}
corrplot(cor(vals, method="spearman"), method="ellipse")


timing-ct_eq_cmp-perf-corrplot
There is a very strong correlation between all the different runs, so indeed, it does look like the function is leaking timing information.
Note, we’re using the "spearman" correlation statistic as the values are not normally distributed.

Let’s compare it to the hmac.compare_digest() method:

import perf

setup = """
from hmac import compare_digest

str_a = b'secret API key'

str_b = b''.join((str_a[:{0}], b'X' * (14 - {0})))

assert len(str_a) == len(str_b)
assert len(str_a) == 14
"""

fun = """compare_digest(str_a, str_b)"""

if __name__ == "__main__":
    total_runs = 128
    runs_per_process = 4
    runner = perf.Runner(values=runs_per_process,
                         warmups=64,
                         processes=total_runs//runs_per_process)
    vals = list(range(14))  # length of str_a
    for delta in vals:
        runner.timeit("compare_digest delta={0:#04x}".format(delta),
                      fun,
                      setup=setup.format(delta))
PYTHONHASHSEED=1 python3 timing-compare_digest-perf.py \
-o timing-compare_digest-perf-1.json --rigorous
a = read.csv(file="timing-compare_digest-perf-1.csv", header=FALSE)
data = as.matrix(a)
boxplot(t(data))
timing-compare_digest-perf-1

While there is some difference that depends on where the first differing byte is, there is no difference between first and second byte, and the “step” around 8th byte is only around it (when comparing longer strings, I still see just one step at the beginning and one at the end). I have no good explanation for it. That being said, the difference between medians of the 2nd byte and 11th byte is 0.240 ns, for comparison, one cycle of the CPU (4Ghz) on which the test is running takes 0.250 ns. So I’m assuming that it is not detectable over the network, but may be detectable in cross-VM attacks.

To confirm the results I’ve run the test with simple == for 255 byte long strings and with using the hmac.compare_digest().

Results for ==:

a = read.csv(file="timing-eq_cmp-2-perf-1.csv", header=FALSE)
data = as.matrix(a)
boxplot(t(data))


timing-eq_cmp-2-perf-1.png
As expected, obvious steps that are directly dependant on the amount of matching data between the two parameters to the operator.

Results for compare_digest():

a = read.csv(file="timing-compare_digest-8-perf-1.csv", header=FALSE)
data = as.matrix(a)
boxplot(t(data), ylim=c(min(data), quantile(data, 0.99)))


timing-compare_digest-8-perf-1
They are quite noisy, but what the grouping around 2.239e-7 hints at (the thick horizontal line comprised of circles), is that the distribution is not unimodal (otherwise the outliers would look like the ones below the boxes). Let’s see what are the counts for different time bins, as in a histogram, in detail:

require("lattice")
a = read.csv(file="timing-compare_digest-8-perf-1.csv", header=FALSE)
data = as.matrix(a)
h <- hist(data, breaks=200,plot=FALSE)
breaks = c(h$breaks)
mids = c(h$mids)
hm <- rbind(hist(data[1,], breaks=breaks, plot=FALSE)$counts)
for (i in c(2:length(data[,1]))) {
  hm <- rbind(hm, hist(data[i,], breaks=breaks, plot=FALSE)$counts)}

d = data.frame(x=rep(seq(1, nrow(hm), length=nrow(hm)), ncol(hm)),
               y=rep(mids, each=nrow(hm)),
               z=c(hm))
levelplot(z~x*y, data=d, xlab="delta", ylab="time (s)",
  ylim=c(min(data), quantile(data, 0.99)))


timing-compare_digest-8-perf-1-levelplot.png
We can see now, that even though the measurements with delta between 0 and 8 and 249 and 255 look very different on the box plot, it’s more because a third mode was added to them rather than one of the other two was removed. Statistical test confirms this:

a = read.csv(file="timing-compare_digest-8-perf-1.csv", header=FALSE)
data = as.matrix(a)
r = c()
for (i in c(1:length(data[,1]))){
  r[i] = ks.test(data[19,], data[i,])$p.value}
which(unlist(r) < 0.05/nrow(data))
 [1]   1   2   3   4   5   6   7   8  36  37  38  39  41  45  46  49  50  51  52
[20]  53  54 249 250 251 252 253 254 255


(the deltas between 36 and 54 are a fluke that subsequent quick runs didn’t show).

Note about benchmarking

You may have noticed that the data we have collected, has very low amounts of noise. While it is partially the result of use of the perf module instead of the timeit library module, it mostly is the result of careful system configuration.

On the benchamrking system, the following tasks were performed:

  • 3rd and 4th core were isolated
  • kernel RCU was disabled on the isolated cores
  • HyperThreading was disabled in BIOS
  • Intel TurboBoost was disabled
  • Intel power management was disabled (no C-states or P-states other than C0 were allowed)
  • CPU frequency was locked in place to 4Ghz (the nominal for the i7 4970K of the workstation used)
  • Decreasing maximum perf probe query rate to 1 per second
  • Disabling irqbalance and setting default IRQ affinity to un-isolated cores
  • ASRL disabled
  • Python hash table seed fixed

Those operations can be performed by:

  1. Adding isolcpus=2,3 rcu_nocbs=2,3 processor.max_cstate=1 idle=poll to the kernel command line
  2. Disabling HyperThreading in BIOS
  3. Running python3 -m perf system tune
  4. Disabling ASLR by running echo 0 > /proc/sys/kernel/randomize_va_space
  5. exporting the PYTHONSEED environment variable

Documentation of the perf module provides most of the explanations of the particular options, but we diverge in two places: ASLR and Python hash seed. The purpose of the perf module is to test the overall performance of a piece of Python code (and compare it to either compilation or different implementation). Because Python is a language than answers the question “what if everything was a hash table” ;), that means the names of variables, memory positions of variables or code, number of variables, and particular hash table key have significant impact on performance. But, because we are interested if an attacker is able to tell behaviour of code between two different inputs, and those two inputs will likely be processed by the same process, both the ASLR seed and the Python hash table seeds will be constant from the point of view of the attacker. To speed up finding the expected value for particular inputs I thus opted out of those randomisation mechanisms.

Expectations of behaviour

You may wonder, why is the Python code so unstable, so data dependant, if the implementation of hmac.compare_digest() is doing exactly the same thing (xor-ing the values together and then or-ing result with a guard variable)? The problem stems from the fact that the Python int and C unsigned char are vastly different data types – one is used for arbitrary precision arithmetic while the other can store just 256 unique values. Thus, even such simple operations like xor or or with two small integers are data dependant in Python.

Let’s see how much time does the Python VM need for those two small integers. (Unfortunately, it looks like perf uses the slow json module, and because it exports results after every loop iteration, after few hundred results, the export takes more time than benchmarking. To make it fast enough, and not waste few days on exporting the same data over and over again, we will use timeit module.)

Script:

import timeit
import sys
import math

setup = """
val_a = {0}

val_b = {1}
"""

fun = """val_a ^ val_b"""

def std_dev(vals):
    avg = sum(vals)/len(vals)
    sum_sq = sum((i - avg)**2 for i in vals)
    return math.sqrt(sum_sq / (len(vals) - 1))

if __name__ == "__main__":
    total_runs = 20
    runs_per_process = 3
    warmups = 16

    runner = timeit.Timer(fun, setup=setup.format(0, 0))
    number, delay = runner.autorange()
    number //= 2
    delay /= 2

    print(("will do {0} iterations per process, "
           "expecting {1:7.2} s per process")
          .format(number, delay), file=sys.stderr)
    print("warmups:", file=sys.stderr, end='')
    sys.stderr.flush()
    for _ in range(warmups):
        timeit.repeat(fun, setup=setup.format(0, 0), repeat=1,
                      number=number)
        print(".", file=sys.stderr, end='')
        sys.stderr.flush()
    print(file=sys.stderr)

    for a in range(256):
        for b in range(256):
            res = []
            for _ in range(total_runs // runs_per_process):
                # drop the first result as a local warmup
                res.extend(i / number for i in
                           timeit.repeat(fun,
                                         setup=setup.format(a, b),
                                         repeat=runs_per_process + 1,
                                         number=number)[1:])
                print(".", file=sys.stderr, end='')
                sys.stderr.flush()
                if std_dev(res)  timing-xor-2-timeit-1.csv
require("lattice")
a = read.csv(file="timing-xor-2-timeit-1.csv",
             header=FALSE, col.names=seq(1, 20),
             fill=TRUE)
data = as.matrix(a)
med = apply(data, 1, median, na.rm=TRUE)
# full lines
len = length(med)
columns = ceiling(length(med) / 256)
d = data.frame(x=rep(seq(0, 255), length.out=len, 256),
               y=rep(seq(0, 255), length.out=len, each=256),
               z=med)
my.at = seq(min(med), max(med), length=40)
levelplot(z~x*y, data=d, xlab="b", ylab="a",
          at=my.at, aspects="iso",
          colorkey=list(at=my.at, labels=list(at=my.at)))


timing-xor-2-timeit-1.png
While there few repeating patterns, there are 4 things that are of particular importance – behaviour when the two numbers are equal (the lighter diagonal), when both are zero, or when one of the operands is zero. The difference between the background and the diagonal is small, just 0.555 ns, but that translates to about 2 cycles at 4GHz. The difference between the 0, 0 and the backgrounds is even smaller, just 0.114 ns, so half a cycle. The difference between the background and the situations when the second variable is non-zero is about 2.24 ns which translates to about 9 cycles. When the first variable is non-zero and the second is, the difference is about 1.39 ns which is about 6 cycles. Here’s the zoomed in part of the graph for the small numbers:
timing-xor-2-timeit-1-zoom.png

The binary or operator is similarly dependant on values of parameters:
timing-or-timeit-1

Both of those things put together mean that using the supposedly constant time compare doesn’t actually protect against timing attacks, but rather makes them easier. The strength of the signal for different inputs is about 100 time stronger, likely allowing them even over Internet, not only over LAN (as is the case for == operator).

Anything else?

Because I started looking into those microbenchmarks to verify the “constant” time CBC MAC and pad check from tlslite-ng, needed to protect against Lucky 13 (see the very extensive article by Adam Langley on the topic), I’ve also checked if it is possible to speed up the process of hashing data. Because on the Python level we don’t have the luxury of access to lower level hash APIs, as the developers of OpenSSL have, to implement the CBC check, I wrote code that in fact calculates 256 different hmacs for every record that contains at least 256 bytes of data + padding. That means that for every record processed, the client and server actually process 64 KiB of additional data. In theory (that is, if the hmac itself is constant time), we could speed the process of checking the mac and de-padding in TLS dramatically, if we could hash the data just once, as OpenSSL is doing in its TLS implementation. You may say, “but hashes are implemented in C, surely they are constant time!”. To which I’ll answer, “what did we say about trusting assumptions?”.

Let’s see how our assumptions hold. First code that hashes all of provided data, but returns also a hash from the “middle” of data (in a TLS implementation that would be the real HMAC that we need to compare to the one from record):

import perf

setup = """
import hmac
from hashlib import sha1

def fun(digest, data, split):
    digest.update(data[:split])
    ret = digest.copy().digest()
    digest.update(data[split:])
    return ret, digest.digest()

str_a = memoryview(b'X'*256)
key = b'a' * 32

val_b = {0}

mac = hmac.new(key, digestmod=sha1)
"""

fun = """fun(mac.copy(), str_a, val_b)"""

if __name__ == "__main__":
    total_runs = 128
    runs_per_process = 4
    runner = perf.Runner(values=runs_per_process,
                         warmups=16,
                         processes=total_runs//runs_per_process)
    vals = list(range(256))  # length of str_a
    for delta in vals:
        runner.timeit("hmac split delta={0:#04x}".format(delta),
                      fun,
                      setup=setup.format(delta))


Command to gather the statistics:

PYTHONHASHSEED=1 python3 timing-hmac-split-perf.py \
-o timing-hmac-split-perf-1.json


And a way to visualise them:

require("lattice")
a = read.csv(file="timing-hmac-split-perf-1.csv", header=FALSE)
data = as.matrix(a)
h <- hist(data, breaks=200,plot=FALSE)
breaks = c(h$breaks)
mids = c(h$mids)
hm <- rbind(hist(data[1,], breaks=breaks, plot=FALSE)$counts)
for (i in c(2:length(data[,1]))) {
  hm <- rbind(hm, hist(data[i,], breaks=breaks, plot=FALSE)$counts)}

d = data.frame(x=rep(seq(1, nrow(hm), length=nrow(hm)), ncol(hm)),
               y=rep(mids, each=nrow(hm)),
               z=c(hm))
levelplot(z~x*y, data=d, xlab="delta", ylab="time (s)",
  ylim=c(min(data), quantile(data, 0.99)))


timing-hmac-split-perf-1
Besides the obvious peaks since 56th to 64th byte every 64 bytes (caused by an additional hash block that had to be padded to calculate the intermediate HMAC), there is also a dip for the first byte of the 64 byte block and a second dip for bytes between 20 and 55 of every block. Finally, when the split is about even (in that the intermediate hash is calculated over the first 120 bytes), the whole operation takes measurably longer. In short, if the position of the intermediate hash comes from the last byte of encrypted data (as it does in TLS), calculating HMAC like this has a definite sidechannel leak.

To confirm, let’s perform Kolomogorov-Smirnov test:

a = read.csv(file="timing-hmac-split-perf-1.csv", header=FALSE)
data = as.matrix(a)
r=c()
for (i in c(1:nrow(data))){
   r[i] = ks.test(data[2,], data[i,])$p.value}
which(unlist(r) < 0.05/(nrow(normalised)-1))


(we’re testing against second row as the first row (for delta of 0) is obviously different from the others so all tests failing wouldn’t be unexpected)

  [1]   1  21  23  26  44  57  58  59  60  61  62  63  64  69  71  75  77  78
 [19]  80  81  82  83  92  93  94  95  96  98  99 100 103 104 109 114 118 121
 [37] 122 123 124 125 126 127 128 130 131 132 133 134 135 136 137 138 139 140
 [55] 141 142 143 144 145 146 147 160 161 163 165 169 170 171 172 173 174 175
 [73] 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 215
 [91] 217 218 249 250 251 252 253 254 255 256


Quite obviously different, even with just 128 samples per delta value.

Summary

Moral of the story is, don’t use something without testing if it behaves as it claims to. If it does have tests, verify that they check your expectations, not only the programmers that wrote it in the first place.

State your assumptions and test them. If values look similar, measure them multiple times, and use statistical methods to compare them.

Test setup

Tests were performed, as previously mentioned, on an Intel i7 4790K CPU. The system was running Linux 4.17.5-1-ARCH with Python 3.6.6-1 and perf 1.5.1 from Archlinux.

Conversion from json files to csv files was performed using json-to-csv.py script available at the testing repo, together with raw results, at github.

Post scriptum

Other operations on integers, including equality are also not constant time:

import timeit
import sys
import math

setup = """
val_a = {0}

val_b = {1}
"""

fun = """val_a == val_b"""

def std_dev(vals):
    avg = sum(vals)/len(vals)
    sum_sq = sum((i - avg)**2 for i in vals)
    return math.sqrt(sum_sq / (len(vals) - 1))

if __name__ == "__main__":
    total_runs = 3
    runs_per_process = 3
    warmups = 16

    runner = timeit.Timer(fun, setup=setup.format(0, 0))
    number, delay = runner.autorange()
    number //= 100
    delay /= 100

    print("will do {0} iterations per process, "
          "expecting {1:7.2} s per process"
          .format(number, delay), file=sys.stderr)
    print("warmups:", file=sys.stderr, end='')
    sys.stderr.flush()
    for _ in range(warmups):
        timeit.repeat(fun, setup=setup.format(0, 0), repeat=1,
                      number=number)
        print(".", file=sys.stderr, end='')
        sys.stderr.flush()
    print(file=sys.stderr)

    for a in range(256):
        for b in range(256):
            res = []
            for _ in range(total_runs // runs_per_process):
                # drop the first result as a local warmup
                res.extend(i / number for i in
                           timeit.repeat(fun,
                                         setup=setup.format(a, b),
                                         repeat=runs_per_process+1,
                                         number=number)[1:])
                print(".", file=sys.stderr, end='')
                sys.stderr.flush()
                if std_dev(res)  timing-eq-timeit-1.csv


Execution:

PYTHONHASHSEED=1 taskset -c 2 python3 \
-u timing-eq-timeit.py > timing-eq-timeit-1.csv


Code to create the graph:

require("lattice")
a = read.csv(file="timing-eq-timeit-1.csv", header=FALSE,
             col.names=seq(1, 20), fill=TRUE)
data = as.matrix(a)
med = apply(data, 1, median, na.rm=TRUE)
# full lines
len = length(med)
columns = ceiling(length(med) / 256)
d = data.frame(x=rep(seq(0, 255), length.out=len, 256),
               y=rep(seq(0, 255), length.out=len, each=256),
               z=med)
my.at = seq(min(med), max(med), length=40)
levelplot(z~x*y, data=d, xlab="b", ylab="a",
          at=my.at, colorkey=list(at=my.at, labels=list(at=my.at)))


timing-eq-timeit-1
So it looks to me like the xor operation is actually one of the more constant time primitives…

RAID doesn’t work!

Now, that we have the clickbaity title out of the way, let’s talk about data integrity. Specifically, disk data integrity on Linux.

RAID, or as it is less well known, Redundant Array of Independent Disks is a way to make the stored data more resilient against disk failure. Unfortunately it does not work against silent data corruption, which in studies from CERN were present in the 10-7 range, other studies have also shown non-negligible rates of data corruption.

You may say, OK, I understand fixing it doesn’t work for RAID 1 with just two copies, or with RAID 5, as you don’t know which data is correct – as any one of them can be – surely the system is more clever if it has RAID 6 or 3 drives in RAID! Again, unfortunately it isn’t so. Careful reading of the md(4) man page will reveal this fragment:

If check was used, then no action is taken to handle the mismatch, it is simply recorded. If repair was used, then a mismatch will be repaired in the same way that resync repairs arrays. For RAID5/RAID6 new parity blocks are written. For RAID1/RAID10, all but one block are overwritten with the content of that one block.

In other words, the RAID depends on the disks telling the truth: if it can’t read data, it needs to return I/O error, not return garbage. But as we’ve established, this isn’t the way disks behave.

Now you may say, but I use disk encryption! Surely encryption will detect this data modification and prevent use of damaged/changed data! Again, this is not the property of either AES in XTS mode or in CBC mode – the standard modes of encryption for disk drives – those are so-called malleable encryption modes. There is no way to detect ciphertext modification for them in general case.

This was one of the main reasons behind Btrfs and ZFS; checksumming all data and metadata so that detection of such incorrect blocks could be possible (so that at the very least this corruption is detected and doesn’t reach the application) and with addition of the in-build RAID levels, also corrected.

What if you don’t want to (or likely can’t, in case of ZFS) use either of them? Until recently, there was not much you could do. Introduction of the dm-integrity target has changed that though.

Using dm-integrity in LUKS

dm-integrity target is best integrated with LUKS disk encryption.

To enable it, the device needs to be formatted as a LUKS2 device, and integrity mechanism needs to be specified:

cryptsetup luksFormat --type luks2 --integrity hmac-sha256 \
--sector-size 4096 /dev/example/ciphertext

(Other options include --integrity hmac-sha512 and --cipher chacha20-random --integrity poly1305. Smaller tags will be discussed below)

which then can be opened as a regular LUKS device:

cryptsetup open --type luks /dev/example/ciphertext plaintext

This will create a /dev/mapper/plaintext device that is encrypted and integrity protected.
And the /dev/mapper/plaintext_dif that provides storage for authentication tags.

Note that the integrity device will report (none) as the integrity mechanism:

integritysetup status plaintext_dif
/dev/mapper/plaintext_dif is active and is in use.
type:    INTEGRITY
tag size: 32
integrity: (none)
device:  /dev/mapper/example-ciphertext
sector size:  4096 bytes
interleave sectors: 32768
size:    2056456 sectors
mode:    read/write
journal size: 8380416 bytes
journal watermark: 50%
journal commit time: 10000 ms

This is expected, as LUKS passes the encryption tags from a higher level and dm-integrity is only used to store them. This can be verified with cryptsetup:

cryptsetup status /dev/mapper/plaintext
/dev/mapper/plaintext is active.
type:    LUKS2
cipher:  aes-xts-plain64
keysize: 512 bits
key location: keyring
integrity: hmac(sha256)
integrity keysize: 256 bits
device:  /dev/mapper/example-ciphertext
sector size:  4096
offset:  0 sectors
size:    2056456 sectors
mode:    read/write

The device can be removed (while preserving data, but making it inaccessible without providing password again) using cryptsetup:

cryptsetup close /dev/mapper/plaintext

Testing

To test if the verification works correctly, first let’s verify that the whole device is readable:

dd if=/dev/mapper/plaintext of=/dev/null bs=$((4096*256)) \
status=progress
988807168 bytes (989 MB, 943 MiB) copied, 6 s, 165 MB/s
1004+1 records in
1004+1 records out
1052905472 bytes (1.1 GB, 1004 MiB) copied, 6.28939 s, 167 MB/s

Now, let’s close the device and check if the block looks random, and overwrite it:

cryptsetup close /dev/mapper/plaintext
dd if=/dev/example/ciphertext bs=4096 skip=$((512*1024*1024/4096)) \
count=1 status=none | hexdump -C | head
00000000  70 a1 1d f7 da ae 04 d2  d5 f1 ed 6e ba 96 81 7a  |p..........n...z|
00000010  90 c9 7c e7 01 95 2b 12  ed fc 46 fb 0c d7 24 dd  |..|...+...F...$.|
00000020  48 a2 17 7a 17 9f 26 d8  ef ca 97 74 6e 56 2b 55  |H..z..&....tnV+U|
00000030  59 60 6c 72 e1 5d 14 b3  00 f9 70 e8 f3 31 5e 6f  |Y`lr.]....p..1^o|
00000040  c7 98 c8 e0 e0 f6 52 d3  36 07 34 93 59 42 98 12  |......R.6.4.YB..|
00000050  a8 44 f4 fa 13 94 d6 30  5d 88 ee 79 4c 99 7a a8  |.D.....0]..yL.z.|
00000060  cd 35 87 52 07 66 74 68  9e 61 2e 26 c3 74 67 91  |.5.R.fth.a.&.tg.|
00000070  33 57 21 61 44 b4 2e 31  a6 61 90 3f 04 d9 5e f3  |3W!aD..1.a.?..^.|
00000080  46 dc 2c c5 cb 50 1a b4  3a b5 4d 4d ee d3 0f fd  |F.,..P..:.MM....|
00000090  be 6c 5f 3a b6 f9 b3 f3  21 ac 6b cf dd f0 2e 3b  |.l_:....!.k....;|

Yep, looks random (and will look different for every newly formatted LUKS volume).

dd if=/dev/zero of=/dev/example/ciphertext bs=4096 \
seek=$((512*1024*1024/4096)) count=1
dd if=/dev/example/ciphertext bs=4096 skip=$((512*1024*1024/4096)) \
count=1 status=none | hexdump -C | head
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

Not any more.

What happens when we try to read it now?

cryptsetup open --type luks /dev/example/ciphertext plaintext
dd if=/dev/mapper/plaintext of=/dev/null bs=$((4096*256)) \
status=progress
464519168 bytes (465 MB, 443 MiB) copied, 2 s, 232 MB/s
dd: error reading '/dev/mapper/example': Input/output error
496+1 records in
496+1 records out
520097792 bytes (520 MB, 496 MiB) copied, 2.43931 s, 213 MB/s

Exactly as expected, after reading about 0.5GiB of data, we get an I/O error.
(Re-writing the sector will cause checksum recalculation and will clear the error.)

Repeating the experiment without --integrity is left as an exercise for the reader

Standalone dm-integrity setup

By default, integritysetup will use crc32 which is quite fast and small (requiring just 4 bytes per block). This gives probability of random corruption not being detected of about 2^{-32} (as the value of the CRC and the data will be independently selected). Please remember that this is on top of the silent corruption of the hard drive; i.e. if the HDD has a probability of returning malformed data of 10^{-7} then probability of malformed data reaching upper layer is 10^-7 \cdot 2^-32 \approx 2^{-55} \approx 10^{-16}. There is a hidden assumption in this though – that the malformed data returned by the disk has uniform distribution, I don’t know if that is typical and was unable to find more information on this topic. If it is not uniformly distributed, the failure rate for crc32 may be higher. More research is necessary in this area.

In case that probability is unsatisfactory, it’s possible to use any of the hashes supported by the kernel and listed in the /proc/crypto system file, sha1, sha256, hmac-sha1 and hmac-sha256 being the more interesting ones.

Example configuration would look like this:

integritysetup format --progress-frequency 5 --integrity sha1 \
--tag-size 20 --sector-size 4096 /dev/example/raw-1
Formatted with tag size 20, internal integrity sha1.
Wiping device to initialize integrity checksum.
You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum).
Progress:   7.5%, ETA 01:04,   76 MiB written, speed  14.5 MiB/s
Progress:  17.2%, ETA 00:49,  173 MiB written, speed  16.8 MiB/s
Progress:  25.5%, ETA 00:45,  257 MiB written, speed  16.6 MiB/s
Progress:  32.5%, ETA 00:42,  328 MiB written, speed  16.0 MiB/s
Progress:  42.1%, ETA 00:35,  424 MiB written, speed  16.5 MiB/s
Progress:  51.2%, ETA 00:29,  516 MiB written, speed  16.8 MiB/s
Progress:  58.5%, ETA 00:25,  590 MiB written, speed  16.4 MiB/s
Progress:  68.6%, ETA 00:18,  692 MiB written, speed  16.9 MiB/s
Progress:  77.3%, ETA 00:13,  779 MiB written, speed  17.0 MiB/s
Progress:  84.3%, ETA 00:09,  850 MiB written, speed  16.6 MiB/s
Progress:  93.9%, ETA 00:03,  947 MiB written, speed  16.9 MiB/s
Finished, time 00:59.485, 1008 MiB written, speed  16.9 MiB/s

A new device is created with the open subcommand:

integritysetup open --integrity-no-journal --integrity sha1 \
/dev/example/raw-1 integr-1
integritysetup status integr-1
/dev/mapper/integr-1 is active.
type:    INTEGRITY
tag size: 20
integrity: sha1
device:  /dev/mapper/example-raw--1
sector size:  4096 bytes
interleave sectors: 32768
size:    2064688 sectors
mode:    read/write
journal: not active

(note that in line 4 we now have sha1 instead of (none))

If we want to cryptographically verify the integrity of the data, we will need to use an HMAC though.
Setting it up is fairly similar, but will require a key file (note: this key needs to remain secret for the algorithm to be cryptographically secure).

dd if=/dev/urandom of=hmac-key bs=1 count=20 status=none
integritysetup format --progress-frequency 5 --integrity hmac-sha1 \
--tag-size 20 --integrity-key-size 20 --integrity-key-file hmac-key \
--no-wipe --sector-size 4096 /dev/example/raw-1
integritysetup open --integrity-no-journal --integrity hmac-sha1 \
--integrity-key-size 20 --integrity-key-file hmac-key \
/dev/example/raw-1 integr-1
dd if=/dev/zero of=/dev/mapper/integr-1 bs=$((4096*32768))
integritysetup status integr-1
Formatted with tag size 20, internal integrity hmac-sha1.
dd: error writing '/dev/mapper/integr-1': No space left on device
8+0 records in
7+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 9.45404 s, 112 MB/s
/dev/mapper/integr-1 is active.
type:    INTEGRITY
tag size: 20
integrity: hmac(sha1)
device:  /dev/mapper/example-raw--1
sector size:  4096 bytes
interleave sectors: 32768
size:    2064688 sectors
mode:    read/write
journal: not active

(this time I’ve used --no-wipe as dd from /dev/zero is much faster)

Combined mdadm and dm-integrity

Now that we know how to set up dm-integrity devices and that they work as advertised, let’s see if that indeed will prevent silent data corruption and provide automatic recovery with Linux RAID infrastructure.

For this setup I’ll be using files to make it easier to reproduce the results (integritysetup configures loop devices automatically).

RAID 5

Setup

First let’s initialise the dm-integrity targets:

SIZE=$((1024*1024*1024))
COUNT=6
for i in $(seq $COUNT); do
truncate -s $SIZE "raw-$i"
integritysetup format --integrity sha1 --tag-size 16 \
--sector-size 4096 --no-wipe "./raw-$i"
integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-$i" "integr-$i"
dd if=/dev/zero "of=/dev/mapper/integr-$i" bs=$((4096*512)) || :
done
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-1': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 5.119 s, 207 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-2': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 6.37586 s, 166 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-3': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 6.18465 s, 171 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-4': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 6.32175 s, 167 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-5': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 5.94098 s, 178 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-6': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 6.73871 s, 157 MB/s

Then set up the RAID-5 device and wait for initialisation.

mdadm --create /dev/md/robust -n$COUNT --level=5 \
$(seq --format "/dev/mapper/integr-%.0f" $COUNT)
mdadm --wait /dev/md/robust && echo ready
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/robust started.
ready

Single failure

To make sure that we do not work from cache, stop the array, clear cache, and then overwrite half of one of the drives:

mdadm --stop /dev/md/robust
integritysetup close integr-1
tr '\000' '\377' < /dev/zero | dd of=raw-1 bs=4096 \
seek=$((SIZE/4096/2)) count=$((SIZE/4096/2-256)) \
conv=notrunc status=progress
mdadm: stopped /dev/md/robust
415428608 bytes (415 MB, 396 MiB) copied, 1 s, 413 MB/s
131008+0 records in
131008+0 records out
536608768 bytes (537 MB, 512 MiB) copied, 1.49256 s, 360 MB/s

(As we’re not encrypting data, the 00’s actually are on the disk, so we need to write something else, 0xff in this case. Also, we’re skipping last 1MiB of data as that’s where RAID superblock lives and recovery of it is a completely different kettle of fish.)

Restart the array:

integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-1" "integr-1"
mdadm --assemble /dev/md/robust $(seq --format \
"/dev/mapper/integr-%.0f" $COUNT)
mdadm: /dev/md/robust has been started with 6 drives.

Verify that all data is readable and that it has expected values (all zero):

hexdump -C /dev/md/robust
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
13ab00000

Additionally you can verify that the errors were reported to the MD layer.
First, run ls -l /dev/md/robust to get to know the number of the md, then use it in the following command instead of 127:

grep . /sys/block/md127/md/dev-*/errors
/sys/block/md127/md/dev-dm-14/errors:867024
/sys/block/md127/md/dev-dm-15/errors:0
/sys/block/md127/md/dev-dm-16/errors:0
/sys/block/md127/md/dev-dm-17/errors:0
/sys/block/md127/md/dev-dm-18/errors:0
/sys/block/md127/md/dev-dm-19/errors:0

Scrub the volume to fix all bad blocks:

mdadm --action=repair /dev/md/robust
mdadm --wait /dev/md/robust && echo ready
ready

And verify that all the incorrect blocks were rewritten/corrected:

dd if=/dev/mapper/integr-1 bs=4096 of=/dev/null status=progress
864800768 bytes (865 MB, 825 MiB) copied, 4 s, 216 MB/s
258086+0 records in
258086+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 4.68922 s, 225 MB/s

Double failure

Let’s see what happens if two drives return I/O error for the same area. Let’s introduce the errors:

mdadm --stop /dev/md/robust
integritysetup close integr-1
integritysetup close integr-2
tr '\000' '\377' < /dev/zero | dd of=raw-1 bs=4096 \
seek=$((SIZE/4096/2)) count=$((100*1025*1024/4096)) \
conv=notrunc status=none
tr '\000' '\377' < /dev/zero | dd of=raw-2 bs=4096 \
seek=$((SIZE/4096/2)) count=$((100*1025*1024/4096)) \
conv=notrunc status=none
mdadm: stopped /dev/md/robust

Then restart the array

integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-1" "integr-1"
integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-2" "integr-2"
mdadm --assemble /dev/md/robust $(seq --format \
"/dev/mapper/integr-%.0f" $COUNT)
mdadm: /dev/md/robust has been started with 6 drives.

Let’s verify:

hexdump -C /dev/md/robust
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
hexdump: /dev/md/robust: Input/output error
9c026000

As expected, we’ve received an I/O error. Additionally if we look at the status of the array, we’ll see that it has become degraded:

Personalities : [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 dm-14[0](F) dm-19[6] dm-18[4] dm-17[3] dm-16[2] dm-15[1]
5155840 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/5] [_UUUUU]

This is most likely because the kernel will re-try to read the sectors, but if it receives a set number of read errors it cannot correct (see max_read_errors in md directory), it will kick the disk out of the array. Probably suboptimal given this setup – knowing the error came from software not hardware would mean that kicking the disk from the RAID won’t change anything.

Clean-up

mdadm --stop /dev/md/robust
for i in $(seq $COUNT); do
integritysetup close "integr-$i"
done

RAID 6

Let’s now try with RAID 6, that is, with double redundancy.

set-up:

for i in $(seq $COUNT); do
truncate -s $SIZE "raw-$i"
integritysetup format --integrity sha1 --tag-size 16 \
--sector-size 4096 --no-wipe "./raw-$i"
integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-$i" "integr-$i"
dd if=/dev/zero "of=/dev/mapper/integr-$i" bs=$((4096*512)) || :
done
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-1': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 4.8998 s, 216 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-2': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 7.21848 s, 146 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-3': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 7.85458 s, 135 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-4': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 7.48937 s, 141 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-5': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 7.30369 s, 145 MB/s
Formatted with tag size 16, internal integrity sha1.
dd: error writing '/dev/mapper/integr-6': No space left on device
505+0 records in
504+0 records out
1057120256 bytes (1.1 GB, 1008 MiB) copied, 7.05441 s, 150 MB/s

RAID initialization:

mdadm --create /dev/md/robust -n$COUNT --level=6 \
$(seq --format "/dev/mapper/integr-%.0f" $COUNT)
mdadm --wait /dev/md/robust && echo ready
ready

Single failure

mdadm --stop /dev/md/robust
integritysetup close integr-1
tr '\000' '\377' < /dev/zero | dd of=raw-1 bs=4096 \
seek=$((SIZE/4096/2)) count=$((SIZE/4096/2-256)) \
conv=notrunc status=progress
mdadm: stopped /dev/md/robust
415428608 bytes (415 MB, 396 MiB) copied, 1 s, 413 MB/s
131008+0 records in
131008+0 records out
536608768 bytes (537 MB, 512 MiB) copied, 1.49256 s, 360 MB/s

Restart the array:

integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-1" "integr-1"
mdadm --assemble /dev/md/robust $(seq --format \
"/dev/mapper/integr-%.0f" $COUNT)
mdadm: /dev/md/robust has been started with 6 drives.

Verify that all data is readable and that it has expected values (all zero):

hexdump -C /dev/md/robust
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
fbc00000

Good. And let’s check the status of the array:

Personalities : [raid10] [raid6] [raid5] [raid4]
md127 : active raid6 dm-19[5] dm-18[4] dm-17[3] dm-16[2] dm-15[1] dm-14[0](F)
4124672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [_UUUUU]

Not good, looks like the raid6 target has a different way of handling I/O errors than the raid5 one and even if the failures are correctible, the array is degraded.

Double failure

Faults introduction:

mdadm --stop /dev/md/robust
integritysetup close integr-1
integritysetup close integr-2
tr '\000' '\377' < /dev/zero | dd of=raw-1 bs=4096 \
seek=$((SIZE/4096/2)) count=$((100*1025*1024/4096)) \
conv=notrunc status=none
tr '\000' '\377' < /dev/zero | dd of=raw-2 bs=4096 \
seek=$((SIZE/4096/2)) count=$((100*1025*1024/4096)) \
conv=notrunc status=none
mdadm: stopped /dev/md/robust

Restart:

integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-1" "integr-1"
integritysetup open --integrity-no-journal --integrity \
sha1 "./raw-2" "integr-2"
mdadm --assemble /dev/md/robust $(seq --format \
"/dev/mapper/integr-%.0f" $COUNT)
mdadm: /dev/md/robust has been started with 6 drives.

Verify:

hexdump -C /dev/md/robust
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
fbc00000

so far so good

Personalities : [raid10] [raid6] [raid5] [raid4]
md127 : active raid6 dm-14[0](F) dm-19[5] dm-18[4] dm-17[3] dm-16[2] dm-15[1](F)
4124672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [__UUUU]

not so good, both disks were marked as faulty in the array…

Clean-up

mdadm --stop /dev/md/robust
for i in $(seq $COUNT); do
integritysetup close "integr-$i"
rm "raw-$i"
done

Summary

While the functionality necessary to provide detection and correction of silent data corruption is available in the Linux kernel, the implementation likely will need few tweaks to not excerbate situations where the hardware is physically failing, not just returning garbage. Passing additional metadata about the I/O errors from the dm-integrity layer to the md layer could be a potential solution.

Update: This bug has been fixed in the upstream code in September 2019.

Also, this mechanism comes into play only when the hard drive technically already is failing, so at least you will know about the failure, and really, the failing disk is getting kicked out 🙂

Learn more

Data integrity protection with cryptsetup tools presentation by Milan Brož at FOSDEM.

Note: Above tests were performed using 4.16.6-1-ARCH Linux kernel, mdadm 4.0-1 and cryptsetup 2.0.2-1 from ArchLinux.

Safe primes in Diffie-Hellman

Diffie-Hellman key agreement protocol uses modular exponentiation and calls for use of special prime numbers. If you ever wondered why, I’ll try to explain.

Diffie-Hellman key agreement

The “classical” Diffie-Hellman key exchange also known as Finite Field Diffie-Hellman uses one type of operation — modular exponentiation — and two secrets for two communication peers to arrive at a single shared secret.

The protocol requires a prime number and a number that is a so-called “generator” number to be known to both peers. This is usually achieved by either the server sending those values to the client (e.g. in TLS before v1.3 or in some SSH key exchange types) or by distributing them ahead of time to the peers (e.g. in IPSec and TLS 1.3).

When both peers know which parameters to use, they generate a random number, perform modular exponentiation with this random number, group generator and prime and send the result to the other party as a “key share”. That other party takes this key share and uses it as the generator to perform modular exponentiation again. The result of that second operation is the agreed key share.

If we define g as the generator, p as the prime, S_r as the server selected random, S_k as the server key share, C_r as the client selected random and C_k as the client key share, and SK being the agreed upon secret, the server performs following operations:
g^{S_r} = S_k \mod p
C_k^{S_r} = SK \mod p

Client performs following operation:
g^{C_r} = C_k \mod p
S_k^{C_r} = SK \mod p

Because (g^{C_r})^{S_r} = (g^{S_r})^{C_r} = SK \mod p both parties agree on the same SK.

Unfortunately both peers need to operate on a value provided by the other party (not necessarily trusted or authenticated) and their secret value at the same time. This calls for the the prime number used to have some special properties.

Modular exponentiation

The basic operation we’ll be dealing with is modular exponentiation. The simple way to explain it is that we take a number, raise it to some power. Then we take that result and divide it by a third number. The remainder of that division is our result.

For 2^10 mod 12, the calculation will go as follows, first exponentiation:

2^10 = 1024

Then division:

1024 = 85*12 + 4

So the result is 4.

One of the interesting properties of modular exponentiation, is that it is cyclic. If we take a base number and start raising it to higher and higher powers, we will be getting the same numbers in the same order:

$ python powmod.py -g 3 -m 14 -e 28
3^0  mod 14 = 1
3^1  mod 14 = 3
3^2  mod 14 = 9
3^3  mod 14 = 13
3^4  mod 14 = 11
3^5  mod 14 = 5
3^6  mod 14 = 1
3^7  mod 14 = 3
3^8  mod 14 = 9
3^9  mod 14 = 13
3^10 mod 14 = 11
3^11 mod 14 = 5
3^12 mod 14 = 1
3^13 mod 14 = 3
3^14 mod 14 = 9
3^15 mod 14 = 13
3^16 mod 14 = 11
3^17 mod 14 = 5
3^18 mod 14 = 1
3^19 mod 14 = 3
3^20 mod 14 = 9
3^21 mod 14 = 13
3^22 mod 14 = 11
3^23 mod 14 = 5
3^24 mod 14 = 1
3^25 mod 14 = 3
3^26 mod 14 = 9
3^27 mod 14 = 13

This comes from the fact that in modulo arithmetic, for addition, subtraction, multiplication and exponentiation, the order in which the modulo operations are made does not matter; a + b mod c is equal to (a mod c + b mod c) mod c. Thus if we try to calculate the example for 3^17 mod 14 we can write it down as ((3^6 mod 14) * (3^6 mod 14) * (3^5 mod 14)) mod 14. Then the calculation is reduced to 1 * 1 * 3^5 mod 14.

The inverse of modular exponentiation is discrete logarithm, in which for a given base and modulus, we look for exponent that will result in given number:

g^x mod m = n

Where g, m and n are given, we’re looking for x.

Because there are no fast algorithms for calculating discrete logarithm, is one of the reasons we can use modulo exponentiation as the base of Diffie-Hellman algorithm.

Cyclic groups

Let’s see what happens if we start calculating results of modular exponentiation for 14 with different bases:

$ python groups.py -m 14
Groups modulo 14
g:  0, [1, 0]
g:  1, [1]
g:  2, [1, 2, 4, 8]
g:  3, [1, 3, 9, 13, 11, 5]
g:  4, [1, 4, 2, 8]
g:  5, [1, 5, 11, 13, 9, 3]
g:  6, [1, 6, 8]
g:  7, [1, 7]
g:  8, [1, 8]
g:  9, [1, 9, 11]
g: 10, [1, 10, 2, 6, 4, 12, 8]
g: 11, [1, 11, 9]
g: 12, [1, 12, 4, 6, 2, 10, 8]
g: 13, [1, 13]

Neither of the numbers can generate all of the numbers that are smaller than the integer we calculate the modulo operation with. In other words, there is no generator (in number theoretic sense) that generates the whole group.

To find such numbers, we need to start looking at prime numbers.

Cyclic groups modulo prime

Let’s see what happens for 13:

$ python groups.py -m 13
Groups modulo 13
g:  0, [1, 0]
g:  1, [1]
g:  2, [1, 2, 4, 8, 3, 6, 12, 11, 9, 5, 10, 7]
g:  3, [1, 3, 9]
g:  4, [1, 4, 3, 12, 9, 10]
g:  5, [1, 5, 12, 8]
g:  6, [1, 6, 10, 8, 9, 2, 12, 7, 3, 5, 4, 11]
g:  7, [1, 7, 10, 5, 9, 11, 12, 6, 3, 8, 4, 2]
g:  8, [1, 8, 12, 5]
g:  9, [1, 9, 3]
g: 10, [1, 10, 9, 12, 3, 4]
g: 11, [1, 11, 4, 5, 3, 7, 12, 2, 9, 8, 10, 6]
g: 12, [1, 12]

The obvious result is that we now have 4 generators — 2, 6, 7 and 11 generate the whole group.

But there is other result hiding. Let’s see the results for 19, but with sizes of those groups shown:

$ python groups-ann.py -m 19
Groups modulo 19
g:   0, len:   2, [1, 0]
g:   1, len:   1, [1]
g:   2, len:  18, [1, 2, 4, 8, 16, 13, 7, 14, 9, 18, 17, 15, 11, 3, 6, 12, 5, 10]
g:   3, len:  18, [1, 3, 9, 8, 5, 15, 7, 2, 6, 18, 16, 10, 11, 14, 4, 12, 17, 13]
g:   4, len:   9, [1, 4, 16, 7, 9, 17, 11, 6, 5]
g:   5, len:   9, [1, 5, 6, 11, 17, 9, 7, 16, 4]
g:   6, len:   9, [1, 6, 17, 7, 4, 5, 11, 9, 16]
g:   7, len:   3, [1, 7, 11]
g:   8, len:   6, [1, 8, 7, 18, 11, 12]
g:   9, len:   9, [1, 9, 5, 7, 6, 16, 11, 4, 17]
g:  10, len:  18, [1, 10, 5, 12, 6, 3, 11, 15, 17, 18, 9, 14, 7, 13, 16, 8, 4, 2]
g:  11, len:   3, [1, 11, 7]
g:  12, len:   6, [1, 12, 11, 18, 7, 8]
g:  13, len:  18, [1, 13, 17, 12, 4, 14, 11, 10, 16, 18, 6, 2, 7, 15, 5, 8, 9, 3]
g:  14, len:  18, [1, 14, 6, 8, 17, 10, 7, 3, 4, 18, 5, 13, 11, 2, 9, 12, 16, 15]
g:  15, len:  18, [1, 15, 16, 12, 9, 2, 11, 13, 5, 18, 4, 3, 7, 10, 17, 8, 6, 14]
g:  16, len:   9, [1, 16, 9, 11, 5, 4, 7, 17, 6]
g:  17, len:   9, [1, 17, 4, 11, 16, 6, 7, 5, 9]
g:  18, len:   2, [1, 18]

Note that all the sizes of those groups are factors of 18 – that is p-1.

The third observation we can draw from those results is that, for any number the size of the group of the generated elements will be at most as large as the size of the base number.

With 19, if we take generator 8, the size of its subgroup is 6. But size of subgroups of 7, 18, 11 and 12 is respectively 3, 2, 3 and 6.

Thus, not only is the subgroup much smaller than the full group, it is also impossible to “escape” from it.

Safe primes

We saw that for primes, some bases are better than others (look into finite fields to learn more).

As we noticed, sizes of all groups are factors of the prime less one (see Fermat’s little theorem for proof of this). Of course, with the exception of 2, all primes are odd numbers, so p-1 will always be divisible by two – it will be a composite number. But q = (p-1)/2 doesn’t have to be composite. Indeed, we call primes for which q is also prime safe primes.

Let’s see what happens if we calculate groups of such a prime:

$ python groups-ann.py -m 23
Group sizes modulo 23
g:   0, len:   2, [1, 0]
g:   1, len:   1, [1]
g:   2, len:  11, [1, 2, 4, 8, 16, 9, 18, 13, 3, 6, 12]
g:   3, len:  11, [1, 3, 9, 4, 12, 13, 16, 2, 6, 18, 8]
g:   4, len:  11, [1, 4, 16, 18, 3, 12, 2, 8, 9, 13, 6]
g:   5, len:  22, [1, 5, 2, 10, 4, 20, 8, 17, 16, 11, 9, 22, 18, 21, 13, 19, 3, 15, 6, 7, 12, 14]
g:   6, len:  11, [1, 6, 13, 9, 8, 2, 12, 3, 18, 16, 4]
g:   7, len:  22, [1, 7, 3, 21, 9, 17, 4, 5, 12, 15, 13, 22, 16, 20, 2, 14, 6, 19, 18, 11, 8, 10]
g:   8, len:  11, [1, 8, 18, 6, 2, 16, 13, 12, 4, 9, 3]
g:   9, len:  11, [1, 9, 12, 16, 6, 8, 3, 4, 13, 2, 18]
g:  10, len:  22, [1, 10, 8, 11, 18, 19, 6, 14, 2, 20, 16, 22, 13, 15, 12, 5, 4, 17, 9, 21, 3, 7]
g:  11, len:  22, [1, 11, 6, 20, 13, 5, 9, 7, 8, 19, 2, 22, 12, 17, 3, 10, 18, 14, 16, 15, 4, 21]
g:  12, len:  11, [1, 12, 6, 3, 13, 18, 9, 16, 8, 4, 2]
g:  13, len:  11, [1, 13, 8, 12, 18, 4, 6, 9, 2, 3, 16]
g:  14, len:  22, [1, 14, 12, 7, 6, 15, 3, 19, 13, 21, 18, 22, 9, 11, 16, 17, 8, 20, 4, 10, 2, 5]
g:  15, len:  22, [1, 15, 18, 17, 2, 7, 13, 11, 4, 14, 3, 22, 8, 5, 6, 21, 16, 10, 12, 19, 9, 20]
g:  16, len:  11, [1, 16, 3, 2, 9, 6, 4, 18, 12, 8, 13]
g:  17, len:  22, [1, 17, 13, 14, 8, 21, 12, 20, 18, 7, 4, 22, 6, 10, 9, 15, 2, 11, 3, 5, 16, 19]
g:  18, len:  11, [1, 18, 2, 13, 4, 3, 8, 6, 16, 12, 9]
g:  19, len:  22, [1, 19, 16, 5, 3, 11, 2, 15, 9, 10, 6, 22, 4, 7, 18, 20, 12, 21, 8, 14, 13, 17]
g:  20, len:  22, [1, 20, 9, 19, 12, 10, 16, 21, 6, 5, 8, 22, 3, 14, 4, 11, 13, 7, 2, 17, 18, 15]
g:  21, len:  22, [1, 21, 4, 15, 16, 14, 18, 10, 3, 17, 12, 22, 2, 19, 8, 7, 9, 5, 13, 20, 6, 11]
g:  22, len:   2, [1, 22]

The groups look very different to the ones we saw previously, with the exception of bases 0, 1 and p-1, all groups are relatively large – 11 or 22 elements.

One interesting observation we can make about bases that have group order of 2q, is that even exponents will remain in a group of size 2q while odd will move to a group with order of q. Thus we can say that use of generator that is part of group order of 2q will leak the least significant bit of the exponent.

That’s why protecting against small subgroup attacks with safe primes is so easy, it requires comparing the peer’s key share against just 3 numbers. It’s also the reason why it’s impossible to “backdoor” the parameters (prime and generator), as every generator is a good generator.

Testing for SLOTH

Researchers at INRIA have published a new attack against TLS they called SLOTH. More details about it can be found at http://sloth-attack.org.

The problematic part, is that many frameworks (that is GnuTLS, OpenSSL, NSS) even if they don’t advertise support for MD5 hashes, would in fact accept messages signed with this obsolete and insecure hash.

Thus, to test properly if a server is vulnerable against this attack, we need a client that is misbehaving.

For easy writing of such test cases I have been working on the tlsfuzzer. Just released version of it was extended to be able test servers for vulnerability against the SLOTH attack (to be more precise, just the client impersonation attack – the most severe of the described ones).

Client impersonation attack

To test vulnerability of server to client impersonation attack, you will need the TLS server, set of a client certificate and key trusted by server and Python (any version since 2.6 or 3.2 will do). The full procedure for testing a server is as follows.

Certificates:

For testing we will need a set of certificates trusted by the server, in this case we will cheat a little and tell the server to trust a certificate directly.

Client certificate:

openssl req -x509 -newkey rsa -keyout localuser.key \
-out localuser.crt -nodes -batch -subj /CN=Local\ User

Server certificate:

openssl req -x509 -newkey rsa -keyout localhost.key -out localhost.crt -nodes -batch -subj /CN=localhost

Server setup

The test client expects an HTTP server on localhost, on port 4433 that requests client certificates:

openssl s_server -key localhost.key -cert localhost.crt -verify 1 -www -tls1_2 -CAfile localuser.crt

Reproducer setup

The reproducer has a bit of dependencies on the system.

First thing, you will need python pip command. In case your distribution doesn’t provide it, download it from https://bootstrap.pypa.io/get-pip.py and run using python:

python get-pip.py

After that, install dependencies of tlsfuzzer:

pip install --pre tlslite-ng

Note: Installation may print an error: “error: invalid command ‘bdist_wheel'”, it can be ignored, it doesn’t break installation of package. In case you want to fix it anyway, upgrade setuptools package installed on your system by running:

pip install --upgrade setuptools

Finally download the reproducer itself:

git clone https://github.com/tomato42/tlsfuzzer.git

Running reproducer

Once we have all pieces in place, we can run the reproducer as follows:

cd tlsfuzzer
PYTHONPATH=. python scripts/test-certificate-verify.py -k /tmp/localuser.key -c /tmp/localuser.crt

(if you generated user certificates in /tmp directory)

Results

If the execution finished with

MD5 CertificateVerify test version 4                              
MD5 forced...
OK
Sanity check...
OK
Test end
successful: 2
failed: 0

That means that the server is not vulnerable.

In case the “MD5 forced” failed, but “Sanity check” resulted in “OK”, it means that the server is vulnerable.

Example failure may look like this:

MD5 CertificateVerify (CVE-2015-7575 aka SLOTH) test version 4
MD5 forced...
Error encountered while processing node <tlsfuzzer.expect.ExpectClose object at 0xe3a410> with last message being: <tlslite.messages.Message object at 0xe3a8d0>
Error while processing
Traceback (most recent call last):
  File "scripts/test-certificate-verify.py", line 140, in main
    runner.run()
  File "/root/tlsfuzzer/tlsfuzzer/runner.py", line 139, in run
    msg.write()))
AssertionError: Unexpected message from peer: ChangeCipherSpec()

Sanity check...
OK
Test end
successful: 1
failed: 1

(if the error was caused by Unexpected message from peer: ChangeCipherSpec, as shown above, it means that the server is definitely vulnerable)

In case the Sanity check failed, that may mean one of few things:

  • the server is not listening on localhost on port 4433
  • the server does not support TLS v1.2 protocol, in that case it is not vulnerable (note: this is NOT a good workaround)
  • the server does not support TLS_RSA_WITH_AES_128_CBC_SHA cipher (AES128-SHA in OpenSSL naming system)
  • the server did not ask for certificate on first connection attempt

The cryptopocalipse is near(er)!

That’s at least what NIST, CNSS and NSA think.

The primary reason for deploying cryptographic systems is to protect secrets. When the system carries information with a very long life (like locations of nuclear silos or evidence for marital infidelity) you need to stop using it well before it is broken. That means the usable life of a crypto-system is shorter than the time it remains unbroken.

Suite B is a set of cryptographic algorithms in very specific configurations that was originally published in 2005. Implementations certified by NIST in the FIPS program were allowed for protection of SECRET and TOP SECRET information depending on specific key sizes used. In practice SECRET was equivalent to 128 bit level of security, so SHA-256 signatures, AES-128 and P-256 curve, TOP SECRET required 192 bit level of security with SHA-384 signatures, P-384 curve and AES-256 for encryption.

They now claim that quantum computers are much closer than we think (less than 10 years time frame) and as such the keys used for protection of secure information need to be increased in short term (significantly in case of ECC) and research of quantum resistant algorithms is now a priority.

New recommendations

That means we get a new set of recommendations.

To summarise:

If you’re using TLS or IPsec with Pre-Shared Keys (PSK) with AES-256 encryption, you’ll most likely be fine.

If you were planning deployment of ECC in near future, you should just increase key sizes of existing RSA and DH systems and prepare for deployment of quantum resistant crypto in near future instead.

For RSA and finite-field DH (a new addition to Suite B but very old crypto systems by their own right) the recommended minimum is 3072 bit parameters. That is not particularly surprising, as that is the ENISA as well as NIST recommendation for 128 bit level of security.

What is a bit surprising is that they have changed the minimum hash size from 256 to 384 bit.

For ECC systems the P-256 curve was degraded to be secure enough only to protect unclassified information, so it was put together with 2048 bit RSA or DH. The minimum now is P-384 curve.

So now the table with equivalent systems looks like this:

 LoS RSA key size DH key size ECC key size Hash AES key size
112 bit 2048 bit 2048 bit 256 bit SHA-256 128 bit
128 bit 3072 bit 3072 bit 384 bit SHA-384 256 bit

What does that mean?

Most commercial systems don’t need to perform key rotation and reconfiguration of their systems just yet, as the vast majority of them (nearly 90%) still use just 2048 bit RSA for authentication. What that does mean is that the recent migration to ECC (like ECDHE key exchange and ECDSA certificates) didn’t bring increase in security, just in speed of key exchange. So if you’re an admin, that means you don’t need to do much, at least not until other groups of people don’t do their part.

Software vendors need to make their software actually negotiate the curve used for ECDHE key exchange. Situation in which 86% of servers that can do ECDHE can do it only with P-256 is… unhealthy. The strongest mutually supported algorithms should be negotiated automatically and by default. That means stronger signatures on ECDHE and DHE key exchanges, bigger curves selected for ECDHE and bigger parameters selected for DHE (at least as soon as draft-ietf-tls-negotiated-ff-dhe-10 becomes a standard).

Finally, we need quantum computing resistant cryptography. It would be also quite nice if we didn’t have to wait 15 or even 10 years before it reaches 74% of web servers market because of patent litigation fears.

More nails to RC4 coffin

Last week Christina Garman, Kenneth G. Paterson and Thyla van der Merwe have published a new attacks on RC4 in a paper titled Attacks Only Get Better: Password Recovery Attacks Against RC4 in TLS. In it they outline an attack which recovers user passwords in IMAP and HTTP Basic authentication using 226 ciphertexts. Previous attacks required about 234 ciphertexts.

The other attack, published yesterday at the BlackHat conference, is the Bar-mitzvah attack which requires about 229 ciphertexts.

While connections to relatively few servers (~6% of Alexa top 1 million TLS enabled sites) will end up with RC4 cipher, the 75% market share of RC4 in general is not reassuring.