Stories by Sergey "Shnatsel" Davidoff on Medium

The state of SIMD in Rust in 2025

Sergey "Shnatsel" Davidoff — Wed, 05 Nov 2025 15:14:40 GMT

If you’re already familiar with SIMD, the table below is all you need.

And if you’re not, you will understand the table by the end of this article!

What’s SIMD? Why SIMD?

Hardware that does arithmetic is cheap, so any CPU made this century has plenty of it. But you still only have one instruction decoding block and it is hard to get it to go fast, so the arithmetic hardware is vastly underutilized.

To get around the instruction decoding bottleneck, you can feed the CPU a batch of numbers all at once for a single arithmetic operation like addition. Hence the name: “single instruction, multiple data,” or SIMD for short.

Instead of adding two numbers together, you can add two batches or “vectors” of numbers and it takes about the same amount of time.

On recent x86 chips these batches can be up to 512 bits in size, so in theory you can get an 8x speedup for math on u64 or a 64x speedup on u8!

Instruction sets

Historically, SIMD instructions were added after the CPU architecture was already designed, so SIMD is an extension with its own marketing name on each architecture.

ARM calls theirs “NEON”, and all 64-bit ARM CPUs have it.

WebAssembly doesn’t have a marketing department, so they just call theirs “WebAssembly 128-bit packed SIMD extension”.

64-bit x86 shipped with one called “SSE2” which has basic instructions for 128-bit vectors, but later they added a whole menagerie of extensions on top of that, with SSE 4.2 adding more operations, AVX and AVX2 adding 256-bit vectors and AVX-512 adding 512-bit vectors.

The word “later” in the above paragraph creates a problem.

Does this CPU have that instruction?

If you’re running a program on an x86 CPU, it’s not a given that the CPU has any particular SIMD extension. So by default the compiler isn’t allowed to use instructions beyond SSE2 because that won’t work on all x86 CPUs.

There are two ways around this problem.

If you work for a company that only ever runs their binaries on their own servers or on a public cloud, you can just assert that they’re all recent enough to at least have AVX2 that was introduced over 10 years ago, and have the program crash or misbehave if it ever runs on anything without AVX2:

RUSTFLAGS='-C target-cpu=x86–64-v3' cargo build --release

However, if you are distributing the binaries for other people to run, that’s not really an option.

Instead you can do something called function multiversioning: compile the same function multiple times for different SIMD extensions, and when the program actually runs, check what features the CPU supports and select the appropriate version based on that.

Fortunately, this problem only exists on x86.

ARM made its NEON mandatory in all 64-bit CPUs and then didn’t bother expanding the width beyond 128 bits. (Technically SVE exists, but in 2025 it is still mostly on paper, and Rust support for it is still in progress).

WebAssembly makes you compile two different binaries, one with SIMD and one without, and use JavaScript to check if the browser supports SIMD.

Solution space

There are four approaches to SIMD in Rust, in ascending order of effort:

Automatic vectorization
Fancy iterators
Portable SIMD abstractions
Raw intrinsics

Automatic vectorization

The easiest approach to SIMD is letting the compiler do it for you.

It works surprisingly well, as long as you structure your code in a way that is amenable to vectorization. This article covers it:

Can You Trust a Compiler to Optimize Your Code?

You can check if it’s working with cargo-show-asm or godbolt.org, but your benchmarks are the ultimate judge of the results.

Sadly there is a limit on the complexity of the code that the compiler will vectorize, and it may change between compiler versions. If something vectorizes today that doesn’t necessarily mean it still will in a year from now.

The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).

So right now, if you need to process floats, autovectorization is a no-go unless you can use nightly builds of the Rust compiler.

(Floats are cursed even without SIMD. Something as simple as summing an array of them in a usable way turns out to be really hard).

There is no built-in way to multiversion functions, but the multiversion crate works great with autovectorization.

Fancy iterators

Just like rayon lets you run your iterators in parallel by swapping .iter() with .par_iter(), there have been attempts to do the same for SIMD. After all, what is SIMD but another kind of parallelism?

This is the approach that the faster crate takes. That crate has been abandoned for years, and it doesn’t look like this approach has panned out.

Portable SIMD abstractions

The idea is to let you write your algorithm by explicitly operating on chunks of data, something like [f32; 8] but wrapped in a custom type, and then provide custom implementations of operations like + that compile down into SIMD instructions.

std::simd is exactly that. It supports all instruction sets LLVM supports, so its platform support is unparalleled. It pairs well with the multiversion crate. Sadly it’s nightly-only and will remain such for the foreseeable future, so it’s unusable in most situations.

The wide crate is a mature, established option. It supports NEON, WASM and all the x86 instruction sets. But it doesn’t support multiversioning at all, save for very exotic and limited approaches like cargo-multivers.

The pulp crate has built-in multiversioning, and is reasonably mature and complete, if not as much as wide. It powers faer, so its performance is clearly proven. A major limitation is that it only operates on the native SIMD width, so you need your code to be able to handle variable width chunks as opposed to expressing everything in terms of something like [f32; 8] and letting the library lower it into the appropriate instructions like std::simd and wide do. And it’s difficult to write code that’s generic over type, so if you want both f32 and f64 there will be some code duplication. The architecture support is also limited — only NEON, AVX2 and AVX-512. AVX2 was introduced in 2012, but in the Firefox hardware survey only 75% of systems have it.

The macerator crate is a fork of pulp with better support for generic programming and vastly expanded instruction set support. It supports all x86 extensions, WASM, NEON, and even the LoongArch SIMD extensions. It also improves on pulp for generic programming. It’s used only by burn-ndarray, and even there it’s an optional dependency. It sounds great on paper, but it’s oddly obscure and therefore unproven.

The fearless_simd crate is inspired by pulp’s design, but also supports fixed-size chunks just like std::simd and wide. It’s far less mature than pulp, but it’s under active development. As of this writing it supports NEON, WASM and SSE4.2, but not the newer x86 extensions. Seems too immature just yet, but something to keep an eye on.

simdeez is a rather old crate that supports all instruction sets except AVX-512 and comes with built-in multiversioning. What gives me pause is that despite existing for many years, it’s still barely used. Everyone else who needed SIMD built their own instead of using it. And its README says:

Currently things are well fleshed out for i32, i64, f32, and f64 types.

So I guess the other types aren’t complete?

TL;DR: use std::simd if you don’t mind nightly, wide if you don’t need multiversioning, and otherwise pulp or macerator.

If it’s not 2025 when you’re reading this, check out fearless_simd, because std::simd is still in nightly in your glorious future, isn’t it?

Raw intrinsics

If you want to get really close to the metal, there are always the raw intrinsics, just one step removed from the processor instructions.

The problem looming over any use of raw intrinsics is that you have to manually write them for every platform and instruction set you’re targeting. Whereas std::simd or wide let you write your logic once and compile it down to the assembly automatically, with intrinsics you have to write a separate implementation for every single platform and instruction set (SSE, AVX, NEON…) you care to support. That’s a lot of code!

It’s really not helped by the fact that they are all named something like _mm256_srli_epi32 and your code ends up as a long list of calls to these arcanely named functions. And wrappers that help readability introduce their own problems, such as clashes with multiversioning or unsafe code or arcane macros.

You also have to build your own multiversioning. Or rather, you have to manually dispatch to the dedicated implementation you have manually written for each instruction set. std::is_x86_feature_detected! macro takes care of the feature detection, but it is somewhat slow. In some cases it is beneficial to detect available features exactly once and then cache the results, but you have to implement that manually too.

On the bright side, this year writing intrinsics got markedly less awful. Most of them are no longer unsafe to call in Rust 1.87 and later, and the safe_unaligned_simd crate provides safe wrappers for the rest.

So at least this approach is no longer unsafe on top of all the other problems it has!

Which one is right for you?

The right tool for the job ultimately depends on the use case.

Want zero dependencies and little up-front hassle? Autovectorization. Porting existing C code or targeting very specific hardware? Intrinsics. Anything else? Portable SIMD abstraction.

And now that you made it this far, you can understand the table at the top of the article, which will help guide your decision!

I am stepping back from maintaining ‘cargo audit’

Sergey "Shnatsel" Davidoff — Wed, 05 Mar 2025 14:34:50 GMT

TL;DR: I will no longer be actively working on cargo audit and the RustSec security advisory database. I will continue working on my other Rust projects and contributing to the Rust ecosystem. Maintenance of cargo audit and its database will be carried on by the other maintainers.

The catalyst

Any command that goes through Rust’s build system, Cargo, can execute arbitrary code. This applies to any command starting with cargo.

This is a pretty reasonable thing for a build system. After all, you’d expect the same from calling make or gradle or pretty much anything else.

It is not so reasonable when you consider external plugins for Cargo such as cargo audit. Similar programs like osv-scanner or trivy do not execute arbitrary code from the repository, so why should cargo audit?

So what do we do?

The exact details of the issue are not really relevant here, but you can read my article about it if you’re curious:

Do not run any Cargo commands on untrusted projects

The important bit is that there are three ways to resolve this issue:

Change the behavior of Cargo

While the Cargo team agrees this is desirable in the long run, this is also a breaking change for existing users that may be using these features for legitimate purposes.

And since this is a breaking change, it will need to go through a lengthy deprecation period first. So any solution here is not going to arrive soon enough, even if implementation on it started today. Plus there are always people out there running years-old versions of Cargo.

Document the issue and pretend that this solves it

So should we just accept this and document that cargo audit cannot be run on untrusted code?

Most people are running cargo audit either on CI, which is well equipped for running untrusted code, or on their own projects they trust. And yet, I’m certain that someone, somewhere, will quite reasonably assume that cargo audit isn’t different from other vulnerability scanners, and run it on untrusted code without sandboxing, and get pwned.

I even wrote a custom binary format parser to make its binary scanning 100% memory safe and robust, so it feels weird to get an arbitrary code execution vulnerability from a far simpler source.

Rename ‘cargo audit’ and migrate all users

So should we rename it to something like rust-audit and tell all users to migrate?

But if we’re migrating users anyway, does it have to be a migration to the new name of the same tool? Maybe it’s time to just port the remaining “special sauce” like binary scanning and automatic fixes to something else, like osv-scanner, and just migrate people to that?

Why do we even have ‘cargo audit’?

When Tony started cargo audit eight years ago, the very concept of an actually machine-readable vulnerability database and a tool that could automatically check your dependencies against it was novel. There was literally nothing else that could do the job of cargo audit.

But in the last few years the rest of the industry has finally caught up. There is now a single machine-readable format and aggregator (OSV), a bunch of tools that pull data from it, and even a competing vulnerability database under a permissive license that imports our data and exports all its data to OSV (at least for now).

cargo audit still has special sauce, to be sure:

It generates the Cargo.lock file if it’s not present or out of date. But it calls into Cargo to do it, so that’s unsafe on an untrusted project.
It can scan even binaries built without cargo auditable and recover an (incomplete) list of dependencies by parsing panic messages.
You can run cargo audit fix to try and automatically upgrade your vulnerable dependencies to versions with the fixes.
It will notify you about unmaintained dependencies that do not receive security updates.

But 1 through 3 are features that I’ve implemented (at least in their current form), and the code is quite straightforward. There is no reason why other tools, even ones not specifically built for Rust, could not adopt them.

And 4 has been a controversial feature. Some people love it, some people hate it. cargo audit is configured to only show a warning by default and not cause CI to fail, but lots of people use a Github action that treats them as an error, and get upset when unmaintained crates break CI. This cannot be easily fixed because, ironically, that Github action is unmaintained. Even if this doesn’t fail CI, it can still put unwanted pressure on people.

Given that this feature is not unambiguously and universally great, I wouldn’t be too sad to let it go.

The costs of maintenance

Unlike some of my other Rust projects, cargo audit requires ongoing maintenance to keep up with the changes to Cargo and the ecosystem.

And that’s not mentioning the work and stress of maintaining a database of security advisories. Every single one should be reviewed and published promptly, any day of the week, any day of the year, and contain information that is accurate and actionable.

Oh, and did I mention all of this has been done by only a small team of volunteers for years now?

cargo audit never had any funding (aside from that one time when Google paid me to bridge the advisory database to OSV. Thanks, Google!), and the volunteer maintainers are spread thin as it is. I maintain way too many Cargo plugins and make other contributions across the ecosystem. Tony maintains so many crates that just listing them takes over 10 pages. Alex started Rust for Linux and maintains Python’s cryptography package, among other things. And Alexis has multiple important projects of his own, and also seems to be the only one of us who actually has a life.

Is it worth it?

cargo audit is not as crucial and irreplaceable as it used to be. It still has features that are completely unique, but those are relatively straightforward to implement and require barely any maintenance.

The real maintenance hog is the boring parts, which are now implemented perfectly fine by multiple other tools. And sinking a lot of time and effort into maintaining yet another implementation that isn’t meaningfully better than others just so that we could add a thin layer of unique features on top doesn’t seem like a good use of anyone’s resources.

And it’s difficult to keep up with the times when everyone is spread so thin.

For example, right now the database of cargo audit is not as complete as that of some of the other tools. The Github database pulls data from us and exports to OSV, but we didn’t yet wire up import of data from the Github database. Alexis already did most of the work on that and I meant to get it over the finish line, but never did because most of my time goes into urgent maintenance.

Now that cargo audit is not vastly ahead of the industry anymore, I think it’s time to share the special sauce around, take a step back, and focus on things that remain genuinely novel or that the industry is willing to fund. So that’s exactly what I will be doing.

And if you disagree with my assessment, the remaining maintainers could definitely use some help. You can get in touch with them on Zulip.

Do not run any Cargo commands on untrusted projects

Sergey "Shnatsel" Davidoff — Mon, 03 Mar 2025 12:57:47 GMT

TL;DR: Treat anything starting with cargo as if it is cargo run.

Background

Cargo is the build system for the Rust programming language.

It is common knowledge that you should not invoke cargo run on random code from the internet, because that will run the code.

It is reasonably common knowledge that cargo build will also run arbitrary code. It is a build system, after all! They run code to build code.

I’m here to tell you that any command starting with cargo can run arbitrary code when operating on an untrusted repository, and should be treated the same as cargo run.

This shouldn’t be surprising to people coming from other languages. For example, anything starting with make also runs arbitrary code. Cargo just takes care of things by default so you hardly ever have to inject your own code - but the fact that you don’t have to doesn’t mean you can’t.

Attack vector #1

Cargo reads its configuration from a number of files, some of which can be defined right in the repository, e.g. in .cargo/config.toml.

Among other things, you can redefine the path to the Rust compiler:

[build]
rustc = "/usr/bin/rustc" # path to the rust compiler

So you can also put a malicious script in the repository and do this:

[build]
rustc = ".cargo/fake_malicious_rustc"

which will be called in more situations than you might think: not only on cargo build, but also cargo clean, cargo update, cargo metadata. In fact, I’m not aware of a single Cargo subcommand that doesn’t call rustc.

And rustc is not the only problematic field. There’s also rustc-wrapper, rustc-workspace-wrapper, the linker configuration, and the entire [env] section that can be used to override environment variables for processes that Cargo launches, so you could override $PATH and get arbitrary code execution that way.

“But wait,” you might think, “I’m using this cool third-party plugin for Cargo, and it’s not reading the Cargo configuration. Surely running that is fine?”

Attack vector #2

There are many Cargo plugins meant to help with supply chain security. There’s cargo audit, cargo cyclonedx, cargo vet, cargo crev, and so on. They aren’t even part of Cargo, and they shouldn’t read the config files.

This doesn’t mean that they are safe to run. When you run cargo audit, what gets called is Cargo which then invokes the actual cargo-audit.

That is, unless when you put this in .cargo/config.toml in the repository:

[alias]
audit = "run"

which makes it so that Cargo turns the cargo audit you typed into a cargo run behind the scenes and it’s arbitrary code execution all over again.

Disclosure

I reached out to the Rust Security Response WG and a member of the Cargo team. The consensus was that this behavior is not an entirely new discovery and this behavior is not going to change in the near future, so it should just be better documented.

There is also a desire to eventually remove attack vector #2, but it will require a deprecation period since it’s a breaking change for people who override Cargo plugins for a legitimate reason, and there is no clear timeline for it as yet.

I also looked up what Cargo plugins had bug bounty programs. I reported attack vector #1 to Mozilla who maintain cargo vet, but they concluded it’s not really a problem for their intended use cases.

There’s also a bug bounty for cargo cyclonedx, but for obvious reasons it excludes maintainers, and I am a maintainer of cargo cyclonedx (along with cargo audit, which is why I’ve been using it as an example above. I maintain way too many Cargo plugins). So no bug bounty money for me.

So is this a problem?

This behavior is kind of by design, and isn’t really different from how build systems work in other programming languages.

It’s the existence of Cargo plugins that makes the situation murky. And whether it’s a problem or not should be decided on a case-by-case basis.

It is definitely a problem for cargo crev, which explicitly treats the code it’s running on as untrusted and lets the user review it before running. The code under review being able to misrepresent itself is a problem. They will have to become simply crev or rust-crev or something else that doesn’t involve getting called through cargo.

It’s not really a problem for cargo cyclonedx because in the long term it should become a wrapper for cargo build anyway. Although we’ll probably have to shutter the bug bounty program: if we assume the code you’re running it on is trusted, then there aren’t any attack vectors at all.

And this puts cargo audit in a really awkward spot. Most people just run it on their own projects, which is entirely safe. On the other hand, similar tools like osv-scanner or trivy do not execute arbitrary code from the repository, so why should cargo audit?

I even wrote a custom binary format parser to make its binary scanning 100% memory safe and robust, so it feels weird to get an arbitrary code execution vulnerability from a far simpler source.

I don’t really know what the right answer is for cargo audit.

Workarounds

The good news is that attack vector #2 can be avoided right now by calling a Cargo subcommand directly. For example, instead of cargo crev you would type cargo-crev crev (yes, ‘crev’ appears twice) and that fixes it.

It’s still a terrible default and completely untenable except as a quick and dirty patch, but it’s something.

The bad news is that there is no robust workaround for attack vector #1, and it affects most third-party subcommands too because they still call Cargo behind the scenes.

You can override individual configuration keys in a way that takes precedence over config.toml files, but this will never be airtight. There are too many individual configuration keys that could cause problems (looking at you, [env] section), and more of them can be added over time.

You could try deleting all the configuration files in a given project. But it’s not safe to put them back in once you’re done, otherwise two instances of a tool that does that running in parallel would still be vulnerable: one instance can delete the files and then put them back in before the other instance is finished running, leading to arbitrary code execution. And you also would need to ensure only you have permissions to write to these files, otherwise it’s a local privilege escalation vulnerability. This approach is too brittle to be practical.

So the best advice I can give is in the title of the article: Do not run any Cargo commands on untrusted projects.

How to avoid bounds checks in Rust (without unsafe!)

Sergey "Shnatsel" Davidoff — Tue, 17 Jan 2023 14:29:38 GMT

You can often hear online that indexing into a slice, such as my_slice[i] is slow in Rust and you should do something else instead for performance.

The details, however, are murky. There’s little in the way of benchmarks, and hardly any documentation on removing this overhead without resorting to unsafe code.

So after optimizing a bunch of high-profile crates by removing bounds checks (and also removing unsafe code from some of them), I figured I should write down the results, and the techniques I discovered.

In this article I’m going to cover:

What is the typical runtime cost of bounds checks on indexing
How to avoid bounds checks without unsafe code
How to verify that bounds checks have been eliminated
How to benchmark and profile Rust code
When a bounds check is absolutely necessary, how to construct the cheapest possible bounds check

What are bounds checks?

Consider the following code:

https://medium.com/media/6768b245de1fa7ea4d7ae766ff092218/href

Accessing the 7th element of the array is incorrect, because the array is only 4 elements long, so there is no 7th element.

In C this would lead to accessing some random memory outside the intended range, which is called a buffer overflow.

The notorious Heartbleed is a buffer overflow, but most vulnerabilities of this type don’t get flashy names because there’s just so many of them. Yet writing outside the intended range is a very common way for an attacker to execute their code on your machine and do literally anything they want — steal credit card info, mine cryptocurrency, spy on you, etc. This is why buffer overflows are considered the most dangerous software vulnerability.

To avoid that, Rust inserts so called bounds checks on every access that ensure that a buffer overflow never happens — something like this:

https://medium.com/media/def2081c05a86a225231110846d2ad90/href

If you try accessing an invalid index, a Rust program will panic instead of creating a security vulnerability.

While this is great for security, this has a small cost in runtime performance, because now there is more code for the CPU to execute.

Do bounds checks actually slow you down?

The real-world performance impact of bounds checks is surprisingly low.

The greatest impact I’ve ever seen on real-world code from removing bounds checks alone was 15%, but the typical gains are in 1% to 3% range, and even that only happens in code that does a lot of number crunching.

You can occasionally see greater impact (as we’ll see soon!) if removing bounds checks allows the compiler to perform other optimizations.

Still, performance of code that’s not doing large amounts of number crunching will probably not be impacted by bounds checks at all.

Try it yourself!

While I will be posting the results I got, there’s nothing quite like trying things for yourself. So I’ve prepared a repository with all the code and will be providing all the necessary commands so you can follow along.

If you have already installed Rust, run this to get the code and all the tools:

cargo install cargo-show-asm hyperfine
git clone https://github.com/Shnatsel/bounds-check-cookbook
cd bounds-check-cookbook
cargo build --release

Let’s see some bounds checks

To have a simple example to experiment with, let’s write a function that calculates the Fibonacci numbers and writes them to a Vec:

https://medium.com/media/35213c54ff786e01dd4b94f4caa52545/href

The compiler is really good at removing any code that’s not being called, and at precomputing everything it can in advance. I had to add a main with a lot of tricks in it to make sure it doesn’t happen, and so that we get to see the bounds checks at runtime.

Let’s look at the assembly and see what the bounds checks look like:

cargo asm --rust --bin fibvec_naive_indexing fibonacci_vec

This will print the optimized assembly of the fibonacci_vec function, i.e. the instructions the CPU will actually execute, along with the Rust code that produced them.

You can do this even if you know nothing about assembly! Just eyeballing the amount of assembly produced and looking for function names is sufficient.

Let’s look at the hot inner loop first, the fib[i] = fib[i-1] + fib[i-2]; part simply by searching for it in the output of cargo asm:

https://medium.com/media/4bf457f67e13dd6e921b8efd515a91a1/href

That’s it? That’s just two instructions!

And indeed, if we scroll down a bit, we’ll see more code attributed to this line - it’s not all in one place:

https://medium.com/media/0431830cf7afa55b133baf66070923d6/href

What happened here is the compiler outlining the code that’s taken when the bounds check fails. That code path leads to a panic, and panics are rare. So the compiler shuffled the code in such a way that we don’t even load the code that is only executed when leading up to a panic until we we actually need it. Clever!

Anyway, back to the assembly! core::panicking::panic_bounds_check appears to be the panic on bounds check failure, happening in assembly attributed to our line of code. So this is what they look like!

Let’s see if the if length > 1 { fib[1] = 1; } bit outside the hot loop also has a bounds check on it…

https://medium.com/media/ced5d303487125638afc40e56528bbfc/href

No bounds checks here! The compiler was smart enough to realize that when length is strictly greater than 1, it’s impossible for the bounds check to fail. Our Vec called fib also has length strictly greater than 1, and so fib[1] is always in bounds.

However, it didn’t seem to realize that the same holds for the loop, specifically the fib[i] = fib[i-1] + fib[i-2]; line.

Perhaps we can help it?

Help the optimizer

We’re going to make two changes to the code to make it easier for the optimizer to prove that the bounds checks never fail:

Instead of indexing up to length, which is just some integer, we’ll index up to fib.len(), to make it obvious that the index is always in bounds.
Instead of using a Vec, we’ll make a slice of it once and index into the slice. This makes it more clear that the length doesn’t change.

This gets us the following code:

https://medium.com/media/6d7f8005e9819baa121f2c8c6f5e07b0/href

And let’s verify it with cargo asm — the command is in the code above:

https://medium.com/media/c5de518211e9eac12c3aa42a91c5348d/href

It’s again split in two parts for some reason, but the bounds check is gone!

But is it any faster? Let’s find out!

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000'

Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.612 s ±  0.040 s    [User: 1.435 s, System: 2.132 s]
  Range (min … max):    3.546 s …  3.693 s    10 runs
 
Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.133 s ±  0.019 s    [User: 0.995 s, System: 2.103 s]
  Range (min … max):    3.106 s …  3.163 s    10 runs
 
Summary
  'target/release/fibvec_clever_indexing 1000000000' ran
    1.15 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

If you’re on Windows, you may have to add .exe to those paths.

It is faster, by a whopping 15%! That much is often no cause for celebration, but that’s the greatest boost from eliminating bounds checks that I’ve ever seen, so that’s just about the best we could have hoped for!

And while this example was somewhat contrived, I used these techniques to speed up the fastblur crate by 15%. (Although I’ve shaved off 6x as much execution time through other means first).

Update: the fastblur example was also helped by indexing into a slice instead of a &mut Vec, more info here. So the gain from removing bounds checks is actually less than 15%.

Aside: Compiler Optimizations

Now let’s also try this on a 64-bit ARM CPU, just to confirm…

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.320 s ±  0.024 s    [User: 1.131 s, System: 2.179 s]
  Range (min … max):    3.263 s …  3.346 s    10 runs
 
Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.226 s ±  0.019 s    [User: 1.092 s, System: 2.127 s]
  Range (min … max):    3.209 s …  3.263 s    10 runs
 
Summary
  'target/release/fibvec_clever_indexing 1000000000' ran
    1.03 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Aaand it’s back in the expected 3% range. No huge 15% uplift here.

But the assembly on ARM is really short:

https://medium.com/media/ddda537209cebe68825cca6c063d24af/href

No bounds checks in sight! And that’s just 3 instructions, which means very little work to do, so it should be very fast!

What’s going on?

Recall that removal of bounds checks by themselves doesn’t matter much. You can only see a big uplift if removing bounds checks allowed the compiler to perform other optimizations.

If you go back and squint at the x86 assembly of fibonacci_vec without bounds checks, it’s almost the same lines repeated over and over, which looks suspiciously like loop unrolling.

Why is it performed on x86 and not on ARM? I have no idea! It should be — this is a basic optimization that should not be related to the CPU in any way.

For comparison I tried this on a POWER9 CPU, and the compiler seems to unroll the loop even more and hyperfine reports a massive speedup of 1.78 ± 0.04 times, so I’m just going to file a bug for rustc and let people who know how compilers work deal with it.

The important takeaway for us is this: optimizing compilers are solving an NP-hard problem in a really short time, and there are always some cases they don’t handle exactly right.

Worse, the exact details change between versions. Even if the compiler improves on average, it may regress your specific code! Automatic vectorization for example is notoriously fickle, which is frustrating because it can get you much better performance when it works.

I’ve found the optimizations that remove bounds checks to be very reliable — once you get them working, they tend to keep working. So you can use these techniques and generally expect them not to break in the future. But that’s only the part responsible for the 3% uplift!

Since the loop unrolling responsible for the 15% uplift works on x86 but not on ARM, I wouldn’t bet on it working reliably in the future. Such is the sad reality of having something solve an NP-hard problem in a very short time.

Fortunately, in real programs that don’t spend all of the execution time in a single hot loop the differences are nowhere near this pronounced — regressions in one place are counterbalanced by improvements in another.

Aside: Benchmarking

So you may be wondering, why am I using hyperfine and going through all this trouble of writing a non-trivial main()?

Why don’t I just use cargo bench or criterion or something else specifically designed for benchmarking?

That once again has to do with the compiler’s tendency to precompute everything it can, and eliminate all code that doesn’t result in any changes to the output.

If the return value of a function is never used, and the function doesn’t panic, the compiler will simply remove it!

This is great for production code and terrible for benchmarks.

You can try to combat this by wrapping inputs and outputs in std::hint::black_box(), but it’s difficult to wrap all the things correctly, and it’s not clear which optimizations it inhibits, exactly.

I am sidestepping all this by making a real binary that reads the inputs, and the inputs are supplied only when the program is actually run, so there’s no way for the compiler to precompute anything. It also prints the result, so the compiler cannot remove the fibonacci_vec function as dead code.

And having standalone binaries also makes inspecting the assembly and profiling easier, as you will see shortly!

On the flip side, I have to crank the Vec lengths way up to get easily measurable execution times, and this may not be representative of how these functions perform in on small Vecs due to CPU cache effects.

Now, back to experimenting with bounds checks…

Just Use Iterators

Experienced Rust users have probably been screaming at me about that the entire time they’ve been reading this article - up to this point, anyway!

Rust provides convenient iterators that let you work with collections without worrying about bounds checks, off-by-one errors and the like.

They also let you express what you want to accomplish more clearly — as long as you can remember the names of all the iterators or always use an IDE that shows documentation, which is a pretty big catch if you ask me!

So let’s go ahead and rewrite our code using iterators. We need sliding windows over our Vec, and there’s a handy iterator for that called windows, so let’s give it a shot:

https://medium.com/media/b94efbc4155f11f447b047948785e9c9/href

Uh oh! That doesn’t compile! The issue is that windows() only gives us read-only slices, we cannot write through them. For most other iterators there is a corresponding _mut() version that gives mutable slices, but there is no windows_mut()! What gives?

Turns out windows_mut() cannot be implemented as a regular iterator in Rust, because if you were to .collect() it into say a Vec, you would end up with many overlapping and mutable slices — but Rust requires every part of memory to be mutable from only one place at a time!

What we’re looking for is called a so-called streaming iterator that cannot be .collect()ed, but there doesn’t seem to be such a thing in the standard library yet. So we’ll have to change our code a bit:

https://medium.com/media/4433ce48a3b4fd23c0aa9a42072912ff/href

This works, and you can use cargo asm to confirm that there are no bounds checks going on here. On Rust 1.65 it benchmarks slightly faster than our earlier attempt on my machine, by about 2%. On 1.66 it’s another big boost:

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 'target/release/fibvec_iterator 1000000000'
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.530 s ±  0.056 s    [User: 1.452 s, System: 2.027 s]
  Range (min … max):    3.450 s …  3.616 s    10 runs
 
Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.111 s ±  0.058 s    [User: 1.037 s, System: 2.038 s]
  Range (min … max):    3.039 s …  3.207 s    10 runs
 
Benchmark 3: target/release/fibvec_iterator 1000000000
  Time (mean ± σ):      2.763 s ±  0.057 s    [User: 0.666 s, System: 2.078 s]
  Range (min … max):    2.686 s …  2.847 s    10 runs
 
Summary
  'target/release/fibvec_iterator 1000000000' ran
    1.13 ± 0.03 times faster than 'target/release/fibvec_clever_indexing 1000000000'
    1.28 ± 0.03 times faster than 'target/release/fibvec_naive_indexing 1000000000'

And it does provide a nice uplift on ARM as well:

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 'target/release/fibvec_iterator 1000000000'
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.324 s ±  0.024 s    [User: 1.160 s, System: 2.154 s]
  Range (min … max):    3.285 s …  3.354 s    10 runs
 
Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.257 s ±  0.022 s    [User: 1.112 s, System: 2.136 s]
  Range (min … max):    3.232 s …  3.297 s    10 runs
 
Benchmark 3: target/release/fibvec_iterator 1000000000
  Time (mean ± σ):      2.968 s ±  0.025 s    [User: 0.782 s, System: 2.175 s]
  Range (min … max):    2.929 s …  3.011 s    10 runs
 
Summary
  'target/release/fibvec_iterator 1000000000' ran
    1.10 ± 0.01 times faster than 'target/release/fibvec_clever_indexing 1000000000'
    1.12 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Fortunately this isn’t some sort of iterator secret sauce — a program with the same structure but using indexing with optimizer hints is just as fast.

But notice that we had to significantly change how we implement the computation! Iterators are very handy if you are writing code from scratch, and you totally should use them — but they can be a pain to retrofit into an existing code. And some patterns cannot be expressed with iterators at all!

Update: after the release of this article the standard library documentation has been updated with the instructions for emulating windows_mut(). An example calculating Fibonacci numbers can be found here.

And finally, I’ve used the for loop with an iterator here, but in this case the compiler can miss some “obvious” optimizations. If you instead use .foreach on the iterator, the compiler should optimize the code better. This is especially relevant if you have a chain of iterator adapters, something like .skip().filter().take() or even longer.

So if you find yourself writing long iterator chains, it might be worth benchmarking it against an index-based implementation with optimizer hints, like the one I’ve described earlier. Or like the following…

Put an assert in front of the hot loop

Let’s make use of this function we’ve written.

Now that we have a Vec full of Fibonacci numbers, we can write a function that checks if another Vec also has Fibonacci numbers simply by comparing the two. If our Vec is small enough to fit into the CPU cache, this could be faster than doing the math over and over!

A naive implementation could look like this:

https://medium.com/media/b16277c161cf59251d5a5e567b569fe4/href

Let’s check the assembly:

https://medium.com/media/dd65f78d2b53904d41e812829f5afed4/href

Oh no — the bounds checks are back! The is_fibonacci() function has them in the hot loop!

We have to check bounds here, because we don’t know the lengths of either of these slices in advance. It’s required for correctness! But what we can do is perform the bounds check only once outside the loop, instead of for every element, which will make the cost negligible.

Let’s make sure the sizes are the same before we enter the loop, and use the trick of iterating only up to .len() from earlier:

https://medium.com/media/2d69bc4df8ddbf4acc30fbe2fba41334/href

Et voila, no more bounds checks inside the hot loop:

https://medium.com/media/d55f3c69320ef3b0c0444cf8794b2a7c/href

That’s it, that’s all the assembly attributed to the indexing line now!

This can also be achieved with an iterator, but realistically you’ll just use the == operator to compare the slices. The code is contrived - showing off the optimization technique is what’s important here.

I’ve sped up the jpeg-decoder crate using this approach, so check that out if you want to see applications to real-world code. There I used assert!s instead of slicing, but the principle is the same.

The great thing about this approach is that you cannot have too many assert!s — the redundant ones will be removed by the compiler, just like the bounds checks!

Inlining propagates constraints across functions

So you’ve used your shiny new is_fibonacci function for a while, and decided to split comparing elements into its own function for reuse:

https://medium.com/media/60fa936bec800bdda3ddbecde3a76175/href

And now the bounds checks are back! The elements_are_equal function is a separate entity and cannot make any assumptions about the way it is called (or at least when it has #[inline(never)] on it).

Inlining is when the compiler copies the contents of the function to the place where it’s being called, instead of just putting a function call there. (Inlining is its own rabbit hole that goes pretty deep, just like CPU cache.)

We use #[inline(never)] on functions the assembly of which we want to view so that the function does not get inlined and become part of another function in the generated code. (While it is in theoretically possible to attribute the code to inlined functions, cargo asm doesn’t do that yet).

Instead we’re going to use #[inline(always)] for elements_are_equal() to make sure its contents are copied into is_fibonacci() and they are optimized together — getting us the benefits if them being separate in the source code, but a single entity for the optimizer, so that the knowledge of index constraints would be propagated across functions.

https://medium.com/media/55f2b08e8c9a1d493f595c11e3f73f6e/href

We’ve swapped #[inline(never)] for #[inline(always)] and the bounds checks should be gone! Let’s verify:

$ cargo asm --rust --bin comparison_split_inline elements_are_equal

Error: No matching functions, try relaxing your search request

Right, we can’t view the assembly of elements_are_equal() because it no longer exists in the assembly as a separate function.

But we can still check the assembly of is_fibonacci and verify that it worked! The bounds checks are gone again!

Out in the real world I’ve sped up the rand crate by 7% with a few assert!s and an #[inline(always)] — the same techniques as shown here.

Let’s see how much of a difference this optimization actually made here:

$ hyperfine --warmup 3 --min-runs 20 'target/release/comparison_realistic 100000000 100000000' 'target/release/comparison_naive 100000000 100000000' 'target/release/comparison_clever 100000000 100000000' 'target/release/comparison_iterator 100000000 100000000'
Benchmark 1: target/release/comparison_realistic 100000000 100000000
  Time (mean ± σ):     729.8 ms ±  13.5 ms    [User: 193.8 ms, System: 532.5 ms]
  Range (min … max):   711.9 ms … 748.4 ms    20 runs
 
Benchmark 2: target/release/comparison_naive 100000000 100000000
  Time (mean ± σ):     739.8 ms ±  12.8 ms    [User: 206.5 ms, System: 529.1 ms]
  Range (min … max):   725.9 ms … 761.7 ms    20 runs
 
Benchmark 3: target/release/comparison_clever 100000000 100000000
  Time (mean ± σ):     736.2 ms ±  13.0 ms    [User: 210.0 ms, System: 521.7 ms]
  Range (min … max):   719.1 ms … 761.6 ms    20 runs
 
Benchmark 4: target/release/comparison_iterator 100000000 100000000
  Time (mean ± σ):     734.6 ms ±  10.9 ms    [User: 201.8 ms, System: 528.3 ms]
  Range (min … max):   724.3 ms … 760.7 ms    20 runs
 
Summary
  'target/release/comparison_realistic 100000000 100000000' ran
    1.01 ± 0.02 times faster than 'target/release/comparison_iterator 100000000 100000000'
    1.01 ± 0.03 times faster than 'target/release/comparison_clever 100000000 100000000'
    1.01 ± 0.03 times faster than 'target/release/comparison_naive 100000000 100000000'

Hm. Hardly any difference, this is below 1% and might as well be noise. I had to add warmup and crank up the number of runs to get past the noise.

Okay, we probably should have answered an important question about this function before we started optimizing it:

Does this function even account for a large enough portion of the execution time to be worth optimizing?

Aside: Profiling

If we make a function twice faster, but it only accounted for 2% of the execution time of the program, we’ve only sped up the program by 1%!

We can use a profiler to find where time is spent in the program, and so which function would be a good target for optimization.

Profiling languages that compile to native code, such as Rust, is remarkably poorly documented. There’s a multitude of tools in various states of disrepair, most of which only work on a single OS, so the landscape can be difficult to navigate and is filled with gotchas. So here’s a guide for doing this with modern tools!

As a teaser, here’s the result we’re going to get:

Yes, I know it’s not very readable. Click it!

See? It’s a whole interactive UI for viewing profiles, right in the browser! What you’re looking at is Firefox Profiler — which actually works in any browser, and it’s one the best profiler UIs I’ve ever used.

The killer feature is sharing the results in two clicks.

I cannot overstate how awesome the sharing feature is, and how much easier it makes communicating performance results. If you include a profile in a bug report about performance, it saves so much time for both you and whoever is going to end up working on fixing it!

Interpreting profiles

What we’re looking at is a flame graph. Each horizontal bar represents a function, and it’s as wide as the function’s share of execution time.

The bars are stacked on top of each other; the bar on top is called by the function represented by a bar directly below it.

The yellow bars are in userspace, and the orange bars are in the kernel.

For example, I can tell that main calls fibonacci_vec which in turn calls into the kernel, which does something with “pages” — that must be referring to memory pages. We’re creating a Vec there, so this must be memory allocation.

So apparently 51% of the time of this program is spent allocating memory, and another 9% is spent deallocating it! Actually hyperfine was telling us about it earler that we spend a lot of time in the kernel — it reports user and sys times, with user being our program and sys being the kernel.

Note that the order of the bars in a flame graph is meaningless, it only shows the aggregate time spent in a given function. If you want to see how the program execution actually went instead of viewing aggregates, switch to the “Stack Chart” tab:

This shows that the calls into the kernel from fibonacci_vec are spread evenly across the execution time. Apparently the kernel gradually provisions memory instead of serving us a big allocation up front when we request it. The deallocation at the end, however, happens in a single chunk.

Modern operating systems provision memory only when it’s actually being written to, which is what we’re seeing here. That’s also why you can try allocate 100 TB of RAM and the initial allocation call will succeed — but your process will get killed if you try to actually write to all of that.

Capturing execution profiles

Here’s how to create one of those beautiful graphs for your own code.

First off, the profiler needs debug symbols to attribute chunks of machine code to functions. To enable debug symbols, add this to your Cargo.toml:

[profile.release]
debug = true

If you don’t do this, the compiler will generate very limited debug symbols, which will in turn provide very little visibility — although you’ll still get some info even if you haven’t done this.

The other steps are unfortunately platform-specific:

Profiling on macOS & Linux

There’s a convenient profiler that shows results in Firefox Profiler:

cargo install samply
samply record target/release/comparison_naive 100000000 100000000

This will record the profile and open the web browser with the results.

Seeing into the OS kernel (on Linux)

There is one thing Samply cannot do yet, and that’s reporting on where the time is spent inside the kernel. This is only needed if Samply is showing wide orange bars on the flame graph, and you want to understand what they stand for.

In this case use perf script to visualize the profile. It is slower and less accurate, but lets you see into the kernel as well:

sudo perf script -F +pid > profile.perf

Now go to profiler.firefox.com and upload the resulting profile.perf file, and you should see the same profiling UI showing the results.

The profiles I showed earlier were captured using this method.

Profiling on Windows

Here’s the list of good free profiling tools available on Windows:

(This section was intentionally left blank by Microsoft).

Intel VTune and AMD uProf are free of charge, but are not particularly great — the UIs are clunky, and it may be difficult to get them to work in the first place (e.g. you may have to change some settings in the BIOS).

Fortunately, you can just use the Linux instructions in WSL2 or any other VM with Linux.

Update: after valiant reverse-engineering of barely-documented Windows APIs, samply now also works on Windows!

The results

So the question we wanted to answer was:

Does is_fibonacci() even account for a large enough portion of the execution time to be worth optimizing?

If you open my profile, you can see in the “Flame Graph” or “Call Tree” views that the program spent 13% of the time in is_fibonacci, and if you subtract all the kernel time from fibonacci_vec it accounts for 23% of the time.

Since the execution times of these two functions are roughly comparable, it seems that eliminating bounds checks in is_fibonacci has indeed sped up this function very little.

To reiterate, seeing only a slight boost from eliminating bounds checks is the typical outcome. It’s the 15% improvement that’s the anomaly!

With that out of the way, let’s look at one final technique for dealing with bounds checks.

What if I know literally nothing about the index?

So far we’ve relied on some knowledge we had about the constraints of our index, or had a loop where we could check bounds once before it instead of doing it on every iteration. But what if…

What if you know absolutely nothing about the index?

Let’s say the index comes from untrusted user input, and you can assume absolutely nothing about it. You have to perform a bounds check. There is no other way. But the function is rather hot… what if you could speed it up? It might not gain you much, but then again it might, and it’s just one bounds check and how inescapable can it really be?!

You end up thinking about your conundrum at night. This bounds check haunts your dreams. Taunting you. You wake up in cold sweat, and resolve to do something about it. You try panic = abort in Cargo.toml, and maybe it helps or maybe not, it could be just noise and benchmarks are a lie and oh god why did you decide to go into programming?!

Fear not, for I shall deliver you from this predicament.

We’re going to create. The. Cheapest. Possible. Bounds. Check.

Observe.

The. Cheapest. Possible. Bounds. Check.

So you have your lookup table for Fibonacci numbers, and a function nth_fibonacci() that simply performs a lookup by index.

Easy peasy, except the function must perform a bounds check because its inputs are completely unpredictable:

https://medium.com/media/b7197348a5f2435676e600d85e10348d/href

Look at the assembly! Just, ugh. A single line, creating so much of it:

https://medium.com/media/d1ae52435c583280b2721ba8c5f81230/href

Since we know nothing about the index, we have to perform a bounds check. But we can do it on our own terms.

The branching instruction created by ifs or can be quite expensive, as far as CPU instructions go; we’ve seen its impact on benchmarks already. What if there is a cheaper alternative?

Let’s recall why we need bounds checks in the first place: we want to make sure all lookups are confined to the Vec or slice we’re working with - because if they aren’t, we get Hearbleed or worse. But a panic on an invalid access that usual bounds checks create is not strictly mandatory; as long as all accesses are confined to the slice we’re working with, we’re good.

So technically we could use the modulo operator, written as %, to confine all accesses to the slice, like this:

fibonacci[i % fibonacci.len()]

This won’t return an error on an invalid access like a regular bounds check would, it will just silently return an incorrect result… but is it cheaper?

Unfortunately the % operator is also an expensive instruction, but there is one very special case we can take advantage of. If the divisor is a constant and is known to be a power of two, any compiler worth its bytes will optimize it into a bitwise AND, and bitwise operations are very cheap.

Our lookup table only holds 100 numbers, which is not a power of two, but we can extend it to the nearest power of two with dummy values to make it all work out — we’ve already accepted wrong results on invalid accesses, so we might as well go all the way!

And so our code becomes…

https://medium.com/media/893d7c9ef7e08d21e4f990e042be492b/href

Let’s check the assembly…

https://medium.com/media/8f81be3ffd260ab5ac4dae96cb2f7356/href

Yes!

We’ve done it!

This is it! The cheapest possible bounds check!

We have inflated the memory usage slightly and no longer report errors on invalid accesses, but we’ve achieved the goal of speeding up our function, even when we could assume nothing about the index.

Whether it’s worth the trade-offs… that’s for you and your benchmarks to decide.

Anti-patterns

Finally I want to take a look at some patterns I keep seeing people write over and over, even though they hurt them instead of helping them.

debug_assert! instead of assert!

https://medium.com/media/5714d4e818f2bb292a18dc65c2aaba68/href

debug_assert! is removed in release mode. Ironically, this makes the code slower: the compiler now cannot assume anything about the length of the input, and has to insert all three bounds checks in release mode!

Do this instead:

https://medium.com/media/a402cc2b72f13e51aa8ebc5f043ff144/href

As a rule of thumb, assert!s checking lengths help performance instead of hindering it. They are very difficult to overdo either — any redundant ones will be removed by the optimizer, just like bounds checks!

If in doubt, use assert! instead of debug_assert!.

Or, if you want to be super-duper extra sure you’re not doing extra work:

https://medium.com/media/defc83d20c37f979bd168b1f0e3dd885/href

This also only performs one bounds check instead of three.

Unsafe indexing by a constant value

https://medium.com/media/0d027e6d376ef57bd4dd29530cb78e47/href

This unsafe doesn’t make your code any faster. In fact, in debug mode this is slower than the safe version because get_unchecked creates function call overhead, and optimizations that would remove it are disabled.

Do this instead:

https://medium.com/media/9d24fdda8e5e53fc489b958588c4ece9/href

The compiler always optimizes indexing by a constant into a slice of known length in release mode. And in debug mode this is even faster than unsafe.

Parting thoughts

Hopefully this menagerie of techniques for dealing with bounds checks will serve you well, and you will never have to resort to get_unchecked and risk creating code execution vulnerabilities in your code.

If you would like to practice these techniques, you can search Github for unsafe indexing and see if you can convert it into safe code without regressing performance.

Happy hacking!

The simpler alternative to GCC-RS

Sergey "Shnatsel" Davidoff — Sun, 30 May 2021 14:07:13 GMT

The GCC-RS project, which can be summed up as “Rewrite the Rust compiler in C++”, got a bit of media attention lately. In this post I’ll try to convince you that all the stated benefits of it can be achieved without necessitating a rewrite by leveraging rustc_codegen_gcc instead.

All of the opinions expressed in this article are my own. They do not represent the opinions of any organizations I may be part of.

I am not directly affiliated with any of the projects discussed here.

Background: LLVM vs GCC

The official Rust compiler currently uses LLVM for code generation. LLVM is an open-source compiler and code generation library competing with GCC, the other major open-source compiler stack.

As a code generator, GCC has several advantages over LLVM:

GCC can produce code that runs 10% or so faster on some x86 hardware (but not all x86 hardware), at least when compiling C and C++
GCC supports more CPU architectures. LLVM already supports all desktop or server-grade CPUs manufactured in the last 15 years, but GCC also supports some hobbyist retrocomputing architectures, such as HP PA.

So it would make sense to allow using GCC as the code generator when compiling Rust programs.

Why GCC-RS?

GCC-RS intends not only to use GCC for code generation, but also reimplement the entire rest of the Rust compiler from scratch, in C++.

For reference, the official Rust compiler is written in Rust. Rewriting Rust code in C++ seems a bit backwards. So why are they doing this?

The FAQ for the project lists the following benefits of GCC-RS:

Support for more CPU architectures

That’s true! But the official Rust compiler is not nailed down to LLVM. In fact, it supports pluggable code generation backends. (If you’re not sure what code generation is or what other parts are there in the Rust compiler, read this).

And rustc_codegen_gcc, you guessed it, simply plugs GCC into the existing Rust compiler as a code generation backend. This allows compiling code for all the architectures supported by GCC, without rewriting the entire rest of the compiler from scratch.

Cross-language LTO

In order to use link-time optimization (LTO) across C and Rust, you need to use the same code generation stack in both C and Rust. Aside of producing smaller binaries and slightly faster code, LTO is also a prerequisite for CFI, a new exploit mitigation technique.

However, this also would work perfectly fine with rustc_codegen_gcc.

And besides, cross-language LTO is already possible with the LLVM backend, provided you’re using the LLVM-based Clang compiler for C code. Firefox now uses it in production on all platforms.

The GCC-RS FAQ lists Linux as the motivating example. Ironically, Linux supports LTO with LLVM but not GCC!

GCC Plugins

Existing GCC plugins such as those in the Linux Kernel project should be reusable since they target GIMPLE in the middle-end.

Even ignoring how weird and niche this use case is, rustc_codegen_gcc also emits GIMPLE and would work just as well.

Bootstrapping

The Rust compiler is written in Rust. That presents a problem for CPU architectures that don’t have a Rust compiler built for them yet. It’s a chicken-and-egg problem, and resolving it is called “bootstrapping”.

To bootstrap a C compiler, typically you’d write a super simple C compiler in assembly, which you use to compile a bit more advanced C compiler written in C, which you then use to compile an early version of GCC, use that to compile a slightly newer GCC, and so on until you catch up to the latest version.

If you need C++ (and latest GCC is written in C++, so you do need it), you do the same trick and use a simple C++ compiler written in C to get the chain going. Same for any other language, really.

The Rust bootstrap chain is quite long because you need to get from C to OCaml and then compile pre-release Rust to compile Rust 1.0 to compile Rust 1.1 to compile Rust 1.2 and so on until you catch up to 1.53 (or whatever the latest version is when you’re reading this). So if you can have a Rust compiler written in C++ that compiles 1.53 directly, you can save yourself some time.

So GCC-RS could help with this, right?

Not really. In reality you only need to walk the entire chain on one architecture. Then you can use your fully-functional Rust compiler on e.g. x86 to build a compiler for ARM, HP PA or whatever else you might need. This is called cross-compilation, and is fully supported by Rust.

And shortening the chain on one architecture is a solved problem.

You see, you don’t need the full-blown compiler with all the validation and error messages and whatnot, you just need it to compile things correctly. That’s what mrustc is: a minimal Rust compiler written in C++ designed for bootstrapping and nothing else. It lets you bootstrap from C++ on x86 and cross-compile to any architecture from there.

Other considerations

As you can see, every single benefit that GCC-RS lists can be provided by rustc_codegen_gcc and mrustc, without rewriting the compiler from scratch and at dramatically lower development and maintenance costs.

But what if they forgot to include some crucial benefit in the FAQ? Here are some other points people bring up in relation to GCC-RS:

Isn’t having multiple implementations a good thing?

Well, maybe? It didn’t work out for C/C++, but perhaps we can learn from that and do better. Still, the benefits of this are rather nebulous and I’m not convinced that they justify the costs.

Wouldn’t having an alternative implementation help specify the language?

Yes, that’s what miri is for. You feed it some Rust code and it tells you if it’s valid and whether your unsafe code triggers any undefined behavior or not.

Isn’t Rust vulnerable to the Ken Thompson hack?

No, it is not. The “trusting trust” problem is already solved by mrustc.

libgccjit.so is annoying to package for Linux!

Speaking as a former Debian maintainer — yes, it is mildly annoying, but it has to be done anyway, so GCC-RS doesn’t help here.

rustc_codegen_gcc relies on MIR which is unstable!

Keeping up with the changes to MIR is much easier than keeping up with the changes to the entire language. And that’s ignoring the enormous up-front investment that a full compiler rewrite would entail.

Supporting multiple GCC versions in rustc_codegen_gcc will be difficult!

Yeah, so just don’t do it! Every release of GCC-RS targets a single specific GCC version to avoid this issue. rustc_codegen_gcc could trivially do the same.

Doesn’t GCC-RS reuse the borrow checker written in Rust?

Not the production-ready borrow checker, but the experimental one.

But yes, it does! They’ve reused 5,000 lines of Rust. Only 465,000 lines to go!

I hear GCC-RS has funding!

Yes, one full-time developer and a part-time project manager for one year. For rewriting the entire Rust compiler from scratch, that’s underwhelming.

The company providing the funding mentioned that they’ve failed to get anyone else interested in funding GCC-RS. Coincidence? I think not!

Conclusion

I believe the rewrite of Rust compiler in C++ that the GCC-RS project is attempting is completely unjustified. The gargantuan effort required to make it a reality would be better spent elsewhere.

These projects will provide all the listed benefits at a dramatically lower cost:

rustc_codegen_gcc for portability to obscure platforms
miri and ferrocene for specifying behavior
mrustc for cutting down on bootstrapping time

Ultimately, GCC-RS might provide some value that I’m not seeing. But if you care about portability to obscure platforms, language specification or bootstrapping time, I encourage you to support one of these projects rather than GCC-RS. They should provide a far greater return on investment.

I’m putting my money where my mouth is and will be supporting rustc_codegen_gcc on Github Sponsors starting in June.

Discuss this article on Reddit

Smoke-testing Rust HTTP clients

Sergey "Shnatsel" Davidoff — Thu, 16 Jan 2020 19:42:35 GMT

Back in 2014 I was fetching frontpages of the top million websites to scan them for a particular vulnerability. Not only have I found 99,9% websites to be vulnerable to a trivial attack, I’ve also found that curl command was randomly crashing with a segmentation fault, indicating a likely vulnerability in libcurl — the HTTP client library that the whole world seems to depend on.

By that time I was already disillusioned in the security of software written in C and the willingness of maintainers to fix it, so I never followed up on the bug. However, this year I decided to repeat the test with software written in a language that’s less broken by design: Rust.

Here’s how 7 different HTTP clients fared.

Update: this article was published in January 2020 and no longer represents the current state of affairs.

Baseline

Before we start talking about specific software, let’s define what we’re comparing it against. I’m going to hold all software to the standard of being actually dependable and maybe even secure — which is an incredibly high bar that the vast majority of software currently in use fails to meet. To wit:

Linux kernel gets thousands of potentially exploitable memory safety bugs per year that are largely ignored — or at best silently fixed without any kind of security announcement, so the fixes don’t get into Linux distributions and are then found powering exploits in the wild.

libcurl is fairly benign by comparison with only 9 publicly reported security bugs per year (no matter how you count). Which is, you know, a new exploit every couple of months or so. But that’s just the vulnerabilities that were properly disclosed and widely publicized; dozens more are silently fixed every year, so all you need to do to find an exploit is look through the commit log. Don’t believe me? Here is a probably-exploitable bug that is still unpatched in the latest release of libcurl. You’re welcome.

And in case you’re wondering, this trick works for every open-source C library. Although if you want the exploit to work for the next 10 years, look at the bug tracker instead.

Oh, and all of that was just for libcurl itself. Underneath it there has to be a TLS implementation, which is usually OpenSSL.

OpenSSL is infamous for its Heartbleed, but had numerous other bugs and keeps getting more. The codebase has a custom reimplementation of most standard library functions, which makes it intractable to security analysis or exploit mitigation techniques. The quality of the code and documentation is said to be such that if there were a state agency program to sabotage publicly available cryptography, OpenSSL would be its crown jewel. There are even multiple forks trying to fix OpenSSL that have failed to gain wide adoption.

I could go on like that about almost any widely used piece of C code. We live in a world where all software is broken. And while I wish maintainers of some projects would be more diligent, it would not fix the underlying problems: that complexity of practical codebases exceeds the human ability to reason about it, and that humans inevitably make mistakes. When you’re writing software in a memory-unsafe language such as C, any trivial mistake can lead to a security vulnerability. This makes writing secure software in C about as easy as performing appendectomy on yourself. This one guy did it once in Antarctica, why can’t you?

Even if you don’t care about security, every one of those is a really tricky reliability issue too — even the non-exploitable ones! That’s the kind of issues that get you paged in the middle of the night to deal with it in production and then completely fail to reproduce in a test environment. The best kind.

“Hopelessly broken” is the baseline I will be comparing software against.

Yet there is still hope: C is no longer the only language you can write performant and reusable software in. Rust is a new kid on the block that does everything that C can, and exactly as fast, but makes computers perform all the safety checks instead of requiring humans to think about them. The vulnerabilities that plague C codebases are impossible in Rust!

…unless you explicitly opt in to unsafe operations for some parts of the code, at which point you’re effectively back to the good old C in those places. But at least you can do that rarely and only when you really need to. Right?

Methodology

The test I ran was the simplest possible workload imaginable that curl has already failed: read the URL from the command line, fetch it and exit. No content parsing, no async I/O, no connection reuse, no authentication, nothing. Just the simplest possible thing in a real-world setting.

The default TLS backend was used for every library. The binary using each library (with code looking roughly like this) was built with Address Sanitizer so that we’d notice a memory error if it happens instead of hoping the OS would suspect something is wrong and kill the process.

The list of top million websites is from Majestic-12. I’ve used 50 concurrent connections, which is fairly conservative — you can have way more than that on a mid-range desktop, let alone a server, but it should prevent us from being mistaken for a DDoS attack. One such run takes at about 8 hours with HTTP connection timeout set to 5 seconds.

I’ve used Google Cloud to run this (which conveniently provides $300 free credit), but it should also work from any regular server or public VPN. I do not recommend doing this from a plain old consumer ISP without protection — they tend to frown upon tons and tons of HTTPS connections.

I also briefly looked through the dependencies of each client to get an idea of the amount of unsafe code it relies on and what kind of failure modes I can expect.

reqwest

reqwest is the premier Rust HTTP client. Its number of downloads on crates.io leaves everything else in the dust. It rides on top of pure-Rust HTTP stack (with the notable exception of relying on OpenSSL by default, although rustls is an option) and it just had a new major release that migrated it to futures-0.3 to support async/await.

First things first: it didn’t segfault!

I am actually impressed because I had really low expectations going into this. cargo-geiger output on reqwest’s dependency tree does not instill confidence, and it relies on the HTTP stack that contains copies of standard library functions but with safety checks disabled and where something labeled as “A set of types for representing HTTP requests and responses” also contains a bespoke HashMap implementation with 1500 LoC, 32 unsafe blocks and its own DoS protection because “it’s faster”, without ever mentioning any of that in the README. Plus that code seems to predate migration of std::HashMap to a faster implementation, so it’s not clear if all of that custom code is even useful anymore; could be harmful for all I can tell, since I couldn’t find any benchmarks comparing it against the standard library.

I can’t help but wonder how many more bespoke unsafe primitives lurk in that HTTP stack.

Writing bespoke unsafe code is a bad idea for the same reason why writing anything important in C is a bad idea: all human-written code has bugs, but when you have bugs in unsafe code they tend to be exploitable vulnerabilities. The aforementioned HashMap code was written by humans, so it is no exception:

The maintainers have not filed security advisories for these two bugs despite my call to do so (granted, you need some rather unusual client code to trigger them), but at least they were taken seriously and fixed within days, and that’s is already incredibly responsible by C standards.

I’m not sure why this HTTP stack is not getting a publicized exploit every couple of months. Are exploits truly more rare than that? Or is it because nobody’s looking for them? Or perhaps they’re just getting silently fixed and we simply never learn about them?

Anyway, reqwest didn’t segfault on a basic smoke test, which beats the state of the art from 5 years ago. It didn’t really work though. 6% of the time it downloaded and printed the data, then hung. I had to wrap my test binary in timeout to keep things going.

That hang turned out to be a known deadlock. It’s not really surprising that it happened because the Rust async/await ecosystem is so immature, with async/await being stabilized literally two months ago and even the compiler itself failing to uphold its safety guarantees right now. Plus the bug easily could be in some dependency, not in reqwest itself. What is surprising to me is that they issued a new stable release with a known deadlock.

The previous release (0.9 series) has been in use for a while, so it shouldn’t have such glaring bugs. But if you don’t need to have thousands of HTTP connections open at the same time, use something simpler. Like, way simpler, without any async in it.

ureq

Minimal request library in rust.

Motivation: Minimal dependency tree, Obvious API

This, this is what my bitter, jaded eye likes to see.

ureq does not do any fancy cooperative multitasking or async I/O — just plain old blocking I/O that you can stuff into threads if you want concurrency.

You won’t be pulling your hair trying to catch a deadlock that happens on one request out of 100,000. Properly handling backpressure is a breeze. And the threaded architecture scales to a few hundred concurrent connections just fine. This is what the go-to HTTP client should look like. The use of async I/O on the client should be a weird thing that you resort to for niche use cases, not the default.

According to cargo-geiger, ureq relies on 10 times less unsafe Rust than reqwest, and the only scary thing in its dependency tree is SmallVec — which, granted, is still kinda scary, but at least it’s the only such thing in there. Plus the subset of SmallVec API actually in use looks easy enough to replace with safe code. I’ve tried swapping SmallVec for the 100% safe TinyVec and there was no difference in performance, so maybe I should open a PR for that.

ureq also decisively ditches OpenSSL for rustls. Using rustls may or may not be a good idea depending on what exactly you’re doing — it has not been audited for attacks such as SMACK that don’t have anything to do with memory safety— but at least Rust type system makes such mistakes easier to avoid.

As could have been expected, no segfaults! Also no deadlocks, because there is basically nowhere for them to come from. I did find a couple of panics though: one in ureq DNS handling on 13 websites out of a million (0.0013%) and also an assertion failure leading to panic on 7 websites on RSA validation in ring (0.0007%), which leads me to believe that the Rust parts of ring are less robust than I had hoped. It sure could use some fuzzing.

I’m sure there is plenty more panics in there — I’ve only just tested valid inputs, and feeding it invalid data should cause a whole lot more panics. Fortunately panics are designed to be possible to handle and recover from, so these conditions can be reasonably planned for, unlike deadlocks.

There are also three unsafe blocks in ureq itself that I’m not sure are necessary. I would have to audit them before using ureq for anything serious. A quick glance at the bug tracker also reveals that connection pooling is not really usable yet, but if you don’t use that you should be good.

The only glaring omission that I see in ureq is that it doesn’t allow setting a timeout for request completion, so if the remote host keeps replying, the connection will be open forever. This is a great vector for denial-of-service attacks: if you can get ureq to open connections to URLs you control, you can get your server to keep the connections open forever and easily exhaust some kind of resource on the client: thread pool, network ports, RAM, whatever runs out first.

isahc

isahc provides an idiomatic Rust interface to libcurl. Yeah, you’re talking to the devil, but it’s the devil everyone has already struck a deal with, so it’s probably fine! Plus it supports the deprecated and broken crypto protocol from 20 years ago!

Jokes aside, I am glad it exists, because it provides maximum interoperability and sometimes you need that above everything else — e.g. if you have to integrate with a weird legacy system and all connections are already inside a secure VPN such as WireGuard.

The API exposed by isahc does feel pretty nice and Rusty. But I did run into an interesting gotcha with it: was getting HTTP/2 protocol errors even though I’ve disabled HTTP/2. Turns out disabling http2 feature doesn’t actually disable HTTP/2.

I didn’t expect any fireworks from within libcurl this time around because Google has started continuously fuzzing it in 2017 as part of OSS-fuzz and found some 75 crashing bugs since then, so the really low-hanging fruit detectable by a basic smoke test should be picked by now.

You’d think cargo-geiger output would be totally benign, since all the complexity is inside libcurl, but no. Isahc pulls in the same crate from reqwest’s stack advertised as “HTTP types” but actually containing a bespoke HashMap implementation. And that crate in turn pulls in yet another bespoke and highly unsafe data structure with an equally misleading description.

Test results are promising: no segfaults! This is quite impressive because the bindings to libcurl’s C API are inherently unsafe and leave plenty of room for error. On the flip side, I might not be seeing the full picture because I’ve only compiled isahc with Address Sanitizer, not libcurl. If the bindings have caused some kind of memory corruption in libcurl data structures, this setup wouldn’t detect it.

I also didn’t see any other kind of runtime malfunction, which is a first!

However, I have some gripes not with the library itself, but with the tech stack it relies on, just like with reqwest. Specifically, curl-sys crate may choose to use its own bundled version of libcurl even if explicitly asked not to do that. This amounts to loading a gun, pointing it at your foot while you’re not looking and sticking a timer on the trigger.

You see, if the user has explicitly requested use of the shared library, they may reasonably assume that bug fixes or security patches in the system-wide library will also apply to the Rust application. But curl-sys may choose to violate that assumption depending on the build environment: if development headers for libcurl are not present, or if http2 support is requested and the system-wide libcurl is too old, curl-sys will statically link its own version of libcurl instead of erroring out.

Now your code will not only unexpectedly receive no security patches, which will make it trivially exploitable by the very next publicly known vulneraiblity, it will also be running a different version of libcurl in a wildly different configuration than everything else when you don’t expect it to, which is all kinds of fun to debug, especially in production at 3AM.

For example, you may think you made resource leaks impossible by setting a default connection timeout in libcurl, and then curl-sys smuggles in a totally different libcurl behind your back, stealthily reopening this can of worms. Plus the probability of the absence of version-specific bugs or behavioral differences between versions is basically nil, so good luck debugging that!

Bye-bye, security patches. Hello, debugging inexplicable production outages at 3AM.

Initially I had a hard time communicating to the maintainer that this is an issue, but they have eventually conceded that this could be improved and that PRs fixing this would be welcome.

I won’t even spend time commenting on the timeliness of security updates to the bundled libcurl (weeks after CVE disclosure) or the existence of RustSec advisories for such updates (none).

http_req

Simple HTTP client with built-in HTTPS support. Currently it’s in heavy development and may frequently change.

No bespoke unsafe, OpenSSL or rustls at your option and no other dependencies with unsafe code in them!

No segfaults (duh!), one panic, some hangs… wait, those seem to be legitimate hangs: the server is just submitting chunks of data really slowly! Then why didn’t ureq also hang here? Well, let’s report that to ureq.

Other than the panic, this looks like my idea of a perfect library, which sounds too good to be true. And it is too good to be true: http_req is so basic that there’s hardly anything in there to break. For example, it doesn’t follow redirects, so it didn’t even get to the frontpages of most websites I’ve thrown at it, it’s only seen a basic redirect response. There is also no support for setting a timeout for the entire request, so it’s susceptible to denial-of-service attacks.

In a nutshell: too basic for most uses right now, but perhaps something to keep an eye on.

attohttpc

I know I will have trouble telling anyone about this crate. Naming it something like “hytt” or “ht2p” would make it much easier to refer to.

It uses blocking I/O, which makes it a potential contender for being the go-to solution. The feature set is roughly on par with ureq, except it supports some features that are basically optimizations (like compression) and doesn’t support some features that are strictly required for some use cases (like cookies). Sadly it has no rustls option, so you’re locked into OpenSSL.

Dependencies are mostly sane and minimal, with one exception. That bespoke HashMap advertised as “HTTP types” is in here too, and it is of course pulling in that other data structure which is 8,000 lines of code and full of unsafe. Using that thing makes sense if you want to pass small chunks of HTTP data across different threads and you have no idea what the access patterns are (so basically async code), but attohttpc does none of that insanity, it parses the entire response in the same thread. So all attohttpc gets for its trouble here is a slowdown due to non-contiguous data layout and atomic reference-counting plus an extra 8,000 LoC of unsafe code to its attack surface. This crate would be much better off using the standard library Vec.

Initially attohttpc lacked any kind of timeouts, but the maintainers have implemented connection and read timeouts upon my request. A timeout for the entire request is still missing though, so DoS attacks remain possible.

A test of the version with timeouts triggered no segfaults, but did reveal one intermittent panic and caused a few expected hangs due to the absence of a timeout for the entire request.

minreq

Simple, minimal-dependency HTTP client. The library has a very minimal API, so you’ll probably know everything you need to after reading a few examples.

Exactly what it says on the tin! No scary dependencies whatsoever, HTTPS via rustls and optional JSON through serde-json. The API is extremely minimal, and judging by the description that’s on purpose, so I do not expect this to be a viable choice in any kind of serious project. You will eventually needs something not covered by the API and have to migrate to another library.

The documentation stated that it supported a timeout for the entire request, but upon reading the code I’ve found that it was not the case. Fortunately this is fixed now, but may come with a significant performance hit.

The test has revealed no segfaults, two panics and I can’t really comment on hangs because I’ve tested it before the fix.

cabot

No, it’s not a certificate authority bot, it’s an HTTP client. And I thought “attohttpc” was a bad name!

It uses async I/O, but what makes it distinct from reqwest is that it uses an entirely different async I/O stack with dramatically less unsafe code. And by virtue of being a different implementation it’s also going to exhibit new and exciting failure modes!

Sadly cabot is currently too basic to exhibit anything exciting. It doesn’t even follow redirections, so it will actually process less than a half of the websites I’ll throw at it. It also supports no timeouts of any kind, which makes it unusable in automated systems where the user can’t decide they’ve had enough and kill the process.

The amount of unsafe code in the dependencies is at 1/2 of that of reqwest. There is nothing particularly egregious in there except for two things. One, there is a hard dependency on pretty_env_logger, which alone pulls in more unsafe code than all of ureq’s dependencies combined. Two, it uses regex to parse some parts of HTTP — which is a bad idea, but given that it uses Rust’s supposedly DoS-resilient regex library and not PRCE, it’s not a terrible idea. Fortunately, both of these issues look simple enough to fix, and with these dependencies ditched it would go down to 1/3 of unsafety of reqwest.

The test results are basically in line with what I expected. No segfaults, but one frequent panic. I also got 25,000 hangs and I have no idea if those are deadlocks or the upstream simply didn’t respond in time because cabot doesn’t support any kind of timeouts.

So nothing to get excited about in here yet, but perhaps it will evolve into a viable contender for the weird use cases that require async I/O.

Not tested

surf

surf is a common interface on top of isahc and the HTTP stack that’s underlying reqwest. I’ve already tried both backends, so not much to see here.

yukikaze

yukikaze is built on basically the same stack as reqwest, but provides a different API on top. Since all the complaints I had about reqwest were actually about the underlying HTTP stack, they also apply here.

awc

An HTTP client built on the Actix framework. The HTTP stack seems to be shard with actix-web. It uses async I/O and thus relies on Rust’s immature async/await ecosystem, which brings up the same issues with complexity and exciting new failure modes as with reqwest. Definitely should not be your first choice when shopping for an HTTP client. And cargo-geiger output is also not comforting.

A quick glance at the dependencies reveals that it relies on actix-service, which underpins all of Actix and has a bespoke and unsound Cell implementation. For example, this method violates memory safety by handing out multiple mutable references to the same data, which can lead to e.g. a use-after-free vulnerability. I have reported the issue to the maintainers, but they have refused to investigate it.

There are no comments on their bespoke Cell implementation — not only no comments to justify why it’s needed, but no comments at all. So I dug through the commit log to see why they rolled their own unsafe primitive instead of using Rcwhich would be doing the exact same thing, except safely. Here’s the commit message justifying the change:

add custom cell

That’s it.

And while I probably could pressure the maintainers into fixing this particular bug (or maybe even dropping their bespoke cell altogether, if I’m really lucky!), that will do nothing to improve their general approach to safety which determines the shape of all their future code. So I just give up and admit that I can’t put my money on the security or reliability of this thing.

I want to highlight that Actix is not unique in having its own bespoke and unsound unsafe primitives — the HTTP stack that’s underlying reqwest is facing largely the same issues, although its maintainers are much more willing to fix reported bugs. And the solution to this problem is very simple: stop making bespoke unsafe primitives, because for everything other than toy projects reliability trumps performance.

I need an HTTP client…

So which library would I use if I needed one right now?

Well, I’ve found serious bugs in every single one of them, so none are usable as-is. Plus I would do more research than this before committing to one.

First I would check if the problem can be solved with blocking I/O, and if so, look no further. Clients with blocking I/O are simple, dependable, and boring — they have nothing in them that could break in an interesting way. I won’t be pulling my hair when debugging and nobody’s going to be woken up in the middle of the night because something has deadlocked in production.

Based on this cursory glance it seems that ureq and attohttpc could be hammered into a usable shape in a week or so each (assuming you’re willing to stick a panic catcher on them), plus however long it will take to add rustls as an optional TLS backend to attohttpc if you want to get rid of OpenSSL.

I’m letting the panics slide because they’re not a DoS vector: panics are trivial to guard against, especially if you’re spawning threads, and there is almost nothing in ureq dependency tree that could crash the entire process save for a panic-while-panicking.

But what if I had a use case that is not served by the clients with blocking I/O?

It pains me to say this, but… I wouldn’t go with reqwest. Credit where it’s due: unlike curl, it’s not hopelessly broken, and developers of the underlying async runtime go above and beyond due diligence in some respects. But first, the async/await ecosystem as a whole is still immature, and second, I won’t be able to trust reqwest’s underlying HTTP stack until they ditch most of their bespoke unsafe primitives.

Sadly isahc is also not a great candidate. For starters, I would still need to rely on the immature async ecosystem if I need high performance and run into the same deadlocks as with reqwest (or cabot?). Also, when something goes wrong in libcurl it brings down the entire process, which aborts all requests currently in flight (unlike ureq or attohttpc, where you can abort just one), so DoS resilience of libcurl is basically nil while the attack surface is enormous. Which is why there is no way I’m linking libcurl+OpenSSL into my main process, otherwise it would bring down not only all in-flight requests but the entirety of my backend as well. So I’d have to put all the code that does HTTP fetching into a separate process, sandbox it and communicate with it using something like RPC… Ewww.

Actually, why am I even trying to do this in Rust at this point? If I’m breaking the HTTP fetching into a separate process anyway, I might as well go for a mature async I/O stack in there.

I’m not sure how the HTTP client in Go is implemented, but at least Go gets async I/O mostly right, so that’s worth a look. And it comes with its own TLS implementation, so we won’t be stuck with OpenSSL. On the other hand, Go makes concurrency very easy to mess up (and no, channels are not a solution) and error handling in Go is a minefield, but hopefully fetching webpages would be simple enough not to run into these issues?

Alternatively, Erlang has a mature async I/O stack, its reliability is unparalleled and makes it really hard to mess up concurrency. But I don’t know if the HTTP client specifically is any good, and it may be hard to find people familiar with Erlang, so the decision to use it should not be taken lightly.

Conclusion

The place of the go-to Rust HTTP client is sadly vacant. Clients with async I/O will never be able to fill it due to the sheer operational complexity they bring, and none of the clients with blocking I/O have enough features to be viable contenders.

Instead of a single good HTTP client with blocking I/O Rust has at least 4 passable ones. I wish their maintainers got together and made a single good client instead.

Two months in, the async/await ecosystem is as immature as you’d expect. If I needed an HTTP client with async I/O right now, I’d use a different memory-safe language.

Only the libraries written in C and Rust can be integrated in code written in other languages, and the C libraries that the entire world relies on are hopelessly broken. Rust libraries are also broken, but not hopelessly so. Let’s fix them and usher in a new era of performant, secure and reliable software.

Appendix: well, technically…

In the interest of making an already long read more tractable I’ve omitted some technicalities that do not change the overall picture much. If you feel I’ve overlooked something, check if it’s covered in the extended notes.

Appendix: issues reported in the making of this article

No crate maintainers were harmed in the making of this article.

Security as Rust 2019 goal

Sergey "Shnatsel" Davidoff — Fri, 18 Jan 2019 18:20:42 GMT

Our vision for Rust. Image courtesy of Sreejith K.

Note: I am publishing this on behalf of Secure Code Working Group because we do not have a WG blog established yet. Multiple people have contributed to this post.

Rust Secure Code Working Group is a bunch of curious people hanging out in a public chat on the Internet. You can click here to hang out with us.

Our mission is to make it easy to write secure code in Rust.

We have the following goals for the Rust language and ecosystem:

Most tasks shouldn’t require dangerous features such as unsafe. This includes FFI.
Mistakes in security-critical code should be easily caught by machines or, failing that, humans aided by machines.
It should be clear to programmers how to perform security-sensitive tasks.
Security-critical code which is relied on by Rust programmers should be bug free.

This article details what we have agreed on as especially critical areas that we would like to see improved upon in 2019.

Security updates

Safe Rust eliminates entire classes of security bugs, which makes it very promising for security-critical applications such as web servers. However, even memory-safe code may contain logic bugs leading to security breaches. No code is perfect, so security bugs will occur, and are already occurring.

Rust needs a mechanism to deliver security updates to any kind of production deployments in a timely manner. This involves finding good answers to the following questions:

If you run Rust code in production, how do you get notified that you need to apply a security update? How do you set up a pipeline to apply these updates automatically? This is exacerbated by Rust’s static linking, since every affected program needs to be updated individually, even if a vulnerability is in a transitive dependency. We need solutions both for software installed via cargo installand via complex deployment pipelines used for production servers.
How should fixes in compiler or standard library bugs be applied? Currently there is no “rebuild everything that was ever installed” command in Cargo. Also, how do we notify people that they need to rebuild everything? What if the code is non-trivially deployed, like a shared library linked into another language?
How should security updates to statically linked C libraries be handled? What if the build is for Windows where the only reasonable way to build against C libraries is to bundle them with the -sys crate? Should the maintainer of Rust -sys crate be responsible for security updates to the C code, and if so, how do we make that manageable for the maintainer?

Prior art

RustSec project hosts a Rust security advisory database and provides a command-line tool that checks Cargo.lock for vulnerable dependencies. This is a great start, but currently you need to run it manually on each of your projects to check them, and doing that every day is impractical. It also doesn’t handle compiled binaries.

There is also a tool to cross-reference the crates.io index with the RustSec database. It has identified, for example, a crate with 2500+ downloads per month that depends on a grossly outdated and trivially exploitable version of OpenSSL. Right now crates.io itself does not present this info in any way, so the crate in question may keep accumulating unsuspecting users.

An RFC for some of this functionality has been proposed in 2016, but shelved. The issues already in the RustSec database are proof that it is needed. Reviving it is being discussed here.

Rust compiler encodes the rustc, LLVM and standard library versions into all binaries it produces. This allows easily checking for binaries with vulnerable stdlib versions, regardless of deployment method. However, the versions of all the other libraries used to compile the binary are not encoded.

The Update Framework provides protocols for resilient and timely delivery of security updates, which is harder than it sounds. An implementation of it in Rust is in progress.

Use of unsafe code

Many widely used libraries use unsafe code where it’s not strictly necessary. Typically this is done for performance reasons, i.e. there are currently no safe abstractions to achieve the goal safely and efficiently.

The goal here is to reduce or eliminate the use of unsafe code throughout the ecosystem where it is not strictly necessary without regressing correctness or performance. The action items for that include:

Investigate why exactly people resort to unsafe code on a case-by-case basis. Compile a list of case studies so that we can identify missing safe abstractions or idioms.
Try to rewrite unsafe code into safe without regressing performance. Document the patterns and anti-patterns, create guidelines and/or clippy warnings based on those.
Create safe abstractions to serve common cases that are currently served by unsafe code, such as copying a part of a slice into itself.
Prioritize language and compiler work items that allow better verification at compilation stage, such as better bounds check elision or const generics.

Rust ecosystem is fairly large these days, so there is a lot of code to cover. Perhaps a community effort akin to libs blitz is required.

Prior art

Non-lexical lifetimes that have landed in the 2018 edition of Rust made the borrow checker smarter, reducing the need for resorting to unsafe code. Kudos to everyone involved!

Some other highlights:

Analysis of unsafe code in decoding crates that led to a safe abstraction proposal, documentation improvement, a clippy warning, and discovery of a security bug in the standard library.
Developers uplifting their only unsafe block into the standard library. It is now a widely used function with the unsafe block encapsulated in a safe external interface, verified once and for all.
Crates such as byteorder providing safe abstractions for commonly used and potentially unsafe operations long before they appeared in the standard library.
Unsafe Code Guidelines effort that has resulted in safer abstractions as such as MaybeUninit. It also paves the way for security-oriented static analysis of unsafe code and better compiler optimizations.

Verification of standard library

The Rust standard library is a truly impressive piece of engineering. It sets the bar for Rust API design and incorporates the latest advances in algorithms and data structures, with more on the way.

Due to its role as the foundation of the language providing essential safe abstractions over the hardware it is also full of unsafe code.

Two serious vulnerabilities have been discovered in libstd to date. Another one was introduced but reverted before release because it was so bad that it caused crashes even on valid data. All of these were introduced during optimization or refactoring, and have passed manual code review.

The fact that humans are no good at analyzing unsafe code is the very reason for Rust’s existence. We need computers to assist in verification of Rust’s standard library.

There are several ways to go about that:

Static analysis would be a relatively cheap and scalable way to gain more confidence in the code. Rust is much more amenable to static analysis than C/C++ or dynamically typed languages, but there is no go-to security-oriented static analyzer yet.
Fuzzing or parametric testing could also scale well, assuming fuzzing harnesses could be automatically generated based on type definitions of stdlib functions. It would not find all the bugs, but it is easy to run continuously and feasible to scale to the entirety of the standard library with little maintenance burden.
Formal verification methods provide greater assurance in correctness, but require more effort and introduce a non-trivial maintenance burden. Even though verifying the entirety of standard library this way is probably not practical at this time, it would be great to apply them to verify the most essential parts of it.

One of the already discovered vulnerabilities was trivial and would have been flagged by a static analyzer or easily discovered via fuzzing — if any of those were actually employed.

Prior art

Parametrized testing is easy to use in Rust, with two mature frameworks available: QuickCheck inspired by Haskell tool of the same name and Proptest inspired by Hypothesis.

The guided fuzzer trifecta — AFL, libfuzzer and honggfuzz — are already adapted to work with Rust and take 15 minutes to deploy. The trophy case is quite impressive.

A new fuzzer called Angora has just been released; according to its authors, it is vastly superior to the AFL-inspired status quo. It is itself written in Rust, but cannot fuzz Rust code yet.

Bughunt-rust has experimented with probabilistic model checking for verifying standard library data structures inspired by similar work in Erlang, but used guided fuzzers instead of QuickCheck RNG to improve coverage.

Fuzzing relies on dynamic analyzers to detect issues. Rust supports the venerable LLVM sanitizers, although Address Sanitizer currently requires some workarounds and nobody’s really sure how to use Memory Sanitizer, which led some people to build custom tooling instead. There is also Rust-specific tool called MIRI, but so far it only supports a subset of Rust and does not compose well with fuzzing.

Clippy is the go-to heuristic static analyzer for Rust, although it doesn’t have many safety lints yet. MIRAI is a sound static analyzer for Rust based on theory of abstract interpretation, but it is in the early stages of development.

On the formal verification front, RustBelt project has proven certain properties of the Rust type system and verified correctness of several standard library primitives. SMACK software verification toolchain works with Rust and has been used to find real bugs, but does not take advantage of the Rust type system, which makes it somewhat cumbersome to use.

Some promising work has been done on proving absence of overflows and panics or even proving user-defined properties on unmodified Rust code, without manually writing additional proofs in a verification language. This project is known as Prusti. It only works with a subset of safe Rust so far, but the prospect of formally verifying properties of Rust code with little to no additional effort is very exciting.

Code authentication and trust

Trust towards third-party code is a hot topic right now due to the recent event-stream incident in NodeJS. Ironically, security researchers have warned about this years ago.

This is an important problem, and there is work being done on that front. For example, something like cargo-crev may solve it in some cases. But trust towards external code is an unsolved problem in general, even in programming languages with built-in sandboxing capabilities.

As such, we do not expect it to be completely solved in 2019. However, there are improvements that we can make right now.

Adopting better code authentication practices is one. Someone is going to get their account compromised sooner or later, and the recent ESLint compromise is quite illustrative of why a strategy for mitigating this is needed. Even basics such as requiring signatures from several maintainers to upload a package are currently not supported.

This has been brought up as early as 2014; the attitude towards it is generally positive, and there is some work being done in this direction, but nobody has stepped up to actually implement the remaining part of it yet.

We need your help!

Some of the items we’ve listed require participation from core Rust teams, but most of them really don’t.

This is where you come in.

Rust is a community-driven language. We are just random people on the Internet coming together to work on a shared goal.

If you feel that these goals are worthwhile, pick an interesting item from the WG issue tracker and see if you can help. After all, it takes more than a village to build a successful programming language.

And stop by to say hello on Zulip!

How I’ve found vulnerability in a popular Rust crate (and you can too)

Sergey "Shnatsel" Davidoff — Fri, 28 Sep 2018 03:51:45 GMT

I have recently discovered a zero-day vulnerability in a fairly popular and well-designed Rust crate. In this article I’m going to discuss how I did it and why it wasn’t discovered earlier, and introduce a new tool, libdiffuzz, that I’ve created for the job. A recently discovered vulnerability in Rust standard library makes a cameo appearance.

In my earlier article about a one-line unsafe block that has nearly ruined everything I’ve explained how I’ve used fuzzing to look for vulnerabilities in widely used Rust code. However, the titular one-life unsafe was found not through an automated process, but by manually reading the code. Why didn’t fuzzers discover it?

Fuzzers work by feeding your program random input and seeing what happens. They only detect that something is wrong if the program crashes. So in order to get fuzzers to actually discover memory issues that lead to vulnerabilities, you need some way to notice improper handling of memory when it happens. There have been many attempts to build such tools over the years, but the most practical and popular tool is Address Sanitizer. It reliably detects all sorts of bugs, is supported by Rust compiler out of the box, and is in fact enabled by default in one of Rust fuzzers, cargo-fuzz.

However, there is a class of memory issues that Address Sanitizer cannot detect: reading from uninitialized memory. If you can get a program to output contents of uninitialized memory, that’s called a “memory disclosure” vulnerability. There are multiple examples of such bugs in common C code, and in certain contexts they can be devastating: how about stealing cookies and passwords from web browser simply by displaying an image and running a bit of JavaScript?

All of these are the same image. Due to a bug in the GIF decoder contents of browser memory shows up in the decoded images.

There is a tool that can detect reads from uninitialized memory, called Memory Sanitizer, but it currently doesn’t work with Rust standard library. So unless you completely avoid using Rust standard library, there is no tool that let you detect reads from uninitialized memory in Rust.

Well, bummer. That means I’ll have to build one.

The birth of libdiffuzz

Since I’m only interested in memory disclosure vulnerabilities, i.e. cases when contents of uninitialized memory show up in the program output, it should be sufficient to run the same operation twice and compare the results. If a program has decompressed the same zip file twice and got different results, that usually means that contents of uninitialized memory have shown up in the output.

With that in mind, I’ve written a simple test program that reads from uninitialized memory and tried to detect it using the “run twice, compare results” technique. I wanted to be able to check if results differ between runs at a glance without comparing huge amounts of data by hand, so this is what I ended up with:

https://medium.com/media/189a43a0a33ebe25a4eb80f068ba1718/href

This program will panic if the use of uninitialized memory is detected. Our goal here is to get it to panic reliably — we know it’s buggy, we just need to be able to detect the bug automatically.

Turns out it’s not that easy because what uninitialized memory actually contains varies depending on the memory allocator in use. And no matter what memory allocator I tried, I couldn’t get it to crash. When built with Rust’s default jemalloc, sum_uninitialized() would always return 0. When built with system allocator (as in the code above), the return value would differ between different runs of the process, but not between different invocations of the function within the same process. I have even tried AFL’s libdislocator which is basically a poor man’s address sanitizer implemented as a memory allocator (which makes it usable on black-box binaries), and even that didn’t work: my sum_uninitialized() always produced a stable result.

At this point I’ve (mentally) screamed “How hard can it be?!”, opened the source code of libdislocator and trivially patched it to fill every allocated buffer with a value that’s incremented on every allocation instead of a constant value. And it worked! This test program started crashing!

From the lab to real world

Armed with my newly-minted abomination I went looking for a prospective real-world target to use it on. I’ve picked claxon, a FLAC decoder written in Rust, for a few reasons:

Code that does nontrivial binary parsing is the poster child for security vulnerabilities in memory management
It contains 8 unsafe blocks per ~2000 lines of code, which is entirely too many for my liking and cannot possibly be not exploitable in a library that does complicated binary parsing (seriously, don’t do unsafe)
The author claimed that the library has been extensively fuzzed
I have already fuzzed it myself for about 1 billion executions total, so I already had a bunch automatically generated files that exercise many different execution paths — a great starting point for further fuzzing
Nobody has looked for this particular class of vulnerabilities in Claxon before — the only thing that could detect it, Memory Sanitizer, would not have worked with Claxon because it uses the Rust standard library
This code has defied me before (see point 4) and I took that as a challenge

So I’ve thrown together a fuzz target that decoded the same file twice and checked that the result is the same (if you’re craving for fancy words, call this “differential fuzzing”), plugged it into AFL and left it overnight. And lo and behold, I woke up to 3 automatically discovered crashes!

And just as expected, the crashes were indeed happening on the assert!() that was comparing results from two subsequent runs and failing, and it only happened under libdiffuzz; they went completely unnoticed otherwise.

https://medium.com/media/21907857864530273414c8490d715970/href

I have reported the vulnerability to crate maintainer, who has promptly investigated and fixed it, then audited the rest of the code for similar bugs and added fuzzing a target similar to mine as a CI job. Swift handling of security vulnerabilities by maintainers is always great to see, and Claxon’s maintainer went above and beyond the call of duty.

Side note: it later turned out that I forgot to disable checksum verification in Claxon, so most inputs generated by the fuzzer were rejected early because of checksum mismatch (random data doesn’t have valid CRC16 in it, duh). But thanks to the sheer amount of inputs AFL has thrown at Claxon it has generated some files with valid CRC16 anyway, by sheer luck. To give you some context: AFL tests roughly 1 billion inputs per day on my mid-range CPU.

I’ve opened a PR to automatically disable checksum verification in Claxon during fuzzing so we wouldn’t have to deal with it anymore. With checksums disabled it only takes a few minutes to discover the bug using libdiffuzz.

I have also tried fuzzing with AFL + libdiffuzz on lodepng-rust and miniz-oxide, but got nothing. lodepng-rust was created as a largely automated translation of a C codebase where these issues have already been discovered with AFL, and miniz-oxide actually comes with a “run twice, compare results” fuzz harness that compares Rust and C implementations. For those projects it was mostly about not triggering false alarm.

However, the entire rest of Rust ecosystem has probably never been fuzzed with anything that could detect use of uninitialized memory. So if you want to claim some zero-day vulnerability discoveries to your name, just pick a crate that has unsafe blocks in it, ideally with something like mem::uninitialized() or vec.set_len(), and give it a spin in a “run twice, compare results” fuzzing harness with libdiffuzz. There should be plenty of low-hanging fruit because nobody’s tried picking any of it yet.

I have published a cleaned-up version of my tool in github, check it out if you want to learn more or give it a spin: https://github.com/Shnatsel/libdiffuzz

It comes with a quickstart guide for Rust and a sample test harness that’s simpler than the one I’ve used for Claxon. Also, a list of caveats. Lots of them.

Why didn’t Rust prevent this?

The short answer is “Because people have deliberately opted out of its safety guarantees.” But why did they opt out?

In Claxon it was for the sake of optimization. Here’s the commit that introduced unsafe code:

Do not fill buffer with useless zeros initially · ruuda/claxon@cfeb761

Note that before this commit the buffer is diligently initialized with zeroes using buffer.extend(repeat(0).take(new_len — len)); — quite a mouthful! Not only that’s complicated, it’s also slow — it compiles into something like a loop that fills the allocated memory with zeroes.

Other than the obvious issue with it being kinda slow on normal inputs, it can get excruciatingly slow on deliberately malformed inputs, which can be used to mount a denial-of-service attack. If the implementation is perfectly efficient and uses full memory bandwidth (roughly 100Gb/s for DDR4), filling the entire 64-bit address space would take 16,000,000,000 seconds - or 500 years. Even with memory usage limits it’s still not pretty, because a single file can do this over and over and over again.

However, modern operating systems let you request already zeroed memory, which not only is roughly 4x faster in my tests, but is also asynchronous and lazy: even if you allocate a lot of such memory, zeroing memory will not block your program you actually try to access the relevant parts of it.

Can you ask your OS to do that from Rust? Yes! std::vec::from_elem() will simply request zeroed memory from the OS if you pass 0 as the element to fill the vector with. This function is not public, but that’s what vec! macro desugars into, so the fastest way to initialize a vector of size max_len is actually vec![0; max_len];. After switching Claxon from using uninitialized memory to this macro there was no measurable performance difference.

Sadly, none of this is documented. The vec! macro is used all over the place in Vec documentation, but it does not mention that this is the fastest way by far to safely initialize a vector, or discuss about efficient initialization at all.

Documenting the fastest way to safely initialize a vector would have prevented this vulnerability.

I have opened an issue against Rust to document this more clearly.

But wait, it gets weirder

I have also investigated the vulnerability in inflate, discussed here.

Side note:inflate was not actually exploitable, since the code calling the vulnerable function was structured in such a way that it never passed it the specific values required to exploit it. Still, the vulnerable function is an example of security bug in real-world code.

Unsafe code was used in inflate because there was no safe way to accomplish what they needed safely and efficiently. I have written a detailed analysis of it on the Rust internals forum, which I will not duplicate here.

I have also included a proposal for a safe abstraction that would prevent such issues in the future. The day after writing it the proposal I’ve started contemplating how I would go about implementing it, and then found that somebody has already written and posted a working prototype. Overnight. I didn’t even have to do anything. God I love Rust community.

In that thread Scott McMurray has brought up a similar function in the standard library, which could be used to solve the problem if it were generalized a bit. Then he took a closer look at it and realized that the standard library function was vulnerable too:

[stable] std: Check for overflow in `str::repeat` by alexcrichton · Pull Request #54397 · rust-lang/rust

This is the second-ever security vulnerability in the standard library. In case you’ve missed it, I’ve written an article detailing the first one.

Just like the first stdlib vulnerability, this one was introduced during refactoring. Unlike the first one, it does not require a sequence of specific function calls, and would be easily discovered via fuzzing if anyone has actually fuzzed that particular function.

This led me to contemplate automatically generating fuzzing harnesses for the standard library functions, but I haven’t gotten around to actually prototyping that yet.

Conclusions

First things first: if you haven’t fuzzed your code yet, you should. Doesn’t have to be with libdiffuzz either — most bugs and almost all really severe vulnerabilities can be discovered without it. In Rust it’s stupidly easy and won’t take you more than 15 minutes to set up.

As the fuzzing trophy cases filled with bugs from real-world projects in both Rust and everything else can attest, you don’t know your code until you’ve fuzzed it.

My pet libdiffuzz might also be of use. Feel free to borrow it and subject your unsafe code to its unrelenting jaws.

However, fuzzing won’t find all of the bugs. Do not rely on it as proof that your 2-line unsafe block is actually secure! And even if it is secure now, someone will refactor it later and it will become exploitable - just like it happened in the standard library.

So if you can help it, try to refactor your unsafe code into safe. And if you can’t, post on rust-internals forum and describe what’s slow or what kind of safe abstractions you’re missing. For example, lewton crate is 100% safe code because it has upstreamed its only unsafe function into the standard library, where it got a lot more eyeballs. And it’s beneficial for others too: I have recently used this very function at work without having to worry about auditing a transmute by myself.

Also, there is a project to verify the implementations of data structures in Rust standard library, and it could use all the help it can get. And if you’re interested in auto-generating fuzzing harnesses for stateless stdlib functions, let me know. I can handle generating fuzz harnesses, but I could use some help with listing stdlib functions and parsing parameter types.

Discuss this article on Reddit

How Rust’s standard library was vulnerable for years and nobody noticed

Sergey "Shnatsel" Davidoff — Sat, 18 Aug 2018 02:47:33 GMT

Rust is a new systems programming language that prides itself on memory safety and speed. The gist of it is that if you write code in Rust, it goes as fast as C or C++, but you will not get mysterious intermittent crashes in production or horrific security vulnerabilities, unlike in the latter two.

That is, until you explicitly opt in to that kind of thing. Uh oh.

Wait, what?

You see, Rust provides safe abstractions that let you do useful stuff without having to deal with the complexities of memory layouts and other low-level arcana. But dealing with those things is necessary to run code on modern hardware, so something has to deal with it. In memory-safe languages like Python or Go this is usually handled by the language runtime — and Rust is no exception.

In Rust, the nutty-gritty of hazardous memory accesses is handled by the standard library. It implements the basic building blocks such as vectors that expose a safe interface to the outside, but perform potentially unsafe operations internally. To do that, they explicitly opt in to potentially unsafe operations (read: barely reproducible crashes, security vulnerabilities) by annotating a block with unsafe, like this: unsafe { Dragons::hatch(); }

However, Rust is different from languages like Python or Go in that it lets you use unsafe outside the standard library. On one hand, this means that you can write a library in Rust and call into it from other languages, e.g. Python. Language bindings are unsafe by design, so the ability to write such code in Rust is a major advantage over other memory-safe languages such as Go. On the other hand, this opens the floodgates for judicious use of unsafe. In fact, a couple of months ago a promising library caught some flak for engaging in precisely this sort of thing. So when I was trying to gauge whether Rust actually delivers on its promise of memory safety, that’s where I started.

I’ve messed with popular Rust libraries over the course of a month and then described my findings in Auditing popular Rust crates: how a one-line unsafe has nearly ruined everything. The TL;DR version of it is that Rust crates do sometimes use unsafe when it’s not absolutely necessary, and bugs that lead to denial of service are abundant, but after poking six different crates I have failed to get an actual exploit.

Clearly, I had to kick it up a notch.

Ultimate Power, Bad Guys Only

There is a highly effective technique for discovering vulnerabilities that I haven’t applied to Rust yet. It beats everything else by a long shot, and can be used only by the bad guys who want to break stuff, not the good guys who fix it. It’s… searching the bug tracker.

You see, most people writing code in C or C++ are not actually security-minded. They just want their code to work and go fast. When they encounter a bug that makes the program output garbage or crash, they simply fix it and go investigate the next bug. What else is there to do?

Well, turns out in C and C++ many of those bugs are caused by mistakes in memory management. It’s those kinds of bugs that present remote code execution vulnerabilities, and that safe Rust is designed to prevent. The proper way to handle them is to file them into a database called Common Vulnerabilities and Exposures (CVE for short) so that people who care about security are alerted to it and ship fixes to users. In practice such bugs are silently fixed in the next release at best, or remain open for years at worst, until either someone discovers them independently or the bug is caught powering some kind of malware in the wild.

This leaves a lot of security vulnerabilities in plain sight on the public bug tracker, neatly documented, just waiting for someone to come along and weaponize them.

I particularly like an example of such a bug in libjpeg that was discovered in 2003 but not recognized as a security issue. The patch to fix it ended up in limbo until 2013, at which point it was incorporated into an update so obscure that nobody received it anyway. The fix did not even get a changelog entry. It was independently discovered later in 2013 by Michal Zalewski, author of afl-fuzz, and after 10 years since the vulnerability was discovered the fix has at last shipped.

That is, anyone who has bothered to just scroll through the bug tracker could steal cookies and passwords out of your web browser by simply loading an image and a bit of JavaScript for 10 years.

Touché.

The worst part is, bugs that are already fixed are not eligible for bug bounties. So the Bugtracker Search technique will not get you bug bounty money; it will, however, get you real exploits for production systems. This is why it’s unrivaled if you want to break stuff, and useless if you want to fix it and not go broke in the process.

Also, getting maintainers to take your “this is a security vulnerability” comments seriously can be problematic, and actually exploiting the bug to prove it can be a lot of work, which further discourages pro bono applications of this technique.

Into the woods

Actually applying the Bugtracker Search™ to Rust code was even easier than I expected. Turns out GitHub lets you search all the projects written in a certain language, so I’ve just typed “unsound” in search query, selected “Rust” as language and off we go! Bugs, bugs everywhere!

I did not have much time to spare at the moment, so typing “crash” instead of “unsound” in the search box is left as an exercise for the reader. Also, I’ve only searched for open bugs in recently updated projects and ignored the standard library (those guys gotta be responsible, right?).

This got me my first Rust zero-day exploit! It was discovered two months before I’ve found it through github search and comes with its own blogpost, albeit focusing on performance. After I pointed out that it is a security vulnerability, the crate maintainer has fixed it within two hours. And then has backported the fix to every affected series even though the crate is still in 0.x.x versions. Kudos!

Still, actually exploiting this bug in practice is tricky. It would be a good candidate for exploit chaining, but it’s hard to use by itself.

Okay, that was not ultimate enough. Time to kick it up another notch.

In the belly of the beast

At this point we’re looking for something that is straightforward to exploit (something like buffer overflow with data an attacker can control) and has not been recognized as a security vulnerability yet.

It doesn’t matter if the bug is fixed in the latest version of the code: people lack incentives to update to the latest version as long as whatever they’re using works for them, and have a very clear incentive to not upgrade because whatever they’re using is known to work well, while the latest update is not.

So even if there is an update that fixed the issue, a lot of people will not actually install it, because there is no reason to — unless it is marked as a security update.

I was contemplating my course of action when I’ve accidentally stumbled upon a reddit thread discussing the history of vulnerabilities in the Rust standard library, which pointed out this gem:

seg fault pushing on either side of a VecDeque
https://github.com/rust-lang/rust/issues/44800

This is a buffer overflow bug in the standard library’s implementation of a double-ended queue. The data written out of bounds is controlled by the attacker. This makes it a good candidate for a remote code execution exploit.

The bug affects Rust versions 1.3 to 1.21 inclusive. It is causing a crash that is relatively easy to observe, yet it has gone unnoticed for two years. In the release that fixed it it did not even get a changelog entry. No CVE was filed about this vulnerability.

As a result, Debian Stable still ships vulnerable Rust versions for some architectures. I expect many enterprise users to have vulnerable versions as well.

As usual, bad guys win.

Whooops!

I did not expect to find something like this in the standard library because Rust has a very well thought out and responsible security policy (other projects, take note!), and the Rust security team consists of people who regularly work on the compiler and standard library. The fix should not have gone unnoticed.

I have contacted the Rust security team about the issue, asking them to make an announcement and file a CVE. The reply was:

Hey Sergey,

This was fixed way back in September; we don’t support old Rusts. As such, it’s not really eligible for this kind of thing at this point, as far as our current policies goes. I’ll bring it up at a future core team meeting, just in case.

And then, shortly:

We talked about this Wednesday evening.

- We do want to change our policy here
— The current policy is that we only support the latest Rust
— The general feeling is “if it’s important enough for a point release, it’s important enough for a CVE”
— This specific patch does seem like it should have gotten more attention at the time
— This stuff also obviously ties into LTS stuff as well
- We don’t have the time or inclination to work on updating this policy until after the [2018] edition ships
— We’d rather take the time to get it right, but don’t have the time right now

Okay, I have to admit that this sounds reasonable.

They have subsequently reaffirmed that they have no intention to file a CVE for this issue, so I went ahead and applied for one myself via http://iwantacve.org/. This is supposed to involve a confirmation by email, and I am yet to hear back. I have no clue how long this will take.

Update: this issue has been assigned CVE-2018-1000657.

Bugs, bugs everywhere!

This exposes a bigger issue with the standard library: insufficient verification. If this bug — which is relatively easy to observe! — has gone unnoticed for two years, surely something like it is still lurking in the depths of the standard library?

This problem is not unique to Rust. For example, Erlang — that funky language that people use to program systems with 99,9999999% uptime (no, that’s not an exaggeration) — has repeatedly shipped with a broken implementation of Map data structure in its standard library. There is a fascinating series of four articles detailing a systemic approach used to discover those issues.

To actually deliver on the safety guarantees, Rust standard library needs dramatically better testing and verification procedures. Some of its primitives were mathematically proven to be correct as part of RustBelt project, but that did not extend to implementations of data structures.

One way to do that would be to use the same approach as was used for verifying the Map structure in Erlang — building a model of the behavior of the structure in question and automatically generating tests based on it, then verifying that the outputs of the model and the implementation match for certain automatically generated inputs. Rust already has the tooling for that in the form of QuickCheck and proptest.

Another way to verify the implementations is to use a symbolic execution framework such as KLEE or SAW. They work by analyzing the code and figuring out all possible program states for all possible execution paths. This lets you either generate inputs that trigger faulty behavior or make sure that certain behavior is impossible. Sadly, neither of those tools supports recent versions of Rust.

Alas, both of those approaches are time-consuming and would require coordinated effort. It’s not something one can do for the entire standard library over a couple of weekends — otherwise I’d be opening a pull request for Rust standard library by now instead of writing this article.

Oh, and before you bring out the pitchforks and denounce Rust for all eternity: for reference, Python runtime gets at about 5 remote code execution vulnerabilities per year. And that’s just the already discovered ones that got a CVE! How many were silently fixed or still lurk in the depths of Python runtime? Only the bad guys know.

Everything is broken

I have once reported a buffer overflow in a popular C library that is used in one of the major web browsers. It was the textbook example of a security vulnerability, and could be triggered simply by opening a webpage. I was told that the bug was silently fixed in a subsequent release that nobody has upgraded to yet. When I asked the maintainers to file a CVE, they said that if they filed one for every such bug they fixed they’d never get any actual work done.

Oh, and the worst thing? The vulnerability I’ve reported in that library was found by a fully automated tool in less than a day. All I did to discover the vulnerability was basically point and click. Imagine how many more exploitable bugs a dedicated security expert could discover!

This was when I have actually understood and internalized that everything is broken.

The horrifying thing for me is that I still use that web browser. It’s not like I have any alternatives — every practical web browser relies on a huge mess of C code. And it is evident that humans are unable to write secure C code, unless they swear off dynamic memory allocation altogether.

This is why I’m so hopeful about Rust. It is the only language in existence that could really, truly, completely and utterly supplant C and C++ while providing memory safety. There is a mathematical proof of correctness for a practical subset of safe Rust and even some inherently unsafe standard library primitives, and ongoing work on expanding it to cover even more of the language.

So we know that safe Rust actually works. The really hard theoretical problems are solved. But the inherently unsafe parts of the implementation, such as the language runtime, could use more attention.

Update: Brian Troutwine has kicked off a project to validate Rust standard library primitives using QuickCheck! Check out bughunt-rust on GitHub, and join the hunt!

Discuss this article on Reddit

Auditing popular Rust crates: how a one-line unsafe has nearly ruined everything

Sergey "Shnatsel" Davidoff — Thu, 19 Jul 2018 01:38:11 GMT

Following the actix-web incident (which is fixed now, at least mostly) I decided to poke other popular Rust libraries and see what comes of it.

The good news is I’ve poked at 6 popular crates now, and I’ve got not a single actually exploitable vulnerability. I am impressed. When I poked popular C libraries a few years ago it quickly ended in tears. The bad news is I’ve found one instance that was not a security vulnerability by sheer luck, plus a whole slew of denial-of-service bugs. And I can’t fix all of them by myself. Read on to find out how I did it, and how you can help!

My workflow was roughly like this:

See if the crate has been fuzzed yet to identify low-hanging fruit.
If it has been fuzzed, check sanity of fuzzing harness.
If something is amiss, fuzz the crate.
In case fuzzing turns up no bugs, eyeball the unsafes and try to check them for memory errors.
If no horrific memory errors turn up, try to replace whatever’s under unsafe with safe code without sacrificing performance.

Turns out Rust community is awesome and not only has excellent integration for all three practical fuzzers along with a quick start guide for each, but also a huge collection of fuzz targets that covers a great deal of popular crates. Ack! Getting low-hanging fruit at step 1 is foiled!

So I’ve started checking whether fuzzing targets were written properly. Specifically, I’ve started looking for stuff that could block fuzzing — like checksums. A lot of formats have them internally, and PNG has not one but two — crc32 in png format and adler32 in deflate. And lo and behold, none of the crates were actually disabling checksums when fuzzing! This means that random input from fuzzer was rejected early (random data does not have a valid checksum in it, duh) and never actually reached the interesting decoding bits. So I’ve opened PRs for disabling checksums during fuzzing in miniz_oxide, png, lodepng-rust, and ogg, and then fuzzed them with checksums disabled. This got me:

4 distinct panics and a memory exhaustion in png
Memory leak in lodepng-rust
A panic in lewton, the Vorbis decoder in Rust. There are probably more panics hiding behind this one.

inflate crate was the first where fuzzing has turned up nothing at all, so I've started eyeballing its unsafes and trying to rewrite them into safe code. I've added a benchmarking harness and started measuring whether reverting back to safe code hurts performance. cargo bench was too noisy, but I've quickly discovered criterion which got me the precision I needed (did I mention Rust tooling is awesome?). I got lucky - there were two unsafes with two-line safe equivalent commented out, and reverting back to safe code created no measurable performance difference. Apparently the compiler got smarter since that code was written, so I've just reverted back to safe code.

This left just one unsafe with a single line in it. Spot the security vulnerability. I would have missed it if the crate maintainer hadn’t pointed it out. If you can’t, there are hints at the end of this post.

By sheer luck the rest of the crate just so happens to be structured in a way that never passes input parameters that trigger the vulnerability, so it is not really exploitable. Probably. I could not find a way to exploit it, and the crate maintainer assures me it’s fine. Perhaps we just haven’t figured out how to do it yet. After all, almost everything is exploitable if you try hard enough.

Sadly, simply replacing the unsafe .set_len() with .resize() regressed the decompression performance by 10%, so instead I've added an extra check preventing this particular exploit from happening, and then liberally sprinkled the function with asserts that panic on every other way this unsafe could go wrong that I could think of.

Is the function secure now? Well, maybe. Maybe not. Unless we either rewrite it in safe rust (or prove its correctness, which is a lot harder) we will never know.

The thing is, I’m pretty sure it’s possible to rewrite this in safe Rust without performance penalty. I’ve tried some local optimizations briefly, to no avail. Just like with high-level languages, writing fast safe Rust requires staying on the optimizer’s happy paths, and I have not found any documentation or tooling for doing that. https://godbolt.org/ that lets you inspect the LLVM IR as well as assembler and shows what line of Rust turned into what line of assembly, but you can’t feed your entire project to it. (As pointed out in comments, cargo-asm can do that to an entire project). And you also need tools to understand why a certain optimization was not applied by rustc. LLVM flags -Rpass-missed and -Rpass-analysis seem to be capable of doing that, but there is literally no documentation on them in conjunction with Rust.

Discussing the vulnerability further would be spoilerrific (seriously, try to locate it yourself), so I’ll leave further technical discussion until the end of the post. I want to say that I was very satisfied with how the crate maintainer reacted to the potential vulnerability — he seemed to take it seriously and investigated it promptly. Coming from C ecosystem it is refreshing to be taken seriously when you point out those things.

By contrast, nobody seems to care about denial of service vulnerabilities. In the 3 crates I’ve reported such vulnerabilities for, after 3 weeks not a single one was investigated or fixed by maintainers of those crates, or anyone else really. And the DoS bugs are not limited to panics that you can just isolate into another thread and forget about.

After not getting any reaction from crate maintainers for a while I tried fixing those bugs myself, starting with the png crate. In stark contrast to C, it is surprisingly easy to jump into an existing Rust codebase and start hacking on it, even if it does rather involved things like PNG parsing. I've fixed all the panics that fuzzers discovered based on nothing but debug mode backtraces, and I don't even know Rust all that well. Also, this is why there are 4 distinct panics listed for PNG crate: I've fixed one and kept fuzzing until I discovered the next one. lewton probably has many more panics in it, I just didn't got beyond the first one. Sadly, three weeks later my PR is still not merged, reinforcing the theme of "nobody cares about denial of service". And png still has a much nastier DoS bug that cannot be isolated in a thread.

(To be clear, this is not meant as bashing any particular person or team; there may be perfectly valid reasons for why it is so. But this does seem to be the trend throughout the ecosystem, and I needed some examples to illustrate it).

Also, shoutout to tungstenite — it was the only crate that did not exhibit any kinds of bugs when being fuzzed for the first time. Kudos.

Conclusions:

Unlike C libraries, Rust crates do not dispense security vulnerabilities when you poke them with a fuzzer for the first time (or sometimes even the third time). Humans make all the same mistakes, but Rust prevents them from turning into exploits. Mostly.
Rust tooling is diverse, high-quality and accessible. afl.rs, cargo-fuzz, honggfuzz-rs, sanitizers, criterion, proptest and clippy not only exist, but also come with quickstart guides that makes deploying any of them take less than 15 minutes.
Cargo and docs.rs combined with Rust language features that allow expressively encoding application logic make an existing complex codebase surprisingly easy to understand and hack on, making drive-by contributions a breeze. And I don’t even know Rust all that well.
Hardly anyone uses #![forbid(unsafe_code)]. Rust offers to rid you of paranoia and arbitrary code execution exploits, but people don't seem to take up on the offer. (Shoutout to lewton developers who did).
Safe Rust code can be as fast as one with unsafe (shoutout to serde-json that is the fastest JSON parser in the world, written in fully safe Rust), but squeezing out those last 20% requires you to adjust your code in arcane ways to hit the optimizer happy paths, kinda like with high-level languages. There is no documentation or tooling for doing such a thing, although the building blocks are there. Until such documentation and tooling is created, the only viable option is trial and error.
A lot of crates contain 2–3 unsafe blocks that can probably be refactored into safe code without losing performance. This is probably related to the lack of tooling. Rust isolates unsafe code and that makes auditing code easier, but in practice it is not actually audited. We need a libs-blitz-like effort to get rid of such unsafes, I can't process the entire ecosystem alone. (If you also put #![forbid(unsafe_code)] on the cleansed crate, I will love you forever).
Fuzzing would not have discovered this vulnerability at all, unless you had a very specific fuzzing setup looking specifically for this kind of thing. Even then, the chances of ever hitting it were pretty darn low. Fuzzing is a very easy way to prove presence of bugs, but it cannot prove their absence.
Symbolic execution tools like KLEE or SAW that can be used to prove correctness do not have Rust integration, even though both operate on LLVM IR. KLEE used to have it, but sadly the LLMV version used in KLEE is now grossly outdated. Update: SMACK has been adapted to work with Rust!
If you want to write DoS-critical code in Rust and use some existing libraries, you’re out of luck. Nobody cares about denial of service attacks. You can poke popular crates with a fuzzer and get lots of those. When you report them, they do not get fixed. There is a linter to detect potential panics, but if a linter for stuff like stack overflows or unbounded memory allocations exists, I am not aware of it.
Rust has no mechanism for propagating security updates through the ecosystem. I was surprised to find that Cargo does not alert you when you’re using an outdated library version with a security vulnerability, and crates.io does not reject uploads of new crates that depend on vulnerable library versions, and does not alert maintainers of existing crates that their dependencies are vulnerable. A third-party tool to check for security vulnerabilities exists, but you’ve never heard of it and you have better things to do than run that on all of your crates every day anyway.

Originally I thought this would be a fun exercise for a few weekends, but the scope of the work quickly grew way beyond what I can hope to achieve alone. This is where you come in, though! Here’s a list of things you can try, in addition to the hard tooling tasks listed above:

Fuzz all the things! It takes 15 minutes to set up per crate, there is no reason not to. Also, there is a trophy case.
Fix bugs already discovered. For example: panic in lewton (easy), unbounded memory consumption in png (intermediate), lodepng memory leak (C-hard). You can also fuzz lewton afterwards to get more panics, just don’t forget to use ogg dependency from git. You can reuse my fuzz harnesses if you wish.
Refactor unsafes in popular crates into safe code, ideally without sacrificing performance. For example, inflate crate has just one unsafe block remaining, png has two. There are many more crates like that out there.
There are easy tasks on docs and tooling too: AFL.rs documentation is outdated and describes only version 0.3. Version 0.4 has added in-process fuzzing that’s ~10x faster, it needs to be mentioned. Also, AFL could use more Rusty integration with Cargo, closer to what cargo-fuzz does. Also, disabling checksums is a common pitfall that needs to be mentioned.

I’d love to keep fixing all the things, but at least in the coming month I will not able to dedicate any time to the project. I hope I’ve managed to at least lead by example.

And now, details on that vulnerability! If you haven’t found it yourself, here’s a hint: similar bugs in C libraries.

If you still haven’t found it, see the fix.

Spoilerrific discussion of the vulnerability below.

Vulnerable code from git history for reference

The function run_len_dist() does a fairly trivial thing: resizes a vector to fit a specified amount of data and copies data from element i to element i+dist until i+dist hits the end of the vector. For performance, contents of the vector are not initialized to zeroes when resizing, as it would have been done by vec.resize(); instead, vec.set_len() is used, creating a vector with a number of elements set to uninitialized memory at the end.

The function never checks that dist is not zero. Indeed, if you call it with dist set to 0, it will simply read uninitialized memory and write it right back, exposing memory contents in the output.

If this vulnerability were actually exploitable from the external API (which it isn’t, probably), inflate would have output contents of uninitialized memory in the decompressed output. inflate crate is used in png crate to decompress PNGs. So if png crate was used in a web browser (e.g. servo) to decode images, an attacker could pass a crafted PNG to the client, then read the decoded image using javascript. This lets the attacker read memory contents from the browser - cookies, passwords, you name it. This is not quite as bad as Heartbleed or Meltdown, but it's up there.

Sadly, regular fuzzing would not have discovered this vulnerability. If it were actually exploitable, at least one way to trigger it would involve setting several distinct bytes in the input to very specific values. And even the best current generation fuzzers cannot trigger any behavior that requires changing more than one byte simultaneously, except in rare cases or if you explicitly tell what consecutive byte strings it should try. And there is nothing in the code that would guide the fuzzers to these specific values.

Even if fuzzers did discover such an input by random chance, they would not have recognized it as a vulnerability, unless you do either of these things:

Fuzz your code under memory sanitizer (not to be confused with address sanitizer), which is impossible for any crate that links to C code and is compatible with only one fuzzer — AFL, and only in its slower stdin mode (possibly honggfuzz too in its slower binary-only instrumentation mode, but I haven’t checked).
Create a fuzz harness that decodes the same input twice and verifies that the output matched, and somehow ensure that the memory allocation was not reused. AFAIK Rust’s default jemalloc allocator can reuse allocated memory, so you're probably again limited to AFL in stdin mode.

This just goes to show that fuzzing unsafe code does not actually guarantee absence of bugs.

Safe Rust, however, does guarantee absence of memory errors that lead to arbitrary code execution exploits and other unspeakable horrors. So let’s use it.

Join the discussion this article on Reddit