vorpal.se

3D printing with unconventional vase mode

2025-06-23T00:00:00+02:00

This article targets an advanced audience who are already familiar with the 3D printing. In this article I will try to collect some information I haven’t found written down in a single place yet. In particular, a lot of the information is seemingly only available in the form of YouTube videos that take a long time to get to the point.

This article assumes some familiarity with CAD and 3D printing. Basic concepts and terminology is not explained.

Basics of vase mode

With that out of the way what is this about? Vase mode is a printing mode where the printer prints a spiral path, with no seams. This is fast, avoids the visual blemishes of the seam but also has some downsides:

Only a single perimeter. This potentially means weaker parts.
No disconnected areas (per layer), you have to print with a single path.
No internal geometry. No infill. No top layers.
No supports.

Typically, it gets used for vases and pots. Thus, the name. Here is a crude example (I’m not an aesthetics focused designer, so imagine something prettier than this. If it fits and functions, it ships in my book):

Of note here is that the model itself isn’t hollow, but the slicer will make it hollow for you (since it only prints a single perimeter). In PrusaSlicer this setting is found at “Print Settings” → “Layers and perimeters” → “Vertical shells” → “Spiral vase”. OrcaSlicer etc should have the same or similar setting as well somewhere else. I have no idea about Cura.

But there are some ways to stretch this mode to the limits, and that is what this article is about. This will make vase mode useful for more than just simple vases. And that can often be the fastest and lightest way to print a part, if you can pull it off.

To understand the tricks you do need to understand how vase mode works though. It takes solid geometry, and takes the outline of it. What is inside doesn’t matter. It will be ignored:

As can be seen, while the hole exists in the bottom solid layers, the slicer ignores it above that point.

So what can we do above that?

Internal geometry via slits

The idea comes from the RC plane 3D printing community, where they want to print lightweight but strong parts. In particular wings with internal supporting geometry.²

There are two main tricks for unconventional vase mode prints. Let’s start with slits, as the next trick builds upon this first trick. As I’m no aircraft wing designer I will use other geometry for illustration purposes. The idea is useful in other contexts than RC wings, that is the whole point of this article.

Make a slit into the part. The left is for demonstration only, you need the slit to be really thin, 0.0001¹ mm or so, as shown on the right:

If we extrude this into a block and slice it, PrusaSlicer will see this slit and print an outer perimeter going into the part, making a sort of internal support. You are basically modelling the infill yourself now:

If you try this, it will not work for you. This is because you are missing a crucial setting in PrusaSlicer. By default, PrusaSlicer will merge together close parts of the model. You need to change “Printer Settings” → “Advanced” → “Slicing” → “Slice gap closing radius”. Set it to 0.0.³ Otherwise, none of this will work.

For our example with a hole in the middle from the introduction we could get the following result:

Note that the slit will be visible and you can feel it with your fingers, but it will be a fairly smooth indentation, not a sharp edge.

Double walls

Now, let’s expand on this technique to make it even more useful: Have you ever wanted to use vase mode but with two perimeters? We can build upon the previous trick to make a double wall:

This is done by making a slit through into the hollow inside and making sure the part itself is exactly wide enough for two perimeters that touch. You can find the width you should use by going into PrusaSlicer (with the same settings that you plan to use to print with) and looking at the info text in “Print Settings” → “Layers and perimeters” → “Vertical shells”:

That is the value you want to use for this to work correctly.

We can build upon this to make our internal geometry touch the opposite wall, like so:

We can also use this to anchor a slit to the outside wall. This allows us to anchor internal geometry to the outside wall without poking through. In fact, to ensure we have a single continuous outline, all but one slit must be done like this. The following picture shows what you need to do (note that the double wall thickness is 0.87 mm in this example, it will change depending on other settings):

These two tricks presented so far form the basis of what I have seen called “unconventional vase mode”.⁴

But there are some more tricks related to vase mode that are worth knowing about.

Extrusion width

To make a vase mode stronger, you can increase the extrusion width. The general recommendation is that you can go to about 2x the nozzle diameter and keep good quality. This works, since the nozzle has a bit of a flat spot around the orifice.

However, British Youtuber “Lost in Tech” did some tests showing that you can go way further than that, but I haven’t tested this myself, and quality does eventually start going down. It might be worth looking into if this is useful to you.

In PrusaSlicer you can change this in “Print Settings” → “Advanced” → “Extrusion width”. For vase mode “External perimeters” is what matters (above the solid base layers, that is):

Remember to rescale any double walls to fit the new extrusion width. It might be a good idea to use a variable in your CAD model to make it easier to update (at least if you use parametric CAD like OnShape, FreeCAD or Fusion 360 which support variables).

Fake vase mode

Finally, if you absolutely cannot print something in vase mode you can still get most of the benefits by what I have seen called “fake vase mode”⁵. To understand this, we should first consider exactly what settings vase mode changes. In PrusaSlicer vase mode changes the following settings:

Single perimeter (except for the first few bottom layers).
No top layers.
No infill (except for the first few bottom layers).
No supports
Disables the setting “ensure vertical shell thickness”.
Prints in a continuous spiral path.

You can do all of those except 6 by hand in the slicer. And you can mix and match those first five things as you see fit.

Let’s investigate this via a case study rather than simplified theoretical examples like we have done so far

Case study: spheres on sticks

I needed some spheres on the end of wooden sticks, to hold up a bird net over my strawberries on my balcony. I didn’t want the net to lie on the plants directly, and I needed something on the end of the sticks so that the net wouldn’t tear. Thus, spheres (or rather: truncated spheres for print bed adhesion and overhang reasons) on sticks.

Here is the basic design in a section view:

This doesn’t quite work in vase mode, because the top of the sphere has very shallow overhangs. And the top needs to be smooth. (The “roof” of the internal hole is fine, thanks to the cone shape.) It is so close, we can almost use vase mode.

So first I designed this in CAD. We have a slit from the outside to the centre, as well as some slits from the centre that goes almost to the outside. In fact, they go to the “recommended object thin wall thickness” mentioned before. (Note that the slits do not go down into the solid bottom layers, for some additional strength.)

This results in the following in PrusaSlicer:

Like true vase mode, I used zero infill. But I enabled “Ensure vertical shell thickness” and 1 top solid layer. This added a tiny bit of material just below the shallow top of the dome, making it printable, but still lighter than printing normally. Then I used a layer range modifier to disable “ensure vertical shell thickness” for the lower part of the print where it wasn’t needed, as PrusaSlicer wanted to add some material on the inside of the lower layers as well.

I also increased the extrusion width to 0.8 mm (with a 0.4 mm nozzle) to get additional strength, and I used scarf seams to make the outside seam almost invisible.

You can go further from true vase mode though: You could have an inner and outer perimeter like traditional non-vase slicing, but still model your own infill only where needed. You will get seams obviously, but you might still be able to print faster and save weight. We are moving further from true vase mode here, but only you can decide what exactly is best for your print:

In fact, when I printed some of these spheres, the version without a slit to the outside ended up the best looking⁶:

The slit is visible, but on the part printed without a slit extending to the outside there are no visible seams at all. The unevenness at the top is due to me filing away a small blob that the nozzle left behind as it pulled away at the end. It is smooth to the touch but reflects the light differently.

Conclusions

Vase mode and “fake vase mode” is an often underused printing mode for functional parts, and it can be used to save weight and print time. The difference will be most noticeable on larger parts, on smaller parts 10 vs 15 minutes might not be worth the extra design effort (unless you are planning to print many copies of the same part).

I’m a bit disappointed that the slit was as visible from the outside as it was. From the videos about RC aircraft wings that I saw I expected this to be less noticeable. But “fake vase mode” still comes to the rescue here, offering most of the benefits. And when combined with scarf joint seams (which I found truly impressive, first time I tried it), I don’t really see the need for true vase mode any more. You might as well get the best of both worlds.

I did not find any written resource online summarizing these techniques, so I hope this post is useful not just to remind myself in the future, but also to others looking for this information. With that in mind, below is a cheat sheet of the important points and settings to remember.

These techniques require tuning settings in your slicer. This may not be possible if you are printing with at a commercial print farm, or targeting people slicing with a dumbed down web based slicer (as has recently been launched by both Printables and Makerworld). But it would be a shame if such dumbed down slicers restricted what we could design and publish. I will always try to make the most what both CAD and the slicer exposes to me.⁷

Do you have some other tips or tricks for vase mode? Did I get something wrong? Comment on Reddit or on Lemmy and I will likely see it (eventually).

Cheat sheet

Want to quickly remind yourself of the core ideas of this article when you are designing your next part? Here is a quick cheat sheet:

Slits: Use slits to add internal geometry.
- 0.0001 mm wide (or 0.001 if your CAD software doesn’t like you that day).
- PrusaSlicer: Set “Print Settings” → “Advanced” → “Slicing” → “Slice gap closing radius” to 0.
Double walls: Use double walls for more strength and to connect slits to the opposite wall.
- PrusaSlicer: “Print Settings” → “Layers and perimeters” → “Vertical shells” (Look at info text to find width you need to use for your current print settings.)
Extrusion width: You can increase the extrusion width to 2x the nozzle diameter for additional strength with no quality downsides. You might be able to go even further, but eventually quality will start going down.
- PrusaSlicer: “Print Settings” → “Advanced” → “Extrusion width” → “External perimeters”
Fake vase mode: You don’t need to use vase mode to get most of the benefits. You can mix and match all parts of normal vase mode except for the continuous spiral path. But consider scarf joints to hide seams.

You might need to experiment with the specific value to make your CAD program and slicer happy. With OnShape, sometimes 0.0001 mm works, sometimes only 0.001 mm works (or Onshape doesn’t see the slit), and I don’t know why exactly. ↩
The first mention I can find of this is in this video by Tom Stanton, but I cannot say for sure that this is where it originated. ↩
I found this solution from this forum post. ↩
Unconventional vase mode is briefly mentioned in this excellent but long blog post about designing for 3D printing. I strongly recommend reading it if you want to make your CAD designs portable between different printers and for general tips on how to design to avoid supports, and in general make full use of the peculiarities of the manufacturing method. ↩
I “semi-invented” this method myself, but then found out I wasn’t first. I was thinking “wouldn’t it be possible to…” and then I googled and found this video by “BV3D: Bryan Vines”, that already discussed this idea, though it takes a while to get to the point. ↩
Printed with AddNorth Economy PETG, white. ↩
I even considered using Full Control XYZ for a minute to have true vase mode and then switch back to non-vase mode on top. In the end I came to my senses and decided not to write my model with Python code. ↩

Filkoll - The fastest command-not-found handler

2025-03-25T00:00:00+01:00

I recently got annoyed at the command-not-found handler found in Arch Linux. So I wrote my own faster implementation. But first lets back up for a second, what am I even talking about?

The problem

A command-not-found handler is a program that runs when you type a command in your terminal that is not found in your PATH. It will print some suggestions such as:

arch ❯ uv
uv may be found in the following packages:
  extra/uv 0.6.2-1        /usr/bin/uv

debian ❯ postfix                               
Command 'postfix' not found, but can be installed with:
sudo apt install postfix

debian ❯ postfffix
Command 'postfffix' not found, did you mean:
  command 'postfix' from deb postfix
Try: sudo apt install <deb name>

As can be seen, it does output a few different ways on different Linux distributions. I found the one in Arch Linux (provided by pkgfile) slow (taking ~1.6 s). It also doesn’t implement fuzzy matching on typos. To be fair, pkgfile does a lot more than just command-not-found handling, but for this use case that isn’t relevant.

I did some profiling and found that pkgfile was spending most of its time parsing a cache file for the “extra” repository. A very large file (478 MiB). I reported a bug about this, but while waiting for it to get fixed I decided “I can do better” and “this sounds like a fun weekend project”.

I should mention that pkgfile has improved immensely as a result of that bug report. It now only takes on average 128.3 ms to run. But by that point I had already finished my program. And my program only takes ~5 ms to do the same query. So now I’m 25.6x faster, instead of 320x faster. And I also use 102x less CPU time to do so. Oh, and I do fuzzy searching (which pkgfile doesn’t do).

If you want to go use my program look at the github page. The rest of this blog post will be aimed at developers who are interested in the technical details of how I did this. I am using Rust, but the concepts should be applicable to other languages as well (especially C, C++, Zig and other low level languages).

Design phase

The first step of any project like this is to figure out the design. And the first step of that is figuring out what you want to do. Pkgfile does a lot more than just command-not-found handling: It also does forward and reverse searching for which package provides a particular file. Unlike pacman -Qo it will provide this info even when the package isn’t installed. While handy, this is a general functionality I don’t need to solve if I want to write the fastest possible command-not-found handler. I’m more than happy to use pkgfile for those other use cases.

So I decided to hyper-specialise on command-not-found handling. Okay, that defines the scope. Next we need to know where to get the data to search. Pkgfile downloads the data from the pacman mirrors itself, but it turns out pacman can do this for us, using the pacman -F sub-command. In order to simplify I will piggy back on that. No need to check that I’m not redownloading changed files, pacman can do that for me. When I want to update my cache files I just first run pacman -Fy.

Wait a second? If pacman has this downloaded already, why do I need to make cache files? Well, as it turns out, those files are big compressed tar archives, and they are mapping from package, to the files that package provides. I need the opposite mapping. So I need to build some sort of cache file with the subset of data I’m interested in, organized such that it is quick to search.

At this point the general approach is clear, a command with two sub-commands: update (updates the cache) and binary (searches the cache for a binary).

We should also identify our “hot path”. That is, which of these sub-commands matter the most that it is quick? The answer is binary, this will be run interactively. update will be run once per day on a background timer. So we should optimise for binary at the expense of update (though we want update to be as fast as possible within those constraints of course). Identifying this “hot path” early is something I have found very useful across many interactive command line tools I have written.

Implementation

The first step of the implementation is of course another design step: What data structures and libraries should we use. I wanted to try out rkyv for a while and this project should be a good fit: It is a library for zero-copy deserialisation. That is: once I load the cache file I don’t need to do any decoding of the data in the file. I can use it in memory as is.

One thing you have to think about when doing zero-copy deserialisation is that you can’t use compression. At least not traditional compression. If you have compressed files you need to decompress them first, so no zero-copy for you. As such we need to look for alternative approaches to shrink the data size. I will be using string interning (a technique I’m familar with since earlier projects). More on that later.

Then there was the question of fuzzy search that I wanted to do. That I had to do some research on, and I ended up with a “simpler is better” solution. More on that later as well.

I will also use some of the usual suspects for a Rust project: clap for argument parsing, rayon for a thread pool, etc.

So lets look at the sub-problems we have ahead of us:

Zero-copy deserialisation
String interning
Efficient data structures
Building the cache
Searching the cache
Fuzzy searching

Each of these will be covered in a section below.

Zero-copy deserialisation

You might naively think that you could just write a struct to a file and load it back into memory. This can work in specific circumstances, but if you have anything with an indirection (pointer), that breaks down. The struct will probably not be loaded at the same location as where you created it before saving it. So this is why I turned to rkyv as it solves this problem for us. It will use offsets in the data instead of absolute pointers. Like any serialization library in Rust it quite ergonomic to use thanks to Rust’s derive macro system:

#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
struct MyStruct {
    a: u32,
    b: String,
    c: Vec<String>,
}

Rust derive takes care of the hard work for you. For the benefit of those who don’t know Rust: what derive does is passing the annotated code to a pre-processor (a “proc macro” in Rust parlance) that can generate code based on the input it gets. The expanded code takes care of all the tricky details for you, in this case for serializing and deserializing the struct. Many other derives exist, both built into the standard library and as separate libraries.

Rkyv is not compatible with the popular serde library for Rust (another derive based library): serde fundamentally can’t do zero-copy except for a small subset of the data in very specific circumstances.

I’m not going to deep dive into how rkyv works, but some information is available in the official rkyv mdbook for those that are interested.

String interning

There are many strings that repeat in our data. Most binaries live in /usr/bin or a small handful of other directories. Many packages provide more than one binary (so package name also repeats many times). We can exploit this duplication to save on memory and file size.

String interning is the idea that you store one copy of a string and point to it many times instead of storing the same string many times. There are many libraries for this in Rust. None of them support Rkyv from what I could find. So I wrote a very simple single threaded interner intended to work well together with Rkyv.

Side note: String intering sounds great, why not use it all the time? Well, it has a cost: You need to look up the string in the interner to get at the data, the strings are read only, and if the string doesn’t live for long that can be wasted memory unless you implement garbage collection or reference counting (which also has a cost). So like almost everything in programming, it is situational.

Most string interners are intended for use in a single session of the program. As such they try to optimise for not reallocating memory too much etc, and internally have complex allocation strategies. I don’t need that. In fact, I don’t want that. My “hot path” is searching. That is what I should optimise for. What is the cheapest possible string interner for my use case (even if building it is more expensive)?

Well, I don’t know if it is the cheapest possible, but it is very cheap: A single Vec<u8>. An interned string is just a u32 offset into that Vec. It points to a u16 that is the length of the data (we don’t need massively long strings), followed by the data itself. It is very fast to do a lookup in. And very fast to extract a Rust &str from. It is a bit of a pain to build though.

For a start, how do we know when we have seen a string before? We could just do a linear search. But that is slow (complexity $O\left(n\right)$). And even though updating isn’t our fast path, I would like it to have reasonable complexity ( $O\left(1\right)$). Instead, we need a hash map on the side, but only while building. So we end up with:

/// Use a newtype handle to avoid mixing up incompatible numbers.
#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
pub struct Handle {
    offset: u32,
}

/// The interner as it exists after creation, ready to be serialized.
#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
pub struct StringInterner {
    data: Vec<u8>,
}

/// The builder for a string interner
pub struct StringInternerBuilder {
    data: Vec<u8>,
    lookup: HashMap<String, Handle>,
}

impl StringInternerBuilder {
    pub fn new() -> Self { /*...*/ }

    /// Intern a new string
    /// (possibly getting back a handle to a previously interned string,
    ///  or a new handle)
    pub fn intern(&mut self, value: &str) -> Handle { /*...*/ }

    /// Finalize the builder and create the readonly version
    ///
    /// This consumes this builder.
    pub fn into_readonly(self) -> StringInterner { /*...*/ }
}

impl ArchivedStringInterner {
    pub fn get(&self, handle: ArchivedHandle) -> &str { /*...*/ }
}

Wait, what is going on with ArchivedStringInterner and ArchivedHandle? Well, that is how Rkyv works. Since it is zero-copy it cannot in general use the same data type when serializing and deserializing. So you get back an “archived” version of it, that uses relative offsets, rather than pointers.

Another point here is that of “newtypes”. This is fairly common pattern in Rust. For those who don’t know Rust, it is just a wrapper around an inner type such that you will get a compiler error if you mix variables of different newtypes. Good to reduce the risk of bugs from accidentially mixing e.g. UserId(u64) with PostId(u64). The Handle is exactly that: a newtype around u32 (unsigned 32-bit integer).

Let’s look at the implementation of get:

/// Get the string for a given handle
pub fn get(&self, handle: ArchivedHandle) -> &str {
    let data = self.get_raw(handle);

    std::str::from_utf8(data).expect("Invalid utf8")
}

/// Get the raw bytes for a given handle
pub fn get_raw(&self, handle: ArchivedHandle) -> &[u8] {
    // Extract the offset into our Vec from the handle
    let offset = handle.offset.to_native() as usize;
    // Check bounds (though rust will panic later on the built in bounds check if
    // we are out of bounds, so this isn't strictly needed, just provides a more
    // obvious error).
    debug_assert!(self.data.len() >= (offset + 2));

    // Get the string size at offset (u16)
    let size = u16::from_le_bytes([self.data[offset], self.data[offset + 1]]) as usize;
    // Get the string at offset + 2 (based on the size we just read)
    &self.data[(offset + 2)..(offset + 2 + size)]
}

Nice, entirely safe Rust. And, as we are not going to be looking up many strings, it isn’t worth hyper optimising this to the point of using unsafe. In particular it wouldn’t be safe to not do UTF-8 validation: It is possible to mix up handles from different interners. So in general we can’t trust the handle to point to anything sensible.

What are the pros and cons of this interner? As I said above, it is very fast once built. But it is single threaded to build. And since it is a single Vec: if the vector needs to grow, the whole vector needs to be reallocated, which might result in an expensive memory copy if it can’t grow in place. That will affect our cache build speed. But that is not on the hot path.

Efficient data structures

Okay, we now have a string interner. What else do we need? Well, we need a root object to serialize. That needs to have a lookup table for binary names to packages. After a bit of back and forth I ended up on the following (excluding rkyv derives etc):

struct DataRoot {
    /// Name of repository (e.g. core or extra)
    repository: String,
    /// Interner for paths and package names
    interner: StringInterner,
    /// Mapping from binary name to data about binary
    binaries: HashMap<String, SmallVec<[Record; 2]>>,
}

struct Record {
    /// Which package provides the binary
    package: PackageRef,
    /// Which directory is the binary in (e.g. /usr/bin)
    directory: DirectoryRef,
}

struct PackageRef(Handle);
struct DirectoryRef(Handle);

Okay, what is going on here? Well, there are more newtypes around the Handle from earlier. Again to prevent mixing things up.

SmallVec is from a crate (Rust library) of the same name. It will store up to N elements inline. It will only allocate separately on the heap if it outgrows that. Since most binaries are only provided by a single package, this is a good fit. We expect the length to be 1. As it turns out length 2 is free: A SmallVec on a 64-bit build is always a minimum 24 bytes anyway, so we get two elements for free in this case.

So now we have a mapping from binary names to info about the binary.

Building the cache

When building we need to first download the data. As mentioned in the design section: we piggy back on pacman -Fy for that. That creates several files in /var/lib/pacman/sync. We need to process the *.files files. These are compressed tar archives. I decided to process one such file per thread. This naturally creates a cache file per input archive.

We only need to index binaries in $PATH so we can filter on that while building. This reduces the size of the data immensely. Instead of a few hundred MiB we end up with about 950 KiB of data (after the interning that is).

I did a few performance tricks here, such as delaying validating that the paths are valid UTF-8 until after the initial filtering on paths, and reusing allocations in hot loops. In the end the bottleneck is processing the extra.files archive (45 MiB compressed, 511 MiB uncompressed). But I manage in about 1-1.4 seconds on my Skylake era laptop. Not bad, and since this will run in the background that is fine(TM).

One performance trick I found quite generally useful in many Rust projects is to use the regex crate. From other languages (Python, C++, etc) I’m used to regexes being slow, and thus not ideal for high performance code. However, the Rust implementation is exceptional, and often compiles¹ down to very performant state machines, exploiting the specifics of your pattern. I used it to filter on $PATH. I build a regular expression that is basically all the elements from $PATH joined with | (the regex “or” operator). Based on what I have read about regex I suspect this compiles down to an aho-corasick automaton under the hood (though I don’t know how to get such debug info such as the underlying regex engine out of it).

Searching the cache

I wanted to make the search multi-threaded (obviously). But starting a thread takes longer than the entire runtime of the single threaded search! So I’ll take that.

How did I get there though? This is where we get into the unsafe code. (Insert spooky music here.)

Memory mapping

For a start, I memory-map the cache files. This uses the memmap2 rust crate. Doing that memory map is unsafe. Scary! Well, not really. You just need to ensure the data doesn’t change out under you while you are using it.

I think it is time for an aside on unsafe in Rust. And how the community deals with it. I have seen the entire spectrum from “unsafe outside of the standard library is evil” to “eh, no big deal“. I think it is best to take a middle way on this. You shouldn’t be scared of unsafe but also don’t use it without having a reason. And when you do use it, read the Safety notes for whatever function you are calling. If you are working with raw pointers, and especially on the border between raw pointers and references it gets much more complex (and this page is not the right guide for you). But most of the time it is just a matter of reading and being careful.

So lets look at the safety documentation here for memmap2:

All file-backed memory map constructors are marked unsafe because of the potential for Undefined Behavior (UB) using the map if the underlying file is subsequently modified, in or out of process. Applications must consider the risk and take appropriate precautions when using file-backed maps. Solutions such as file permissions, locks or process-private (e.g. unlinked) files exist but are platform specific and limited.

Okay, what does that mean for us? Well, I needed to go back to the updating code and ensure it didn’t modify the cache files in place. Instead, I create new files and move them in place once done. On Linux and Unix at least, this is an atomic replacement. A concurrent reader gets either the old or new file. Not a mix of both.

So we can write our own safety comment:

// SAFETY:
// * When the file is written it is created anew, not overwritten in place. As
//   such it cannot change under us.
let mmap = unsafe { Mmap::map(&file) }?;

I also documented this in the update code:

    // Figure out the file names for the files we write, note that we will write to a temporary file.
    let cache_path = cache_path.with_extension("binaries");
    let tmp_path = cache_path.with_extension("binaries_new");

    // ... Serialising and writing here ...

    // INVARIANT: Rename to atomically update the file
    // This is a safety requirement to allow mmaping the file during lookup.
    std::fs::rename(&tmp_path, &cache_path)?;

This helps ensure that a future change to the writer doesn’t break this assumption by mistake.

(Side note: I’m not considering the case of an attacker modifying the file here, the file will be owned by root: the update job runs from a systemd timer as root. In my threat model I trust root.)

This handling is quite different from what you see with code in most other languages, including code I myself have written in e.g. C++ or Python before. Everything is basically an unclear soup of safe and unsafe in those languages (to varying degrees, there are much more unsafe things in C++ than in Python, but there are a few things you can mess up in Python too, though usually you don’t get a segfault, but get weird behaviour or an unexpected exception, etc). Rust forces you think about these things, which I think is good.

Rkyv safety invariants

We are not out of the woods yet though. We still have to access the data. And I found another reason to use unsafe here:

Rkyv can be used safe or unsafe. In safe mode it validates the data when loaded. In unsafe mode it skips that, but you have to ensure the data is valid some other way.

I measured, and I could save a few milliseconds by using the unsafe load. And that is a sizable fraction of the total program runtime at this point! So what is our safety invariant here? Quoting the rkyv documentation:

The byte slice must represent a valid archived type when accessed at the default root position. See the module docs for more information.

The module docs say:

The safety requirements for accessing a byte slice will often state that a byte slice must “represent a valid archived type”. The specific validity requirements may vary widely depending on the types being accessed, and so in general the only way to guarantee that this call is safe is to have previously validated the byte slice.

Using techniques such as cryptographic signing can provide a more performant way to verify data integrity from trusted sources.

It is generally safe to assume that unchanged and properly-aligned serialized bytes are always safe to access without validation. By contrast, bytes from a potentially-malicious source should always be validated prior to access.

Well, what are we trying to protect against? As I hinted at above, since the file is owned by root in /var/cache/filkoll, I don’t consider malicious actors. If root is compromised there are bigger problems! So what do we need to protect against?

The answer is version upgrades! Maybe we change the file format in a new release. Or update to an incompatible version of rkyv? To deal with this I added a file header in front of the rkyv data:

#[derive(zerocopy::IntoBytes, zerocopy::FromBytes, zerocopy::Immutable, zerocopy::KnownLayout)]
pub(crate) struct Header {
    /// A magic number to identify the file format
    pub(crate) magic: u32,
    /// INVARIANT: A version number that is manually incremented if the format
    /// changes in ways the the hash cannot catch.
    pub(crate) version: u32,
    /// A hash of the type of the root object
    pub(crate) type_hash: u64,
}

There is a lot going on here, lets break it down:

First a magic number to identify the file format. This is a good practise in general for binary files and can help identify the file. I used 0x70757976 (FKOL in ASCII). This does not provide much safety though. Just that the file is probably from the same program.
Then a format version number that is manually handled. Good, but there is a risk of me forgetting this when making changes. But it was needed to handle some edge cases that the final component couldn’t handle.
And, the final part is a hash computed using type_hash. This gives a hash on the data format. A hash for the Cargo.lock file is also mixed into that (to protect against rkyv upgrades etc). This includes the specific versions and checksums of all dependencies.

That should cover all eventualities, though it will have false positives (report a mismatch even if when the format is compatible). But for a cache file that is fine. It is cheap to re-create. And better safe than sorry.

So we check the header, and only then load the rest of the file with rkyv’s unsafe accessor function.

Actually searching the cache

Finally, it is time to search!

Recall from DataRoot above that we have binaries: HashMap<String, SmallVec<[Record; 2]>>. A hash map! We can just look up the binary name. Then resolve the interned strings and print out the match. And that is exactly what we do for an exact search.

Fuzzy searching

Fuzzy searching is a bit more complex. After much looking around and experimenting I used strsim for this. This is a library to compute the “edit distance” between two strings.

A simple example of a distance function could be “how many letters differ between the two strings”. This is known as the Hamming distance. However, that can only handle strings of the same length. The Levenshtein distance is a bit more general: it computes the minimum number of insertions, deletions, or substitutions needed to get from one string to the other. There are many more, that might handle e.g. “swapping” as a single edit or other features. However, after testing a few different ones I settled on the Levenshtein distance as it was both fast and gave good results.

However, this approach (comparing string distances) means I need to go through all keys in the hash map. I was planning to do something fancier to begin with (there is a symspell crate that looked interesting, but also unmaintained). But the naive approach here was fast enough that I just settled for that:

For each key in the hash map:
Compute the Levenshtein distance to the search string (the typoed binary).
If it is below a threshold, consider it a potential match
Sort the potential matches by Levenshtein distance
If there is an exact match just print it, otherwise print all possible matches.

Testing

I’ll admit: I have a tendency to just write tests for some tricky bits and leave the rest to manual exploration. Then I throw an integration test on it after while when it becomes difficult to maintain without one. This actually works surprisingly well for small command line tools like this.

However, writing this article spurred me to do the integration test early.

What I did is set up a podman container with an Arch Linux image and run a cache update followed by some test queries in it. Then I check the output against a “golden file” and report a failure if it differs. It is simple but effective.

Though you need to ensure you use a fixed version of the image and package database, as upgrades can definitely change small details in the output. Thankfully, Arch Linux has an archive where you can get the sync database for a specific date, making this sort of testing easy.

Benchmarking and profiling

And finally: your program isn’t fast unless you can prove it is. And this is where benchmarking and profiling comes in. Much has been written on this topic already, so there isn’t much new I can add. Thus, I’ll be brief. Really, you should just go read the excellent free Rust Performance Book instead.

For command line programs with a fixed runtime (like the program this blog post is about) you should use hyperfine to compare the runtime of two different programs under test. This can be your program vs what came before. Or two copies of your program with slightly different implementations of an algorithm. If less readable code doesn’t make it measurably faster, it isn’t worth it. You can also compare different flags, etc of course as well.

But how do you identify your hotspots so you know where to try to tweak the code? Well, you use a profiler. For Linux (which is the only platform I know), use perf to record the data. Then I recommend hotspot for visualising the data. Every person swears that their favourite visualiser is the best / easeist / most powerful. I found hotspot to most clearly show me where the problems are. It not only have flamegraphs, top down views, calle-caller views, etc, but also a time axis. It is easy to zoom and filter in on different phases of execution, etc. Hotspot probably isn’t the easiest to get up and going with though, since it has a lot of features. But I use it for both my hobby projects and for my day job. It gets the job done, and done well.

And then you need to consider memory usage too. For Rust the best option is bytehound, though it takes a bit of work figuring out the UI. Tip: Try right-clicking on various things that you don’t expect to be right clickable! Heaptrack is another option, that is good, but doesn’t know how to demangle Rust symbols. If you are heap profiling C or C++ it is however an excellent choice, with a UI very similar to Hotspot (though Bytehound is more powerful for advanced analysis, it is scriptable!).

Just go and try out a few different tools and see what you like. And read the Rust Performance Book. Seriously.

Note that when I say “compiles” about regular expressions, I refer to the regex being compiled into a state machine at runtime of the Rust program, not compilation of a Rust program itself. ↩

Reverse engineering ACPI functionality on a Toshiba Z830 Ultrabook

2022-08-21T00:00:00+02:00

I have inherited a Toshiba Satellite Z830-10W ultrabook. This is a very light and small laptop from around 2011. It had Windows 7 on it. With only 4 GB RAM and a 128 GB mSATA SSD, it is a bit on the weak side for running Windows 10. Plus I’m a Linux user. However, there were issues under Linux.

Initial investigation

Since the form factor and weight was nice (and the battery still had life in it) I decided it might be worth experimenting with. So I booted a USB stick with Xubuntu (not my choice to install, but quick to test out basic features from the live environment). I found several things that did not work: some LEDs, some buttons and backlight breaking after a sleep and resume cycle.

Some may be annoyed at this, but I have been interested in getting into reverse engineering and kernel programming for a long time, so I saw this as an opportunity and good first problem rather than a problem.

At this point I made a list of the problem I knew of so far. I also set up an external SSD with Linux as 128 GB is not enough to work comfortably with both Windows and Linux, and I would need to trace things under Windows to see how they work. I used Arch Linux as that is my Linux distro of choice. The laptop has one USB 3 port, making this approach bearable.

Then I spend some time reading the Linux kernel driver toshiba_acpi that provided some working features to familiarise myself with how ACPI on Toshiba laptops work.

ACPI

A quick summary of ACPI is in order at this point: It is a standard (originally introduced in 1996) that lets the firmware the features of the hardware to the operating system. It is focused on describing things that can not be auto discovered and on power management. For example:

Addresses of memory mapped chips on the motherboard, such as the CMOS clock.
How to suspend and turn off the computer.
Notifications of the lid closing or opening on a laptop.

It also supports vendor specific extensions, and toshiba_acpi implements support for those under Linux. I spent some time reading up on ACPI as well. Some basic functionality is described in fixed format tables. However, most of the complicated features is described by a program that is provided as byte code to the OS. The OS implements a virtual machine that runs the ACPI byte code.

Tooling under Windows

After reading up on relevant background material I needed to figure out what tooling to use on Windows. One thing I quickly discovered was the AMLI debugger. This is bundled with Debugging Tools for Windows and lets you trace ACPI calls when kernel debugging. One snag though: On Windows 7 it needs a “checked build” of the operating system, specifically of ACPI.SYS. I tried to get this working, it did not work out.

I ended up upgrading the laptop to Windows 10. While that is a much worse experience (the laptop takes minutes after boot to become usable, it is completely unusable while updates are downloading or installing, …) I do not plan to keep Windows 10 (or Windows in general) long term. For now, it works.

Using a local kernel debugging session on the Toshiba I was able to execute !amli set traceon spewon. This makes it output a lot of debug info to !dbgprint. A lot. This was way too much to do my intended “press a button, see what happens” approach. I needed to log this to file and compare traces to make any headway finding the needles in the haystacks.

Sysinternals to the rescue! DebugView is a program in the Sysinternals suite of programs that lets me do exactly that (amongst other features).

Side note: User space tracing

I did also investigate the possibility of user space tracing, but that would look at the API between the user space programs and the kernel, which might be relevant, but might use a completely different API that is being translated in the driver. Some tools I came across that might be useful if this is what you want to do:

API Monitor seems to allow tracing a number of library calls and system calls. In some ways it is comparable to strace on Linux.
Spy++ is a tool to monitor window messages. It is a part of Microsoft Visual Studio. This might be useful as a complement.

In the end I had a brief look at these tools but did not end up using them for my actual reverse engineering work other than to determine which user space programs and services were talking to the driver (so I could kill all but the one I wanted to investigate at the time and reduce noise). This was really needed as especially the program that pop up an overlay for when you press the Fn-key tended to perform a lot of repeated background queries for the same thing over and over again. And then there is the ECO process that keeps querying for what I believe is power usage all the time (as that seems to be the only thing it is used for). Both of these seem a bit wasteful on the battery in my eyes.

Interestingly, the handling of the extra buttons below by user space was not done via the usual method of DeviceIoControl calls. Instead, this was received as Window event notifications of the WM_POWERBROADCAST type.

The Toshiba driver on for ACPI on Windows is named TVALZ.sys by the way (at least on this particular laptop) in case you want to figure out which processes opens handles to it.

The process: reverse engineering

Next came a tedious task of establishing a baseline with all programs stopped. This involved grepping the various logs on a Linux computer. Then I just had to look at the new logs excluding all those things I filtered out. Each single ACPI function call typically generates hundreds of lines in the log, with details about function parameters and sub calls. Fortunately each new entry point is marked with the line AMLI:. I have included part of one call below as an example. This particular call queries the state of the keyboard backlight. The full log for this call is 168 lines, and as such I have left it out of here!

00015647    10:09:48    AMLI: FFFFD082FA3A2080: \_SB.VALZ.GHCI(Buffer(0x4){ 
00015648    10:09:48     0x00,0xfe,0x00,0x00},Buffer(0x4){ 
00015649    10:09:48     0x95,0x00,0x00,0x00},Buffer(0x4){ 
00015650    10:09:48     0x00,0x00,0x00,0x00},Buffer(0x4){ 
00015651    10:09:48     0x00,0x00,0x00,0x00},Buffer(0x4){ 
00015652    10:09:48     0x00,0x00,0x00,0x00},Buffer(0x4){ 
00015653    10:09:48     0x00,0x00,0x00,0x00})
00015654    10:09:48     
00015655    10:09:48    ffffd082f388c1aa: { 
00015656    10:09:48    ffffd082f388c1aa: CreateDWordField(Arg0=Buffer(0x4){ 
00015657    10:09:48     0x00,0xfe,0x00,0x00},Zero,REAX) 
00015658    10:09:48    ffffd082f388c1b1: CreateWordField(Arg1=Buffer(0x4){ 
00015659    10:09:48     0x95,0x00,0x00,0x00},Zero,R_BX) 
00015660    10:09:48    ffffd082f388c1b8: And(REAX,0xff00,Local0)=0xfe00 
00015661    10:09:48    ffffd082f388c1c1: If(LEqual(Local0=0xfe00,0xfe00)=0xffffffffffffffff) 
00015662    10:09:48    ffffd082f388c1c9: { 
00015663    10:09:48    ffffd082f388c1c9: If(LEqual(R_BX,0xc000)=0x0) 
00015664    10:09:48    ffffd082f388c1e1: If(LEqual(R_BX,0xc800)=0x0) 
00015665    10:09:48    ffffd082f388c1f9: If(LEqual(R_BX,0xc801)=0x0) 
00015666    10:09:48    ffffd082f388c211: } 
00015667    10:09:48    ffffd082f388c211: If(LEqual(Local0=0xfe00,0xff00)=0x0) 
00015668    10:09:48    ffffd082f388c248: Return(GCH0(Arg0=Buffer(0x4){ 
00015669    10:09:48     0x00,0xfe,0x00,0x00},Arg1=Buffer(0x4){
...

The log consists of line number, time stamp and log text. The log text contains some decompiled ACPI byte code (known as ACPI Machine Language or AML for short). AML defines both a nested structure of objects and methods, but also the instructions in those methods.

The process: testing your hypotheses

Once I had a set of hypotheses based on the reverse engineering I had done I needed to test them. One way would be to actually change to Linux kernel driver, recompile it and then reload it via rmmod and insmod (or using modprobe).

However, a much quicker option is to use acpi_call. This is a out of tree kernel module that allows the root user to do direct ACPI method calls from the comfort of their own shell prompt. I ended up using the following helper functions in my zsh prompt to test these:

call_ghci() {
  echo "\\_SB_.VALZ.GHCI $1 ${2:-0} ${3:-0} ${4:-0} ${5:-0} ${6:-0}" | \
    tee /proc/acpi/call
  cat /proc/acpi/call; echo
}

set_hci() { call_ghci 0xff00 "$@" }
get_hci() { call_ghci 0xfe00 "$@" }
get_sci() { call_ghci 0xf300 "$@" }
set_sci() { call_ghci 0xf400 "$@" }
open_sci() { call_ghci 0xf100 "$@" }
close_sci() { call_ghci 0xf200 "$@" }

Some example usages:

$ # Turn on ECO LED
$ set_hci 0x97 1 1 0 0 0
\_SB_.VALZ.GHCI 0xff00 0x97 1 1 0 0
[0x0, 0x97, 0x1, 0x1, 0x0, 0x0]

$ # Get the BIOS boot order
$ open_sci; get_sci 0x157 0 0 0 0 0; close_sci
\_SB_.VALZ.GHCI 0xf100 0 0 0 0 0
[0x44, 0x0, 0x0, 0x0, 0x0, 0x0]
\_SB_.VALZ.GHCI 0xf300 0x157 0 0 0 0
[0x0, 0x8505, 0xfff30174, 0x5, 0xfff30741, 0x0]
\_SB_.VALZ.GHCI 0xf200 0 0 0 0 0
[0x44, 0x0, 0x0, 0x0, 0x0, 0x0]

This allowed me to test almost all the features without changing kernel code. The one exception is the notifications for the buttons

Some other tooling that may be useful to know about under Linux is that many (but not all) ACPI events are sent to user space via Netlink. I found that the pyroute2 library allowed me to read this without resorting to coding in C.

The documentation was lacking for this particular feature of pyroute2, but that has since been fixed.

In the end reading the ACPI events over netlink was not useful to me. The things I needed were not there.

Results

In this section I present my findings. This is organised to help someone who wants to work on this, or perhaps extend my work. It is not for a general audience, but is intended as reference material for anyone working on kernel development in this area, or intending to improve Linux support for similar devices.

Background on Toshiba ACPI communication methods

This section is a short summary of the general protocol. This is already implemented in the toshiba_acpi Linux kernel driver. If you are already familiar with that you can skip this section.

Almost all vendor specific features work via the \_SB_.VALZ ACPI device defined in the DSDT. This device is also known as TOS6208 and VALZeneral. There are a handful of interesting methods on this object, but for the purposes of this write-up only GHCI is relevant. This method takes 6 integer (32-bit) arguments and returns a buffer 6 32-bit integers.

The general format of queries is: {OPERATION, REGISTER, ARG1, ..., ARG4 }. The operation is one of HCI_GET/HCI_SET or SCI_GET/SCI_SET (plus SCI_OPEN and SCI_CLOSE). This allows for getting and setting various registers to control features or read out data.

The data returned varies a bit, but is generally on the form: {STATUS_CODE, REGISTER_FROM_QUERY, VAL1, ..., VAL4 }

What is the difference between HCI_* and SCI_* calls? The only important difference here is that for SCI_GET/SCI_SET you first need to call SCI_OPEN and then follow the get or set with a SCI_CLOSE call.

Much of the rest of this write-up consists of documenting registers previously not handled by the toshiba_acpi Linux driver.

The “Eco” LED

The toshiba_acpi driver has support for controlling some LEDs including the “Eco” LED (the eco LED is the one on the far right in the picture). Unfortunately that LED works differently on this laptop.

Apparently Toshiba has two formats for controlling this on differet laptops:

// Format A
{HCI_SET, HCI_ECO_MODE, onoff, 0, 0, 0}
// Format B
{HCI_SET, HCI_ECO_MODE, onoff, 1, 0, 0}

The toshiba_acpi driver tries to use format B if it gets the error TOS_INPUT_DATA_ERROR when trying to use format A. On this laptop the error returned is TOS_NOT_SUPPORTED. Other than that format B works as expected.

Battery charge mode

This laptop supports not charging the battery fully in order to prolong battery life. Unlike for example ThinkPads where this control is granular here it is just off/on. When off it charges to 100%. When on it charges to about 80%.

According to the Windows program used to control the feature the setting will not take effect until the battery has been discharged to around 50%. On Windows Toshiba branded this feature as “Eco charging”

In the following example ACPI calls I will use the following newly defined constants:

#define HCI_BATTERY_CHARGE_MODE 0xba
#define BATTERY_CHARGE_FULL 0
#define BATTERY_CHARGE_80_PERCENT 1

To set the feature: {HCI_SET, HCI_BATTERY_CHARGE_MODE, charge_mode, 0, 0, 0}
To query for the existence of the feature: {HCI_GET, HCI_BATTERY_CHARGE_MODE, 0, 0, 0, 0}
To read the feature: {HCI_GET, HCI_BATTERY_CHARGE_MODE, 0, 0, 0, 1}

The read may need to be retried if TOS_DATA_NOT_AVAILABLE is returned as the status code. This rarely happens (never observed it on Linux), but I have seen it happen under Windows, and the Windows software it did retry it.

Panel power control via HCI

The toshiba_acpi driver supports controlling the panel power via SCI calls (SCI_PANEL_POWER_ON). This laptop appears to support that (the codes give no errors), but nothing happens. Instead, HCI calls must be used.

#define HCI_PANEL_POWER_ON 0x2
#define PANEL_ON 1
#define PANEL_OFF 0

To read/query existence: {HCI_GET, HCI_PANEL_POWER_ON, 0, 0, 0, 0}
To write: {HCI_SET, HCI_PANEL_POWER_ON, panel_on, 0, 0, 0}

Hardware buttons

All the Fn+<key> hotkeys work. However, there are some hardware buttons that do not. These buttons are:

A button between space and the touchpad to turn off/on the touchpad.
Two buttons next to the power button, one is “eco-mode”, the other is “projector”.

The two buttons next to the power button both send Windows+X by default. The touchpad control button does nothing that Linux can detect.

To enable this functionality several changes are needed.

The toshiba_acpi driver currently uses {HCI_SET, HCI_HOTKEY_EVENT, HCI_HOTKEY_ENABLE, 0, ...} to enable the Fn+<key> hotkeys, where HCI_HOTKEY_ENABLE = 0x09. However on this laptop the value 0x05 must be used instead.

This is not the whole story however, as these keys do not work like any of the Fn-hotkeys (ACPI notification on \_SB_.VALZ). Instead, once enabled via the above method they start sending notifications on various PNP0C32 devices. These are currently not handled by Linux. According to a search PNP0C32 is “HIDACPI Button Device”.

The devices in question are:

PNP0C32 \_SB_.HS81 UID 0x03: Enable/disable trackpad
PNP0C32 \_SB_.HS87 UID 0x01: Eco button
PNP0C32 \_SB_.HS86 UID 0x02: Monitor/projector button

Only the “path” and the UID value in the ACPI DSDT tell these devices apart.

The notification always uses the value 0x80.

BIOS setting control from the OS

Several of the BIOS settings can be controlled from the OS. This all happens via SCI calls. On Windows the Hwsetup.exe program offers this control. I’m not sure how useful any of this is (as this is already available via the BIOS).

Still: it is a neat feature to have.

Setting boot order

This is a BIOS (not UEFI) laptop, so boot order could normally not be controlled from the OS. However here it is possible:

#define SCI_BOOT_ORDER 0x157

In this SCI register the boot order is stored as a list with each nibble indicating a device:

#define SCI_BOOT_ORDER_FDD 0x0
#define SCI_BOOT_ORDER_HDD 0x1
#define SCI_BOOT_ORDER_LAN 0x3
#define SCI_BOOT_ORDER_USB_MEMORY 0x4
#define SCI_BOOT_ORDER_USB_CD 0x7
#define SCI_BOOT_ORDER_USB_UNUSED 0xf

These are then combined as follows:

For example, consider the request to set boot order to USB memory, USB CD, HDD, LAN, FDD: {SCI_SET, SCI_BOOT_ORDER, 0xfff03174, 0, 0, 0}

Each nibble indicates a device, with the lowest nibble being the first device in the boot order. As this device doesn’t have a physical FDD I assume that this refers to USB attached devices, but I have not tested this (I do have a USB floppy drive if anyone really cares).

When reading the data out the result is a bit surprising: {0x0, 0x8505, 0xfff30174, 0x5, 0xfff30741, 0x0}

Presumably these other values also mean something, the boot order in this case is USB memory, USB CD, HDD, FDD, LAN, so the third value is the boot order.

Setting USB memory emulation

The BIOS can either treat USB memories as HDDs or FDDs for booting purposes:

#define SCI_BOOT_FLOPPY_EMULATION 0x511
#define SCI_BOOT_FLOPPY_EMULATION_FDD 0x1
#define SCI_BOOT_FLOPPY_EMULATION_HDD 0x0

To set: {SCI_SET, SCI_BOOT_FLOPPY_EMULATION, value, 0, 0, 0}
Getting/existence query: {SCI_GET, SCI_BOOT_FLOPPY_EMULATION, 0, 0, 0, 0}

Display during boot

This controls if BIOS/GRUB/etc is shown on just the internal monitor or not.

Note: When changing this in Windows it tells me a restart is required.

#define SCI_BOOT_DISPLAY 0x300
#define SCI_BOOT_DISPLAY_INTERNAL 0x1250
#define SCI_BOOT_DISPLAY_AUTO 0x3250

To set: {SCI_SET, SCI_BOOT_DISPLAY, value, 0, 0, 0}
Getting/existence query as usual.

CPU control

I presume this is only for operating systems that don’t manage this themselves, I don’t know for sure. The wording in the documentation is vague, but I believe it controls CPU frequency behaviour.

Note: When changing this in Windows it tells me a restart is required.

#define SCI_CPU_FREQUENCY 0x132
#define SCI_CPU_FREQUENCY_DYNAMIC 0x0
#define SCI_CPU_FREQUENCY_HIGH 0x1
#define SCI_CPU_FREQUENCY_LOW 0x2

Set and get as usual: {SCI_GET/SET, SCI_CPU_FREQUENCY, value, 0, 0, 0} (You should be spotting a pattern by now.)

Wake on LAN (WoL)

Note! This only controls Wake on LAN when off/hibernated (and since this laptop has Intel Rapid Start, presumably in that mode too). It is not relevant to WoL when in sleep.

Here the Windows driver seem to query several possibilities until it hits on one that works:

#define SCI_WAKE_ON_LAN 0x700

#define SCI_WAKE_ON_LAN_OFF 0x1
#define SCI_WAKE_ON_LAN_ON 0x1

#define SCI_WAKE_ON_LAN_REG1 0x0
#define SCI_WAKE_ON_LAN_REG2 0x1000
#define SCI_WAKE_ON_LAN_REG3 0x800

To set: {SCI_SET, SCI_WAKE_ON_LAN, value | register, 0, 0, 0}
To get/query: {SCI_GET, SCI_WAKE_ON_LAN, register, 0, 0, 0}

For example on this specific laptop to enable WoL:

{SCI_SET, SCI_WAKE_ON_LAN, SCI_WAKE_ON_LAN_ON | SCI_WAKE_ON_LAN_REG3, 0, 0, 0}

REG1 and REG2 give return code TOS_INPUT_DATA_ERROR on this laptop, but presumably they are needed on some laptops, or the Windows program would not be attempting to use them.

SATA power control

This is another one that I don’t know what exactly it corresponds to, maybe it is something Linux can control directly:

#define SCI_SATA_POWER 0x406
#define SCI_SATA_POWER_BATTERY_LIFE 0x1
#define SCI_SATA_POWER_PERFORMANCE 0x0

Get/set/query as expected: {SCI_SET, SCI_SATA_POWER, value, 0, 0, 0}

Legacy USB

Controls Legacy USB support in BIOS.

Note: When changing this in Windows it tells me a restart is required.

#define SCI_LEGACY_USB 0x50c
#define SCI_LEGACY_USB_ENABLED 0x1
#define SCI_LEGACY_USB_DISABLED 0x0

Get/set/query as expected: {SCI_SET, SCI_LEGACY_USB, value, 0, 0, 0}

Wake on keyboard

This controls if pressing a key on the keyboard wakes the laptop from sleep. Otherwise, only opening the monitor or pressing the power button works for this.

#define SCI_WAKE_ON_KEYBOARD 0x137
#define SCI_WAKE_ON_KEYBOARD_ENABLE 0x8
#define SCI_WAKE_ON_KEYBOARD_DISABLE 0x0

Get/set/query as expected: {SCI_SET, SCI_WAKE_ON_KEYBOARD, value, 0, 0, 0}

Other features

Here is a summary of other features that I have not been fully able to decode and understand.

Power usage

The Windows-software can read power usage in watts both when on AC and when on battery.

On startup of the program for this and when switching between AC and battery:

{HCI_SET, 0x42, 0x1, 0, 0, 0}
{HCI_SET, 0x42, 0x10, 0, 0, 0}

When on AC the following calls are involved:

{HCI_GET, 0xa7, 0x0, 0x0, 0x8b, 0x0}
{HCI_GET, 0xa7, 0x0, 0x0, 0x8b, 0x1}
{HCI_GET, 0xa8, 0x1, 0x0, 0x98, 0x0}
{HCI_GET, 0xa8, 0x1, 0x0, 0x98, 0x1}

When on battery the calls changes:

{HCI_GET, 0xa1, 0x1, 0x0, 0x44, 0x0}
{HCI_GET, 0xa1, 0x1, 0x0, 0x44, 0x1}
{HCI_GET, 0xa8, 0x1, 0x0, 0x98, 0x0}
{HCI_GET, 0xa8, 0x1, 0x0, 0x98, 0x1}

Not all of these calls happen with the same frequency. The frequency also changes when going between AC and battery.

The returned data makes no sense to me, but it does vary with system load, so I suspect scaling and possibly masking is involved. However, I don’t have a good way to go any further with this without going into questionable methods such as decompilation. As such I have left this alone for now.

Mysterious other calls

I don’t even know what these do, but I have observed them under Windows:

When locking the screen under Windows: {HCI_SET, 0x25, 0x2, 0x1, 0, 0} When putting the system to sleep under Windows: {HCI_SET, 0xbd, 0x81, 0, 0, 0}

Linux currently only uses HCI_GET on HCI_SYSTEM_INFO, Windows sometimes uses HCI_SET too:

On screen lock: {HCI_SET, HCI_SYSTEM_INFO, 0, 1, 0, 0}
On screen unlock: {HCI_SET, HCI_SYSTEM_INFO, 0, 0, 0, 0}

Toshiba Service Station causes this call to be performed once when it is opened:

{HCI_GET | 0x12, 0x9f, 0, 0, 0, 0}

The 0x12 makes no difference, but seems to be returned in the reply buffer. Thus, I speculate that the lower byte can be used as a sort of “transaction ID” to associate a request with a response. As to what the call does I can’t say, but it returns the same value (0x5988 in 4th integer in the buffer) every time.

In addition, on Windows, may calls that just fail (according to the status codes) are performed. These presumably are calls relevant to other models.

What’s next?

Now that everything is documented, and everything except the buttons tested, what is the next step?

At this point is I plan to experiment changing the toshiba_acpi kernel driver. For implementing support for the hardware buttons I plan to contact the relevant mailing list as I’m unsure how to proceed. Is a Toshiba specific driver a good option or is a generic driver better?

Old school terminal: Informer D304

2021-08-12T00:00:00+02:00

I found a weird looking terminal in a storage room / museum in the basement of the local university:

A quick web search revealed that there was almost no information available of this incredibly unergonomic terminal online, which peaked my curiosity. I was able to find one image in a scan of a marketing brochure as well as a mention (without photos) on a wiki about terminals.

What is a terminal?

The uninitiated might at this point wonder what this is. What do I mean by a terminal? A terminal was not a computer, but would be used to connect to a large central computer (think hundreds of kilos and very expensive). Multiple such terminals could be connected to the same computer to allow many users to use the computer at once. This is a relic from the time before personal computing.

Specs & history

From the two sources above we can learn some information and specs. The most interesting are probably these:

Datapoint	Value
Introduction date	March 1979
Interface	RS-232C or 20 mA current loop (optional). Daisy chaining multiple terminals was supported as well.
Printer interface	Optional support for a “buffered printer port”
Baud rate	50 to 19200
Character Set	Upper and lower case, a total of 128 characters
Display Format	Multiple, not just the usual 80x24. Selectable by the host (presumably using control codes)

The company that made the terminal (Informer, Inc) appears to have been based on Los Angeles, CA, USA. I have not been able to find any additional information about them apart from their street address in the PDF above.

Photos

Because of the lack of information online I decided to take some photos of the unit to provide some documentation of it from various angles:

From the top, right next to some retro computers:

From the back we can see where the connectors are mounted:

When we take a closer look at the connectors (below), we can see that this model appears to be lacking the printer port option. The RS-232 connectors appear to be a standard DB-25 connector, but I’m not sure if the ability to daisy chain to a “next terminal” is a standard. I believe this unit does not have the current loop option.

Also, the power connector is not standard (8 pins, four on each row). This is unfortunate as I was not able to spot any power brick nearby. Finally there is a composite video out.

Looking at the bottom of the device there is some DIP switch configuration information as well as a nameplate (link to larger version):

Interestingly the device is marked as D304-K. The K seems to refer to this being the keyboard unit, as the monitor is marked D304-M (M for Monitor?):

There is also a contrast knob on the monitor underside as well as a tilt-and-swivel stand:

Finally, a closer look at the keyboard layout. As can be seen (link to larger version), this is a Swedish layout, but with some unusual keys. For example the Ü key is not standard on modern Swedish layouts, nor is it used in the Swedish language. Another difference is that Shift-4 is usually ¤ not $.

Apart from that there are several differences to modern keyboards in general. For example, there appears to be separate Return and LF (Line Feed?) keys. And the key cluster on the left is quite interesting.

What’s next for this terminal?

I would love to be able to get this up and running (connected to a modern Linux computer perhaps), but without the power brick I fear there is little chance of that happening.

If you have any more information regarding this type of terminal (especially with regards to the power connector), please get in touch. I’d love to be able to get this up and running at some point (assuming it is not broken). To contact me about matters regarding old computers use retro at <the domain of this website>.

QuickMCL vs AMCL performance

2019-04-07T00:00:00+02:00

This article compares the performance of QuickMCL and AMCL when it comes to computational resources. It does not compare them for speed or quality of localisation, but these should be identical if a parameter set is used that is supported by both programs (e.g. likelihood field model).

Test methodology

The tests were made with a fixed number of particles, effectively disabling KLD sampling (Fox, 2003). This is needed to get any sort of useful results, since otherwise the results will not be reproducible.

The test consisted of playing back a ROS bag file with a recorded scenario and measuring the CPU usage & memory usage at the end. This was repeated 5 times per parameter value tested and the average was taken.

The CPU (Intel Core i7 i7‑8550U) was set to run in performance mode to minimise effects of CPU frequency scaling and optimisations (-O3) were enabled for GCC 5.4.0‑6ubuntu1~16.04.11.

For memory usage unique set size was measured, which is the amount of memory would be freed should the program exit at that point. This gives a fairer measurement than resident set size, virtual set size or similar measurements, since it will ignore shared memory-mapped resources such as the C standard library, but it will include unique libraries used by the program.

The tests were run with the bag file played back at an accelerated rate, after verifying that this did not change the results as long as the CPU load was below 100%, which was carefully monitored.

The code to re-create these results is available on github.

CPU performance

First, let us look at CPU performance

As can be seen, AMCL has quadratic complexity! Turns out this is from a poor algorithm choice in the resampler where it uses a linear array of the sum of particle weights, resulting in a complexity of $O(m\cdot n)$ where m is the number of new particles and n is the number of old particles.

As a contrast, QuickMCL uses a binary tree for this, resulting in a complexity of $O(m\cdot\log n)$

Also of interest is that the conversion from LaserScan to PointCloud2 is essentially free, but it should be noted that there is some overhead in running it as two programs. Thus it is only recommended to use external laser processing if you need the point cloud for something else anyway, or if you want to do more advanced processing, such as merging data from multiple lasers.

The overhead is even larger if we include the resource usage of both programs (not shown in this graph).

Also of note is that switching from the default float to double for pose components is essentially free when it comes to CPU usage.

Memory usage

Here, both implementations at least scale linearly, but AMCL does use significantly more memory, both for the baseline and per particle.

AMCL does have some structure members that are no longer in use (or only used in certain situations) as well as cases of inefficient memory layout resulting in unnecessary padding, but it seems strange it would cause such a huge difference.

Thus it isn’t clear what AMCL is doing that is causing it as even with internal laser processing, QuickMCL doesn’t even come close to it.

Also of note is that switching from the default float to double for pose components is essentially free for memory usage as well.

Conclusions

There are three main points to draw from this study:

QuickMCL handily beats AMCL when it comes to resource usage, both for CPU and memory. AMCL is, of course, better tested and has more features (beam model, extra odometry models, etc).
It might be worth making double precision the default for QuickMCL since it doesn’t appear that the difference between single and double precision is very large, at least on modern x86 CPUs.
External laser processing only make sense if you need the point cloud for something else, or need to do more advanced processing (such as merging data from multiple lasers). Otherwise, you are better off using the internal laser processing.

Bibliography

Dieter Fox. Adapting the Sample Size in Particle Filters Through KLD-Sampling. The International Journal of Robotics Research, 22(12):985–1003, 12 2003. doi:10.1177/0278364903022012001. ↩

AMCL reverse engineering

2019-04-04T00:00:00+02:00

This is my notes from reverse engineering AMCL. I did this to get a better understanding of the state of the art localisation when implementing my own Monte Carlo Localisation as a course works project during my masters in robotics degree.

Hopefully, it will be useful to someone else, but remember that you learn by doing it yourself than by just reading about it.

Note that this is an analysis of what AMCL does differently than the book [@thrunProbabilisticRobotics2005], and thus it assumes the reader understands the basic concept.

Odometry

There are two interesting extras here:

If $\delta_{trans}$ is too small (<0.01 m), set $\delta_{rot1}$ to 0. This avoids some numerical instability when rotating in place.
It has a rather neat check for driving backwards, taking the minimum of $\delta_{rotN}$ and $\pi-\delta_{rotN}$ as the basis for the variances. This prevents severe overestimating of the noise when reversing.

Both of these features have been implemented.

Map (from likelihood model perspective)

For the map AMCL stores two attributes per cell:

Occupancy state (occupied, free, unknown)
Distance to the nearest occupied cell.

This means AMCL does not use the full probability information in the original evidence grid.

The distance to nearest occupied cell is computed via a something similar to Dijkstra’s algorithm, flood filling outwards from the occupied cells, up to a limited max distance. The distance computed appears to be the L2 norm however.

Side note: This is computed in a really smart way, pre-computing a lookup table for one quadrant of the grid (up to the limited max distance). This can be found in CachedDistanceMap in map_cspace.cpp.

Side side note: However, it could be optimised further since currently it uses double-pointer indirection (double**) as opposed to linearising the array and doing it with a multiplication followed by addition. This is a micro-optimisation however.

Sensor model: Likelihood field

AMCL has both likelihood models (two of them) and a beam model, but the focus for this analysis is exclusively on the likelihood model.

AMCL uses a limited number of beams (30 by default), spaced equidistantly in the sensor code.

Because of the difference in the map it is non-trivial to work out if I’m doing the same thing as AMCL for computing the probability, but here is what AMCL does in Python-esque pseudo code (“de-optimised”):

for particle in cloud:
    pose = particle.pose + laser_offset
    p = 0
    for beam in subset(scan, max_beams):
        if not sanity_check(beam):
            # check beam for max range, NaN etc
            continue
        endpoint = trigonometry(beam, pose)
        map_coords = math(endpoint)
        if map_coords is outside_map:
            z = max_occ_dist  # Max distance we precomputed distances for
        else:
            z = map_data[map_coords].occ_dist
        pz = z_hit * exp(- (z ** 2) / (2 * sigma_hit ** 2))
        pz += z_rand / range_max
        # Here is a helpful assert to ensure pz is in [0, 1]
        p += pz ** 3
    particle.weight = p

All the time, the total weight is tracked as well and everything is normalised to [0, 1] at the end (assuming non-zero weights).

From this, it can be seen that AMCL does not use the algorithm in the book, but rather adds probabilities, but in a non-linear way (cubing them). The cubing is documented as “ad-hoc” but “works well”. In the range of interest [0, 1] this has the effect of reducing the values. However, this reduction is non-linear and small values will be reduced more.

In addition to this algorithm, there is an extended likelihood field algorithm with something called beam skipping (not enabled by default), which according to the documentation is supposed to help with unexpected obstacles. If enabled, this part of the algorithm is still only active after the particle set has converged.

In a strange oversight, AMCL does not precompute the probability, only the distance to the nearest obstacles. It would have been quite possible to precompute the probability on that map. Maybe this is because AMCL has some sort of support for dynamic parameter change, which would break this if sigma_hit would change without getting a new map?

Particle filter

As for the particle filter itself, its design is informed by the choice of resampling procedure (KLD).

The filter has a min and max number of particles, as well as various particles related to KLD.

Convergence

There is the concept of convergence of the filter. This is computed by looking at if all particles are within a certain threshold of the mean position. Only x and y are considered for this, $\theta$ is not used. This seems to be used exclusively for the likelihood field with beam skipping model.

Action & sensor update

The action update is straight-forward, trivial even, just processing all the particles with the odometry model in use.

The sensor update has a couple of extra steps however after the sensor model has executed. Simplified pseudo-python:

if total_weight > 0:
    normalise_weights()
    w_slow = low_pass(w_slow, avg_weight, alpha_slow)
    w_fast = low_pass(w_fast, avg_weight, alpha_fast)
else:
    set_all_weights(1/particle_count)

As a side note, in this code there is both a very stupid thing and a very smart thing:

It recomputes the total to compute avg_weight, even though it already has the total (stupid).
The low pass filter is implemented as x = x + alpha * (x' - x) instead of the more common expression x = (1-alpha) * x + alpha * y. This is one less arithmetic operation (smart).

Resampling

AMCL uses adaptive KLD resampling. According to comments, this is incompatible with low variance sampling, so that isn’t used.

By default, the adaptive sampling is not enabled (both alpha parameters are set to 0), but enabling it does seem to improve localisation.

Pseudo code for what it does:

def resample(self):
    # Switch between two particle sets
    set_a = self.current_set
    set_b = self.next_set
    # Cumulative probability table (huh?)
    c = array(double, set_a.sample_count + 1)
    c[0] = 0
    c[1:] = running_total([a.weight for a in set_a])

    # KD-tree for adaptive sampling (huh?)
    set_b.kdtree.clear()

    total = 0
    set_b.sample_count = 0
    w_diff = 1 - self.w_fast / self.w_slow
    if w_diff < 0:
        w_diff = 0

    while set_b.sample_count < self.max_samples:
        sample_b = set_b.samples[set_b.sample_count++]
        if uniform_rand(0, 1) < w_diff:
            # See amcl_node.cpp uniformPoseGenerator, basically generates random
            # poses inside known free spaces
            sample_b.pose = generate_random_pose()
        else:
            # "Naive discrete event sampler"
            r = uniform_rand(0, 1)
            for i in range(0, set_a.sample_count):
                if c[i] <= r < c[i+1]:
                    break
            sample_b.pose = set_a.samples[i]
        sample_b.weight = 1.0  # Surely this will be overwritten soon anyway?
        total += sample_b.weight
        set_b.kdtree.insert(sample_b.pose, sample_b.weight)
        # KLD sample check:
        if set_b.sample_count > self.resample_limit(set_b.kdtree.leaf_count):
            break
    # Reset low pass filters, to not make it too random
    if w_diff > 0:
        self.w_slow = self.w_fast = 0
    normalise_weights()
    # Compute cluster statistics
    self.cluster_statistics(set_b)
    swap_particle_sets()
    # Only used for beamskip model
    self.update_convergence()

The KLD criteria seem to be slightly different than what the book has, need to look further into the KD-tree implementation used here.

The function responsible for the KLD criterion has the following pseudo-code:

def resample_limit(self, k):
    # Apparently taken directly from "Fox et al."
    # Appears to be code on page 264, table 8.4, lines 15-16
    if k <= 1:
        return self.max_samples
    b = 2/(9*(k-1))
    c = sqrt(b) * self.pop_z
    x = 1 - b + c
    n = ceil((k-1)/(2*self.pop_err) * x ** 3)

    # Limit to [min, max]
    if n < self.min_samples:
        return self.min_samples
    if n > self.max_samples:
        return self.max_samples
    return n

The KD-tree is interesting in that it inserts into a discrete grid, tracking the total weight for each cell (used for cluster handling later on).

The grid size is 0.5 meters in x/y and 10 degrees in $\theta$ (hardcoded).

Clusters

AMCL identifies clusters in the sample sets, this appears to be used for the actual localisation to select the best candidate as well as compute the covariance of the filter.

The clustering is quite simple, using the coarse KD-tree from the KLD buckets.

Put all leaves in a queue.
For each item in the queue:
Check if it is already in a cluster, if so skip to the next node
Assign a cluster ID to the node (using a counter).
Check all neighbour buckets ($3^3$) to see if they are non-empty, and if so assign them to the same cluster. Do this recursively. There is a slight bug here however, in that the algorithm doesn’t handle warparound for the neighbours in $\theta$.

Then the code processes the clusters:

Reset the statistics (mean, covariance, summed weight, particle count) for the clusters
Reset overall filter statistics (as above)
For each particle:
Get the cluster ID
Update statistics for both the total filter and each cluster
For each cluster normalise the statistics: Divide by number of particles to get mean, compute final covariance and so on.
Same for overall filter statistics

Later on, it will use the cluster with the highest total weight.

Overall node

For the ROS node itself, there appears to be multi-threading going on, since there is a mutex being used. However, this seems to only be protecting the configuration parameters. May be related to the support for dynamic reconfiguration. However, some other ROS packages do not use mutexes to protect dynamic reconfiguration.

The laserReceived callback

The function of most interest would be laserReceived, where the actual localisation code is executed.

There appears to be some sort of support for having multiple lasers as well as caching of transforms related to this.

There is a flag, pf_init to special-case the first iteration

A very high level overview of the function is (with multi-laser code removed):

def laser_received(self, scan):
    if not has_map():
        return
    get_laser_transform()
    pose = get_odometry(scan.header.stamp)
    force_publication = False
    if self.pf_init:
        force_publication = True
        pose_delta = pose - self.prev_pose
        self.do_filter_update = pose_delta > threshold
    if not self.pf_init:
        self.prev_pose = pose
        self.pf_init = True
        self.do_filter_update = True
        self.resample_count = True
    elif self.pf_init and self.do_filter_update:
        run_odometry_model()

    resampled = False
    if self.do_filter_update:
        run_sensor__model()
        self.do_filter_update = False
        self.prev_pose = pose
        if ++self.resample_count % self.resample_interval:
            resample()
            resampled = True
        publish_cloud()
    if resampled or force_publication:
        # Compute single position hypothesis
        max_weight = 0
        for cluster in clusters:
            if cluster.weight > max_weight:
                best_cluster = cluster
                max_weight = cluster.weight
        estimated_pose = pose(best_cluster.mean, filter.covariance)
        publish_pose(estimated_pose)
        publish_transform(estimated_pose - odom_transform)
    else:
        republish_last_transform()

From this analysis it appears there is nothing particularly unexpected, but the code is complicated quite a bit by handling multiple lasers, lasers mounted upside down, and many other things.

AMCL does resampling unconditionally every resample_interval filter updates, by default 2.

AMCL bugs

I found some bugs in AMCL:

Clustering doesn’t handle wrap-around from $-\pi$ to $\pi$ (or vice versa). (AMCL bug)
Clustering is broken for cluster ID 0, since it uses 0 both for “not yet assigned” and “first cluster”. It should, however, work out in the end since that first cluster will just be written to twice, getting ID 1 (assuming it consists of more than one bucket).
Incorrect computation of circular variance, it is actually computing circular standard deviation. (AMCL bug)
Update (2019-04-14): Turns out this was not actually a bug. The two expressions are in fact equivalent.

Summary

There is a lot of smart code in AMCL, but also quite a lot of stupid code, and it is all pretty poorly documented. Parts are written in C and parts in C++. Not unexpected of code with its roots in research that many people have worked on over the years.

I also found some bugs, mostly with minor impacts. In addition to the bugs, several places use sub-optimal algorithms (as demonstrated above) or compute the same value more than once unnecessarily.

Oh and by the way, I put up my own MCL implementation publicly, called QuickMCL. It lacks some features that AMCL has but uses less computational resources.