Adam Pritchard's blog and miscellany on adam-p

The best – but not good – way to limit string length

Mon, 28 Apr 2025 14:13:48 -0400

Getting the length of a string seems simple and is something we do in our code every day. Limiting the length of a string is also extremely common in both frontend and backend code. But both of those actions – especially length limiting – hide a lot of complexity, bug-risk, and even vulnerability danger. In this post, we’re going to examine string length limiting deeply enough to help us fully grok what it means when we do it and how best to do it… and discover that the best still isn’t great.

A TL;DR misses the “fully grok” part, but not everyone has time to read everything, so here are the key takeaways:

Be aware that there are different ways of measuring string length.
Really understand how your programming language stores strings in memory, exposes them to you, and determines string length.
Make an intentional decision about how you’re going to count characters when limiting string length.
Look carefully at how the “max length” features provided by your language (framework, etc.) actually work. There’s a very good chance that they do not match the limiting method you chose.
Make sure you use that same counting method across all the layers of your architecture.
Probably limit by counting normalized Unicode code points. (Like Google recommends.)

With that out of the way, let’s start our investigation by looking at some of our familiar string length functions:

		“a”	“字”	“🔤”	“👨‍👩‍👧‍👦”	“र्स्प”	“x̴͙̹̬̑̓͝͝”
Go	`len(string)`	1	3	4	25	15	17
JavaScript	`String.length`	1	1	2	11	5	9
Python 3	`len(str)`	1	1	1	7	5	9
Swift	`String.count`	1	1	1	1	1	1

Those four measurements of string length are exemplars of the approaches common to most programming languages: UTF-8 bytes, UTF-16 code units, Unicode code points, and grapheme clusters.

## Character encodings and terminology

There are good explanations of this stuff elsewhere, but let’s try to quickly get a handle on the concepts we need to go further. (Feel free to skip anything you’re already comfortable with.)

First, a working definition of a “character”: This is the human conceptual ideal of the smallest building block of most written languages¹: a letter, an emoji, an ideograph, a punctuation mark, a symbol, a grapheme. Later on we’ll also think of a character as “if a user types this, they expect a ‘character count’ to increase by 1”. All of the examples in the above table probably look like “a character” to you.

(We’re going to avoid using this term lightly, in favour of technical correctness, to avoid confusion. So I’m never going to say “Unicode character”.)

### Unicode

Unicode is humanity’s attempt to list all possible characters, plus a whole lot more: control characters, non-printable characters, fragments that can be combined to form characters, and so on.

Each entry in the Unicode space is called a “code point” and is represented with a 32-bit unsigned integer, though the actual usable space is only 2²¹ (1.1 million possible values) and only about 150,000 of those values have been assigned. You might also see the term “Unicode scalar units” – these are basically the same as code points, but exclude the reserved “surrogate pair” range.

Examples (character: code, decimal)

“a”: U+0061, 97
“字”: U+5B57, 23383
“🔤”: U+1F524, 128292

Strictly speaking, a “Unicode code point” is an abstract concept, with a numerical value assigned to each character. There are 3 common concrete schemes for encoding those code points: UTF-8, UTF-16, and UTF-32. “UTF-32” is a direct representation: 32 bits to encode the 32 bits, with endian variants. I’m going to prefer just saying “Unicode code point” for clarity. We’ll discuss UTF-8 and UTF-16 at length below.

Note that in Go, a Unicode code point is typically called a “rune”. (Go seems to have introduced the term for the sake of brevity². I certainly appreciate that, but I’m going to stick with universal terms here.)

### Grapheme cluster

Some Unicode code points can be combined and rendered into a single visual character; we call this a grapheme cluster (or extended grapheme cluster).

We’ll go through some examples, to help illuminate the complexity.

“👨‍👩‍👧‍👦”: composed of 7 code points
1. “👨”: U+1F468, “man”
2. zero-width joiner: U+200D, HTML entity ‍
3. “👩”: U+1F469, “woman”
4. zero-width joiner: U+200D, HTML entity ‍
5. “👧”: U+1F467, “girl”
6. zero-width joiner: U+200D, HTML entity ‍
7. “👦”: U+1F466, “boy”

The family emoji is an example of a “zero-width joiner (ZWJ) sequence”.

Not all possible combinations of emoji (or code points generally) can be combined in such a way to create a single grapheme cluster character. The Unicode consortium publishes all defined emoji, including multi-code-point compounds.

Whether and how an emoji is rendered depends on the platform you’re viewing it on. For example, here is what the family emoji looks like in Brave browser on Windows – – versus on Android – .

And rendering can change over time. In 2014, with Windows 10, Microsoft introduced the “ninjacat” ZWJ sequence emoji, combining the “cat” and “ninja” emoji. It wasn’t supported by any other platform. In 2021, Microsoft removed support for it and now it renders as two separate emoji.

The family emoji itself has gone through some major changes. (Maybe it was even removed from iOS at some point, but it seems to render fine now.)

There are also “combining marks” that are accents or other fragments that don’t require a ZWJ between the code points.

“é”: composed of 2 code points
1. ordinary letter “e”: U+0065
2. “́”: U+0301, “Combining Acute Accent”

This is an example of a grapheme cluster that can instead be represented with a single code point: U+00E9 is “Latin Small Letter E With Acute” and is visually identical to the above, decomposed cluster. (See “Unicode normalization”, below.)

Use of characters formed by grapheme clusters (that can’t be normalized away) are extremely uncommon in languages that use European and and East Asian scripts, but fairly common in South Asian scripts, like Hindi where ~25% of characters involve a combining mark.

“र्स्प”: composed of 5 code points
1. “र”: U+0930, “Devanagari Letter Ra”
2. “्”: U+094D, “Devanagari Sign Virama”
3. “स”: U+0938, “Devanagari Letter Sa”
4. “्”: U+094D, “Devanagari Sign Virama”
5. “प”: U+092A, “Devanagari Letter Pa”

And then there’s Zalgo text, which abuses combining marks in c̴͚͉͔̓̑͂͜r̷̙̎̎̿͊a̵̜͍̱̋̕z̷̭̰͉͊̎́͒y̵̺̿̔ ways:

“x̴͙̹̬̑̓͝͝”: composed of 9 code points
1. ordinary letter “x”
2. “̴”, U+0334, “Combining Tilde Overlay”
3. “͝”, U+035D, “Combining Double Breve”
4. “̑”, U+0311, “Combining Inverted Breve”
5. “͝”, U+035D, “Combining Double Breve” (again)
6. “̓”, U+0343, “Combining Greek Koronis”
7. “͙”, U+0359, “Combining Asterisk Below”
8. “̹”, U+0339, “Combining Right Half Ring Below”
9. “̬”, U+032C, “Combining Caron Below”

As far as I can tell, there is no limit on the number of code points that can contribute to a single grapheme cluster “character”. We’ll certainly keep that in mind when we think about how to limit string length, below.

(You will sometimes see the word “segmentation” used when talking about extracting grapheme clusters from a string. It generically refers to breaking a string into defined pieces; for example, the JavaScript Intl.Segmenter API can break a string into graphemes, words, or sentences.)

### Unicode normalization

Unicode normalization has two functional axes: composition/decomposition and compatibility simplification, resulting in four modes with standard names. The “NF” is “normalization form”; “C” is “canonical composition”, “D” is “canonical decomposition”; “K” is “compatibility simplification”. Which gives us:

	No Simplification	Simplification
Composition	NFC	NFKC
Decomposition	NFD	NFKD

Composition canonicalization will combine the two-code-point grapheme cluster form of “é” (U+0065 + U+0301) into the single-code-point form “é” (U+00E9); and similar for other grapheme clusters for which an equivalent single code point exists (and does nothing when there is no single code point; e.g., mostly doesn’t change Zalgo text). Decomposition canonicalization will do the opposite: single-code-point characters will be decomposed into multi-code-point grapheme clusters (this includes decomposing Korean Hangul into jamo phonetic components).

Compatibility simplification converts some fancy characters into more ordinary ones. For example, “ℍ” (U+210D) and “ℌ” (U+210C) become plain Latin “H”; superscript “²” becomes plain number “2”; “ﬃ” (ligature) becomes “ffi”. The non-simplifying forms don’t replace such characters. (Note that ordinary diacritics aren’t removed – “é” keeps its accent.) Unlike canonicalization, simplification is not reversible.

NFC is good for keeping a string as compact as possible while ensuring greater consistency.

NFKC is good if you want the string to be searchable or otherwise comparable using equivalent characters (like matching “ℍ𝔼𝕃𝕃𝕆” when searching for “HELLO”).

The decomposition forms (with additional processing) are useful for stripping out all accents; e.g., if you want plain ASCII for searching or for use in filenames.

Note that the output of normalization may change between versions of Unicode. See the Unicode Versions section, below, for discussion of how that kind of thing can go wrong.

### UTF-8

UTF-8 encodes code points into a sequence of 1, 2, 3, or 4 one-byte code units. It has the very nice property of providing a compact encoding for most string data; in particular, all ASCII printable characters fit in one byte, with the same ASCII numeric value (e.g., if you open an ASCII view of a UTF-8 encoded source file, you can probably read it just fine). It is by far the most common encoding to find serialized to disk or on the wire (in recent years).

Note that there is overhead in the design³, so you don’t get the full 8*bytes number of bits to represent code points. Here’s how it breaks down:

1 byte: 7 bits
2 bytes: 11 bits
3 bytes: 16 bits
4 bytes: 21 bits

### UTF-16

UTF-16 doesn’t often get used on the wire or on disk, but many programming languages and OSes use it in-memory. The reason is that some platforms and languages originally supported UCS-2, which is the old 2-byte Unicode standard. When Unicode increased to 4 bytes, UTF-16 was created. It uses 1 or 2 two-byte code units in a code point sequence. Sequences of a single code unit are identical to and backwards compatible with UCS-2, making the transition for UCS-2 platforms reasonably easy.

(Two UTF-16 code units making a code point is called a surrogate pair. There is a “surrogate pair” reserved area in the UCS-2 spec that is used to indicate when a second UTF-16 code unit is used in the code point sequence.)

UTF-16 has the nice property that the entire “Basic Multilingual Plane (BMP)” fits in a single UTF-16 code unit. That’s the “majority of the common characters used in the major languages of the world” (but doesn’t include emoji, notably). The downside is that it takes twice as much space to represent ASCII characters.

(Note that it takes three UTF-8 bytes to fit the entire BMP.)

### Other encodings

We’re not going to get into code pages, WTF-8, CESU, etc. They’re not relevant to the task at hand (and I don’t know enough to say anything useful).

## String lengths

Now that we understand character encoding better, let’s revisit the table above.

Encoding Count	“a”	“字”	“🔤”	“👨‍👩‍👧‍👦”	“र्स्प”	“x̴͙̹̬̑̓͝͝”
UTF-8 code units	1	3	4	25	15	17
UTF-16 code units	1	1	2	11	5	9
Unicode code points	1	1	1	7	5	9
Grapheme clusters	1	1	1	1	1	1

So, different programming languages (and functions within a language) give us different methods of counting. Some examples⁴:

UTF-8: Go, Rust, C++’s std::u8string, Haskell’s Text v2
UTF-16: Java/Android, JavaScript (and TypeScript), C#/.NET (and Windows), Objective-C/iOS (NSString), Haskell’s Text v1.
Unicode: Python 3, Elixir, Ruby, PostgreSQL, Haskell’s Char
Grapheme cluster: Swift’s String.count, Elixir’s String.length, Perl 6

Note that many (probably all) languages provide ways of converting between encodings and counting “length” in those other encodings; the above are just defaults. There might also be a difference between which encoding is used in-memory versus which is presented as the main programming interface to access those strings.

To understand how our programming languages of choice deal with string length, it’s worth taking a step back and thinking about what a string is. The definition of “string” that many of us would give is something like “a bunch of characters”. But we’ve seen now that “character” only has an abstract meaning, so it’s not enough to help us when we use a string type. We need to know and keep in mind two things:

the underlying in-memory representation of strings, and
the view into that representation presented to us.

A few examples:

Go’s string type is really an array of bytes. The intention is that those bytes hold UTF-8 code units, but there’s no guarantee of UTF-8 sequence validity. len(string) gives the byte length of the string. If you want to iterate over the bytes/code units, you must first cast the string to []byte, because if you just iterate over the string, each step will give you a Unicode code point/rune. Go provides unicode/utf8.RuneCountInString to get the code point count. It also has the unicode/utf16 package for converting between runes (code points) and UTF-16 code units. It has no built-in support for grapheme cluster segmentation.

JavaScript’s string type is a set of UTF-16 code units and string.length gives you a count of those code units. [...string] gives an array of Unicode code points. TextEncoder converts to UTF-8. Intl.Segmenter provides access to grapheme clusters.

Swift’s underlying representation used to be UTF-16 but has been UTF-8 since 2019. Its Character type holds a single grapheme cluster and String.count returns grapheme cluster count. To access encodings, it provides String.UTF8View, String.UTF16View, and String.UnicodeScalarView.⁵

Having a deeper understanding of how strings work under the hood will help prevent the confusion of discovering that a single emoji has a length of 7, and the bugs that can follow from that confusion.

### Limiting and Consistency

We finally get to the real point of this post!

Because there are 4 different ways to encode characters, there are 4 different ways to count string length. Because there are 4 different ways to count string length, there are (at least) 4 different ways to limit the length of a string.

That makes it really easy to be inconsistent across levels of your architecture, leading to bugs and bad user experience. They can be easy to miss in testing, since certain characters and character combinations might be needed to reveal them.

Here are a few length limiters that I was looking at when I decided I needed to write this post:

Go’s validator package (not stdlib, but popular): The max and min limiters count Unicode code points (“runes” in Go-speak).
If you create a column constraint with PostgreSQL’s char_length(), it’ll limit by Unicode code points (by default).
The HTML attribute maxlength on input or textarea elements limits by UTF-16 code units. (Except… I have found that IE limits by Unicode code points instead.)
React Native’s TextInput.maxLength limits by UTF-16 code units. This is because on iOS it uses NSString.length and on Android it uses InputFilter.LengthFilter.

Rant: It’s really annoying that the method of counting length often isn’t immediately apparent in the documentation. I should not have had to dig into React Native source code to get the answer. (And this isn’t just an RN problem.)

Inconsistencies can arise between frontend clients and backend API servers, between API and database, between different client implementations, between different servers accessing the same database, and so on. Let’s look at some problems that inconsistent length-limiting can lead to.

If the frontend allows longer input than the backend (perhaps by the frontend allowing 100 Unicode code points while the backend allows 100 UTF-8 or UTF-16 code units), the frontend may indicate that the user’s input is valid just to have it rejected by the backend.

If the frontend has a shorter input limit than the backend, the user will be unnecessarily restricted. In the case of something like a username, if the account was created in a frontend with consistent counting and then the user tries to login on a client with too-short counting, they might be unable to. One can imagine other too-long data might cause a frontend to freak out: assertion failure, refuse to display it, force the user to change it, etc.

Similar problems will occur with length-restrictions at other levels, such as between backend(s) and database.

Be intentional and consistent with how you limit string length. Don’t just blindly use the default on whatever OS you happen to be on and whatever language you happen to be programming in.

Problem severity reality check: Probably allowing this kind of inconsistency won’t result in a fatal bug. Probably it won’t render your UI unusable or crash your server. Probably it won’t introduce an ugly security vulnerability. Most users aren’t going to create usernames that are maximum length. Most users won’t use characters beyond the BMP in their username (although expect emoji elsewhere). You could probably go your whole career not paying any attention to this and probably not get bitten hard by it. But a) you’re curious, and b) you want to limit the margins on “most” and “probably”, so let’s figure out how to do this better.

### How to limit?

This is the question that really prompted me to write this post: What’s the best way to limit string length input?

Let’s go through each encoding type again and consider them as candidates for limiting. We’ll be thinking about it from both UI and API points of view. Keep in mind that the way you’re “limiting” is often different on the frontend versus backend: on the backend, you’re probably just rejecting the input (HTTP 400); on the frontend, you’re either a) showing the user they’ve typed beyond the limit and telling them to fix it, or b) preventing them typing beyond the limit, or c) truncating the input.

Let’s also make explicit: Usually, the reason we want to limit count is because we want to limit size. We don’t want infinitely long usernames, chat messages, or even books. Even if we had no trouble displaying them, we can’t process and store them. Limiting by count rather than byte-size can be more humane, and we’ll keep this in mind as we consider approaches.

#### Grapheme clusters

Because they’re closest to representing what humans think of as a “character”, grapheme clusters seem like they should be the right thing to count. But they’re actually the worst because, again, there’s no limit on the number of code points in a grapheme cluster. Maybe you can use them to limit count, but you can’t use them to limit size. (At maximum “craziness”, zalgo.org adorns the letter “x” with 360 accents and whatnot – for 361 total code points. Swift’s String.count still, correctly, says that it’s one grapheme cluster.)

(Perhaps you could limit the number of code points per cluster, but you’d be breaking the Unicode spec and it’s probably a bad idea. For something along those lines, see the “Hybrid Counting” section below.)

Grapheme clusters also require additional processing and interpretation. This is more of a rendering problem (e.g., iPhones have had multiple crash bugs), but a complex code path during input validation (potentially before authentication, etc.) should give anyone pause.⁶

To a greater degree than the other options, operating at the grapheme cluster level is sensitive to differences between Unicode versions. For example, if a new grapheme cluster is added, a newer segmentation algorithm will count it as 1, while an older Unicode version may count it as the individual code points.

Not all programming languages have built-in grapheme cluster segmentation support. For example, Go has third-party packages, but no support in the stdlib or in golang.org/x/text.

#### UTF-16 code units

If you’re starting from the frontend, there’s a good chance that UTF-16 will be your default choice. There are reasons why it’s not a good choice.

A danger in length-limiting by counting code units is that you inadvertently break a character by truncating some of code units in a sequence. React Native’s iOS limiter suffered this for at least 5 years before fixing it. It looks like RN’s Android limiter avoids this (“If a surrogate pair is encountered at the boundary, it backs off by one character to avoid splitting the pair”).

Counting by anything except grapheme clusters means that your count can jump by more than one per “character”, which is not the worst thing ever, but obviously undesirable UX. There are two reasons for this jump-counting:

Applies to UTF-8, UTF-16, Unicode code points: If the character is a multi-code-point grapheme cluster.
Applies to UTF-8, UTF-16: Multiple code units are required to encode the code point.

The first reason can be somewhat mitigated by using NFC normalization with every keystroke – to the extent possible, that will reduce clusters down to single code points. (Introducing latency while typing is also bad UX! Make sure to test the timing and feel.) We discuss this point more in the Unicode section below.

The second reason can’t be mitigated for UTF-8 and UTF-16 – by design, they can require multiple code units per code point. However, it’s much less of a problem for UTF-16, since the whole Basic Multilingual Plane fits in one code unit, though it depends on the type of input expected – it’ll still be a two-count per emoji, at least.

I’m going to reiterate the caveat to that last point: the vast majority of commonly used characters are located in the BMP, and therefore fit in one UTF-16 code unit. Therefore, for the vast majority of commonly used characters, counting by UTF-16 code units is equivalent to counting by Unicode code points. The big exception to this is emoji.

Another big factor is that, while UTF-16 is quite commonly used in-memory, it’s very unusual for it to be serialized on the wire or to disk (UTF-8 is used most commonly, rarely UTF-32). Because “size on wire” and “size on disk” are things we care about when limiting, it feels strange to count by the one format that we know won’t be serialized. But let’s not overstate it: UTF-16 will still get us within about a factor of two of serialized size.

Using UTF-16 maximizes the amount of encoding conversion we’re going to have to do.

Additionally, in programming languages that primarily use UTF-8 or Unicode code points, UTF-16 seems to be a second-class. For example, in Go, strings are UTF-8 and runes are what you get when you iterate over a string, but UTF-16 support requires the unicode/utf16 package. PostgreSQL doesn’t seem to support UTF-16 at all.

So, UTF-16 isn’t obviously terrible, but we can do better.

#### UTF-8 bytes

UTF-8 is a tempting choice because a) it’s a simple byte count, and b) it’s likely what we’re serializing to on the wire and on disk. It’s also very space-efficient for English text, although less so for other languages.

Counting characters by UTF-8 bytes in the UI is doomed to be confusing. For anything but plain English, the count will regularly increment by 2 or 3 or 4, and in ways that are not reasonable for a human to predict.

It’s important to note that there is a wide range in the significance of a string length limit to the UI/UX. On one end of the spectrum, there are very short fields with visible character counts; an obvious example is the old 140-character tweet limit, which users would stare at while trying to figure out creative contractions that wouldn’t obscure their point too much. On the other end is, say, a 10,000-character message board limit, where it’s not expected that many users will get near it; probably you don’t even bother showing the count, and you just hard-limit the input.

When there’s a very high limit, added for sanity and safety’s sake, it doesn’t really matter how you count characters. You’re not exposing the count to the user, and you probably don’t care if the limit is 10,000 UTF-32 code points, 20,000 UTF-16 code units, or 40,000 UTF-8 bytes.

But when the count is low and exposed to the user… UTF-8 counting is going to look weird.

#### Unicode code points

Argument from authority: Google’s API design guidance states that string size limits must be measured in Unicode code points, and that those strings should be NFC Unicode normalized.

I think this is approximately the best approach – not just for APIs, but also for UIs – but it’s not ideal. (I would also upgrade the “should” be normalized to “should almost always”.)

Counting by Unicode/UTF-32 code points means that everything in the Basic Multilingual Plane and many emoji get a count of 1. That’s an improvement over UTF-16 (and UTF-8). It also doesn’t suffer from the interpretative counting of grapheme clusters.

But…

It still suffers from many of the problems mentioned above:

It can still count by more than one. Multi-code point grapheme clusters are not uncommon for some scripts (such as Devanagari, used by more than a billion people) and some emoji (such as all country flags).
Input limiting by code point in a way that isn’t grapheme-cluster-aware might truncate characters – causing them to be nonsense or otherwise confusingly incorrect. This isn’t hypothetical! I found that React Native on Android (but not iOS) does exactly this.

(The Canada flag emoji is a cluster made of a “Regional Indicator Symbol Letter C” followed by a “Regional Indicator Symbol Letter A”, but the length limit is truncating after the “C”.)
Normalization is needed to limit grapheme clusters, and possibly per-keypress in the UI. Not a big deal, but extra complication and processing.

Even with those caveats, I think counting by Unicode code points is the sanest choice.

#### Hybrid counting

If we were inventing the best possible counting method, what would it look like?

Probably: Grapheme clusters without the unbounded risk.

We would want:

a count of 1 for each “character” – even multi-code-point grapheme clusters in scripts like Devanagari
a count of 1 for each emoji – even multi-code-point compounds
a count of more than 1 for technically-a-single-grapheme Zalgo explosions

I think it would be reasonable to do that with logic like:

Iterate through grapheme clusters:
- Are there N or fewer code points in the cluster? It counts as 1.
- Otherwise, it counts as 1 plus the number of code points above N.

What should N be? (Following answer mostly based on this informative StackOverflow post.)

The longest well-defined grapheme cluster I have found is 10 code points ("👨🏻‍❤️‍💋‍👨🏼" – “Kiss - Man: Medium-Light Skin Tone, Man: Medium-Light Skin Tone”). So maybe 10 should be our limit.

Unicode defines a Stream-Safe Text Format, which applies a limit of 30 (post NKFD nomalization). “The value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage. While it would have been feasible to chose a smaller number, this value provides a very wide margin, yet is well within the buffer size limits of practical implementations.” So maybe 30 should be our limit.

(If I understand the Stream-Safe Text Process correctly, it provides a decomposition algorithm that is similar to what I suggested above. However, it would be more generous: a Zalgo character with 100 combining marks under my counting system would get a length of “71” (if N=30); under the Stream-Safe Text Process, it would have a length of “4”. I’m inclined to be less generous, but either approach still provides a bound, which is what we want.)

Even if I think this kind of approach is optimal, I don’t think it’s perfect.

The byte size allowed would need to be about 100x the length limit. That’s… kind of a lot? (My gut feeling is equivocal.)

Many of the problems mentioned above with using grapheme clusters also apply here, including Unicode version concerns.

It’s very hard to recommend such a non-standard method of counting for anything that’s intended to be consumed externally (e.g., APIs), as it will cause a lot of confusion. Even internally, standardizing that between frontend, backend, and DB would carry a lot of bug risk. So, it would be more feasible if there were a common standard to refer to, code to, and consume, allowing for consistency in understanding and implementation.

With a snappy name. Maybe “graph length”.

In the Appendix, I include some implementations of this algorithm.

##### Twitter counting

Twitter uses an interesting hybrid counting method (docs, code (Apache licensed)). Here are some examples (using twttr.txt.getTweetLength()):

“a”: 1
“ӑ”: 1
“字”: 2
“🔤”: 2
“👨‍👩‍👧‍👦”: 2
“र्स्प”: 5
“x̴͙̹̬̑̓͝͝”: 9

I won’t repeat the docs or logic, but generally:

Many letters and punctuation get a count of 1
Chinese, Japanese, Korean glyphs get a count of 2
All valid emoji (even clusters) get a count of 2
All other code points get a count of 2 each

NFC Unicode normalization is performed before any counting. The API requires UTF-8 encoding.

(Strange: The doc says that “Ồ” (U+1ED2) should have a count of 1, but I get 2.)

I like that there’s mostly an intuitive rule that’s like, “simple characters get a 1-count; complex characters get a 2-count”. Except:

My use of “simple” sure is Euro-centric
The rule doesn’t apply to more than a billion people (Hindi writers, etc.)

## Other considerations

There remain important questions and problems that are outside the scope of this post, but I’ll mention them here so you can keep them in mind.

### What to do when the limit is hit

This post is primarily about how to do the counting to determine if a limit has been hit, but also important is the question “and then what?” We’ll look at the two main categories of possible reactions and their frontend/backend concerns.

#### Reject

The simplest response to too-long input is just to reject it. You’re not manipulating the string to “fix” it, maybe you’re not even telling the user how much they’re over the limit, you just say “nope, too long”.

This is very common in backend code – an HTTP server will probably reply with 400. Maybe you’ll want to indicate which field was too long, maybe not. (You’ll definitely want to do request-size-sanity-limiting before you start considering individual fields.)

In the frontend, “rejecting” input might take the form of allowing the user to enter too much data into a field, detecting that it’s too long, displaying an error message and disabling form submission.

#### Truncate

In the frontend, this might take the form of halting further input into a field or it might mean actively truncating input before submitting. In the backend, it will mean actively truncating. Either form has the risk of breaking characters.

Unicode code points can be broken if we’re limiting by UTF-8 or UTF-16 and we truncate in the middle of a code unit sequence. We’ll end up with an invalid encoding sequence.

Grapheme clusters can be broken if we’re limiting by UTF-8, UTF-16, or Unicode code points and we truncate in the middle of a cluster sequence. We can end up with distorted or different characters or emoji.

So, truncating must be done in a sequence-aware manner. If the limit is hit in the middle of a code point or cluster sequence, you’ll need to back off until you find a boundary and do the truncating there. (Or, better yet, find some battle-tested library to do it for you.)

And then do a bunch of testing to make sure it works the way it should. (Note the React Native bugs mentioned elsewhere.)

Discussion about the relative UX merits of rejection versus truncation is interesting, but beyond our scope.

[This section was prompted by an HN comment from aidenn0.]

### Unicode versions

There have been 16 major versions of the Unicode specification. Older Unicode text is compatible with newer Unicode versions, but the reverse is not necessarily true. For example, new versions may:

Add new emoji and other characters (code points)
Add new valid grapheme clusters, including multi-code-point emoji
Update grapheme cluster segmentation rules
Add character normalization rules

Possible problems that could be introduced include:

If counting by grapheme clusters, there could be a mismatch between frontend and backend, causing them to disagree about whether a string is valid.
Normalization could be different between frontend and backend, introducing the possibility of string length counting differences.
A valid emoji entered on one client may not be valid on other clients with older system Unicode versions.
If there are two different backends with different Unicode versions, they may not normalize a username the same way, resulting in one of them being unable to look it up. (Similarly for passwords, pre-hashing. Or any other string matching.)
If strings are normalized and hashed (e.g., for signing or HMAC) in one place, and then the raw string is sent somewhere else for verification, the normalization step may not produce the same string, resulting in a different hash. (So don’t try to modify the input before checking. For example, the JSON Canonicalization Scheme specifies that “Although the Unicode standard offers the possibility of rearranging certain character sequences, referred to as “Unicode Normalization”, JCS-compliant string processing does not take this into consideration. That is, all components involved in a scheme depending on JCS MUST preserve Unicode string data “as is”.”)

And so on.

For a real-life story of this kind of version mismatch, search for “operating system-provided ICU” here. It also links to a scary discussion (HN discussion) about Unicode version differences in glibc causing PostgreSQL index corruption when the same data is used on a different server machine. “No surprise; I have been expecting to hear about such breakage, and am surprised we hear about it so rarely. We really have no way of testing for breakage either. :-(” Yikes.

“Did Postgres Lose My Data?” is another good dive into Unicode collation (sorting) changes. The author did an analysis of actual inter-version differences. (From the Unicode Technical Report #10: “Collation order is not fixed. Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.”)

I have never seen “Unicode version” be specified as a requirement in an API (or other) spec, but I think it should be, strictly speaking. Otherwise you can’t be certain that both sides understand strings in the same way.

On the other hand… This might be another case of being overly paranoid?

### Normalization concerns

It might be tempting to have your API require that strings be Unicode normalized before submission, because: a) you’re already normalizing on the frontend in order to count the length, and b) it’s your API and you make the rules. But I think it’s a bad idea.

First of all, you’re not going to trust the API-caller to actually do it, so you’ll have to re-run the normalization on the backend anyway.

…And then compare your normalized input with the raw input. But then see above for Unicode version concerns – you could end up in a pathological state where the backend refuses to accept perfectly valid input.

Note that normalization is not zero-cost, and may be significant on very large data. For example, Django had a recent possible DoS vulnerability due to slow normalization.

### Encoding errors

Another thing to learn about your programming languages of choice is how they handle bad UTF-8 or UTF-16 encodings, and how you’re going to handle that when it happens.

It seems common to replace broken sequences with the Unicode replacement character, U+FFFD. E.g., when Go’s json.Unmarshal encounters a bad UTF-8 code unit sequence, it silently replaces it with U+FFFD.

Should you detect invalid sequences? Treat them as bad input? Or just pass them through?

### Processing overhead

In many cases, accessing (or counting) character encodings other than the one used for natively storing your strings is an O(n) operation. That is, the string needs to be scanned through to answer questions like, “how many UTF-16 code units are in this UTF-8 string?”, “how many Unicode/UTF-32 code points in this UTF-16 string?”, “how many grapheme clusters?” – converting from the internal encoding to the target.

For most situations, with most string lengths, this isn’t a significant performance problem, but keep it in mind if you’re doing this inside a big loop, or with huge data, or in an HPC context.

Swift, for one, leaves “breadcrumbs” after the first conversion, to speed up subsequent non-UTF-8 string accesses. I don’t know about other languages, but I don’t think this is common.

## Additional reading

Henri Sivonen’s “It’s Not Wrong that “🤦🏼‍♂️”.length == 7” is great. It covers some of the same basics I mention here, but goes into different nerdy things. For example, it looks hard at the density of information relative to character count (and various encodings) in many, many different languages. This gets into the “fairness” of limiting different languages by the different encodings, which I didn’t really consider here. (It does this by analyzing the myriad translations of a single document: the Universal Declaration of Human Rights. Smart.)

Nikita Prokopov’s “The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)” does what it says on the tin. If you feel you want another pass over this stuff after reading this whole post, go there. It also gets into locale-dependent differences in handling of the same code points, which is good to know and also troubling.

Jeremy Schneider’s “Did Postgres Lose My Data?” is a good story about and investigation into the effects of Unicode version changes. It also links to good resources.

## Conclusions

What prompted me to write this was when I realized that UTF-16 was the (unstated) encoding used for string length counting and limiting in a project I was reviewing. This felt wrong to me: variable number of code units; not first-class in some languages I cared about; kind of a weird in-between. I also didn’t like that the choice wasn’t intentional or aware – it was just whatever the framework used for maxLength under the hood.

When I tried to swoop in with some “senior dev has seen it all” shit… I realized that I didn’t have a solid suggestion with clear reasons. And that’s a good excuse to develop them in a blog post.

One surprising – to me – outcome of this research is that UTF-16 isn’t actually a bad choice: for most languages, most of the time, only one code unit is needed for a non-emoji character. And it’s not like using Unicode code points is a huge improvement, since it also counts by more than one for grapheme clusters.

I was hoping for a clear right choice, backed by solid reasons, and I didn’t find one.

That being said, I think that counting by Unicode code points (with normalization) is the best approach. Did I take 7,000 words to say “just do what Google suggests”? Yes, okay, maybe, but now I know – and you know! – why. And that’s important. Plus, we got to learn some interesting and somewhat bonkers stuff about things we take for granted every day.

I’ll add a TL;DR at the top with takeaway points.

## Appendix: Implement Hybrid Counting ✨Graph Length✨

For fun, let’s see what implementations of the “hybrid counting” approach would look like.

JavaScript:

/**
 * Counts the number of grapheme clusters in a string, with a sanity limit on the number
 * of Unicode code points allowed in the cluster. After 10 code points in a single cluster,
 * the remaining code points in the cluster are counted as one each.
 * The limit is intended to be larger than the number of code points in in legitimate
 * grapheme clusters (as used in emoji and human languages) from less-legitimate uses,
 * like Zalgo text.
 *
 * @param {string} s - the input string
 * @returns {number} - the grapheme length
 */
function graphLength(s) {
 // A bit of research suggests that the locale arugment is ignored for grapheme segmentation
 const seg = new Intl.Segmenter(undefined, { granularity: 'grapheme' });

 let total = 0;

 // Iterate through the grapheme clusters
 for (const { segment } of seg.segment(s)) {
 // Spread … turns the cluster string into an array of Unicode code points
 const n = [...segment].length;

 if (n <= 10) {
 // Short enough to count as one
 total += 1;
 } else {
 // Too long. The first 10 code points count as 1, and the rest each count as another 1
 total += 1 + (n - 10);
 }
 }

 return total;
}

Go (using a third-party segmentation package that seems like the de facto standard):

import (
 "unicode/utf8"
 "github.com/rivo/uniseg"
)

func GraphLength(s string) int {
 var total int
 graphemes := uniseg.NewGraphemes(s)
 for graphemes.Next() {
 cluster := graphemes.Str()
 n := utf8.RuneCountInString(cluster)

 if n <= 10 {
 total += 1
 } else {
 total += 1 + (n - 10)
 }
 }
 return total
}

Those don’t seem terribly hideous. I would consider using that code.

See the Hybrid Counting section for thoughts about using 10 as the count, versus 30.

Whatever your limit, you may wish to add a log when you exceed it. If it’s just Zalgo text, that’s fine, but if a new character enters into a common use, then you probably want to know about it and change your limit.

You probably want to NFC-normalize a string before getting its graph length.

## Appendix: A personal length-limiting example

Years ago I wrote some code that did length-limiting for account creation fields. It’s really beside the point of this post, but I thought it might be slightly interesting – at least to myself – to consider that design in light of what I’ve researched here.

This was a for a Go backend, with account creation done on a website and logins from native Android and iOS (and IE webview on Windows).

There were no length requirements in the spec, but there obviously couldn’t be unbounded input. I wanted to give lots of space for the username and especially the password, including flexible allowance of character types (PRECIS UsernameCasePreserved and OpaqueString). (Nothing about how the usernames would be used required plain ASCII.)

So I decided on these design points:

Backend:
1. Overall request size limited to 4 KB.
2. Username limited to 200 UTF-8 bytes.
3. Password limited to 800 UTF-8 bytes.
Frontend:
1. Username limited to 50 Unicode code points.
2. Password limited to 200 Unicode code points.

It makes some sense that I used UTF-8 bytes on the backend, since size was what I was really concerned about. It makes some sense that Unicode code points were used on the frontend, since they’re a way to stay safely within the backend limit. I chose values that I thought were large enough that the vast majority of users wouldn’t hit them.

But you can immediately see a few problems.

I didn’t properly (or at all) consider grapheme clusters. For a Hindi-speaker/writer, that username length could be reduced to 10 characters. I think that’s below acceptability.

After learning about PostgreSQL collation issues due to Unicode version changes, I would think carefully before allowing non-ASCII characters in an indexed field (e.g., username).

There’s a mismatch between frontend and backend encoding used for limiting. This is somewhat mitigated by a few things:

The limits are fairly generous. (…except for grapheme clusters.)
There is only one client where accounts can be created.
There is no frontend length limiting done when logging in – only when creating or modifying an account.

So a mundane user wouldn’t be able to get into a bad state. And if someone F12-removes the submit-button-disabled state on the frontend, they can create a username/password beyond 50/200 code points, but they will still be able to log in with them. So that’s okay-ish.

What’s less good is that a user with a too-long username won’t be able to edit their account – adding an email address, say – without changing the username, because the UI will flag the (unchanged) username as invalidly long. Sure, if you mess around in browser DevTools you get what you deserve, but it’s still a good example of unfortunate consequences due to inconsistency.

Hacker New discussion will be here, if any results.

When I make generalizations like this, please remember that I don’t actually know what I’m talking about. Maybe many languages consider something smaller than a character to be the atoms? For example, maybe for Japanese kanji it’s the radicals or even the strokes? I don’t know. This kind of glib generalization is just to give most readers a rough idea as we progress through concepts. ↩︎
After reading this post, Connor Taffe took a deep dive into the origins of “rune” (and UTF-8). Code archeology is cool. ↩︎
The overhead enables some cool properties of UTF-8, like: a) The first byte in a UTF-8 code unit sequence tells you how many bytes are in the complete sequence; b) if you look at any byte in a UTF-8 stream, you know if you’re at the start of a sequence or at a later position – handy if you want to back up or skip to the next sequence after an interruption. ↩︎
I don’t know most of these languages well or at all. Corrections welcome. ↩︎
I really like this explicit view approach, but I worry about the risks of defaulting to grapheme clusters. Someone should do a survey of Swift/Perl/Raku/Elixir projects to see how many are vulnerable to giant-grapheme-cluster attacks. ↩︎
I’m about to recommend Unicode normalization during validation, which is certainly also “additional processing and interpretation”. And converting between any encodings requires some kind of “interpretation”, but with varying complexity. ↩︎

Bug story: Sorting by timestamp

Sat, 23 Dec 2023 10:22:41 -0500

While reviewing a co-worker’s results-paging design I realized there was a bug in some paging code I wrote¹ a few years ago. It’s unlikely to manifest and kind of subtle, but I thought that describing it here might be useful to others writing such code (including my future self). It comes down to sorting by timestamp…

We’ll start with a table like this:

CREATE TABLE item(
 id TEXT PRIMARY KEY DEFAULT generate_unique_id(),
 created TIMESTAMPTZ NOT NULL DEFAULT NOW(),
 -- plus other stuff not relevant to the story

Note that we don’t have anything like a SERIAL column², so created is our only option for traversing records chronologically.

Our clients will request pages of those items, persist them locally, and later check to see if there are new items. The paging query looks very generally like this:

-- last_known_id is input
SELECT * FROM item
WHERE created > (SELECT created FROM item WHERE id = last_known_id)
ORDER BY created ASC
LIMIT 10

This approach would be fine if these properties were true (per user):

The creation timestamp is unique
The creation timestamps are monotonically increasing

When writing the code I unthinkingly took those to be the case. But when I took the time to think about it a few days ago, I quickly realized that they’re bad assumptions. Let’s look at some problems with them…

PostgreSQL’s timestamp type has microsecond resolution³. That’s really small, but not infinitely so. It is surely theoretically possible to create more than one record with the same timestamp, even for the same user with cross-transaction serialization constraints. (It probably requires a transaction to begin and commit within a millisecond, but that’s not enough to discount it.) This becomes trivially true if your DB operations don’t have cross-transaction constraints.

From what I can find, Postgres does not provide a guarantee of stable sort for identical values. Based on what (little) I know about b-tree indexes, I think it’s probably true for indexed values (unless some other index type gets used? unless the index gets rebuilt or shuffled?), but it’s still not something that should be (sanely) relied on.

What happens to now() if the OS clock gets adjusted backwards (perhaps due to NTP update, assuming slew isn’t used⁴)? If it goes backwards, then we lose monotonicity. But even if it doesn’t update until a service restart, the restart can still be faster than the magnitude of the time change, so the clock can still effectively go backwards. Again, not monotonic.

If our DB server is in a (single writer, many readers) cluster, the same thing can happen if we fail over to a server that is behind the previous writer. now() again ends up in the past, until the time difference is caught up with.

So, our timestamp is not unique, not monotonically increasing, and we can’t even trust that identical values will sort stably⁵. This means that our clients could miss items when paging.

Monotonicity Failure Scenario: If the user has pulled down items to timestamp X and they manage to create a new item at timestamp X-1, they will never get it (at least until a full re-sync).

Uniqueness Failure Scenario: If the user has two items at timestamp X and they retrieve a page where the last item on the page is the first of those items, the following page will skip the second of those items, because the query is only looking for items with a greater timestamp. (If we change the query to look for items with greater-than-or-equal timestamp, then we’ll be getting duplicate items. Probably better, but still not good – especially if there’s a whole page of duplicates that can never be escaped from.)

Again, these failures are improbable. In our case, a single user creating two items within the same microsecond, and then having those align on a page boundary, is unlikely. As is the possibility that our server clocks drift badly enough that backwards movement occurs.

But it could still happen! Which sucks!

I think that the right way to fix this is to add an ordering column. In our case it only needs to be ordered per user, but it’s probably easiest to just use a BIGSERIAL and order the whole table. We would then use that column rather than created when sorting. It will provide the properties of uniqueness and monotonicity that we need.

There are a few takeaways here:

Thinking about time is hard.
Try your best to recognize your bedrock, implicit assumptions about things. Which is also hard, since they’re mostly subconscious.
Review other people’s code (and have your own reviewed, of course). It forces you to think broader, deeper, and different, and can help you with your own code.

There is discussion of this post on Hacker News here.

The implementation was part of a big change that got squashed – a practice I am having doubts about – so I don’t know for sure if I wrote it. It was years ago! But I certainly reviewed it, and didn’t know better, so I’m still taking the blame here. ↩︎
For reasons. We need a random primary key. ↩︎
A commenter on HN pointed out that the resolution of the timestamp might be coarser than microsecond if the resolution of the service providing time to Postgres is coarser. Probably it’s not coarser, but it’s another thing to not take for granted. ↩︎
If slew is used, then that might increase the likelihood that identical timestamps occur, as time effectively slows down while the correction is made. ↩︎
An HN commenter said: “I found it to be a generally useful rule to never ORDER BY created but instead ORDER BY created,id instead to achieve stable sorting.” Which, yeah, is great advice. It’s probably great advice for any “sort by timestamp” scenario – you just need a another unique value to use as the secondary sort, even if it’s not itself usefully sortable. ↩︎

Inexpert Parenting Tips

Sat, 03 Jun 2023 06:17:56 -0400

These are some parenting tips/tricks/techniques that I’m capturing for a friend. I’m not pretending that the ideas are good or unique or will help anyone other that me, but… maybe?

My frame of reference: I have one 8-year-old son, healthy and neurotypical. He’s basically well-behaved now, but we had a lot of behavioural troubles (hitting other kids, etc.) until recently. I’ll probably be using “he” a lot, as this is basically a personal account.

Much of this stuff is based on books or articles I have read along the way. I’m going to make zero effort to figure out the sources, or else I’ll never write any of it.

## Carve your brain (and your child’s): Do what you want to be

“You are what you do”, for good or ill. Repeated actions or thoughts build pathways and structures in your brain that enable or ease those actions. This is unfortunately most often true with negative behaviours (like constant cell phone checking), but can be used with intentional effort to propel yourself along the path you’d like to be traveling.

This is a pretty abstract “tip”, but it’s going up top because it applies to so many of the tips below. Even if a particular repeated parental behaviour is ostensibly about guiding or providing something to the child, it will still result in carving paths in your own brain. And often this is part of the point.

A concrete example: My first year or so of being a father was very hard for me. I didn’t know that paternal post-partum depression was a thing (affecting 10-25% of fathers!), but I’m quite sure I suffered from it (I wrote a bit more here). During that time I read some advice about “fake it till you make it” and I decided to take that to heart. One way I did that was to say “You’re my special little guy and I love you” to my son often – at every nap time, at least. It felt false (I was in a bad place), but I wanted it to be true. I tried to say it as convincingly as possible (I also wanted him to believe it), I said it regularly, and I said it repeatedly. And eventually it became the most true thing I’ve ever uttered.

Yes, this is some N=1 sample size bullshit. Maybe I was already on the way out of the depression and this did literally nothing. I believe that it helped.

Say the words that you want to be true. Act the way you know you should be acting. Force yourself until it becomes natural.

## Always offer low-friction reconciliation

The time I have most clearly felt “oh, this is a parenting win” was when my (pretty little) kid was angry and hitting, and I was angry, and then I dropped to my knees and opened my arms and he just fell in, curled into a ball, instantly calmed down, and started crying a bit.

This didn’t work every time, but for years it allowed us to defuse a lot of bad situations, and in a way that plainly felt like a gain in love and attachment.

I decided that the lesson from the success of this tactic is that my child was often desperate for a low-friction path to reconciliation and that I should always offer it. By “low-friction” I mean not requiring an admission of guilt or an apology or any other loss of face – just zero steps from “angry” to “unconditional love”.

I’m also using the word “reconciliation” somewhat carefully. The behaviour that led to the conflict might be unacceptable (e.g., hitting other kids) and still need to be addressed – this isn’t necessarily “forgive and forget”. But nothing can be “addressed” while everyone is freaking out, so it’s essential to re-establish love and acceptance.

Brain-carving: It will be very hard to offer this reconciliation when you’re angry, but the more you do it, the easier and more automatic it will become. Just as importantly, you will be carving your child’s brain at the same time – you’ll be teaching them to look for the opportunity for reconciliation, and to take it. This will serve them well forever.

Bonus tip: I found kneeling was often effective in many situations. I think there’s a psychological effect on the kid when the parent is at their level – I think I felt more accessible to him. It’s also a nicer position for hugging a little guy.

Another bonus tip: When my son started hitting, I often found it effective to really obviously not defend myself. If I put my arms up to absorb the blows, I think he took that as meaning that it was more okay to hit me, because I wouldn’t really get hurt. But if I kneeled and opened my arms and offered him my face (i.e., offered him my throat), then he wouldn’t be as easily able to hit me without hurting me and often would stop. But not always. If he hadn’t completely lost his shit, then great, but if he was absolutely out of his mind then I’d have to realize that really quickly. (And then do a “safety hold” right on the busy sidewalk and oh god those were some bad days.)

Relationship tip: “Always offer low-friction reconciliation” is just obviously a good idea in all settings, with everyone, but especially your spouse. I’m pretty bad for holding grudges, and this is something I’m still working on.

## Emotional bookending and bridging

(To be clear, I’m making up bullshit terms that vaguely evoke what I’m trying to get at. I need some kind of section header, right?)

Every night the last thing I say to my son is, “I love you, monkey. I’ll see you in the morning.” Both parts are intentional: the first is pretty obvious (and was part of my own brain-carving, back when I needed it at that level), and the second is to reassure him that I’ll be there the next day. I want that to give him comfort as he falls asleep and if he wakes up in the night.

Every morning as soon as I see him I beeline to him and give him a kiss. This is the fulfillment of the previous night’s promise. For many hundreds of nights, I have communicated as clearly as I can: Our separation is temporary and my love is always with you.

I do this with almost all our separations, especially school. I always say, “I love you, buddy. I’ll see you after [school or whatever activity].” And then make sure I clearly communicate that I’m happy to see him afterwards. (For a long time I would explicitly say “I’m happy to see you.”)

This isn’t just about trying to be a loving parent. I want him to not call for us during the night. I wanted him to not feel anxious at daycare or school, because I suspected that it was contributing to behaviour problems. But, yeah, also for his own sake I want him to feel absolutely sure that he’s loved and that his parents will return to him, even when we’re separated.

(Personal psychological insight: I always dreaded my dad coming home from work. I really, REALLY never want my son to feel like that. I want to be a goddamn beacon of love and comfort. “Don’t be like my father” is not the healthiest guiding principle, but we all just work with what we got.)

## Turn towards

I got the term from a relationship article, but it applies equally to parenting. The idea is that when your partner wants or needs your attention (or affection, or help, or whatever), you “turn towards” them, literally and figuratively. Physically turn towards them and give them your full focus and attention. You don’t keep looking at your phone or the TV or whatever. This is a clear act and message of love and respect and commitment.

This idea applies equally to your child, but is much harder. Kids want so much attention, and they talk so much, and they are so fucking boring. It is impossible to turn towards them as much as they want. But that doesn’t mean we shouldn’t try.

When my son wants my attention, I try to turn toward him immediately. If I can’t, I’ll communicate to him why I can’t and give him the option of talking anyway or waiting until I’m free. When I lose time or patience in this attention-giving I’ll say, “Monkey, I can’t keep paying attention right now, but you can keep talking if you want.”

So, I treat him with respect (attention, truth, etc.) and give him what he needs for as long as I can. This might seem obvious, but, like many of the things here, I think it’s good to do it deliberately and with intention.

For lots of us, this kind of attention isn’t easy to do, and especially not for any amount of time. Again, kids are tedious. But, as with everything, you are what you do – the more you do it, the more you can do it.

Also, yeah, do this with your spouse.

## Ambient affection

For a long time (less lately), I would lightly and briefly touch my son almost every time I walked by – hand-on-head, hand-on-shoulder, that kind of thing. I intended it as a non-intrusive reminder of the proximity of love and affection. (And, as always, carving my own brain.)

I sometimes think in terms of physical stress factors: hormones, heart rate, etc. I want to keep those factors low in my son, and I think of any show of affection as bringing them down. (It’s also why I try to get him into nature often, as there’s actual evidence of efficacy.)

## Tell them when you’re cranky

I always tell my son when I’m in a bad mood (usually due to poor sleep). I tell him that I might be short-tempered and cranky, but that it isn’t – and won’t be – his fault, and that I don’t love him any less.

I’m not sure that the information actually changes his pain-in-the-ass behaviour, but that’s not the point. I only want him to feel… well, exactly what I said – that it’s not his fault and I don’t love him any less. And, really, kids are (or some are, or sometimes, or something) capable of empathy, even towards their parents.

Saying these words is also a reminder to myself to try to not be an jerk to my son just because I had a shit sleep.

## Show resolution after a spousal fight

If (ha!) you get into a fight with your spouse in front of your kid, try your very best to model good behaviour during and after. It can be very hard to stay respectful during a bad fight, but try. Also try to let your kid see the resolution of the fight – partly so that they can see it is resolved, and partly so that they learn a bit about resolving the fights that they themselves inevitably have and will have.

This is so much harder than it sounds.

## Don’t lie

Don’t lie to your kid. Ever. You can certainly omit things, but don’t lie to them. And remind them explicitly that you’ll never lie to them. You want them to believe you as quickly and with as little doubt as possible.

(I take this to include not making my son believe in Santa or the Easter Bunny. It’s up to you if you want to follow me that far, but I haven’t regretted it even a little.)

Besides being fundamentally respectful, this is also very much about laying the groundwork for your future relationship. Even if there aren’t times right now when you desperately need your child to believe you utterly and implicitly, you can bet your ass there will be important times in the future.

Which leads us to…

## Lay the groundwork for the teenage years

Being a teenager is hard. It was hard for us and it’ll be hard for our kids. They’ll be faced with risky decisions and bad situations, and the probability of not getting pregnant or dropping out or dying is much, much better if your kid knows deeply that you are on their side. They need to be able to trust you to love them no matter what, help the best you can, forgive them, not freak out on them, and so on.

Honestly, I think this is the big goal that we should be working towards. Doing your best for your kid at 5 is excellent and necessary, but setting your relationship up for success at 15 is the real goal. Luckily, the former serves the latter.

## Think hard about saying “no”

My knee-jerk reaction is – or was? – to say “no” to a lot of stuff. Silly stuff, irresponsible stuff, stuff we don’t have time for. Also, I’m just kind of a dick sometimes. But I don’t like being like that, and I don’t think it’s good for the kid or our relationship.

One reason is that you don’t want your child to come to think of you as someone who says “no” all the time. You want them to keep talking to you (see the “teenage years” section) and asking you for and about iffy things. You want them to know that you’re on their side and that you generally want them to have what they want to have. Of course, that doesn’t mean that you can always acquiesce, but you’ll always seriously consider it, give them reasons why not (if you’re still going to say no), and work with them to find some kind of compromise when possible.

Another reason is that allowing them the feeling of autonomy is important. Everyone wants to feel like they are the masters of themselves, even very little kids. The feeling of powerlessness can be terrible, cause bad backlash, and carve their brains in ways you don’t want. Saying “no” takes autonomy away from them. Obviously, you can’t let them be autonomous always (they’re dumbasses and won’t survive a day), but you can try hard to give them as much as you can when you can, to balance when you can’t.

Another big reason to avoid saying “no” quickly is so that we don’t allow ourselves to be beaten down by our child and then give in. I think it’s much better to say “yes” (or something like it) immediately than it is to teach the child that if they wheedle and cajole us hard enough they’ll get their way. None of us wants to incentivize that kind of shit.

Related to this, I think we should try to avoid asking for the child’s opinion when the only answer we intend to accept is compliance. So, don’t say “do you want to go to the library?” (or “how do you feel about going to the library?”) when you intend to make the kid go no matter what. Saying “we’re going to the library” might not be giving the child autonomy, but at least it’s not pretending to give them some and then ripping it away from. That’s much worse, and disrespectful. It’s not really free will if you only have one choice.

## Give your parenting partner a break

This is probably too obvious to mention, but: Parenting is hard and we all get tired. Take your kid away from your spouse sometimes, and pretty often.

You don’t need to come up with some heroic activity – it doesn’t need to be long or elaborate. I mean, if it is, great, but it’s more important that you do something, anything. Make it as easy on yourself as possible, so you don’t burn out while trying to prevent your partner from burning out.

## Don’t get carried away by your spouse’s emotions

When my partner gets angry at our son, my gut reaction is usually to lose my shit. I hate seeing her upset. Which is the very wrong response.

The emotional onslaught of two parents is too much. Even from one parent it’s really a lot. You need to try to defuse the situation. Help your spouse regain composure. Let your child know that they are still loved and supported.

Sometimes this can mean directly opposing your spouse in defense of your kid, in front of your kid. I don’t know if this is great or not, but it usually feels okay and helps to calm things down. Be respectful and gentle to your spouse and the child will learn the right lesson.

## Emotional satiation

If your kid needs a hug (or other comfort), give it to them until they pull away first. Let them drink as much comfort from you as they need, and give them no sense that you’re pulling away while they need more.

This is harder than it sounds, and is sometimes beyond our emotional reserves and time. Remind yourself that it’ll be very rare after about age 6.

## Be wary of your intoxicated self

I quickly realized that I couldn’t have even one drink until after my son was asleep. When I drink even a little my patience drops like a rock, and that makes even the smallest difficulty bad.

It’s not like that for everyone, though. Know thy tipsy self. If, like me, you realize that you just can’t drink at all during the day, make peace with it and know that you won’t need to do it forever.

## Pay attention to your tone; aim for neutrality

It’s very easy to convey more than you mean to in your tone of voice and even facial expressions. Even if what might be conveyed is still “true”, it’s often not the message that you actually want to send.

The obvious case for this is when you’re providing negative feedback – when your kid did something wrong and you’re telling them so. What you want to do is have them understand that what they did was undesirable, why, how they should improve in the future, and maybe how they should make amends now. What you don’t want to convey is anything like this:

You’re hurt
You’re whining
Your happiness is dependent on their better behaviour
You’re flying off the handle
Or anything that generally indicates that your love has been partially withdrawn

Those kinds of things freak kids right out, make it hard for them to take a positive lesson from what just happened, and over time can make them hesitant to come to you for emotional safety.

But it’s really hard to not convey that extra information! We do it all the time! And we’re also often in a state where independent shitty factors – being tired, frustration with work, etc. – amplify and exacerbate everything. No matter how much you really do love your child, you’re going to have a really hard time communicating that in bad moments.

The less obvious times when tone neutrality is important is when giving positive feedback. There’s a risk that you overdo it and your kid starts behaving just to get that effusive praise reward from you. Now the reward is problematically extrinsic.

I recommend the Hidden Brain podcast episode about clicker training for dogs, dolphins, and… humans. I didn’t actually do it with my son, but I sure thought about it.

So, try to make a habit of paying attention to what your secondary communication channels – tone of voice, facial expressions, body posture – are communicating, and how strong that communication is. Think about how it might be making your child feel and how they’ll respond to it. Think about what sort of patterns might be getting established by repeated exposure to it.

And consider keeping your tone generally more moderate. Avoid sending your child on an emotional rollercoaster.

## Never make your child responsible for your emotions

(This point ties into the previous one but deserves its own spotlight.)

Your child cannot handle the responsibility for your happiness; do not put it upon them. You’re their parent – their emotional foundation, their bedrock, their safety. Additionally, they need to be allowed to be little and silly and tempestuous, and to grow up naturally, as they are neurologically and endocrinologically able to. Putting the weight of your happiness on them compromises both of those things.

Sometimes you might reverse this responsibility implicitly – tone of voice, etc. Sometimes you’ll do it explicitly – “you’re scaring me”, “you’re making me sad”, etc.

Be aware and beware of this power and responsibility reversal.

## Do the opposite of being angry

Your child is fooling around and knocks over a plant. You are, quite reasonably, some combination of displeased, disappointed, and angry. That suggests some obvious negative responses.

But. Your child knows they’ve done something wrong. They’re probably also upset. Yelling at them will achieve nothing positive. Saying, “Be more careful next time!” communicates and teaches nothing useful (I say that all the time, and it is garbage). Your child is already in a state of emotional distress – piling more on top will only ensure that they are lost in a haze of fear and shame and sadness, beyond the possibility of taking away any possible positive lesson.

So, your knee-jerk negative responses aren’t great. Then what?

Give them a hug. Say nothing for a bit. Hug until you feel your anger start to bleed away, which will also be around when they start to calm down. If there’s something constructive to say, you’ll think of it and be able to say it then. You’ll be able to say it in a voice that can be heard, and they will have the ability to hear it.

If you can’t think of any words of wisdom or guidance that aren’t just “be more careful next time”, then that’s okay – they have already learned something about the physics of mayhem. Just say “let’s clean this up together”.

You just turned a potentially really bad thing into a bunch of good things:

Your child trusts you to be calm and loving and helpful when things are tough. They won’t feel the need to hide such things for you. (Refer to the section on planning for teenagerhood.)
Your child did not have their brain bathed in stress hormones for an extended period. (Which, if it happens regularly, can have longer-term effects.)
You didn’t have your brain bathed in stress hormones for an extended period.
You don’t have to lay in bed awake because you know you fucked up and were pointlessly shitty to your little kid whom you love more than anything.

It’s so hard to do this, but so satisfying. You feel the rage coming on, you keep your mouth shut and hug.

## Hold but don’t grasp

Have your hand that’s on the side where your child is walking available for holding. Switch your bag to the other hand. When they decide to take your hand, be thankful and mindful, but don’t comment on it. Hold their hand as tightly or as loosely as they hold yours. When they decide to release your hand, let them go. Be thankful for the time you had holding it, but don’t comment on it (do not manipulate them into holding more or longer). Continue to keep your hand available long after they need to take it, feeling the loss but being pleased that your child is strong and independent.

Generalize and metaphorize.

## Give credit when it’s due

It’s easy for little achievements – chores completed, thoughtful acts, good behaviour, effort on little tasks that aren’t inherently fun – to go unremarked upon. Especially when they’re more frequent, they’ll just slip by.

I think it’s important, at least every now and then, to let your child know that you see the effort they’re making and that you appreciate it. As argued above, I think the tone of the praise should be neutral, and nothing about it needs to be overly effusive – just an acknowledgement. Like, “I know that dinner last night with a bunch of adults you don’t know must have been pretty boring, but I saw that you stayed patient and polite and I appreciate it.” Or, “I noticed that this morning you opened your blinds and turned off your light and I appreciate it.”

These are little things that are – or will become – just “expected” behaviours, but never forget that your child has a weensy little prefrontal cortex and anything without flashing lights requires a bunch of effort. You don’t need to give this kind of praise all the time, but take reasonable opportunities.

Do this with your spouse as well. We all like our daily little efforts to be appreciated and acknowledged. It communicates love as strongly as anything else.

## Be mealy-mouthed, get ignored

Kids don’t hear sentences that begin with “I think” or “I don’t think”. If you say “I think it’s time to go”, your child immediately discards it and continues to play unabated. And, really, who can blame them? What does it even mean? It sounds like some kind of indeterminate interior rumination. It’s barely actionable for yourself, and certainly not for another person. And definitely not for a zero-self-control little kid who just wants to play.

Kids also don’t know what “minutes” are. They know it has something to do with time and is less than a day, but that’s it. If you say “we’re leaving in 5 minutes”, it will mean nothing to your child and they will ignore you. (Hell, even as an adult, “we’re leaving in 5 minutes” means only something like “we’re leaving in less than an hour, probably”.)

So how are we supposed to make “transitions” smoother for our fragile little babies if “I think” and “minutes” are pointless? Well: A) I don’t know, good luck; and B) try to use fixed counts of activities, like “5 more times down the slide and then we leave”, and then count down. It’s still not going to work great (and they’ll cheat like crazy), but at least it’s an objective measure.

## Mindfulness helps

Shortly after becoming a parent I started practicing mindfulness meditation. (For related reasons.) I read 10% Happier and started using meditation apps (10% Happier, Headspace, Waking Up). And… it has helped me a lot. I’m not going to blab on about it, but I’ll mention a few key things.

First, mindfulness meditation focuses strongly on teaching you to be aware of your physical and emotional state. You pay a lot of attention to how various emotions make your body feel (tight in the chest, knot in the stomach, etc.), and you get much better at recognizing those manifestations of stress, anger, etc. before they take you over (or not too long after they do). This is incredibly valuable when parenting.

Secondly, the act of being mindful helps wonderfully with appreciating the joys of parenthood as they happen. Mindfulness meditation intentionally carves your mind in such a way as to make it easier (or even possible) to fully focus and appreciate the current moment. And when you have a child who is changing all the time – and will never return to the way they are this moment – it’s wonderful to appreciate as many moments as possible. (And, yeah, it helps some with coping with the daily annoying behaviours that it seems they’ll never grow out of.)

Thirdly, via mindfulness you come to the Buddhist understanding of “suffering”, which is something like “undue fretting over things that have already passed or may never come to pass”¹. Focusing on your actual experience, right now, helps reduce the fears that your kid is a psychopath, or will never read, or will forever be wearing diapers to bed.

## Do your best; be compassionate with yourself

A lot of the stuff I’ve written here can be very taxing as a parent. It is utterly impossible to do all of it all the time. As you do it, it will get somewhat easier to do it more of the time – as you carve your brain, and your child’s. But, again, it is not ever going to be possible to do it all the time. And that’s okay.

We do our best. We fuck up. We are kind to ourselves and we don’t beat ourselves up too much. We let go and begin again.

❤️

I don’t know shit about shit, so take that Buddhism lesson with a big grain of salt. I am avoiding the word “skillful”. ↩︎

More Than a Password

Fri, 14 Oct 2022 16:31:07 -0400

Here’s a quick-and-dirty explanation of why two-factor authentication is good, and why U2F/WebAuthn keys (like YubiKeys) are better than the alternatives. (So I have something to point friends and family at.)

Note that when I use the word “attacker”, I don’t necessarily mean someone that’s specifically out to get you. In fact, it’s usually someone trying to crack thousands of accounts at once. This isn’t about you being special or not special. You have an identity and some amount of money, so you’re a target.

(I use the word “attacker” out of habit – it’s what we use in software/security engineering.)

## Why memorized passwords (by themselves) are super bad

You can’t memorize strong passwords, and you certainly can’t memorize many of them. So you’re going to pick weak passwords and you’re going to reuse them between sites.

## Why weak passwords are bad

Attackers are very good at (automated) password guessing. They know the patterns people use (l33tSp34k, book passages, etc.), so it’s unlikely that some little trick you use will be effective. They also have the contents of database breaches (see below), so they use common real-word passwords in their attacks (e.g., “password spraying” attacks).

Maybe some of the services you use will be good at limiting the number of guesses at your password that an attacker gets. But others won’t be. And an attacker will guess your password on a weak service and then try it on all the others.

## Why reusing passwords is bad

Websites and services get breached all the time. Go see if you’ve already been caught in a breach – you probably have. And those are just the known breaches, and where the contents of the stolen data was made public. (Sign up for the service provided by that site – getting notified when you appear in a breached database.)

When such a breach occurs and the accounts database is stolen, often the passwords are stored incorrectly and can be extracted. That enables an extremely effective attack: credential stuffing. The attacker – or anyone else who gets access to the database – starts using the email or username plus password on every other site of interest.

That means that if you use the same password for Gmail and for the Flyfishing Fan Club site and the latter is breached, your email account will likely be compromised shortly after. And then every account that uses that email address for forgot-my-password recovery.

## Entering passwords by hand is pretty bad

The vast majority of people can’t tell the difference between a well-crafted phishing site and the site it’s copying. Indeed, usually the site is copied directly and looks identical. Additionally, there are many ways to make a domain name look like another, or just obfuscate it.

Every time you enter a password by hand, you’re betting on your ability to correctly guess whether a site is legit or not. There are certainly ways to mitigate this – visit via a search, browser history, etc. – but the general danger remains.

If you get phished when trying to log into your email, bank, Facebook, etc., it can be pretty bad, but also keep in mind that even if it happens on an “unimportant” site, you have instantly opened yourself up to credential stuffing attacks everywhere else.

## Use a password manager

Password managers (BitWarden, Lastpass, 1Password, etc.) are great and everyone should use them. They solve the “password memorization” and “password reuse” problems, and partly mitigate “entered by hand” phishing attacks.

The way they generally work is that you memorize (or even write down) one strong password that gets you into the manager… and that’s the only password you ever memorize again. You then generate random passwords using the manager, and it stores them. When you need to log in, you get the password from the manager.

Only one password, so no inhuman memorization. A different password for each site, so no more reuse.

The phishing (partial) mitigation comes from the password manager browser extension (I think they all have one). When you’re on a site with a properly matching domain then, then the auto-fill option is given. And when you’re not on a site with a matching domain name, auto-fill isn’t presented. So it’s basically doing an exact-match check for you. However… the browser extensions aren’t perfect, and it’s possible for them to not always show the auto-fill, which means users will get into the (bad) habit of sometimes having to copy from the manager and into the password field, even on legit sites.

(And then there are logins embedded in other sites. That can really mess up password managers and users.)

So, we’re in a pretty good place when using a password manager. We just shrug when Have I Been Pwned tells us that the Flyfishing Fan Club site is compromised, because we know that we haven’t used the password anywhere else (i.e., nowhere important). And since we’re having our password manager generate strong random passwords for us, we’re not concerned about guessing attacks on our passwords. (Like, it’ll take longer than the age of the universe to guess a 10-random-character password.)

But we’re still somewhat vulnerable to phishing attacks. So let’s address that.

## We need a second factor

We need another tool to help us close the phishing hole. We need a second factor.

“Two factor authentication” typically refers to “something you know” – the password – and either “something you have”, like a hardware token, or “something you are”, like your fingerprint.

### SMS

Probably the most common “something you have” second factor is in the form of SMS text messages with codes that you need to enter into a site or service after providing your username and password. This is better than literally nothing, but it’s not great.

One problem with it is that attackers have become proficient at “SIM hijacking”. This allows them to take over your phone number and receive your text messages. Because they’re getting the prompt to “enter the code we just texted to you”, they know they have the correct username and password, so the extra effort is worthwhile.

But note that it is extra effort for the attacker, and therefore does provide some protection.

The other major shortcoming is that an attacker that creates an effective phishing site can capture your username, password, and the SMS code. Then they’ll pass on all three to the target site/service and have access to your account. (There are even open source tools that make doing exactly this very easy.)

So that’s not great.

### Time-based codes

Another common form of “something you have” is an authenticator app that spits out time-based codes for a site or service. Cool kids call this TOTP: “time-based one-time password”.

This is similar to the SMS approach but better in two ways:

No SIM hijacking problem.
The time window in which the code can be used is typically smaller, so it again increases the effort. Somewhat.

But… It’s still not a full phishing mitigation.

### Hardware keys

U2F (universal second factor), and its successor WebAuthn, is a standard for second factors that actually does completely negate the phishing threat. The implementation of the standard is usually in the form of a slim USB key that has a button on it, but it can vary. The most well-known brand of them is YubiKey.

Let’s see if I can describe how it works concisely and coherently…

With U2F enabled, after you log in to, say, Gmail, you are prompted to enter your key and touch the button on it. The browser then sends the domain name of the site you’re on to the key, which combines it with some cryptographic junk it stores to produce a code to send back to the Gmail server. Based on information you gave when you registered the key with them, Gmail can figure out if the code is good.

Because a phishing site won’t literally be at gmail.com, it won’t get a code that works for gmail.com. So it just cannot get what it needs to complete the login on your behalf, even if it knows your username and email.

(Technical sidebar caveat: An attacker subverts DNS and/or BGP and somehow gets a TLS cert for gmail.com, then they can fool your browser and U2F token. But that’s catastrophic for everyone, for lots of reasons, and is exceedingly rare. But not impossible.)

These keys even work if you leave them plugged into your computer. (YubiKey even makes ones that are stubby little things, designed to be left in.) The only loss in security is if the key is physically stolen and the thief also has your username and password. But this kind of in-person attack is much rarer than the huge dragnet attacks that occur across the internet. (And most physical thiefs probably just want to sell your hardware.)

## What about “passkeys”

There are also some password-less authentication methods in the WebAuthn standard. Google, Apple, and Microsoft support (or soon will) “passkeys” that adhere to this standard. Which is cool. I bet they’re great and security-strong. But: a) I don’t know enough about them yet to say anything useful, and b) until they’re supported more widely it’s hard to be too excited.

## Why should you care?

Answering the question of why you should care about protecting your accounts seems too tedious to tackle, but at the very least consider: Your email is the skeleton key to your online life, and a lot of “online” life spills over into the rest of your life.

Email is used to recover almost every other kind of account. So if someone hijacks your email, they have the ability to control almost every other account you have. This can cost you a lot – money, time, etc.

You can probably think of some other services that would be bad to lose control over: Facebook, your bank, anything your business depends on, etc.

## Get a password manager and two YubiKeys

Run – don’t walk – to get a password manager. Memorized passwords are a ticking time bomb.

The case for second factors isn’t quite as convincing, but phishing is still a real and potentially devastating threat. So get two YubiKeys.

Why two? One on your keychain and one where you keep your passports. Maybe one more if you’re going to leave it plugged into your computer. Because eventually you’ll lose or wreck (like I did) the one in your pocket, and want it to be easy to recover.

Why YubiKey? Because they’re well-respected and “Made in USA or Sweden”. It maybe seems tinfoil-hat-y to worry about where your U2F token is made, but… You don’t want an adversarial country to have a copy of the cryptographic keys stored on it.

Forwarded Header Sabotage

Wed, 30 Mar 2022 09:37:14 -0400

We all know by now that the leftmost values in the X-Forwarded-For header can be spoofed and only the rightmost IPs – added by your own reverse proxies – can be trusted. The Forwarded header (RFC 7239, 2014) has that same problem, and a new one: If the header is parsed correctly, an attacker can sabotage the whole header.

Let’s take a quick trip to understanding how that can happen and how complicated Forwarded parsing can get. (Think about how you’d parse the header as we go.)

## Syntax

A simple Forwarded header might look like this:

Forwarded: for=1.1.1.1, For=2.2.2.2

Here’s what a header looks like with an IPv6 value:

Forwarded: for=1.1.1.1, FOR="[2001:db8:cafe::17]"

Colons and square brackets are not allowed in a “token”, so the IPv6 address needs to be quoted. But that means we could have:

Forwarded: host="with,comma=equals;semicolon";for=1.1.1.1

So now you can’t just split by comma – you need to be aware of quoted strings as well.

But characters can also be escaped, so this is also legal:

Forwarded: ext="escaped\"quote";for=1.1.1.1

The blessed semicolon-separated parameter names in an entry are “for”, “by”, “host”, and “proto” (case-insensitive). There is allowance for “extensions” using other tokens.

Some more legal things:

Anything can be escaped, including backslashes: \\. So don’t just delete all them all.
Any amount of whitespace around the commas and semicolons.
There can be multiple instances of the header, and they must be considered a single list, top to bottom.

And some illegal things:

Can’t have whitespace around the equal sign.
Can’t have disallowed characters in parameter names (and not quotable).
Can’t have disallowed characters in a parameter value, if not quoted. (Which mean, for example, that an unquoted IPv6 address is illegal.)
IPv6 addresses must have square brackets.
Backslash escaping is only allowed in quoted strings.

There is only one single library I’ve found that actually correctly parses the header: github.com/lpinca/forwarded-parse.¹ Everything else just does what you were probably thinking after the first couple of steps above:

Split by comma.
Trim whitespace.
Split by semicolon.
Trim the quotes off the value.
Done.

Hilariously, this half-assed, RFC-violating parsing is resistant to sabotage and proper parsing is not.

## Sabotage!

The Forwarded header is unique. It is the only header that:

Has untrusted values at the start and trusted values at the end.
Is official and specified.

This combination leads to its susceptibility to sabotage, where the whole header – including the trusted part – needs to be discarded because of chicanery in the untrusted part.

The RFC doesn’t (that I can find) provide any special instructions about salvaging the rest of the header if a single entry (“forwarded-element”) has a syntax error. So, in theory, the whole header needs to be thrown if a spoofer adds, say, f*r= instead of for=.

The sabotage is even more fun with an unclosed double-quote:

Forwarded: for="1.1.1.1, for=2.2.2.2, for=3.3.3.3

It’s illegal to have an unclosed quote, so the whole thing is immediately garbage. But even if you wanted to salvage the header… Where do you close the quote? What do you salvage and discard?

## Why is `X-Forwarded-For` not sabotage-able?

Because there’s no spec! People just split by comma, trim, and that’s it. Your trusted reverse proxy will add ", 1.1.1.1" and you don’t really need to care about what comes before that. (Unless you want the leftmost-ish value, but then you’re in the danger zone regardless.)

## Mitigations

“Half-assed, RFC-violating parsing” is the most obvious. If you’re using a rightmost-ish value, you should know if your reverse proxies are going to be quoting things, escaping, etc. – and they probably aren’t. So do a simple comma-splitting and throw away the stuff on the left.

Doing simple splitting means that you could end up with total garbage in your leftmost values – from spoofing or from valid-but-complicated values. You could probably make your parser more-complex-but-still-not-RFC-compliant by trying to handle quotes and escaping, without discarding everything in the case of bad data. Perhaps your deviation rule could be “no commas allowed in quotes or escaped; they always signal a new entry”. Or just don’t use a leftmost value.

(Note that differences in parsing at different points – reverse proxies, server, etc. – could result in parser mismatch vulnerabilities.)

At the reverse proxy level, the obvious mitigation is to discard any existing Forwarded headers and start fresh, so there are only trusted, well-formed values. If you don’t like the idea of discarding potentially valuable forensic information, maybe your reverse proxy could move the previous header value into some new X- header before starting fresh.

A variation on unconditionally discarding the Forwarded is to check for validity and, if it fails, discard or replace with for=unknown (which is also per spec). (I asked lpinca and this is what he prefers.)

Another possibility is to have your reverse proxy not discard the existing Forwarded headers but to add a new one. Maybe the separate headers could be sanely interpreted separately. But this violates RFC 2616, which says “It MUST be possible to combine the multiple header fields into one “field-name: field-value” pair, without changing the semantics of the message”.

So, I can’t see how it’s possible to adhere to the spec and retain the existing header value without falling victim to sabotage.

## Conclusion

I wrote a half-assed, RFC-violating Forwarded parser. Then I decided I wanted it to be “correct” and started looking for better implementations.

Along the way I found an Nginx forum conversation from 2017 about adding Forwarded support. That conversation is interesting and brought the sabotage potential to my attention, but there’s no resolution. It just kind of trails off into hopelessness. (Interestingly, there was strong resistance to discarding the header and thereby losing information.)

Because my Forwarded parser is in a project that I am hoping will be a reference implementation for getting the “real” client IP, I really wanted the parser itself to be a reference implementation. But, as disappointing as it is, it seems like being spec-compliant is the wrong move. I also can’t assume that a user of the library has the ability to tweak their reverse proxy handling of Forwarded (I mean, I could state it as a requirement for use of the library, but that’s limiting and error-prone). Documenting the shortcomings seems about as good as it gets.

(Now I have to figure out how to summarize this in the MDN Forwarded page update that I’m on the hook for…)

## Addendum

[2022-04-03: Added this section.]

David Moles pointed out that Nginx has instructions for enabling Fowarded handling. Part of that is this amazing regex that should be used for validation:

^(,[ \t]*)*([!#$%&'*+.^_`|~0-9A-Za-z-]+=([!#$%&'*+.^_`|~0-9A-Za-z-]+|"([\t \x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\[\t \x21-\x7E\x80-\xFF])*"))?(;([!#$%&'*+.^_`|~0-9A-Za-z-]+=([!#$%&'*+.^_`|~0-9A-Za-z-]+|"([\t \x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\[\t \x21-\x7E\x80-\xFF])*"))?)*([ \t]*,([ \t]*([!#$%&'*+.^_`|~0-9A-Za-z-]+=([!#$%&'*+.^_`|~0-9A-Za-z-]+|"([\t \x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\[\t \x21-\x7E\x80-\xFF])*"))?(;([!#$%&'*+.^_`|~0-9A-Za-z-]+=([!#$%&'*+.^_`|~0-9A-Za-z-]+|"([\t \x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\[\t \x21-\x7E\x80-\xFF])*"))?)*)?)*$

David and Tim McCormack (via email) both suggested that maybe the Forwarded header could be parsed backwards. I initially didn’t think that would be much better that just splitting by comma an parsing each pice, but I’ve come around to the idea. It allows stricter RFC adherence (quoted commas) while still allowing salvaging of rightmost good values.

Seriously, big kudos to lpinca. ↩︎

Symbol and text labels

Mon, 28 Mar 2022 10:54:16 -0400

When I first saw this machine in the doctor’s office (waiting, bored) I thought the duplication of the labels – especially on the power button – was silly. Then I realized that I was being Anglo-centric and that having both enables use by more people, in more places.

Should you strip the IPv6 zone?

Tue, 22 Mar 2022 18:32:38 -0400

There have recently been three different (but related) contexts where I have asked or been asked that question:

When a reverse proxy is adding the client IP to the X-Forwarded-For header.
When the client IP is being used for rate limiting.
When checking if a client IP is contained in a configured list of ranges/prefixes/CIDRs.

As I understood more about zones my opinion on this changed. This is an attempt to capture my understanding and where I ended up.

## What are IPv6 zones?

Only link-local addresses have zone identifiers¹. The Wikipedia definition² of IPv6 link-local address assignment will get us started:

All interfaces of IPv6 hosts require a link-local address, which have the prefix fe80::/10. This prefix is combined with a 64-bit suffix, which the host can compute and assign by itself without the presence or cooperation of an external network component like a DHCP server, in a process called link-local address autoconfiguration.

Link-local addresses are guaranteed to be unique within a “link” – the subnet or network segment the interface is connected to. They have no uniqueness guarantees beyond that, and are not routed beyond the link they belong to.

Because a single machine might have multiple interfaces on separate links, there might be peers on those separate links with the same link-local address as each other. In order to correctly address those peers, there needs to be an additional qualifier – the zone identifier.

In addition to a link-local address, each interface gets a zone identifier. If you list the interfaces on your machine, you’ll see link-local addresses like fe80::2934:e0a2:73b1:80da%21 or fe80::9105:d0ed:bbda:9935%eth0 – the part after the percent sign is the zone.

If you want to connect to the link-local address of another machine you share a link with, it’s not enough to know that its IP is fe80::abcd – you also need to know which of your interfaces to connect through to get to it. So the zone-qualified address would be like fe80::abcd%eth0. You then have enough information to connect to that machine.

It’s important to note that the zone identifier has no meaning outside of the computer that owns the interface. Your IP+zone for a peer machine might be fe80::abcd%eth0, but the address that machine has for its own interface might be fe80::abcd%wifi0 and its address for your machine would then be fe80::9105:d0ed:bbda:9935%wifi0.

Link-local addresses may change on reboot (depending on assignment method and the addresses of other peers on the link). Zones probably won’t change, since they use either the index or name of the interface (or both), but I don’t know if there’s a guarantee of that.

I’m not sure if zones are directly modifiable, but it’s possible in Linux to change interface names, which should result in a zone change.

## So should zones be stripped?

As RFC 6874 says:

It should be noted that zone identifiers have purely local meaning within the node in which they are defined, often being the same as IPv6 interface names. They are completely meaningless for any other node.

That suggests that we should be discarding the zone from addresses that are destined for use anywhere other than the machine that added the zone. But, as we’ve seen, link-local IP addresses are meaningless without the associated zone. So… what’s the right answer?

Let’s examine the individual contexts I mentioned at the top.

### Adding to the `X-Forwarded-For` header

(I wrote extensively about the hazards of using the X-Forwarded-For header. I won’t be repeating it here so we can stay out of the weeds.)

The main use of the X-Forwarded-For header is to get the “real” client IP. A secondary use is to see what path a request took.

The address being appended to the XFF header is the address of the peer connecting to the machine in question, via its own link-local address. So it’ll look like “remote_fe80_ip%local_zone_id”.

I think it’s better if the link-local address zone is preserved. Without it, we won’t know which link the address belongs to, so we won’t know which remote machine made the connection.

Of course, it depends on how the XFF values will actually be used. However, that’s not always known so it seems better to include more information rather than less.

### Rate limiting by IP

This is partly an extension of the XFF consideration, as that header is often the source of the IP we use for rate limiting.

First of all, let’s deal with the possibility of zone spoofing via X-Forwarded-For. If we include the zone in the key we’re using to rate limit, then the possibility arises of an attacker altering the zone to avoid being rate-limited. I think we can dismiss this concern because if an attacker has the ability to change the zone value, then it will likely be no harder (and probably easier) to change the IP itself.

If we’re getting the client IP directly from the socket (rather than a header), then we don’t need to worry about spoofing. However, one thing that can go wrong is if the rate-limiting server changes interface names. Then the same client will end up with different “ip%zone” values.

Which leads us to another potential problem: If the rate-limiting DB is shared between multiple servers, then the same client will have different “ip%zone” values for each server, as they’ll have different interface names/indexes. The rate limit for a single link-local client will effectively be multiplied by the number of servers.

If we strip the zone, then we lose some specificity, which may result in different machines on different links being rate-limited because they coincidentally have the same link-local IP. While this is true, the way link-local address are generated makes this extremely unlikely. So we can dismiss this as well.

The shared rate-limiting DB case seems more weighty than the others, so my feeling is that it’s better to strip the zone.

### Ranges/prefixes/CIDRs

I’ll only be considering the “does this prefix contain this IP” use of prefixes (mostly because that’s the use I was looking at in this context).

This one seems more obvious: If the user includes a zone in a prefix, then the IP prefix should be respected; if there’s no zone in the prefix, then the zone should be stripped from the IP before checking if the prefix contains it.

To make that more concrete:

With zone: Prefix fe80::%eth0/10 should contain fe80::abcd%eth0 but not fe80::abcd%wifi0.
Without zone: Prefix fe80::/10 should contain fe80::abcd%eth0 and fe80::abcd%wifi0.

What if the prefix contains a zone but the IP has had the zone already stripped? I don’t think there’s a good answer – neither “contains” or “does not contain” is entirely sane. So that’s another reason not to prematurely strip the zone.

### Additional concerns

There are other factors involved in all of these considerations.

The first is that link-local addresses probably shouldn’t be put to any use that requires these considerations. They should probably only be used for low-level automatic network coordination – any direct use of a link-local address would be better served by a unique local address. But if you’re writing a general-purpose library or reverse proxy you can’t just say “well, no one should use them” and wash your hands of it.

Another consideration is the restrictions imposed by your tools and programming language. For example, I recently wrote a post examining Go’s handling of prefixes and IPs with zones. In that case, Go’s address prefix types don’t support zones at all (in some confusing ways). I suspect that zone handling elsewhere is similarly uneven.

An example of such constraints limiting design is the Caddy reverse proxy stripping the zone because of the Go zone-handling limitations.

Finally, another quote from RFC 6874:

Today, [zone identifiers] are meaningful only when attached to addresses with less than global scope, but it is possible that other uses might be defined in the future.

So even if link-local addresses aren’t important to you now, it doesn’t mean that zones won’t be important to you later. (But, yeah, that’s hand-wavy and not very compelling.)

### Conclusions

Zones should be kept until the point of use, and then the decision to keep or strip them should be based on the specific use of the IP and ramifications of zones to that use.

This means that reverse proxies should be including the zone in the X-Forwarded-For header, rate limiters should probably be discarding them, and prefix-contains-IP checks should be based on whether there’s a zone in the prefix. But these are only examples – there are myriad uses of IP addresses, and the particular use will dictate (or at least inform) the fate of the zone.

Of course, this is all debatable. To see some other other opinions, check out the few comments I got when I asked about this in the r/ipv6 subreddit. If you have a differing opinion or know of anyone else having written about this, please let me know.

## Appendix: Checking your own link-local addresses and zones

Windows:

$ ipconfig
...
Wireless LAN adapter Wi-Fi:

 Connection-specific DNS Suffix . : home
 Link-local IPv6 Address . . . . . : fe80::2934:e0a2:73b1:80da%21
 IPv4 Address. . . . . . . . . . . : 192.168.1.11
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 Default Gateway . . . . . . . . . : 192.168.1.1
...

$ ping fe80::2934:e0a2:73b1:80da%21

Pinging fe80::2934:e0a2:73b1:80da%21 with 32 bytes of data:
Reply from fe80::2934:e0a2:73b1:80da%21: time<1ms
...

$ ping fe80::2934:e0a2:73b1:80da%nope
Ping request could not find host fe80::2934:e0a2:73b1:80da%nope. Please check the name and try again.

Ubuntu (under WSL1 in this case, so same values):

$ ip address

...
21: wifi0:  mtu 1500 group default qlen 1
 link/ieee802.11 64:bc:58:11:a9:f9
 inet 192.168.1.11/24 brd 192.168.1.255 scope global dynamic
 valid_lft 244176sec preferred_lft 244176sec
 inet6 fe80::2934:e0a2:73b1:80da/64 scope link dynamic
 valid_lft forever preferred_lft forever
...

$ ping fe80::2934:e0a2:73b1:80da%21

PING fe80::2934:e0a2:73b1:80da%21(fe80::2934:e0a2:73b1:80da%wifi0) 56 data bytes
64 bytes from fe80::2934:e0a2:73b1:80da%wifi0: icmp_seq=1 ttl=128 time=0.283 ms
...

$ ping fe80::2934:e0a2:73b1:80da%wifi0

PING fe80::2934:e0a2:73b1:80da%wifi0(fe80::2934:e0a2:73b1:80da%wifi0) 56 data bytes
64 bytes from fe80::2934:e0a2:73b1:80da%wifi0: icmp_seq=1 ttl=128 time=0.298 ms
...

$ ping fe80::2934:e0a2:73b1:80da%nope
ping: fe80::2934:e0a2:73b1:80da%nope: Name or service not known

This is not entirely true. Site-local addresses also used zone identifiers, but they were deprecated and went out of use about 15 years ago. Also, the “zone identifier” is sometimes called a “zone index”. ↩︎
Well, one of the definitions. There’s also a whole entry devoted to link-local addresses. ↩︎

A tiny flaw in Go's netip design

Mon, 21 Mar 2022 18:30:38 -0400

Update 2022-03-23: Matt Layher created a Go issue about this.

Update 2022-04-14: In response to that issue, two weeks ago a change was committed to Go that makes netip.ParsePrefix behave like net.ParseCIDR: they both return an error when a zone is present. It wasn’t released in 1.18.1, but I’m guessing it’ll be in 1.18.2. So that’s great!

Does this surprise you? (Try it in the playground.)

prefix := netip.MustParsePrefix("fe80::%zone/10")
addr := netip.MustParseAddr("fe80::1%zone")
fmt.Println(prefix.Contains(addr)) // ==> false

Go’s new-as-of-1.18 netip package is better in every way than the previous net.IP, etc., but this one design decision will probably burn someone, somewhere, sometime.

If you pass a prefix with a zone to the older net.ParseCIDR it returns an error. If you pass a prefix with a zone to the newer netip.ParsePrefix, it succeeds but silently discards the zone. If you then pass an IP address that is clearly contained by the original prefix – including the zone – to netip.Prefix.Contains… it returns false!

## Why is it like this?

I learned about this from a Reddit comment by Matt Layher¹ who worked on netip (or the original inet.af/netaddr):

For what it’s worth, I helped work on the library that ultimately became Go’s net/netip and we decided we would remove zones in our CIDR prefix parser because we didn’t find any documented usage of a a CIDR like “fe80::%eth0/64” in the wild.

Which is fair, but I don’t think the resulting behaviour is ideal.

## What do the docs say?

The documentation for netip.Prefix.Contains does make clear the behaviour (emphasis added):

Contains reports whether the network p includes ip.

An IPv4 address will not match an IPv6 prefix. A v6-mapped IPv6 address will not match an IPv4 prefix. A zero-value IP will not match any prefix. If ip has an IPv6 zone, Contains returns false, because Prefixes strip zones.

It’s good that it’s documented, but… how many people are going to read the doc for that method? Most people who use it are going to know what it means for a prefix (or CIDR) to “contain” an IP address. And many of us will already be familiar with the older net.IPMask.Contains, which has the one-sentence documentation: “Contains reports whether the network includes ip.” And the doc for netip.ParsePrefix says nothing about discarding the zone.

## Why do I care about this fringe thing that no one uses?

I’m writing a library that will take a configured list of prefixes/CIDRs/ranges, parse them, and then later check if incoming IPs are contained by them. And whether the IP is contained or not could lead to security-relevant decisions, so the accuracy is important.

With the older net package, if the user tried to configure the library to use "fe80::/10%zone", the parsing would fail and there would be an immediate error. If I switch to using netip, the parsing will succeed but then the Contains checks will return false and the resulting behaviour will be wrong. (The ramifications of that will depend on how the library is being used. It could mean rate-limiting a link-local IP. It could mean using a link-local IP for an access control check where it should instead be an external IP.)

So even though the Go/netip/netaddr team didn’t find any instance of a link-local-with-zone-prefix “in the wild”, I still need to code (defensively) for the possibility of it.

To be safe I’m going to have to force the netip code to behave like the net code: return an error from the prefix parsing code if there’s a percent sign.

## Bonus: IPv4-mapped IPv6 handling has also changed

As hinted at in the netip.Prefix.Contains doc I quoted above…

prefix := netip.MustParsePrefix("1.0.0.0/8")

// Let's check that it's working as expected
addr := netip.MustParseAddr("1.1.1.1")
fmt.Println(prefix.Contains(addr)) // ==> true

// Now let's try the "IPv4-mapped IPv6" representation of the same address
addr = netip.MustParseAddr("::ffff:1.1.1.1")
fmt.Println(addr) // ==> "::ffff:1.1.1.1"
fmt.Println(prefix.Contains(addr)) // ==> false!

// But with the older net.IP and net.NetIP...
_, cidr, _ := net.ParseCIDR("1.0.0.0/8")
ip := net.ParseIP("::ffff:1.1.1.1")
fmt.Println(ip) // ==> "1.1.1.1"
fmt.Println(cidr.Contains(ip)) // ==> true!

(Try it in the playground.)

The older net code would convert IPv4-mapped IPv6 addresses to IPv4 addresses, with the result that they would be contained by IPv4 CIDRs. The new netip code does not convert to IPv4, and the resulting address is not contained by an IPv4 prefix.

I haven’t yet thought about this enough to form a strong opinion, but it’s good to know.

Update 2022-04-14: I created an issue about this a few weeks ago. (I also PR’d some documentation fixes regarding the consistent use of “IPv4-mapped IPv6”. It’s been merged.)

Who super helpfully answered my Reddit question and I’m totally not taking a swipe at him. To be clear, I still think netip is great and will be using it wherever I can make 1.18 the minimum Go version. ↩︎

The perils of the “real” client IP

Fri, 04 Mar 2022 13:20:10 -0500

## Summary

This post ended up being incredibly ~~long~~ comprehensive. I’m afraid that many people won’t read enough to get everything that’s important, so here are the key points:

When deriving the “real client IP address” from the X-Forwarded-For header, use the rightmost IP in the list.
The leftmost IP in the XFF header is commonly considered to be “closest to the client” and “most real”, but it’s trivially spoofable. Don’t use it for anything even close to security-related.
When choosing the rightmost XFF IP, make sure to use the last instance of that header.
Using special “true client IPs” set by reverse proxies (like X-Real-IP, True-Client-IP, etc.) can be good, but it depends on a) how the reverse proxy actually sets it, b) whether the reverse proxy sets it if it’s already present/spoofed, and c) how you’ve configured the reverse proxy (sometimes).
Any header not specifically set by your reverse proxy cannot be trusted. For example, you must not check the X-Real-IP header if you’re not behind Nginx or something else that always sets it, because you’ll be reading a spoofed value.
A lot of rate limiter implementations are using spoofable IPs and are vulnerable to rate limiter escape and memory exhaustion attacks.

If you use the “real client IP” anywhere in your code or infrastructure, you need to go check right now how you’re deriving it.

This is all explained in detail below, so keep reading. It’s a weird, scary, bumpy ride.

## Introduction

The state of getting the “real client IP” using X-Forwarded-For and other HTTP headers is terrible. It’s done incorrectly, inconsistently, and the result is used inappropriately. This leads to security vulnerabilities in a variety of projects, and will certainly lead to more in the future.

After thinking about rate limiters for a while, I started worrying about their IPv6 handling. I wrote a post detailing how bad IPv6 rate limiting can and does lead to rate limiter escape and memory exhaustion. Then I moved on to worrying about how rate limiters determine what IP to rate-limit when they’re behind a load balancer (or any reverse proxy). As you’ll see, the situation is bad.

But this isn’t just about rate limiters. If you ever touch code that looks at the X-Forwarded-For header, or if you use someone else’s code that uses or gives you the “real client IP”, then you absolutely need to be savvy and wary. This post will help you get there.

NOTE: Portions of this are redacted as I’m trying to disclose responsibly to the affected projects. Those portions will be added in as that projects choose to make the issues public. (So check back later!)

## It can’t be that hard to get the real client IP, right?

There are many reasons why web services are interested in the IP address of their clients: geographical stats, geo-targeting, auditing, rate-limiting, abuse-blocking, session history, etc.

When a client directly connects to a server, the server can see the client IP address of the immediate socket. If the client connects through one or more proxies (of any kind: forward, reverse, load balancer, API gateway, TLS offloading, IP access control, etc.), then the server only directly sees the IP address of the final proxy used by the client connection.

In order to pass the original IP address on to the server, there are several headers in common use:

X-Forwarded-For is a list of comma-separated IPs that gets appended to by each traversed proxy¹. The idea is that the first IP (added by the first proxy) is the true client IP. Each subsequent IP is another proxy along the path. The last proxy’s IP is not present (because proxies don’t add their own IPs, and because it connects directly to the server so its IP will be directly available anyway). We’re going to talk about this a lot, so it’ll be abbreviated to “XFF”.
Forwarded is the most official but seemingly least-used header. We look at it in more detail below, but it’s really just a fancier version of XFF that has the same problems that we’re about to discuss.
There are also special single-IP headers like X-Real-IP (Nginx), CF-Connecting-IP (Cloudflare), or True-Client-IP (Cloudflare and Akamai). We’ll talk more about these below, but they’re not the main focus of this post.

## Pitfalls

Before talking about how to use XFF properly, we’ll talk about the many ways that using X-Forwarded-For can go wrong.

### Headers are untrustworthy

First of all, and most importantly, you must always be aware that any XFF IPs that were added (or appear to have been added) by any proxy not controlled by you are completely unreliable. Any proxy could have added, removed, or modified the header any way it wants. The client as well could have initially set the header to anything it wants to get the spoof-ball rolling. For example, if you make this request to an AWS load balancer²…

curl -X POST https://my.load.balanced.domain/login -H "X-Forwarded-For: 1.2.3.4, 11.22.33.44"

…your server behind the load balancer will get this:

X-Forwarded-For: 1.2.3.4, 11.22.33.44,

And this:

curl -X POST https://my.load.balanced.domain/login -H "X-Forwarded-For: oh, hi,,127.0.0.1,,,,"

…will give you this:

X-Forwarded-For: oh, hi,,127.0.0.1,,,,,

As you can see, everything already present is just passed through, unchanged and unvalidated. The final, actual IP is just appended to whatever is already there.

(In addition to curl’ing and custom clients, there is also at least one Chrome extension that lets you set the XFF header in browser requests. But how you can set the header doesn’t really matter to us here, only that an attacker can do it.)

### Multiple headers

According to the HTTP/1.1 RFC (2616)³:

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one “field-name: field-value” pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded.

That applies to XFF, as it is a comma-separated list. This can make getting the rightmost (or even leftmost) IP error-prone.

For example, Go has three ways to get a header value:

http.Header.Get(headerName) returns the first header value as a string.
http.Header.Values(headerName) returns a slice (array) of strings with the values of all instances of the header headerName. (headerName is canonicalized before lookup.)
http.Header is a map[string][]string and can be accessed directly. (The map keys are canonicalized header names.) This is similar to using Values.

So here’s the attack:

Eve makes a request with two spoofed XFF headers.
Your reverse proxy adds Eve’s true IP to the end of the second XFF header, per the RFC requirements.
You call req.Header.Get("X-Forwarded-For") and get the first header. You split it up and take the rightmost.
You have chosen a spoofed IP. You treat it as trustworthy. Bad things result.

Unlike Go, Twisted’s method for getting a single header value returns the last value. (Why is there no standard, common, accepted behaviour for this?) This avoids the above attack, but it can cause a different (less likely) problem: If you’re using the rightmost-ish algorithm (described below), you need to go backwards from the right looking for the first untrusted IP. But what if one of your reverse proxies has added a new header instead of appending (a valid thing to do, per the RFC)? Now the IP that you want is nowhere to be found in the last header – it’s full of trusted reverse proxy IPs and the real IP is in a previous instance of the XFF header.

There might be a subtle, hypothetical attack possible here:

You have (at least) two reverse proxies that you trust.
The second of those reverse proxies doesn’t like super long headers, so it creates a new one rather than appending if the XFF header is too long.
Eve knows this. And she wants to hide her IP from you.
Eve spoofs a long XFF in her request to you.
Your first reverse proxy adds her true IP to the XFF header.
Your second reverse proxy doesn’t like how long that header is, so it creates a new one. The header value is the IP of the first reverse proxy.
Your server software gets the last header and it has only a single IP, belonging to your first reverse proxy.
What does your logic do? Use that IP? Treat it as special because it’s private/trusted? Panic because it’s impossible that this IP should be trusted?

Note that when I tested with a server behind AWS ALB I found that ALB had already concatenated the XFF headers. So that’s good. I have no idea if other reverse proxies do the same, but I bet there’s no real consistency.

The best thing to do is merge all of the XFF headers yourself.

(It is worthwhile asking – and checking – to make sure reverse proxies append to the correct header, because appending to the wrong header would wreck the trustworthiness of taking the rightmost. I have only checked AWS ALB and Cloudflare, and they’re doing it right. If anyone discovers something doing it wrong, please let me know.)

[2022-03-04: I created a Go issue arguing for a change to the behaviour of http.Header.Get. Not with any real expectation of a change, but we’ll see.]

### Private IPs

Even in completely non-malicious scenarios, any of the XFF IPs – but especially the leftmost – may be a private/internal IP address. If the client first connects to an internal proxy, it may add the private IP of the client to the XFF header. This address is never going to be useful to you.

### Splitting the IPs

Because X-Forwarded-For isn’t an official standard, there’s no formal spec for it. Most examples show the IP addresses comma-space (", ") separated, but the space isn’t strictly required. (For example, the HTTP/1.1 RFC says that headers like XFF are simply “comma separated”.) Most of the code I looked at splits by just comma and then trims the value, but at least one I found looks for comma-space.

When testing, it looks to me like AWS ALB uses comma-space when adding an IP, but Cloudflare uses only a comma.

### Unencrypted data is always untrustworthy

This should go without saying, but if you’re receiving HTTP-not-S requests, then anyone could have modified the headers before they got to you. This is worth mentioning because an interloper can’t mess with the “rightmost-ish” approach (described below) because they can’t mess with the IP of the final connection from the internet to your reverse proxy or server.

So just encrypt your traffic, okay?

### Other headers (`X-Client-IP`, `True-Client-IP`) might be present and spoofed

Some reverse proxies remove any unexpected or unwanted headers, but some (like AWS ALB) don’t. So an attacker can set headers like X-Client-IP and True-Client-IP straight through to your server. You need to not get fooled into using them if your reverse proxy didn’t specifically set them for you.

### Trying to learn about `X-Forwarded-For`

Trying to educate yourself about XFF is, unfortunately, also difficult.

MDN Web Docs are usually the gold standard for stuff like this, but the page about XFF doesn’t mention these risks at all; it says “the right-most IP address is the IP address of the most recent proxy and the left-most IP address is the IP address of the originating client” with no caveat. The Wikipedia entry is much better: “Since it is easy to forge an X-Forwarded-For field the given information should be used with care. The right-most IP address is always the IP address that connects to the last proxy, which means it is the most reliable source of information.”

[2022-03-09: Created an issue for the MDN documentation. 2022-03-19: I rewrote the page, PR’d it, and the change is live now. You can see a PDF of the original page here. Now to fix the Forwarded page…]

Other sources are similarly variable. Some say nothing whatsoever about the possibility of the header being spoofed or the presence of private addresses (1, 2, 3, 4, 5). Others do a pretty good job of mentioning the risks (6, 7, 8), but sometimes you have to read pretty deeply to get to the warnings.

## Avoiding those pits

Let’s make a few baseline statements:

Using an IP in the private address space as the “real” client IP is never the right choice⁴.
Using a value that’s not actually an IP address is never the right choice.
In the absence of chicanery, the leftmost non-private, non-invalid IP is the closest we can come to the “real” client IP. (Henceforth, “leftmost-ish”.)
The only client IP that we can trust is the first one added by a (reverse) proxy that we control. (Henceforth, “rightmost-ish”.)

The leftmost-ish is usually going to be the most “real”, while the rightmost-ish is going to be the most trustworthy. So which IP should you use? It depends on what you’re going to do with it.

If you’re going to do something security-related, you need to use the IP you trust – the rightmost-ish. The obvious example here is rate-limiting. If you use the leftmost-ish IP for this, an attacker can just spoof a different XFF prefix value with each request and completely avoid being limited.

Additionally, they might be able to exhaust your server memory by forcing you to store too many individual entries – one for each fake IP. It may seem hard to believe that storing IP addresses in memory could lead to exhaustion – especially if they’re stored in a cache with finite time-to-live, but keep in mind:

The attacker won’t be limited to the 4 billion IPv4 addresses. They can use all the bazillion IPv6 addresses, if the limiter isn’t smart about prefixes.
Since many limiters don’t check for valid IPs, an attacker can use any random string it wants.
Also note that these strings can be big; for example, Go’s default header block size limit is 1MB. That means a single random-string “IP” can be almost 1MB. That means adding 1MB of memory usage per request.

It still won’t be feasible for all attackers and configurations, but it shouldn’t be dismissed without consideration.

Or an attacker can force you to rate-limit/block the IP addresses of other users. They can supply a real – but not their – IP address, and you’ll eventually be fooled into rate-limiting it. (And if you’re using the “real” IP for abuse reports, you might end up complaining about the wrong person.)

The downside to using the rightmost-ish IP for rate-limiting is that you might block a proxy IP that’s not actually a source of abuse but is just used by a bunch of different clients and you would have realized that if you’d just used the leftmost-ish instead. Yeah, well. That doesn’t seem super likely, and it’s still infinitely more acceptable than allowing attackers to trivially bypass your rate limiter and crash your server.

If you’re doing something not obviously security-related… Think hard about your use case. Let’s say you just want to do an IP-geolocation lookup for your stats. Probably the leftmost-ish IP is what you want. The vast majority of your users won’t be doing any header spoofing, and the geolocation of random internet proxies are no good to you, so you’ll probably get the best results with the IP closest to the user.

On the other hand, you might want to consider how many internet-proxy-using users you expect to have. Probably few enough that it won’t hurt your stats if you geolocate the wrong thing. And is there a way an attacker could hurt you by purposely skewing your geo stats? Probably not, but take a moment to really think about it.

So be careful when writing your “GetRealClientIP(request)” function. Make sure it has a big warning comment about how it should be used. Or maybe write two functions: “GetUntrustworthyRealClientIP(request)” and “GetTrustworthyButLessRealClientIP(request)”. Which are horrible names. Maybe just pass a flag instead. Anyway, the point is to prevent the callers of your functions from having any confusion about the nature of the result.

Also be careful when using the results of that function. It’s easy to write code that gets the leftmost-ish IP to do some geo lookup and then later decide you also need to do rate limiting… so you might as well use the same “realClientIP” variable! Oops. This might be a good time to make wrong code look wrong.

And remember that the final proxy IP – or the address of the client if it’s connecting directly – is not in the XFF header. You need to look at your request connection information for that. (http.Request.RemoteAddr in Go, the REMOTE_ADDR environment variable for many CGI servers, etc.)

### Algorithms

When reading this, remember that the final proxy IP is not in the XFF list – it’s the RemoteAddr. Also note that RemoteAddr might have the form ip:port, depending on your platform (like it does in Go) – be sure to only use the IP part.

#### First: collect all of the IPs

Make a single list of all the IPs in all of the X-Forwarded-For headers.⁵ Also have the RemoteAddr available.

#### Second: decide what your security needs are

Default to using the rightmost-ish approach. Only use the leftmost-ish if you have to, and make sure you do so carefully.

#### Leftmost-ish: the closest to the “real IP”, but utterly untrustworthy

If your server is directly connected to the internet, there might be an XFF header or there might not be (depending on whether the client used a proxy). If there is an XFF header, pick the leftmost IP address that is a valid, non-private IPv4 or IPv6 address. If there is no XFF header, use the RemoteAddr.

If your server is behind one or more reverse proxies, pick the leftmost XFF IP address that is a valid, non-private IPv4 or IPv6 address. (If there’s no XFF header, you need to fix your network configuration problem right now.)

And never forget the security implications!

#### Rightmost-ish: the only useful IP you can trust

If your server is directly connected to the internet, the XFF header cannot be trusted, period. Use the RemoteAddr.

If your server is behind one or more reverse proxies and not directly accessible from the internet, you need to know either the IP addresses of those reverse proxies or the number of them that the request will pass through. We’ll call these the “trusted proxy IPs” and “trusted proxy count”. (Using “trusted proxy IPs” is preferable, for reasons described in the “network architecture changes” section.)

The trusted proxy IPs or trusted proxy count will tell you how far from the right of the XFF header you need to check before you find the first IP that doesn’t belong to one of your reverse proxies. This IP was added by your first trusted proxy and is therefore the only IP you can trust. Use it.

(Notice that I’m not saying “valid, non-private IP” here. It is tempting to do so, just to be extra safe, and I won’t blame you if you do, but if you can’t trust your own reverse proxy to add the proper IP, then you have a bigger problem.)

Again, if you’re behind one or more reverse proxies and there’s no XFF header, you need to immediately figure out how people are connecting to your server directly.

##### Tentative variation: rightmost non-private IP

If all of your reverse proxies are in the same private IP space as your server, I think it’s okay to use the rightmost non-private IP rather than using “trusted proxy IPs” or “trusted proxy count”. This is equivalent to adding all private IP ranges to your “trusted proxy IPs” list.

An example where this doesn’t work is if you’re behind an external reverse proxy service, like Cloudflare – it’s not in your private address space.

## Falling into those pits

Let’s look at real-world examples!

Warning: I got a little carried away here. I was only intending to look at a couple of projects that I was familiar with, but the hit-rate of dangerous-use-of-leftmost was so high that I just kept searching. (And there were some interesting and educational aspects even when it was done right.)

(If a tool or service isn’t mentioned here, it’s because I either didn’t look at it or couldn’t find enough information about it. I included all successes as well as failures.)

### Cloudflare, Nginx, Apache

Let’s start with some good news.

Cloudflare adds the CF-Connecting-IP header to all requests that pass through it; it adds True-Client-IP as a synonym for Enterprise users who require backwards compatibility. The value for these headers is a single IP address. The fullest description of these headers that I could find makes it sound like they are just using the leftmost XFF IP, but the example was sufficiently incomplete that I tried it out myself. Happily, it looks like they’re actually using the rightmost-ish.

Nginx offers a not-enabled-by-default module that adds the X-Real-IP header. This is also a single IP. When properly and fully configured⁶, it also uses the rightmost IP that isn’t on the “trusted” list. So, the rightmost-ish IP. Also good.

Similarly, when configured to look at X-Forwarded-For, Apache’s mod_remoteip picks the rightmost untrusted IP to set into REMOTE_ADDR.

### Akamai

TL;DR: always set “Send True Client IP Header” to “yes” and always set “Allow Clients To Set True Client IP Header” to “no”.

In the original version of this post, I said that “Akamai does very wrong things, but at least warns about it.”. However, a reader recently brought to my attention that there’s a separate Akamai setting influencing the behaviour of the True-Client-IP header that improves the situation.

Because a lot of Akamai’s documentation is inaccessible without an account, I’m going to quote copiously, so we all have the same information.

In their document “Difference Between the True-Client-IP header sent by edge server and XForwarded-For header” they say:

Question

What is the difference between X-Forwarded-For and True-Client-IP Headers?

Answer

The X-Forwarded-For header is the default header proxies use to report the IPs that a content passes through. This header can be overwritten by other proxies and by parent edge servers. Because of this, it is not always complete.

The True-Client-IP header contains the connecting public IP seen by edge server contacted for a request. This means that if a client is behind a proxy, the header will log the value of that proxy in the True-Client-IP header.

That header is passed through to an origin. It will not be overwritten by proxy or other edge servers.

True-Client-IP has to be explicitly enabled in the Origin Server behavior section (https://techdocs.akamai.com/property-mgr/docs/origin-server#true-client-ip-header), your Property Manager configuration.

Important

If a header named True-Client-IP is present in the request coming from the client, it will not be overwritten or sent twice. It is not a security feature.

The connecting IP is appended to X-Forwarded-For header by proxy server and thus it can contain multiple IPs in the list with comma as separator. True-Client-IP contains only one IP.

That document was my primary reference when I wrote this, although it has changed considerably – see my previous quoting of it.

It suggests that True-Client-IP will have the actual connecting IP only if there isn’t already a True-Client-IP present (and if it’s enabled). So, it can be trivially spoofed. Not good, but also not completely true – that document fails to mention another setting: “Allow Clients To Set True Client IP Header”.

Here are the relevant contents of “What IP address does True-Client-IP header have?”:

Question

What IP address would True-Client-IP header sent to the origin have?

Are there any conditions to affect True-Client-IP header values?

Answer

If “Send True Client IP Header” option is set to Yes in property configurations, edge server sends True-Client-IP request header by default to the origin server.

If “Allow Clients To Set True Client IP Header” option is set to No in property configurations, edge server sends True-Client-IP header to the origin server, that has an IP address that connected to the edge server such as a proxy server.

If “Allow Clients To Set True Client IP Header” option is set to Yes and a client initiates a request with a True-Client-IP request header, the origin server receives the True-Client-IP header as is.

Here is the summary(Assume that “True Client IP Header Name” is left as True-Client-IP as default in property configurations.):

Send True Client IP Header: No

Client’s request True-Client-Header the origin receives

Client’s request has a True-Client-Header The origin receives the header as is.

Client’s request does not have any True-Client-Header The origin does not receive the header.

Send True Client IP Header: Yes

Allow Clients To Set True Client IP Header Client’s request True-Client-Header the origin receives

No Client’s request has/does not have a True-Client-Header. The origin receives True-Client-IP header that has a connecting IP to edge server.

Yes Client’s request has a True-Client-Header. The origin receives True-Client-IP header that the client sent as is.

Yes Client’s request does not have any True-Client-Header. The origin receives True-Client-IP header that has a connecting IP to edge server.

Please refer to following document regarding “Allow Clients To Set True Client IP Header”, “Send True Client IP Header” and “True Client IP Header Name” options:

https://techdocs.akamai.com/property-mgr/docs/origin-server

Client’s request	True-Client-Header the origin receives
Client’s request has a True-Client-Header	The origin receives the header as is.
Client’s request does not have any True-Client-Header	The origin does not receive the header.

Allow Clients To Set True Client IP Header	Client’s request	True-Client-Header the origin receives
No	Client’s request has/does not have a True-Client-Header.	The origin receives True-Client-IP header that has a connecting IP to edge server.
Yes	Client’s request has a True-Client-Header.	The origin receives True-Client-IP header that the client sent as is.
Yes	Client’s request does not have any True-Client-Header.	The origin receives True-Client-IP header that has a connecting IP to edge server.

The page at that last link isn’t login-gated, but the relevant section is short so we’ll quote it for posterity:

True Client IP Header

If you enable the Send True Client IP Header option, edge servers pass the original client IP address to the origin.

Normally, the client IP is passed in the X-Forwarded-For header that is routinely modified by proxies along the way. With this option enabled, the default header name True-Client-IP is used unless you set a custom name for the header in the True Client IP Header Name field. Additionally, with the Allow Clients To Set True Client IP Header toggle you can determine if the client name for this header is passed through and accepted, or whether to apply the value you defined in the True Client IP Header Name field instead.

So, to summarize:

If “Send True Client IP Header” is “no”, then True-Client-IP can be spoofed (and otherwise won’t be present).
If “Send True Client IP Header” is “yes” and “Allow Clients To Set True Client IP Header” is “yes”, then True-Client-IP can be spoofed.
If “Send True Client IP Header” is “yes” and “Allow Clients To Set True Client IP Header” is “no”, then True-Client-IP can’t be spoofed. Thank goodness.

It’s good that there is a configuration that allows for a safe True-Client-IP configuration. (For the record, I haven’t been able to figure out what the defaults are for those settings.)

Things that are less good:

“Allow Clients To Set True Client IP Header” can be set to “no”.
The “What is the difference between X-Forwarded-For and True-Client-IP Headers?” document doesn’t mention the “Allow Clients To Set True Client IP Header” setting at all.
If “Send True Client IP Header” is set to “no”, the True-Client-IP still gets passed through. We’ll see below that some rate-limiting packages will use that header by default – so you could be using it even if you thought you explicitly disabled it in Akamai.

### Fastly

Fastly adds the Fastly-Client-IP header with a single IP value. I think it’s using the rightmost-ish XFF IP:

Essentially, Fastly-Client-IP is the non-Fastly thing that is making the request to Fastly.

However:

The value is not protected from modification at the edge of the Fastly network, so if a client sets this header themselves, we will use it. If you want to prevent this [you need to do some additional configuration].

So, by default Fastly-Client-IP is trivially spoofable. Again, it seems highly likely that there are a lot of people using its default behaviour for security-related purposes and making themselves vulnerable to attack.

### Azure

Azure Front Door adds the X-Azure-ClientIP and X-Azure-SocketIP headers. They are described like so:

X-Azure-ClientIP: Represents the client IP address associated with the request being processed. For example, a request coming from a proxy might add the X-Forwarded-For header to indicate the IP address of the original caller.

X-Azure-SocketIP: Represents the socket IP address associated with the TCP connection that the current request originated from. A request’s client IP address might not be equal to its socket IP address because the client IP can be arbitrarily overwritten by a user.

So, X-Azure-ClientIP is the leftmost-ish XFF IP and X-Azure-SocketIP is the rightmost-ish.

That’s reasonably good, but I think it could be a lot clearer. The only warning about X-Azure-ClientIP is a subtle hint in the description of the other header. I also hand-wavingly feel that the name of the less-secure header is more appealing than the more-secure one, and is probably leading many people into the wrong choice.

### go-chi/chi

Chi is a Go HTTP router and provides a RealIP middleware and a rate limiter. The RealIP middleware has this comment:

You should only use this middleware if you can trust the headers passed to you (in particular, the two [three, actually] headers this middleware uses), for example because you have placed a reverse proxy like HAProxy or nginx in front of Chi. If your reverse proxies are configured to pass along arbitrary header values from the client, or if you use this middleware without a reverse proxy, malicious clients will be able to make you very sad (or, depending on how you’re using RemoteAddr, vulnerable to an attack of some sort).

Which is a pretty good warning, right? Almost.

Let’s take this opportunity to talk about abusing X-Real-IP, True-Client-IP, etc. For example, AWS ALB “[passes] along arbitrary header values from the client” and, indeed, if you don’t realize that you will end up “very sad”. Because a request like this…

curl -X POST https://my.load.balanced.domain/login -H "X-Forwarded-For:1.1.1.1" -H "X-Real-IP:2.2.2.2" -H "True-Client-IP:3.3.3.3"

…results in your server getting these headers:

X-Forwarded-For: 1.1.1.1, 
X-Real-IP: 2.2.2.2
True-Client-IP: 3.3.3.3

chi.middleware.RealIP’s logic goes like: “use the True-Client-IP; if that doesn’t exist, use the X-Real-IP; if that doesn’t exist, use X-Forwarded-For”. So it falls victim to header spoofing.

But, as we’ve learned, the chi.middleware.RealIP warning also isn’t good enough when it comes to X-Forwarded-For because you can never, ever trust all of it⁷. In the RealIP code, the logic I just paraphrased actually ends with “use the leftmost XFF IP address”. And we’re now suitably scared of using the leftmost XFF IP. (It also does not check that the leftmost “IP” is valid and non-private.)

So chi.middleware.RealIP falls firmly into the “only safe for non-security use” category. And you must be aware of its header preference order and what your reverse proxy does or doesn’t set and let through. In short, it’s hard to recommend it.

Chi’s rate limiter has identical logic for obtaining the IP address and doesn’t have the same warning. So that’s bad, for the reasons we’ve discussed – the non-XFF headers could be spoofed, the XFF header could be spoofed, the IP can be garbage, the rate limiter could be bypassed, your memory could be exploded. The best way to use it is to not use its “real IP” logic and instead write your own “GetTrustworthyClientIP(request)” and pass that to its “rate limit by arbitrary keys” feature.

Both RealIP and httprate are both using Go’s http.Header.Get to get the XFF header. As discussed above, this means that switching to taking the rightmost-ish IP wouldn’t be sufficient, as an attacker could force the wrong header to be used.

Chi’s rate limiter is also the one instance I found of the XFF list being split by comma-space instead of just comma. I think that’s wrong.

[2022-03-03: Disclosed to maintainer via email. 2022-03-04: Maintainer requested that I make an issue.]

### didip/tollbooth

The Tollbooth HTTP rate limiter is better, but you still need to be aware of what it’s doing in order to use it properly.

Its README says the order in which it looks for the “real” client IP address is…

By default it’s: “RemoteAddr”, “X-Forwarded-For”, “X-Real-IP” If your application is behind a proxy, set “X-Forwarded-For” first.

Strangely, that default order isn’t actually the default everywhere. If you call limiter.New() that is the default. But if you call tollbooth.NewLimiter() – “a convenience function to limiter.New” – the order is "X-Forwarded-For", "X-Real-IP", "RemoteAddr". Which is an important difference!

It doesn’t make clear that it doesn’t support CF-Connecting-IP, True-Client-IP, or any other arbitrary header – it just silently skips them if you add them (returning empty string, leading to over-limiting).

This is a general problem with Tollbooth – it “fails open”: if it can’t find an IP to use it doesn’t rate limit. So if the user of the library misconfigures it (by trying to use an unsupported header, or even making a letter-case mistake in a supported on), it won’t rate limit at all and will give no indication of this. In a comment on the PR to fix this I give my opinion on how it should be have, but this is a pretty tricky problem.

Tollbooth also suffers from the “multiple headers” problem – it uses Go’s r.Header.Get. So, even though it’s counting from the rightmost, which is good, it’s doing so with the first header, which is bad.

Speaking of counting from the rightmost… If Tollbooth is configured to count very far from the right (like, lmt.SetForwardedForIndexFromBehind(1000)), then it will effectively take the leftmost IP (it uses a minimum index of zero). In that case, it should be doing “leftmost-ish” IP validation.

There are two more things that bug me about tollbooth’s design. The first is that the ostensible default order has Go’s http.Request.RemoteAddr first. My understanding is that that field should never be empty (for an HTTP server, rather than client), so the rest of the list is guaranteed to be ignored. So why have a list? Also, if your server is behind a reverse proxy, RemoteAddr will be your proxy’s IP, which is useless.

The second thing that bugs me is going to get its very own section…

[2022-03-03: Disclosed to maintainer via email. 2022-03-04: Maintainer created a PR to fix it. Ongoing discussion there.]

### A default list of places to look for the client IP makes no sense

Where you should be looking for the “real” client IP is very specific to your network architecture and use case. A default configuration encourages blind, naïve use and will result in incorrect and potentially dangerous behaviour more often than not.

If you’re using Cloudflare you want CF-Connecting-IP. If you’re using ngx_http_realip_module, you want X-Real-IP. If you’re behind AWS ALB you want the rightmost-ish X-Forwarded-For IP. If you’re directly connected to the internet, you want RemoteAddr (or equivalent). And so on.

There’s never a time when you’re okay with just falling back across a big list of header values that have nothing to do with your network architecture. That’s going to bite you.

Even Tollbooth defaulting to using the rightmost XFF IP can be problematic. If your server is behind two layers of reverse proxies, then you’ll be looking at the IP of your first proxy instead of the client’s IP.⁸ (What will probably happen is that you will rate-limit your proxy almost immediately, nothing gets through, and then you fix your config. But it still would have been better to be forced to think about the correct configuration in the first place.)

So, even though I know it’s not very user friendly, I don’t think that rate-limiting libraries should have any default at all, and instead should require explicit configuration.

### ulule/limiter

Another Go rate limiter middleware. By default it doesn’t look at the XFF header, but if enabled it uses the leftmost XFF IP. The option is called “TrustForwardHeader”, but you can never trust the XFF header. So it falls victim to rate limit escape, etc.

It also uses Go’s http.Header.Get, so if it switches to rightmost-ish it will need to change how it gets the XFF header.

When TrustForwardHeader is true it first looks for XFF and then falls through to X-Real-IP and finally uses RemoteAddr. But as we just saw, “a default list of places to look for the client IP makes no sense”.

It returns net.ParseIP(ip) rather than just the raw string. This seems good at first, but net.ParseIP returns nil if the parse fails. So in the case of a garbage string, ulule/limiter doesn’t check for the failure, returns nil, and then, as far as I can tell, uses "" as the “IP” key. (I’m surprised that it doesn’t panic, but I don’t think it does.) The way this logic works makes memory exhaustion more difficult, but it might be achievable using valid IPv6 addresses.

[2022-03-04: Disclosed to maintainer via email. 2022-03-05: PR has been created with fixes (mostly documentation warnings). It’s public, so I’m un-redacting this. 2022-03-17: The PR was merged. I don’t love the changes, but it sounds like the next major version will address the shortcomings.]

### sethvargo/go-limiter

This is yet another Go rate limiter middleware. If its httplimit.IPKeyFunc is configured to look at the X-Forwarded-For header (which is given as an example in its comment), it will use the whole header as the rate limit key. That’s almost worse than taking the leftmost IP.

The way to work around this would be to avoid its “real IP” logic and create your own KeyFunc that extracts the correct IP for your network architecture.

If the library can’t find the configured header(s), it falls through to RemoteAddr. But, again, I don’t think default fallbacks are good.

The library also uses Go’s http.Header.Get().

[2022-03-04: Disclosed to maintainer via email. 2022-03-05: Maintainer indicated by email that I could un-redact this.]

### [REDACTED]

Pending disclosure

### Let’s Encrypt

It looks like Let’s Encrypt is using Nginx with X-Real-IP. If its configuration is good (I don’t think the config files are in GitHub), then it should be using rightmost-ish.

### [REDACTED]

Pending disclosure

### Jetty

It looks like the Jetty web server uses the leftmost XFF IP address. I haven’t dug far enough in to see what it’s used for (or how it’s exposed), but that’s a dangerous start.

### Express

Express is a NodeJS web framework. The default configuration ignores the XFF header, but it’s possible to configure it to use the leftmost or a rightmost-ish XFF IP. The setting that uses the leftmost has this warning:

When setting to true, it is important to ensure that the last reverse proxy trusted is removing/overwriting all of the following HTTP headers: X-Forwarded-For, X-Forwarded-Host, and X-Forwarded-Proto otherwise it may be possible for the client to provide any value.

So that’s pretty good. (But still a footgun that I’m sure someone will fall victim to.)

### Traefik

Traefik is a “cloud native network stack”. Its rate limiter is configured to use the “trusted proxy count” version of rightmost-ish. So that’s good.

### phpList

phpList is an “open source newsletter and email marketing software”. It uses the leftmost XFF IP. I can’t quite tell what it’s used for, but it’s something surrounding login.

### IIS

I could find anything to suggest that Microsoft IIS processes the XFF header, but an official support blog post entitled “How to use X-Forwarded-For header to log actual client IP address?” says:

If you see multiple IP addresses in X-Forwarded-For column, it means the client went through more than one network device. Each network device adds their own IP to the end of the value. The left-most IP address is the actual client IP address. Others belong to network devices the client go through.

Which is a woefully incomplete statement. I fear for the 97,641 people who read that post.

### Tor

Tor is an anonymity network. They have recently realized that they have a control server that is both directly connected to the internet and behind a reverse proxy and they are using XFF to give them trustworthy IPs, so they’re vulnerable to spoofing. It looks like they’re working towards limiting the public-ness of the interface, or doing better verification of CDN connections, or both.

(Because I looked into it, I’ll mention that it looks like they’re not falling victim to the “multiple headers” pitfall. It looks like they use Twisted and call request.getHeader to get the XFF value. The Twisted source for that method indicates that it returns the last matching header. That could cause problems if you need the Nth-from-the-right header, but I think it’s fine in this case.)

### gorilla/handlers.ProxyHeaders

[Section added 2022-03-27. u/Genesis2001 asked about this on Reddit, so I looked at the code and figured I should add some comments here.]

Gorilla is a Go web toolkit. It’s most known for its router, gorilla/mux. It has a ProxyHeaders middleware for handling XFF (that is intended for general consumption, not just for gorilla/mux users).

ProxyHeaders (source) is deficient in a number of ways, but at least it has a warning that the user’s first reverse proxy must strip out the headers being checked before adding them back in. So it’s good that it has that warning, but that requirement means that a) it won’t be usable for a lot of users, and b) it will be misused by a lot of users.

Let’s touch on the problems that are legitimately mitigated by stripping the headers:

It’s taking the leftmost XFF and Forwarded values.
It’s checking X-Forwarded-For and then X-Real-IP and then Forwarded. So it has the “default list” problem.
It’s using X-Forwarded-Host to replace r.Host. So that’s a new spoofable thing.
It’s using X-Forwarded-Proto-then-X-Forwarded-Scheme-then-Fowarded to replace r.URL.Scheme. Another new spoofable thing.

It’s using comma-space to parse X-Forwarded-For, contrary to RFC 2616. So it can set r.RemoteAddr to an IP like “1.1.1.1,10.1.1.1”, etc. It’s also not trimming the result, and I think the LWS rules of RFC 2616 mean that there can be spaces before the comma, so ProxyHeaders can also end up with strings like "1.1.1.1".

It also doesn’t support any single-IP headers besides X-Real-IP, which limits its general utility.

(And I keep wondering why stripping XFF at the first reverse proxy makes sense. If you have that much control – e.g., if you’re using Nginx – you should instead just set X-Real-IP and let XFF behave the way it’s intended to.)

[2022-03-27: Created issue and pull request.]

### Quickies

[2022-03-29: Added this section.]

I keep finding more examples of code making mistakes. They help to drive home the points I’m trying to make, but they would bloat this post if I gave a section to every one. Instead I’m going to write some short bullets.

sebest/xff: Claims to be a Forwarded (RFC 7239) parser but is instead an XFF parser. It uses a leftmost-non-private algorithm, but doesn’t document the risks (though there is a 6-year-old issue). The private IP ranges are incomplete. (Created issues.)
pbojinov/request-ip: Falls back between XFF and many single-IP headers. Uses leftmost for XFF. Has no warning about dangers. Claims to support Forwarded, but just returns the whole value. Has issues and PRs to address some of this, but project seems abandoned.
mo7zayed/reqip: Based on the previous one, with all the same shortcomings.
stanvit/go-forwarded: Takes the absolute rightmost, which is better than leftmost but still not right for all configurations. (Created an issue.)

## Advanced and theoretical pitfalls and attacks

I’ve talked a lot about two attacks on rate limiters: avoiding being limited and exhausting server memory. I’ve done this because rate limiters are what led me to this topic and because causing a map of IPs to fill memory was an obvious danger in many implementations.

But rate limiters are only one “security-related” use of X-Forwarded-For, and there are more, cooler possibilities for badness! They’re harder to find or reproduce, but they should be fun to speculate on…

### Server behind reverse proxy and directly connectable

This was briefly mentioned in the “algorithms” section, but is worth repeating.

Generally speaking, if your server is behind one or more reverse proxies, there are one or more rightmost IPs in the XFF header that you can trust. The “rightmost-ish” algorithm is predicated on that. But if your server can also be connected to directly from the internet, that is no longer true.

With some experimentation, an attacker can craft an XFF header to look exactly like the one you expect from your reverse proxy:

Attacker gets her IP limited/blocked by your server.
Attacker crafts XFF header so that the rightmost of it has different IPs in the private space, and different counts of those IPs.
Continue until the limit/block unexpectedly disappears.

Now you’re using an untrusted XFF IP and don’t realize it. Rate limiter escape, memory exhaustion, etc.

One way to mitigate this is to check the RemoteAddr to make sure it belongs to your reverse proxy before you try to use the XFF header.

### Re-fronting attack

Thanks to Ryan Gerstenkorn for sending me his blog post about this.

If a) your backend is in-house or otherwise not directly externally accessible, and b) it’s fronted by a CDN, and c) you trust the IP addresses/ranges of your CDN, then you may be vulnerable to another class of attack.

With AWS CloudFront, it’s possible for an attacker to create a distribution that points to your origin. Now requests are coming to your origin from trusted IPs, but from a distribution not owned by you. But the real beauty/horror of this is that the attacker can use Lambda@Edge to modify the Host header so that you can’t tell that a different hostname was used to access your origin, and can also modify the X-Forwarded-For header to be whatever the attacker wants.

So your “trusted” reverse proxy IPs become untrusted and can lie to you about the client IP. This can be used to bypass your rate limiter, IP-based access control, etc.

The proper way to address this is to also verify that it’s your CDN distribution talking to you. This will usually involve a shared secret or client certificate.

Note that Gerstenkorn verified that this works for AWS CloudFront. I checked Cloudflare and found that it doesn’t work there: Cloudflare’s “Transform Rules” won’t let you “set” the XFF header, and if you “delete” the header, only the pre-existing header is deleted and a new one is added with the actual IP. And it’s similar when attempting to leverage Workers – the actual client IP is still appended to the XFF header after any other manipulation.

Always do strong verification of your CDN! Maybe there are other Cloudflare headers that are important to you and aren’t as protected as XFF. Or maybe you’re using some other CDN. And this general class of attacks might apply to third-party WAFs, etc., depending on how they’re configured.

### Many trusted reverse proxy IPs

This is more of a “difficulty” than a pitfall or attack, but this is as good a place as any to fit it in.

In the rightmost-ish algorithm, the “trusted proxy IPs” list might be very large and might involve ranges rather than specific IPs. For example:

There might be a pool of reverse proxies that feed into your server.
That pool might scale out with load. So your trusted IPs will have to include whatever range that pool uses.
You might use a large external service, like Cloudflare. So you’ll need to “trust” all of their very large set of IPs.

Especially in that last case – and especially if you’re accepting IPv6 connections – you can’t just have a big list of single IPs that you check. You’ll need to check a list of ranges.

### Network architecture changes

So, you’ve set everything up perfectly. Your configuration is exactly right and you’re picking the correct “client IP” every time. Great. It runs quietly and flawlessly for so long that you forget all about it.

And then you change your network architecture.

The scenarios with the less-bad result are when you’re using the rightmost-ish approach and you add a new level of reverse proxy.

You were accepting connections directly from the internet, so you were using RemoteAddr for rate-limiting. Then you added a load balancer. Now you’re rate-limiting your load balancer.
You were using a single reverse proxy. You were using the rightmost XFF IP – the one that gets added by that proxy – for rate-limiting. Then you added another level of reverse proxy. Now you’re rate-limiting one of your reverse proxies (whichever is first in the chain).
You had a complex setup of internal reverse proxies. You were rate-limiting by rightmost-ish XFF IP, with your whole internal IP range on the “trusted proxy” list. Then you added Cloudflare in front of it all. Now you’re rate-limiting Cloudflare.

Those are “less bad” because they don’t introduce security flaws, but they’re still going to result in a near-complete inability to process requests.

The “much more bad” scenarios tend to occur when removing reverse proxy levels, and introduce vulnerabilities that you won’t notice.

You were using a single reverse proxy. You were using the rightmost XFF IP – the one that gets added by that proxy – for rate-limiting. Then you decide that you don’t need that extra proxy level and instead you connect your server directly to the internet. But now no part of the XFF is trustworthy and you’re vulnerable to spoofing.
You were using Cloudflare in front of AWS ALB and relying on its CF-Connecting-IP header. Then you decide to save some money, drop Cloudflare, and now have requests go directly to ALB. So now you’re vulnerable to CF-Connecting-IP spoofing.
You were using two levels of reverse proxy in front of your server. You were using the rightmost-ish XFF IP with a “trusted proxy count” of 2 for rate-limiting – so you were always taking the second-from-the-right IP. You decide to remove a level of reverse proxy. Now you’re vulnerable to spoofing because the second-from-the-right XFF IP is no longer trustable.

Depending on your logic, it might take an attack before you notice that you’re vulnerable.

The takeaway here is pretty obvious – “when network architecture changes, configuration that depends on that architecture also needs to change” – but it can be very difficult to remember to update every dependent config file when you tweak something that “should” work and does appear to work. (Maybe there are fancy network-management tools that automatically push config changes in response to architecture changes? Would be cool.)

The “trusted proxy count” variation of the rightmost-ish algorithm is especially brittle to adding or removing reverse proxies. The “trusted proxy count” (especially with a big internal range) or “rightmost non-private IP” variations can better cope with changing the number of internal proxies (as long as you’re not going to zero).

#### Even worse: Third-party network architecture changes

Take a look at the Cloudflare IP list. Notice down at the bottom that there have been two times where Cloudflare removed IPs from the list.

Imagine you had those IPs on your trusted list. Imagine you didn’t realize they were removed. Now who owns those IPs? Whoever it is can put whatever they want in the XFF, pass it on to your trusted proxy chain, and you’ll use it as the “real” IP. Rate limit escape and memory exhaustion.

…The answer to that question is that Cloudflare still owns the IPs (I checked ARIN). But the point isn’t about Cloudflare and those particular IP ranges. Any CDN or reverse proxy service with a trusted IP list could change their list and cause problems.

(Edit: A reader shared the email Cloudflare sent when they most recently changed their IP set. He pointed out that even though Cloudflare still owns the IPs, they should be considered untrusted. The email mentions that there’s an API to get Cloudflare’s IPs, which is good. Anyway, my original point wasn’t just about Cloudflare, so it remains unchanged.)

### `X-Forwarded-For` parser mismatch

This is inspired by JSON interoperability vulnerabilities. These occur when different levels of code or architecture interpret JSON in different ways. So if the JSON parser at one level deals with, say, duplicate object keys by taking the first key and another level deals with it by taking the last key, you can have a problem. (E.g., an attacker passes a "username" value along with a matching password, but then also passes another "username" value. If your auth check uses the first username and the business logic uses the second, you’re going to access the wrong user data.)

If there’s one thing that’s certainly true of the XFF header it’s that there’s an abundance of ways of “interpreting” it. Let’s recap some of them and add a fun new one:

Which position? Leftmost? Rightmost? Nth-from-rightmost? And so on.
What value is acceptable? Don’t validate at all? Only IPs? Only non-private IPs?
How to split up the list of IPs? Comma? Comma-space?
How to handle multiple XFF headers?
New: How to handle weird characters? What if there’s a null byte? Or some other control character? What if there’s a UTF-8 sequence?

Any difference in the answers to any of those question marks can result in a mismatch between parsers.

I wish I had a slam-dunk example scenario for this, but I don’t. Here are some hand-wavy ones:

You block access to your service to requests from, say, Antarctica. You have a reverse proxy at one level that grabs an XFF IP and checks that. At another level, you have a reverse proxy that grabs a different XFF IP and collects geolocation statistics. You get confused about why you seem to have users connected from Antarctica. (One of them is doing it wrong, but this isn’t enough to tell you which.)
At one reverse proxy level, you check a user’s incoming IP address against your DB to make sure it’s acceptable for that user. At another reverse proxy level, you update that DB. If there’s a mismatch, you’ll end up too permissive, too restrictive, or both.
More generally… At one reverse proxy level you use the XFF header to determine the client’s IP. Allowing the request to proceed is an attestation that the client IP is acceptable for further processing. At a later reverse proxy level, the client IP is again derived from the XFF header and treated as trusted data because it is implicitly attested to by the previous level.⁹ A difference between the two levels in XFF parsing introduces a vulnerability.

## RFC 7239: Forwarded HTTP Extension, June 2014

After considering comma-vs-comma-space and other parser mismatches, you’re probably thinking, “There should be an RFC to concretely specify this.” And there is, kind of.

RFC 7239 specifies the Forwarded header. Its purpose is to replace and improve X-Forwarded-For. The big thing it addresses is that X-Forwarded-For (a list of client and proxy IP addresses), X-Forwarded-Host (the hostname requested by the client), X-Forwarded-Proto (the protocol used by the client; e.g., “https”), and X-Forwarded-By (the proxy IP address) are all separate-but-related headers. It becomes very easy to for an intermediary proxy to mess up the relationship between them. To address this, the Forwarded header includes all that information in a single list.

Does the Forwarded header fix the security issues that X-Forwarded-For has? Not even a little. It can be misused in exactly the same ways that XFF can.

Does it have wide adoption? Not that I can see. It gets mentioned in documentation occasionally, but I don’t remember it actually being checked in any of the code I read.

Okay, does the RFC at least make clear how it should be used and not be used? Well, there is this section:

##### 8.1. Header Validity and Integrity

The “Forwarded” HTTP header field cannot be relied upon to be correct, as it may be modified, whether mistakenly or for malicious reasons, by every node on the way to the server, including the client making the request.

One approach to ensure that the “Forwarded” HTTP header field is correct is to verify the correctness of proxies and to whitelist them as trusted. This approach has at least two weaknesses. First, the chain of IP addresses listed before the request came to the proxy cannot be trusted. Second, unless the communication between proxies and the endpoint is secured, the data can be modified by an attacker with access to the network.

And that’s it.¹⁰

That warning is strictly true, but it’s not very helpful and could be clearer. Would you read those five sentences and then think, “Now I thoroughly understand the danger! It’s perfectly clear to me how to use this header in a secure manner.”? I wouldn’t.

I feel like it should be the responsibility of this RFC not only to specify how to create the header but also how to correctly consume it.

(Bonus: The RFC adds a variation on IPv6 parsing – quotation marks: “Note that as “:” and “[]” are not valid characters in “token”, IPv6 addresses are written as “quoted-string”.” E.g., "[2001:db8:cafe::17]:4711".)

## Conclusions

Well, that was exhausting. I didn’t start out with the intention of writing a dissertation on this godforsaken header.

I have avoided giving this definition until now, just in case anyone skims the article and sees it, but I think we’re ready for it:

X-Forwarded-For: , ,

That’s what you’ll see on basically every page that describes the header. Is it any wonder that misuse of X-Forwarded-For is so prevalent?

Let’s summarize some of the things we’ve learned, the wisdom we’ve gained, and the opinions we’ve formed:

Danger on the left, trust on the right. There are, maybe, some situations where the leftmost-ish XFF IP can be used, but only very carefully. If there’s any doubt, use the rightmost-ish.
Any header – or any part of any header – that wasn’t set by your own reverse proxy is fundamentally untrustworthy.
Setting a special header to a helpful value but also just letting it through if it’s already set is a terrible idea. (See: Akamai, Fastly.)
If there’s no universally sane default, there should be no default. And “get the real IP” is a case where there’s no sane default. (See: Chi, Tollbooth.)
Defaults should not be insecure. There shouldn’t need to be extra configuration to avoid shooting yourself in the foot. (See: Fastly.)
If you know a function or value is dangerous (spoofable, etc.), put that in your documentation for it in big red letters. Don’t just coyly hint at it. (See: Azure, etc.)
Good specifications (i.e., RFCs) should tell you how to consume a value, not just how to produce it. And if there are different ways to consume that value that make sense in different situations, it should give you the necessary information – with sufficient clarity – to help you make that choice. A reference implementation would also help. [2022-03-24: I wrote a library that I hope will become that reference implementation.]
Inconsistency in security implementations is bad. Pick a tool or cloud service that I didn’t cover here to check for XFF behaviour. Can you guess beforehand what you’ll find with any certainty? You can’t, and that’s bad.
If you’re creating a security tool, product, or feature, you need to have as thorough an understanding of the problem space as possible. Even if you start with solid off-the-shelf components (like a token bucket library and expirable cache for a rate limiter), you still need to be fully aware of, for example, the nature and meaning of a header that’s being used as input. Otherwise you run the risk of introducing a new vulnerability (or two, etc.).
If you’re taking untrusted input – like a header value – and putting it into an ever-growing, pretty-long-lived, in-memory structure (like a rate limiter map), you need to be incredibly careful. (Especially if that structure isn’t behind a rate limiter and instead is the rate limiter.)
Whenever possible, read the code for your dependencies. It’s hard and a ton of work, but bad security surprises can be worse.

I have avoided saying that you should only use the rightmost-ish XFF IP and never, ever the leftmost. But, seriously, just don’t use it.

## Discussion

Comment and discuss at Hacker News.

There have been some interesting comments on HN and Reddit. I’ll share some tidbits here.

### Rust’s `HeaderMap::get` returns first

HN commenter scottlamb pointed out that Rust’s method to return a single header value also returns the value of the first such header. The commenter checked and discovered that they were using it wrong.

### AWS ELB/ALB has an option to make XFF even worse

HN commenter nickjj brought the AWS ELB/ALB “client port preservation” option to my attention. If enabled, the client port number is appended to the IP added to XFF. Turning that option on will a) violate the de facto standard form of the header, and b) mess up a lot of IP parsing code.

(And if the IP suddenly starts failing, then what? Does the rate limiter logic keep moving to the left until it finds a good IP? If done wrong, that could lead to using untrusted values. Does the rate limiter instead give up? And do what? Fail open? Fail closed? Panic? In a comment on the didip/tollbooth PR for this I talk more about this.)

### Consider a custom single-IP header, if you can

HN commenter terom said:

highly recommended to just override the entire XFF header with a single value at the appropriate point in your stack, if at all possible

Which is good advice and I didn’t really say in the post. If you have the ability to use one of the “good” single-IP headers, or add your own at your first proxy, that’s much better than messing around with XFF.

(The reason I didn’t really talk about the custom header is like: I was mostly writing for people who are trying to use what’s available rather than doing a lot of proxy tinkering. Or something.)

### Go’s `net/http/httputil.ReverseProxy` XFF behaviour being re-examined

Right now, httputil.ReverseProxy appends the client IP to the XFF header. It looks like they are considering either replacing the existing XFF header by default or adding options to append to, overwrite, or preserve the existing header.

My gut feeling is that the initial more-knobs-to-turn suggestion in the issue is better than the limited-and-awkward thing it seems to be turning into. (I guess I’ll express my opinion there.)

### Thoughts on overwriting the XFF header

After being prompted to look at httputil.ReverseProxy and Caddy, I started thinking more about overwriting the X-Forwarded-For header (which they both do). I had previously given the idea only a footnote⁷, but it deserves more consideration.

The idea is this: In a multi-reverse-proxy scenario, the first proxy replaces any existing XFF header(s) with one containing only RemoteAddr. All subsequent proxies (configured to trust the previous proxies) then append to the fresh XFF header.

This approach has an obvious nice property: There are no untrusted values in the XFF list. You can’t possibly choose a spoofed value. But there are also aspects that I don’t like.

First of all, I think that it teaches bad XFF hygiene and introduces the possibility of mistakes leading to spoofing vulnerability. Because the XFF header is being replaced by the first proxy, the instructions for use become “use the leftmost”. But what if you then swap out your first proxy that doesn’t remove the XFF and instead appends to it? Spoofed!

Secondly, there’s no configuration-simplicity gain with this approach. Trusted proxies still need to be configured for all proxies but the first. If you’re doing that, you can use the rightmost-ish approach. And the rightmost-ish approach is more robust: Is your XFF overwritten? It works! Is your XFF list appended to? It works! Is your XFF list mostly spoofed? It works! And you’re helping people to understand the right way to think about the XFF header.

Thirdly, there are still legitimate uses for the leftmost-ish XFF IP (albeit with a lot of warnings attached). If your only mode of operation is to overwrite the XFF header, then you utterly deny those use cases.

Finally, hand-wavingly, I think this violates the spirit of the headers. XFF is supposed to be a comma-separated list of all the IPs involved. Which overwriting it negates. I think that, instead, X-Real-IP should be used, set by the first proxy. No need to hijack XFF for this effectively-single-IP purpose.

However, I wouldn’t fight to the death against overwriting the XFF. It’s still an improvement over the dismal widespread-ness of append-and-use-leftmost.

### Envoy’s XFF documentation is really something

HN commenter jrockway pointed me at the Envoy Proxy documentation for XFF use. It’s not exactly generally educational, but I think it’s a really good effort at making sure that Envoy users don’t shoot themselves in the foot.

### Real-world examples of doing it wrong

“Flask apps on Heroku susceptible to IP spoofing” (2013-05-12). Via eli on HN. It was using the leftmost.
“For many years, a very prominent computer science journal used XFF for guarding access — if you set it to an IP of some well-known universities, you’d be able to download all you want.” (HN)
“I remember the source code for a certain simple web app would check if X-Forwarded-For matched a certain IP as a way of granting admin powers. I spoofed it. It was sort of the first “hack” I ever did.” (Reddit)
“This was really helpful, I feel good after fixing something I didn’t know was even a problem 🤗” (Reddit)

## Acknowledgements

Thanks to Rod Hynes for proofreading and providing feedback. All mistakes are mine, of course.

Thanks to Psiphon Inc. for giving me the time to work on this. And employing me.

## TODO

Nodejs collapses XFF https://nodejs.org/api/http.html#messageheaders https://old.reddit.com/r/programming/comments/t7lxeb/the_perils_of_the_real_client_ip_or_all_the_wrong/hzkg18l/
add note about single-value header Get being combined list
all projects: if deciding to use leftmost, check for valid/non-private
finish reference implementation
probably add some diagrams
rethink hyphenating rate-limit* (right now I’m not doing it for nouns but am doing it for verbs, and I can’t decide what’s right)
AWS ALB returns 463 if there are more than 30 XFF IP addresses

…Except for anonymizing proxies that specifically hide the client IP. But that’s not an important difference for us here. ↩︎
I’m picking on AWS ALB mostly because that’s what I’m testing with. It’s also pretty popular and probably not unusual. But please append this caveat every time I mention it: “This is AWS ALB’s behaviour by default, or as I understand it.” ↩︎
I believe this is inherited unchanged into HTTP/2. ↩︎
A redditor points out that if you’re running a server for internal usage, and you want to know the originating internal IP, then you do want an IP in the private address space. Which is true. So if that scenario applies to you, modify the algorithm accordingly (i.e., remove the “non-private” checks). ↩︎
In the leftmost-ish approach, the IP you need might not be in the first XFF header. In the rightmost-ish approach, it might not be in the last XFF header. ↩︎
Seek a guide elsewhere. ↩︎
Okay, it’s possible for your trusted proxy to blow away the existing XFF value and start fresh. But a) that’s not how XFF is supposed to work, b) you’re losing potentially useful information, and c) you achieve the same thing by using the rightmost-ish algorithm. ↩︎ ↩︎
“Rightmost” is not “rightmost-ish”. ↩︎
Of course, the attestation reverse proxy should have set the X-Client-IP or otherwise passed on the IP it verified. But it’s not hard to imagine such a mistake. ↩︎
There is one more sentence elsewhere, but it doesn’t add anything: “With the header field format described in this document, it is possible to know what information belongs together, as long as the proxies are trusted.” (Emphasis added.) ↩︎

The scary state of IPv6 rate-limiting

Sun, 20 Feb 2022 13:44:57 -0500

IPv6 rate-limiting is scarily half-baked right now. If you run a server that does any kind of IP-based rate-limiting, consider not enabling IPv6 if possible. If you do use IPv6, check how your rate-limiter actually handles it.

## Four billion is a pretty small number

Most IPv4 rate-limiters will block individual addresses as they exceed the limit. That’s mostly okay, because there are only 4 billion IPv4 addresses. That means a) they are given out with some frugality, and b) it doesn’t take much memory to block a large proportion of them. If you and 1000 of your closest friends launch a brute-force or credential-stuffing login attack, any server will have no problem rate-limiting all of you.

But IPv6 is a very different matter.

## A gazillion IPs

When you ask your ISP for an IPv6 assignment, you get at least a /64 block – 2⁶⁴ assignable addresses. RIPE suggests giving a /56 prefix (2⁷² addresses == 256 /64 blocks) to home users and a /48 prefix (2⁸⁰ addresses == 65,536 /64 blocks) to businesses (or “If you want a simple addressing plan use a /48 for each end-user”). RFC 6177 agrees with this guidance, as does APNIC.

Searching for ISPs’ IPv6 prefix delegation policies shows that /64¹ and /56 are pretty common. Internode in Australia assigns /56 blocks to residential and business customers. In the US, Charter Spectrum also gives /56s. Cogent lets users request up to /48.

So, it’s safe to assume that an attacker can obtain at least a /56 and probably a /48. It’s also prudent to assume that a determined attacker can utilize all of the addresses at their disposal. And there is at least one tool that does exactly that – “freebind: IPv4 and IPv6 address rate limiting evasion tool”.

## What’s the right way to rate-limit a gazillion IPs?

This StackOverflow answer outlines the best approach I’ve found:

The best algorithm is to start blocking separate addresses. Then when multiple addresses are blocked in the same /64 you block the whole /64. Repeat that for bigger aggregates.

Prefixes are usually given out on nibble boundaries (multiples of 4, or one hexadecimal digit). So you might want to scale from /64 to /60, /56, /52, and /48. A /48 is usually the largest prefix given to a single site.

Depending how careful you want to be you can skip from /64 straight to /56 and /48.

A comment on that answer has a useful addition:

You can implement this gradual aggregation approach in a fairly simple way. Track separate rate limits at the /64, /56, and /48 level all the time. Use higher limits for higher levels. That way there is no aggregation logic at all. It’s just three separate limits based on different keys.

(Fun fact: If I google for “ipv6 rate limiting” (in a private browsing window), the “featured snippet” at the top is a link to the “rate limiting evasion tool” that I mentioned above. The first normal result is to that SO question. And note that it has only 6 votes and a single answer with only 10 votes. Are people just not thinking/talking about the problem? Or am I searching for the wrong thing?)

## How are real rate limiters actually doing it?

Let’s start with Cloudflare, since it’s nice and clear:

Once an individual IPv4 address or IPv6 /64 IP range exceeds a rule threshold, further requests to the origin web server are blocked

That’s pretty good, though it’s missing some of the nuance of the algorithm above. If there’s a large non-malicious site (apartment complex, school, business, etc.) behind the /64, the blocking might be over-aggressive. If an attacker has an assignment larger than /64, they might have between 256 and 65,536 /64s at their disposal. The large end of that range is getting big.

AWS WAF supports IPv6 for rules, inspection, and reporting, but doesn’t specify how it implements rate-limiting for IPv6. Concerningly, it has a really small limit on the number of IPs it can rate-limit at once: “AWS WAF can block up to 10,000 IP addresses. If more than 10,000 IP addresses send high rates of requests at the same time, AWS WAF will only block 10,000 of them.” Unless their IPv6-limiting algorithm is smart, it would be easy for an attacker to ensure they have more blockable units (IPs or /64s) than the limiter can hold. And that means that it would effectively be completely unlimited.

(This raises the question of what the limit on the number of blocked IPs is for other services. I found no such limit mentioned for anything else.)

I also couldn’t figure out what IPv6 strategy Google Cloud Armor uses, but it says this about its configurable rules: “Both IPv4 and IPv6 source addresses are supported, but IPv6 addresses must have subnet masks no larger than /64.” So maybe its rate-limiting is also /64-based, like Cloudflare? Or maybe that’s reading too much into a statement that’s only tangentially related.

Let’s Encrypt limits account creations by /48, because “it’s not uncommon for one person to have a /48 to themselves”. That seems very.. cautious. On the one hand, I like how aggressive it is, but on the other hand… there could be 65,536 home or business networks (/64s) in a single rate-limited /48. I feel like this is too coarse-grained for general use.

A year ago, after a vulnerability report, Nextcloud changed from limiting IPv6 by individual addresses (/128) to limiting by /64. (There also is/was no size-limiting of the IP cache, that I can see.)

I also looked at a couple of Go HTTP rate-limiting libraries – github.com/didip/tollbooth and github.com/go-chi/httprate. Neither distinguishes between IPv4 and IPv6 and simply does per-IP blocking. So that’s bad. And neither has a size limit on the IPs in its limiter cache (only a time limit), so an attacker can consume all available memory, I think.²

(Fun fact: Even a terabyte drive can only store 2³⁶ IPv6 addresses. So you’d need about 270 million such disks to store the IP addresses accessible to a single /64 home user. Or 18 trillion disks for a /48.)

## How many “blockable units” is too many for an attacker?

If a rate limiter is blocking by single IP addresses, then that’s the “blockable unit”³. If it’s blocking by /64, then that’s the “blockable unit”. And so on. The rate limiter effectively “allows” an attacker to have a certain number of blockable units at her disposal depending on the limiting strategy used.

The obvious extremes: An attacker having a single blockable unit is acceptable (and unavoidable). An attacker having 2⁶⁴ blockable units is way too many.

But what if the attacker has 256 blockable units (blocking on /64, attacker has /56)? Or 65,536 blockable units (blocking on /64, attacker has /48)?

Let’s (charitably) assume that AWS WAF’s limit of blocking “10,000 IP addresses” applies to /64s for IPv6. If that’s true, then allowing an attacker 65,636 is too many. (To state the obvious, an attacker could cycle through her /64s and never be limited at all.)

Do other WAFs have a size limit that they’re not publishing? It seems likely, but not certain. Cloudflare, for example, prides itself on withstanding the largest attacks and is surely concerned about state-level attackers with access to at least a /32 prefix – 4 billion /64s. It would take about 40 GB of storage to keep track of that many prefixes (2³² * (8 bytes per prefix + overhead)). That’s not impossible for a big box of RAM, and certainly not for disk, of course (but disk feels a bit slow for this use case). Perhaps Cloudflare is comfortable with blocking that many addresses.

A big box of RAM dedicated to this purpose might be expensive for a smaller operator, but maybe using disk is more acceptable. If we’re talking about Nextcloud running on someone’s NAS box, then /32 attacks are surely outside of the threat model.

What about 256 blockable units? That’s… probably okay?

So, I don’t have a great answer to the question of how many blockable units is too many. What’s your comfort level? What’s your threat model?

And what about an attack that is both distributed and can utilize the full IP space? What multiple of 65,536 (or 256) are you comfortable with?

## Conclusions

I really like the idea of IPv6. I work for a company that would (probably) benefit from widespread IPv6 adoption (so that we’re, uh, harder to block). But as I said in the title: If you need to rate-limit access to something, avoid enabling IPv6 for now. The state of IPv6 rate-limiting just seems too immature.

But what if you have no choice? If you’re using a web application firewall, try to talk to the vendor about what it actually does. (And then let me know what they say!) If you’re doing the rate-limiting yourself, look closely at what your code is doing, because there’s a very good chance that it’s doing it inadequately.

For a quick fix, block IPv6 /64s rather than individual IPs. It might not be perfect, but it’s 2⁶⁴ times better.

I remain hopeful that this situation can improve rapidly. Good algorithms tend to get adopted quickly once they become available in a consumable format, and this isn’t likely a very complex case. (Yes, I am tempted to implement something myself, but this isn’t a problem I personally have right now so I wouldn’t actually use my own code, which is never a good starting point.)

## Postscript

The state of this seems so obviously sketchy that I think I must be missing something important. I am still an IPv6 neophyte. Please correct me if I have gotten anything wrong.

Edit 2022-02-21: I posted this to /r/ipv6 and there are some good contrary comments there. I particularly like this one that talks about IPv6 being better than IPv4 for rate limiting, since providers will generally have a single IPv6 prefix themselves and give out prefixes in a consistent manner, rather than the scattered, different-IP-each-reboot world of IPv4. The comments also talk a lot more about “bycatch” (over-blocking), which I didn’t really. But I still don’t feel they’re worried enough about how providers and libraries have actually implemented rate limiting at this point in time.

Edit 2022-02-22:

A coworker pointed out that the way I did the prefixed-IP-canonicalization in my PRs was overly complicated and can be achieved with the stdlib like ipv6.Mask(net.CIDRMask(56, 128)).String(). I had tried various approaches with the stdlib and didn’t come up with one that worked, but I guess I missed that one. Embarrassing.

I did some searching for fail2ban+ipv6. Their IPv6 support master plan is interesting and relevant. For example: “I am not sure we will land/release 1 [per-IP blocking] alone since, as was stated, it could immediately be exploited by an attacker to cause resources exhaustion/DoS. May be only if treatment of IPv6 addresses would be made optional with a big fat warning on possible ramifications.” Even though it looks like per-IP IPv6 support was added in 0.10.

Reading through all of the comments on that issues suggests that fail2ban still only uses a per-IP strategy to block IPv6. And are aware it’s insufficient. And stopped discussing it a year and a half ago.

Some ISPs also give a small multiple of /64s. But I feel like that case isn’t significantly different from a single /64 for our purposes. ↩︎
After writing this I realized that I’d better be part of the change I want to see, so I submitted PRs to tollbooth and httprate. Both have been accepted. But it’s unlikely that the only two rate-limiting libraries I checked are the only two with this problem, so I don’t think this changes the overall point of this post. ↩︎
To be clear, I’m making this term up for convenience of discussion. ↩︎

Git Submodule vs Subtree

Thu, 17 Feb 2022 21:35:37 -0500

Every now and then I need to make a choice between using git submodules or subtrees (or nothing), or I get asked about them by coworkers. This is infrequent enough that I forget some of the details each time and need to refresh my memory. So I wrote up these notes to share with my coworkers and to help my future self. Hopefully they’re of some use to others as well.

Disclaimer: My experience still isn’t large. I’ve only used each once or twice. And this isn’t a manual for using those commands/tools – it’s just a concise aid for choosing between them.

TL;DR: Subtree is better if you basically want to forget you have external code, or only infrequently update it; submodule is better if you don’t want to forget that the code is external, and/or if you maybe want to edit and push it.

I’d seen a lot of complaining about submodule during research, and had the vague idea that subtree was “better”, but I’ve come to realize that submodule has its place.

When you use subtree, you’re basically copying a remote code base into your file structure. The auto-commit comment at the moment you do it will record the remote commit hash, but otherwise there’s no indication anywhere in the repo that a) the subtree happened, b) what the remote repo was, or c) what the commit of the remote repo was.

From then on, any changes to the subtree code will be treated just like changes anywhere else in the repo. Any operations to git subtree push and git subtree pull the code does weird git directory slicing (which I’ve used before to create a new repo from the subdirectory of an existing repo, retaining commit history for the files in that directory). It’s okay, but clunky.

This is in sharp contrast with submodule, where the remote repo code is not copied. Instead, the remote repo URL is recorded, and its location in the file structure, and the remote repo commit to use. When you clone the repo, you have to separately clone the submodule dependency(ies) (git submodule init+git submodule update or git clone --recurse-submodules). The submodule becomes a separate repo in the file structure, with its own .git dir. For example, if you do git status in the submodule directory, it tells you the status of the submodule code and not the outer repo. If you update the submodule repo directory, it changes the submodule’s tracked commit in the outer repo, and you commit that in the outer repo.

Treating the submodule as a separate entity is easy. You can just do all your usual branching, committing, pushing (the remote is the subrepo remote, not the outer remote).

Cloning the subtree is easier because there’s no separate step, but that’s not a big deal. For vendoring 3rd party code, submodule won’t actually make a copy, so that’s not okay¹ (unless you fork the 3rd party code first and submodule the fork).

So the main decision factors are like:

Use subtree when you just want to copy code from an external repo once, or maybe with occasional pulls.
Use submodule when you want to make your relationship to the external repo really explicit, or if you intend to make changes to the submodule code from within the context of your repo and push to the external repo.

Depending on your vendoring policy and the dependency management system. ↩︎

Diving into Go's HTTP server timeouts

Sat, 15 Jan 2022 00:00:00 +0000

Recently, I was adding timeouts to a Go HTTP server and ended up exploring how the different settings and approaches act and interact. I’m going to publish my notes here, along with the code I used for testing. Hopefully this will help someone else (or myself) in the future.

The timeout testing client can be found here: github.com/adam-p/httptimeout. There is a server in the examples directory that you can make requests to.

I link to it below, but I’m going to recommend here that you read Filippo Valsorda’s post “So you want to expose Go on the Internet”. It’s essential, but I didn’t find it had enough quite enough detail about timeouts, hence the below examination.

There are two different, overlapping levels of timeout in our HTTP server:

Read, write, and idle timeouts on the http.Server
The ServeHTTP timeout (this middleware)

The http.Server timeouts are overlapping and somewhat confusing (to me and others) so I’ll test and detail how they work (or seem to). (Another important but insufficiently thorough reference is the Cloudflare post “So you want to expose Go on the Internet”.)

IdleTimeout: “IdleTimeout is the maximum amount of time to wait for the next request when keepalives are enabled. If IdleTimeout is zero, the value of ReadTimeout is used.” Not relevant to request timeouts.
ReadTimeout: “The maximum duration for reading the entire request, including the body.” It’s implemented in net/http by calling SetReadDeadline immediately after Accept.
ReadHeaderTimeout: “ReadHeaderTimeout is the amount of time allowed to read request headers.” Implemented as above.
WriteTimeout: “WriteTimeout is the maximum duration before timing out writes of the response. It is reset whenever a new request’s header is read.” This effectively covers the lifetime of the ServeHTTP handler stack.

Observations:

The documentation makes a big deal out of ReadHeaderTimeout allowing for per-request timeouts based on the headers. “The connection’s read deadline is reset after reading the headers and the Handler can decide what is considered too slow for the body.” “Because ReadTimeout does not let Handlers make per-request decisions on each request body’s acceptable deadline or upload rate, most users will prefer to use ReadHeaderTimeout.” But since http.Request doesn’t provide access to the underlying net.Conn, I don’t see a way to set a connection deadline from the handler level. (Perhaps it intends the per-request timeout to be done via mw/context rather than via conn deadlines.)
Our TLS terminates at the load balancer, so mention of different TLS behaviour you might see doesn’t apply.
The zero values mean no timeout. These shouldn’t be used for anything but toy servers.
A timeout during header or body read means that there’s no response to the client. This is unfortunate but expected.
A timeout during header read means that there’s no server log written for the request. This is even more unfortunate but also not unexpected. The handler stack (including logging middleware) is not set up until the headers are read.
http.Server timeouts do not themselves cancel the request context. However, if a body read follows the timeout, the resulting error will trigger a context cancellation.
A ReadTimeout during body read results in a log with status of 503. This is initially somewhat surprising. The timeout interrupts the read connection, then the failed read attempt cancels the request context, then the http.TimeoutHandler (discussed below) receives the signal of that cancellation and sends the 503 response.

This is okay, but I’d prefer more control over it. (This might be a bigger problem later, when we try to handle “context canceled” with more nuance.)
The previous point illustrates (I think) that the read and write channels of the connection are severed by the timeouts separately (the response can be written even though the read is interrupted).
ReadHeaderTimeout by itself works as expected. The header read is deadlined, but nothing else is.
ReadTimeout by itself works as expected. The timeout is shared between the header read and body read.
ReadHeaderTimeout and ReadTimeout together:
- If set to the same value, behaviour is indistinguishable from just ReadTimeout being set.
- If ReadHeaderTimeout is a different value from ReadTimeout:
  - If the header read is too slow, then ReadHeaderTimeout is used.
  - If the body read is too slow, then ReadTimeout is used. The time allowed for the body read is the total ReadTimeout minus the time spent reading headers. (As in the ReadTimeout-by-itself case.)
I haven’t figured out (in a reasonable amount of time) a way to emulate/implement a slow read. I don’t know how to stream the response and read of it.
- But if the WriteTimeout is set to 1ns the client gets EOF immediately.
- A sleep longer than WriteTimeout before writing the response results in the client getting no data, but the client still takes the sleep-time to disconnect rather than the timeout-time, which seems very strange to me.
One of the sources led me to believe that ReadHeaderTimeout+WriteTimeout would cover the whole request (“ReadHeaderTimeout … covers up to the request headers … WriteTimeout normally covers the time from the end of the request header read to the end of the response write (a.k.a. the lifetime of the ServeHTTP)”). What actually happens is that the header read timeout is correct, the write timeout is correct, but there’s no body read timeout. So the request can spend forever reading the body but when it goes to write the response the write connection has deadlined.

I believe that what’s happening is that the WriteTimeout is reset every time a read happens, so it’s not actually starting as long as there’s a body read. (The documentation says WriteTimeout “is reset whenever a new request’s header is read.” But that doesn’t seem to be exactly accurate.)
Otherwise rough testing with combinations of the read timeouts with WriteTimeout suggests they behave as expected (no interaction).

In addition to the http.Server timeouts we use a timeout middleware, which is basically a wrapper around http.TimeoutHandler. Here are some observations when the timeout middleware is in play and has a timeout shorter than the connection timeouts:

Unsurprisingly, the timeout mw’s timeout doesn’t start ticking until the handler stack is set up, so not until after the headers are read.
http.TimeoutHandler uses “503 Service Unavailable” as its timeout response. It seems like “408 Request Timeout” would be a more semantically appropriate response. We could intercept the response write to change that code, but it would get hack-y to distinguish between http.TimeoutHandler returning 503 and, say, our Ping endpoint returning it intentionally. Additionally, returning a 5xx error means that our clients will automatically retry the request, which is a good thing (probably).
- We could also use a copy of http.TimeoutHandler (~220 lines) to return whatever value we want.
It may seem silly to worry about sending a response to the client when its connection is so degraded that it probably can’t read it. But: a) the timeout response might be a lot smaller than whatever the client is trying to send, b) the client’s down pipe might be faster than its up pipe, and c) the timeout might actually be due to our server taking too long to the process, rather than a problem with the client.
Whether the client receives the timeout mw 503 response depends on what it’s doing. (My test client that gets interrupted writing slowly can’t read the response, but if it’s trying to read when the timeout happens the response is received okay.)
A slow body read is interrupted by the mw timeout with an “i/o timeout” error. I believe this is due to the request context being canceled by the timeout.
A long time.Sleep isn’t magically interrupted, unsurprisingly. But selecting on ctx.Done and time.After ends early due to the context cancellation.
There are two cases when TimeoutHandler returns 503. The first is, of course, when the deadline it set on the context fires (it could have been set somewhere else, in theory). The other is if the context was canceled for some other reason (such as the client leaving). They are distinguishable from the client side because there’s no response body in the latter case.

Note that it is important that the timeout mw have a shorter timeout than the http.Server timeouts. We want the client to receive a response, if possible, rather than just having its connection severed.

This is not as simple as it might seem. The handler mw timeout must be shorter than either the WriteTimeout or the time remaining to the ReadTimeout after header reading. But at the handler level we don’t know how long the header read took, except that it took less than ReadHeaderTimeout. So our mw timeout should be min(WriteTimeout, ReadTimeout-ReadHeaderTimeout).

…Except that calculation ends up feeling very unnatural in practice. Instead, it makes more sense to first choose the desired handler timeout, then set the http.Server connection timeouts based on that. I think that it’s reasonable to use 0.5x the timeout for ReadHeaderTimeout and 1x the timeout for ReadTimeout and WriteTimeout.

We certainly can’t rely on the timeout mw while reading headers (because there is no middleware at that point), but it’s possible that body read and response write timeouts are redundant. Severing the connection seems safer than cancelling the context and hoping something checks it, so we’ll set the other timeouts anyway.

## Addendum

### Let’s work through the timeout math

Let’s say we want, generally, a 10-second request timeout. So we set TimeoutHandler’s timeout to 10 seconds.

We need to pick a ReadHeaderTimeout that is basically independent from that (because the handler timeout doesn’t start until after the header read is complete). It seems reasonable to pick 5 seconds.

As discussed above, we prefer the ReadTimeout to be longer than the handler timeout, so the client has a chance of getting the response. Because ReadTimeout ticks away during the header read, the calculation for this is something like:

ReadTimeout := handler_timeout + ReadHeaderTimeout + wiggle_room
e.g.,
= 10s + 5s + 200ms

So even if the header read takes 4.9s, we are still left with 10.3s for the body read – slightly longer than the handler timeout.

WriteTimeout covers from the end of the reads until the end of writing. If there’s no body to read, this is the whole post-header request time. So, we want it to be hander_timeout + wiggle_room, so something like 10.2s.

IdleTimeout… is independent of any of this stuff. It seems common to set it to a couple of minutes.

### AWS observations

Using an AWS load balancer in front of your Go server muddies the behaviour of some of these timeouts, but doesn’t completely obviate them.

ALB seems to buffer all the incoming headers, so ReadHeaderTimeout does nothing. ALB’s timeout for reading headers appears to be 60 seconds.

ALB doesn’t seem to have a body-read timeout (or at least not one shorter than a couple of minutes). It does seem to be buffering some of the incoming body, since the client can still send some data after the backend server has given up the connection. About 30 seconds after the server drops the connection, the load balance responds with 502 Bad Gateway.

I didn’t test the write timeout, but I bet there isn’t one.

The ALB idle timeout seems to be 60 seconds.

The Ethics of Driving Speed in Travel Time Estimation

Sun, 28 Nov 2021 00:00:00 +0000

How should travel time be estimated? What are the ethical implications of the approach taken?

You enter your destination into your maps app. It finds a few likely routes. It determines the distance of each pretty easily. It checks traffic conditions along the routes. But we don’t yet have a travel time estimate. Time equals distance divided by speed, adjusted for traffic.

What travel speed (traffic notwithstanding) does the app use?

The speed limit along the route.
The speed at which the app user typically drives along the route.
The speed at which other motorists typically drive along the route.

We’ll set aside option #2 for now. There are many cases where the user has never made the trip that they’re planning. We’ll discuss an expanded version of it below.

The problem with picking between the other two options is that a) most people drive above the speed limit¹ and b) doing so is breaking the law.

If we use the speed limit method and our user drives the speed limit, our estimate is good. But most people drive faster than the speed limit, so our estimated time will typically be too high. The user will leave earlier than they need to, drive faster than we estimated, and arrive earlier than they intended to.

If we use the typical speed method and our user drives the typical speed, our estimate is good. But if our user is in the minority of law-abiding motorists, we underestimate their travel time and the user arrives late.

So there’s the quandary: The speed limit method is incorrect for most people and the typical speed method encourages illegal behaviour and punishes law-abiding drivers (by making them late).

I think that the speed limit approach is more ethically palatable, because it encourages legal behaviour and because being early is almost always better than being late. But giving travel time estimates that are usually incorrect is extremely unappealing – your maps app isn’t very good if one of your fundamental features is usually wrong.

So what do maps apps do and what should they do? To be clear, I know nothing about this domain. I own a car and a cell phone and that’s the extent of my qualifications. But we can have fun thinking about it…

One approach could be to split the difference. Give a time estimate in between the speed limit and the speed people usually drive. This doesn’t feel great because a) it’s still usually going to assume an illegal speed, and b) it’s still likely to be incorrect for many people. Nevertheless, I feel like it’s probably a good approach. It still errs on the side of being early, is closer to being correct for more people than the speed limit approach, and will result in a smaller speeding fine if followed closely.²

It’s also worth mentioning that below a certain speeding threshold, the probability of getting a ticket approaches zero. (Though this claim is confounded by automated systems, like speeding cameras.)

Another approach is to reintroduce option #2 from above, with some extra magic sprinkled on top. The app may not have seen the user drive the exact desired route before, but it has likely seen the user drive similar roads with similar speed limits and can make a very good guess about how fast the user will actually drive.

This is a very nice approach. It results in the most accurate predictions for the most users. And it largely allows the app developers/ethicists (probably one and the same) to wash their hands of the charge that they’re requiring people to break the law to avoid being late. If the user was law-abiding, they’d get law-abiding directions!

(For the brief amount of time before there’s any data about the user… probably keep it clean by giving a speed-limit estimate.)

I don’t know what apps really do. After messing around with Google Maps for a while I managed to find a route³ that seems to encourage breaking the speed limit.

16.4 kilometers in 9 minutes is 109 kilometers per hour, and the speed limit on highway 401 is 100 km/h. But… if that 9 minutes is rounded down from 9.9 minutes, then the speed drops to 99.4 km/h. And it did take me about 10 tries to find a route that exceeded the limit. (I hunted in the Toronto area because I know the speed limits, but there are probably much better, longer, traffic-free stretches elsewhere that would be better experiments. Except now it’s snowy everywhere that I know, and that will also surely factor into estimates.)

Is it okay for a maps app to encourage us to ever break the legal speed limit, even if our previous behaviour – or the behaviour of others on the same stretch of road – indicates that we likely will anyway?

Even seemingly mundane automated systems can have ethical impacts. As consumers of such systems we need to be cognizant of what behaviours such systems are pushing us towards (and away from), and we should get in the habit of consciously and explicitly asking ourselves how we’re being influenced.

For those of us who are developers of such systems, we need to make a habit of consciously stepping back and thinking about the impact of our design decisions on our users. What’s optimal might not always be what’s ethical. And for any non-trivial ethical question, it should be discussed with others. It’s difficult to see the ethical traps in one’s own design and even harder to find better ways out of them – the perspective of others is invaluable.

Disclaimer: As I said above, I have no domain knowledge here. I did some googling to see if there was discussion or papers about this and found nothing, but it’s entirely likely I wasn’t searching for the right words.

This might not be true everywhere, but it sure is where I live. ↩︎
There’s also a conversation worth having about the immorality of breaking the law to speed. And if the moral violation is lesser or greater depending on how badly you exceed the limit, or if being in sin is a binary state. Not a conversation I particularly want to have here, with myself, though. ↩︎
Using a longer link in case that shortened one breaks. ↩︎

Instructions on the ceiling

Sat, 27 Nov 2021 00:00:00 +0000

From the Neck Up does neck-up massage, including intra-oral. When the RMT has their fingers in your mouth it’s pretty hard to give directions. So there are hand signs for communicating with the RMT. And, since you’re laying on your back during the procedure, the guide to the signs are posted on the ceiling.

Timing attack mitigation must exclude network

Sat, 27 Nov 2021 00:00:00 +0000

TL;DR: When trying to prevent timing attacks (e.g., against login username enumeration) by making a request take constant time, make sure you exclude the network read and write time. If you don’t, an attacker can slow down their request to bypass it.

I’ll be covering some background and contextual information here. If you don’t need it, skip to “Exclude network time from constant-time limiting”.

## What is a “timing attack”?

Briefly, a timing attack (in this context) is when an attacker observes the time it takes for a server to handle a request to glean some information about the validity of the input they tried. The typical target for this attack is the login request, and in that context there are – unsurprisingly – two pieces of information that can be attacked: username and password.

Here’s a typical login flow, which we’ll reference below:

Read the request from the client.
Parse the request. Check for basic validity.

If the check fails, go to step 6, indicating a 400 response.

Look up the username in the DB. Retrieve the hashed¹ password.

If the username is not found, go to step 6, indicating a 401 response.

Hash the incoming password and compare against the stored one.

If the password does not match, go to step 6, indicating a 401 response.

Do other work to set up the login session.
Write the response to the client.

### Testing/discovering/enumerating usernames

The time that step #4 takes can be used to test for the existence of a username. If the username isn’t found in the DB, the response will come a little faster than if the username is found and then a password hash-and-compare occurs. So an attacker can try out usernames and watch the response time to find out when one exists in the database.

(Note that multiple requests for a single username may be required to nail down the subtle timing differences.)

An attacker might use this for testing variations on a single username to target a particular user, or might use huge common username lists or even brute force to enumerate all or nearly all of your users.

#### Why do I care?

Maybe you don’t. Maybe you respond with different information depending on whether the username or password is incorrect, because that’s more user-friendly. Maybe your site/service is innocuous, there’s no sensitive information, or everything is public. Maybe you’re confident that your web application firewall or CAPTCHA or other mitigations will prevent this kind of attack.

I care because I work on a tool that’s questionably legal in many countries. Users have a habit of putting their real name in their username or reusing usernames across multiple sites, many of which will connect that username to their real identity. So I – and we – think it best that we limit username testing as best we can (and encourage users to use pseudonyms).

### Testing passwords

A very bad and wrong way of checking for a password match is to do a simple string comparison on the plaintext password. First of all, you shouldn’t be storing plaintext passwords in your DB. Secondly, and relevant to timing attacks, doing a simple string comparison will result in different times taken depending on how many characters in the string match. The string comparison is likely doing a length equality check and then doing a character-by-character equality check, so it’s going to return false early as soon as there’s a mismatch.

This is a solved problem. Use a library with a proper password-hashing algorithm, and use its constant-time equality function. (And don’t just binary-compare the hashes, since the time that takes might also leak something.)

### Other types of requests

“Forgot my password” requests are similarly vulnerable. Typically, the user enters a username or email address, then there’s a lookup to see if the account exists and maybe whether the email address is confirmed, then a token gets generated and stored, then the recovery email is sent. As with the login flow, there are processing differences depending on whether the username or email is found or not, which means timing differences that can be used to discover if the input exists in the DB.

The same considerations about caring mentioned above apply here as well.

## Mitigating timing attacks

### Preventing automated requests

Using CAPTCHAs, rate limiting, or a web application firewall can help prevent automated requests that are attempting to enumerate your users. They won’t generally help with targeted username testing.

You should certainly employ these kinds of measures, but my opinion is that they should be used alongside other mitigations.

### Randomizing response times (not recommended)

If a random sleep is added to the processing of sensitive requests, then the response timing becomes more difficult to use for timing attacks. But only “more difficult” – with enough timing samples, the average can be taken and the attack again becomes viable.

### Constant-time responses

We’re getting closer to the point of this post now.

Forcing responses to take a fixed amount of time prevents timing analysis. If every response, regardless of input, takes the same amount of time, there’s nothing to differentiate and analyze.

The constant time value should be chosen to exceed the possible natural response time. There are likely going to be outliers where the natural response time exceeds the constant time – you should log and alert these incidents, as they ruin the mitigation. But, generally, if they’re kept very infrequent they still won’t provide an attacker enough to work with.

Another approach to constant-time excesses would be to have multiple increments of constants. Like, limit the response to 1 second; but if it naturally takes more than 1 second, limit it to 2 seconds; etc. I’m not sure if this is warranted or adds very much. I wouldn’t bother. You will also have to be very sure that, say, bad username doesn’t always end up in the first time increment while bad passwords always end up in the second.

#### Exclusions

Since we’re only trying to prevent an attacker from distinguishing between “bad username” and “good username but bad password”, then any situations that don’t reveal that can be excluded from having a constant-time response. For example:

Successful login. The user knows that the username and password were both good, so a constant-time response achieves nothing except slowing down the valid-user experience.
“400 Bad Request” responses. For example, if the username is too long or has invalid characters in it. No account lookup is done, so nothing is revealed.

It might be tempting to also exclude 500 server errors. In theory, something like a DB communication error shouldn’t reveal information about the username or password. But such errors can occur for many reasons, some of which may be repeatable by an attacker. It’s probably best to keep server errors constant-time, if possible. (Also, such errors should be extremely rare for benign users and shouldn’t significantly impact the experience of the service.)

## Exclude network time from constant-time limiting

We’ve finally gotten to the point.

When forcing a response to be constant time, the network portion of the request processing must be excluded. By this I mean the time taken to read the request from the client and the time taken to write the response. We’ll see that it’s both acceptable and necessary to do so.

The login flow will end up looking like this:

Read the request from the client.
Record the response start time.
Parse the request. Check for basic validity.

If the check fails, go to step 7, indicating a 400 response.

Look up the username in the DB. Retrieve the hashed password.

If the username is not found, go to step 7, indicating a 401 response.

Hash the incoming password and compare against the stored one.

If the password does not match, go to step 7, indicating a 401 response.

Do other work to set up the login session.
Wait until the constant-time limit has passed since the start time.
Write the response to the client.

First of all, it is acceptable to exclude the network transfers from the constant-time limit because they are completely unaffected by the validity of the input. Additionally, the attacker controls the network input (the request) and has full visibility of the network output (the response). There is nothing to hide here.

Secondly, it is necessary to exclude the network time.

If the request read time is included in the constant-time consideration, the mitigation is effectively undermined. The attacker controls the client’s network speed, so they can trickle the request out until the constant-time limit is passed. Then they start measuring the response time from the point that the request writing is done. The server will then start processing the request. There will be no sleep in step #7 because the constant-time limit has already been exceeded, so the actual processing time will be plainly visible to the attacker.

(If the “multiple increments of constants” approach is used, the attack is more complicated, but I believe it’s still viable. The attacker will need to tweak the request speed so that bad-username requests fall into the first increment while bad-password requests fall into the second. Anyway, this hardly matters, since it’s acceptable to exclude the request read time.)

Excluding the response writing time is also necessary. As soon as the first byte of the response is written, the attacker is signaled that the processing is complete and they have the information they need so any constant-time sleeping needs to occur before the response writing is begun.

## Other concerns

I worry about the attacker using a simultaneous request flood to slow down all other request processing enough that many or all login requests will exceed the constant-time limit and start revealing the true processing time. Sufficient capacity and/or scaling, combined with anti-denial-of-service measures should be sufficient to mitigate this. It will also be a pretty expensive – and therefore unlikely – approach for an attacker to take.

## Not just HTTP

I wrote the above in terms of HTTP requests and responses, but it applies to any network protocol.

## Final words

I’m writing this because when I was implementing this I screwed up by including the network transfer in the constant-time limit – it felt cleanest to implement it as middleware, but that was at the wrong level. I only realized the problem while re-reading some tangentially-related code. Hopefully this helps someone else not make the same mistake.

Hashed using something appropriate, like argon2, scrypt, or bcrypt. ↩︎

My Baby Advice for Fathers

Wed, 14 Jul 2021 00:00:00 +0000

There’s a ton of “what to expect when you’re expecting” stuff out there that I have no intention of repeating. I found there was a lot of stuff I wasn’t prepared for – mentally or logistically – that I want to call out here.

A lot of what I’m going to say is negative. I feel like the negative experience for (some) fathers isn’t discussed very much and that that’s a major disservice to us – and a dangerous one, frankly.

(I’ll be saying “he” a fair bit, since my only child is a boy.)

## The most important thing I have to tell you

…that probably – and hopefully – won’t affect you.

Paternal post-partum depression is a thing. It affects 10%-25% of fathers, and it doesn’t get talked about very much. It can start a month or two before birth, or can start after. It’s fucking horrible.

If you find yourself sinking into a pit of despair and regret, seek outside help. Your partner is (probably) going to be the absolute worst person to seek help from – she’s going to be high on birth hormones and think that everything is wonderful. If you tell her that you wish a meteor would hit the city to erase everything, she’s going to call you a monster and you’ll feel worse.

I’m not saying you need to pre-arrange a shrink, but you do need to:

Monitor yourself for signs of depression. This will be harder than it sounds, because you’ll be sleep-deprived and harried a lot of time. It’ll be difficult to accurately gauge your own state. But try to check in with yourself periodically and see if you’ve been feeling real dark and regretful.

Figure out who you will talk to if you need it. The ideal person to talk to is probably a mental health professional, but it might not need to be, depending on how bad your depression is†. The person needs to be someone who won’t be emotionally tied up in (or in love with) your new baby. They need to be able to listen and not judge. They’ll need to remind you that what you’re feeling is normal (or not abnormal – see above percentages) and that what you’re in now is a phase and will pass. It’s hard to believe that a year (let’s say) isn’t a long time, but you will get out the other side.

But before this freaks you out too much, remember that it doesn’t happen to most people. It’s just that the probability is high enough that it’s worth being aware of and making some amount of (mental) preparation for.

†: Disclaimer: You should always consult a professional and not listen to amateur hacks -- like me -- who suggest otherwise.

## T-8 months

We think that going with a midwife plus a hospital birth is better than obstetrician+hospital or midwife+home birth. You get much more personal care, but the full medical establishment is available if you need it.

When you have an obstetrician, you only see them during their shift at the hospital. Nurses also flit in and out, changing shifts. So there’s very little sense of continuity, and that’s not a great feeling when you’re teetering on the edge of panic after hour 12.

With midwives, you’ll meet the two or three who will be your team at your appointments. As you get into labour at home, they will come to your house to check you out. They’ll be around a fair bit at the hospital as well. And they’ll visit you at home afterwards. (And they’re covered by OHIP.)

I also recommend hiring a doula. I’ll say right now that they’re not cheap – around $1000. They’ll visit you at home before and after and do some training, but they’re basically experienced delivery room support. They’ll stay there as long as you do, and can fetch you food or water or take over for you for a while to help your partner push. They’ll also take photos for you, which I kind of recommend. Erin doesn’t think it was worth it, but I think the doula is a bigger help to the father than the mother.

## T-6 months

Everyone has an opinion about names and most of those opinions will conflict with yours. You probably spent a long time agonizing over the name choice, so it’ll piss you off to have people denigrate it. Even worse, it might make you start second-guessing the choice.

So don’t tell anyone what name you’ve chosen. Pick one that you and your partner are satisfied with, and then let everyone cope when you announce it after the birth. (And try your best to ignore the post facto criticisms as well.)

Figure out what daycares you’re interested in and get on their waiting lists. Maybe you’ll get in super early and you won’t understand why I’m saying this, or maybe this isn’t even early enough.

Start meditating. Get a mindfulness meditation app like Headspace or 10% Happier and do it every day. It gives (or helps with) a lot of super powers that you’re going to need and want:

Self-awareness. Knowing what state you’re in because more difficult and more important.
Reduced anxiety and depression. (Probably. The research is still in its infancy.)
Appreciation of the current moment. This might not sound like much, but it’s gold.
Calming down quickly from being angry. This sort of falls out of “self-awareness”, but it’s incredibly important as your kid gets older and more infuriating. Also good for your spousal relationship.
Compassion. (Eventually you’ll get to metta meditation, which teaches exactly this.)
Generally better mental and emotional resiliency.

## T-4 months

Join some Facebook parenting groups. They’re surprisingly good and useful.

Here’s the East Toronto Young Fathers group. Feel free to post questions, requests, or just vent. Everyone is pretty chill and helpful.

And the equivalent mothers group. However, tell your partner stay out of the politics. It gets incredibly bad. The utility is well worth it, though (I’m told). It’ll give links to other buy/sell/trade groups which are also great.

Get an Audible subscription, so you can build up some credits. And/or get acquainted with the Libby app, and put some holds on a bunch of books (you can defer checking them out if they become available before you need them). You’re probably going to spend a lot of time pushing a stroller for hours so the kid can sleep. And doing other boring stuff that doesn’t allow you to have free hands, but does allow you to listen to something. Podcasts are okay too, but nothing too heavy – you won’t be in the mood.

And get headphones to listen to them on.

Get earplugs. I like these ones. And a sleep eye mask. I like this one.

## T-3 months

Get your partner to start researching postnatal physiotherapy. Whole important swathes of her body (like her pelvic floor) are probably going to be in bad shape after delivery. Maybe this isn’t so important for caesarean births? I don’t know.

Start dropping hints to everyone who idly offers assistance that they should bring you food after the baby is born. And only food. You will never want to cook and will always be hungry.

Order a bunch of Soylent. Seriously. Sometimes you just need some reasonably healthy calories to keep you alive and going. (Don’t order too much of the chocolate, so you’re less tempted to drink more than you need just because it’s sweet and delicious. Also be careful about overdoing the caffeinated ones.) [Edit: Start with buying a small amount of Soylent. Some people don’t seem to get along with it. If you live near me, I’ll give you a couple bottles to try.]

Get familiar with ordering groceries for delivery. Find some not-super-unhealthy, not-super-challenging delivery restaurants.

Make sure you have infant disposable diapers at home. Even if you’re going with a cloth diaper (service) later, you’re going to want fast-and-easy to start with.

If you’re a light sleeper, consider getting an inflatable bed or something to set up in the basement. If you’re not on baby duty, you need get decent rest.

## T-2 months

Think about T-2 months as T-1, because you might get an early surprise. Don’t leave delivery-room prep to the last minute.

In your oh-shit-labor-time go-bag, pack some pre-mixed formula. The staff Michael Garron Hospital (in 2014) are zealously against formula, and other places might be as well. (Which seems to be a pendulum-too-far-the-other-way response to previous formula-pushing, and is misguided.) But when your partner is anemic from birth-bleeding and can’t get the milk going and the kid is losing weight and the nurses are basically threatening to keep you there forever until the milk flows… you’ll want the goddamn formula.

Prearrange with some trusted people to be on speed-dial for bringing stuff to you in the hospital. Grandparents are okay, but not ideal – they can get emotional and irrational and become their own kind of problem for you.

Pack a notebook, and consider installing a voice recorder on your phone. You’re going to need to record events and their times post-recovery and it’s likely your brain will be absolute garbage and not be able to remember anything. Nurses can be weirdly terrible about recording things on the chart or checking the chart or something. Like, “When did you take your last iron pill?”… “Why are you asking me? Didn’t someone write it down? I don’t even know what day it is!” Plus feedings and everything else.

Seriously, you might not have slept in 24+ hours and just gone through the most stressful time of your life. Things that you normally rely on your brain for should be supplemented where possible.

Install your car seat now. If you don’t have a car (like us, at the time), practice installing and uninstalling it in someone else’s car. When you need to rush to the hospital or when you need to bring baby home are not the times to be struggling with it.

Arrange more than 2 weeks leave. I don’t know how much, since I only took 2, but it wasn’t enough. And don’t tell work that you’ll be “available”. Treat it as if you’re going away with just a cell phone and mediocre wi-fi. Prepare accordingly.

## T-0: delivery

The range of possible experiences is absolutely huge, and you can read about them anywhere, so I’m not going to try to “prepare” you. Power through.

Your kid will come out ugly. They all do. No one else in the room will seem to notice, but they’re either tired and high on mothering hormones or they’ve seen a million ugly babies and are too polite to mention it. Later on you’ll both look back at newborn pictures and say “man, that kid was ugly”. Don’t worry: how they look when they first get squeezed out has nothing to do with how they’ll look later.

In that moment you might find yourself… not overjoyed. You’re deadly tired and your wife is bleeding too much and you have your shirt off holding your ugly kid watching the doctors patch her up (hopefully). And you realize that you don’t instantly feel anything for the lump of now-well-differentiated cells in your arms. You just want your wife to be okay and you want to sleep, and you don’t particularly care about this baby thing anymore. This is to be expected. Again, you don’t have the same hormones that the mother has and you’re in physical and mental collapse. Play the role you need to in that moment and you will get through it.

## T+2 days: back at home

Hang a baby blanket or a diaper in your window or on your front porch. This is a somewhat-recognized signal that you’re home. Maybe people will start bringing you food.

Having (useful) friends or (useful) relatives around can be good at this point. Your partner will need help with a million things. And since she has to feed every three hours, she’ll need help day and night. If someone other than you can handle some of that, great.

But beware of friends or family who are going to sit on your couch and expect tea and biscuits and a lovely chat. They can fuck off for a few months.

Get sleep whenever you can, for as long as you can. If someone is around to help your partner, put in earplugs and sleep in the basement, any change you get, any time of the day. You’re probably starting out sleep-deprived and it’ll get worse.

At some point you will stop feeling so tired-shitty. It’s important to remember that this is just a feeling and that you’re still sleep-deprived – your cognitive and physical function is no less impaired. (There is research demonstrating this.) You haven’t evolved beyond sleep. You still need it. You need it more than anything.

Start doing laundry pretty immediately. Your partner probably doesn’t have multiple days worth of what she needs. You’re going to be doing laundry every day. That’s fine – it’ll be the easiest job you have.

## T+2 weeks –> 1 year

I have talked to some fathers who just instantly love being a parent. They immediately know that it’s the best thing they’ve ever done, they love their kid and their role and their new lifestyle and everything. Maybe that’ll be you, too!

Or maybe not.

Maybe you feel (and by “feel” I mean “realize the actuality”) that your baby is a barely human lump. That you love your cats more than the baby. That you’ve never been so interminably bored in your whole life (once the hectic phase is over). That you deeply mourn the hobbies you used to enjoy and might not get back for years.

Maybe that counts as “postpartum depression”, but it also seems like a fairly rational response to the whole thing. Either way, it’s hard to get through.

One important tip that I (thankfully) read somewhere was to fake-it-till-you-make-it love for your child. Say the words, go through the motions of affection. Maybe it won’t look or feel entirely natural, but: a) it’s better than nothing, and, most importantly, b) it will lead you to feel it properly. It will also serve you very well for years to come, when you’re super angry with your kid, but you know that showing love and support is the right move.

It might take a long time before you fully care about your child. For me it was probably 10 months before he felt like an actual human, with a personality, who could be interacted with something other than boredom.

That’s a long time!

But the time will pass. Eventually a day will come that you realize you legitimately love the kid, and that you’re happy to see him, and to interact with him (up to a point – your kid’s capacity for dumb, boring crap will be higher than yours for a long, long time). And that you’d rescue him before the cat if the house was on fire.

## T+1 year –> ∞

One thing that’s always hard to keep in mind is how undeveloped your kid’s brain is. At his best moments, he’ll seem intelligent and mature and self-possessed, but: you can’t demand his best behaviour to be the norm. You may wish it, but you can’t expect it, and you certainly can’t enforce it.

The brain’s prefrontal cortex is the seat of self-control and executive function. It doesn’t fully develop until you’re in your 20s. You can count your kid’s prefrontal neurons on one hand (whatever, science). You just cannot expect them to behave rationally most of the time.

You know how sometimes you’re tired and in a bad mood and you snap at people, especially your partner? Now imagine you had way, way less brain-matter to help you keep that shit in check most of the time. Imagine it didn’t require “tired and bad mood” to make you snappy and shitty. Imagine didn’t just snap at your spouse (with whom you naturally let your guard down), but did around everyone. This is what it’s like to be a kid! Plus, they don’t know how the world works! Plus, they have very little autonomy and it grates them! Plus, they’re tired and hungry a lot!

Being a little kid is super hard. When you see them behaving really well, try to only interpret it as a wonderful blip – maybe as a preview of the future. The rest of the time, try to be really patient and loving even when they’re being absolutely shitty.

When I look back on pictures of my son at age 3 (say), I can have trouble understanding how I could have expected so much and gotten so angry. He’s so little! But I did. In the moment you don’t have that hindsight. Try your best.

A couple books I recommend are Unconditional Parenting by Alfie Kohn and, if you have a boy, It’s a Boy! Understanding Your Son’s Development from Birth to Age 18 by Michael Thompson. I would say they’re basically essential reading to get some outside direction in compassion and understanding. There’s a lot of good stuff in those books that relieves me of having to write it here.

Consider Montessori schools. Most/many start at about age 3 (“casa”, they call it at that age), and they can be very hard to get into later, so start applying. I’m finding the public school system to be basically garbage, and I wish we had gone that route (or gone that route earlier – every year I call to stay on a waiting list, and still nothing). Who knows if it would actually make a difference, though.

Last thing, maybe: Remember that (almost) everything is a phase. Just because your kid is hitting and kicking a lot (or whatever) – for months! – doesn’t mean that he’s a psychopath or otherwise permanently broken. Try to be patient and loving. Your child will grow up and get through each phase and the only lasting harm that can be done is by you and your reaction to it. And then the next shitty phase will come along. (This is also true for nice phases, sadly.)

Make sure you have a backup Yubikey

Fri, 25 Jun 2021 00:00:00 +0000

For four years I carried a Yubikey NEO (USB Type-A) in my pocket, on my keychain. And then it died (would no longer be recognized by any computer).

Yubikey’s durability claim was:

Crush-resistant and waterproof, YubiKey NEO is practically indestructible during normal use, weighs only 3g, and attaches to your keychain alongside your house and car keys

So, they didn’t explicitly say “carrying it on your keychain won’t kill it”, but they sure did imply it.

The wording for their new USB Type-A products is much more tepid:

Design & Durability: Water Resistant, Crush Resistant

Luckily, a few months after I bought the first one I bought a backup that I stored safely, so I wasn’t much put out. Here are the two keys side-by-side – click to see more detail than you need:

(After looking at that photo up close I realized that the problem could be that the two middle pins were touching. So I took a small screwdriver and carved them apart. And it works! But that doesn’t really undermine the point here, which is…)

So make sure you have a backup Yubikey 2FA hardware token, or you could be very unhappy when it gets damaged beyond use. Then you can carry your day-to-day key in your pocket without worry.

Dev Story: Unicode URL length limit blues

Mon, 07 Jun 2021 00:00:00 +0000

I have enjoyed reading other people’s design and debugging train-of-thought posts, so after I spent two days wrestling with a code problem, I thought I’d write it up. It’s not technically exciting, but I think that describing it might be useful to someone – or my future self – someday. Or, at the very least, a little amusing.

(Bonus: While writing this I discovered an error I made while doing the actual work. See if you can spot it before I reveal it…)

## Background

### The Psiphon for Windows UI

Our Psiphon for Windows client uses an Internet Explorer-based HTML control as the GUI, talking to a C++ backend (which itself talks to the Go implementation of our censorship circumvention tech). It has been working reasonably well for the last 6 years. Before that we had a plain, grey-box, dialog-based win32 UI, but the idea of adding a settings UI and localizing everything prompted me to look for other approaches.

We have strict executable size requirements, because many of our users have limited bandwidth, and because we run an auto-responder that emails our client software – the Windows and Android clients attached to a single email. That ruled out a lot of fancy UI approaches, but using the native web control ended up working okay. It meant supporting Internet Explorer 7 through 11 as the HTML/CSS/JS engine, so it wasn’t always a lot of fun, but manageable. (And Go recently dropped support for Windows XP, which meant we could raise our minimum IE version to 8, which helps.)

For this story, the relevant part of the JS<->C++ communication is that when the JavaScript wants the C++ to start a login sequence, it does something like this:

window.location = 'psi:psicash?{"command":"login","username":"abc","password":"xyz"}'

The C++ code gets a window message when the location is going to change and figures out what to do with the URL. (And has the ability to trigger JS functions.)

### PsiCash accounts

Psiphon has an in-app credit system called PsiCash. Users get credit by buying it or by certain rewarded activities. They can spend PsiCash on Speed Boost, which uncaps the network speed and expands the allowed ports.

For the last year I’ve been working on implementing PsiCash user accounts. Until now a user’s PsiCash credit has been tied to a device (or a device’s local storage), but accounts will let the user’s balance roam across devices, be restored after device failure, and so on.

A PsiCash account has a username and password. The PsiCash server is the ultimate arbiter of what’s allowed for the username and password. We use the PRECIS spec for what characters are allowed, which is a pretty wide set (which is important to this story).

We wanted to let the usernames and passwords be pretty long, but we needed to give them fixed limits. We’re allowing 200 bytes for the username and 800 bytes for the password. For the widest UTF-8 characters, that’s 50 code points and 200 code points, respectively. (For now let’s say that a “code point” is basically what you think of as a “character” or “letter”, except not always.)

So, the implementation of accounts is pretty far along at this point, and I’m writing up test cases, and I’m thinking, “I should double-check some of these tests…”

## The problems begin

### Too many bytes

So I put a ton of letters into the username and password fields to see what will happen. And the JS window.onerror handler catches this: “The data area passed to a system call is too small”, and the C++ side doesn’t get the URL message.

Googling for that error doesn’t help much. It can happen with Desktop Bridge applications running on SQL Server. There are hotfixes or Microsoft Management Console that can help. None of those a) seem to apply, or b) are reasonable to ask our users to do.

I found that passing 2020 or fewer bytes was fine, but passing 2022 or more bytes would trigger that error. But passing exactly 2021 bytes… was even worse. It would open a browser tab with a URL that started like res://ieframe.dll/unknownprotocol.htm#psi:psicash?%7B%22command%22%3A%22login%22%2C%22id%22%3A%22MC4yOTc5MjI5MTY4ODU3MjI4%22%2C%22password%22%3A%2201234567890…. And there’s the user’s password in the browser address bar! (We’ll call this the “>2020 error”.)

(This is surely due to IE having a URL length limit of 2048 for GET requests. I didn’t think of that at the time, and there was probably some URL overhead I wasn’t counting. Anyway, it doesn’t change the problem for me.)

Throwing up obscure, non-actionable error messages is bad enough, but the browser tab thing is terrible. So this can’t be allowed to happen.

### Tangent: Unicode

If you’re not familiar with Unicode and its encodings, you might want to skip down and read the appendix about it. But here are some quick definitions of terms I’ll be using:

Unicode: The system of defining all the letters and characters and emoji and so on. Each Unicode entry has a 32-bit number assigned to it.
Code point: The 32-bit value that indicates a Unicode “character”.
UTF-8 and UTF-16: These are the common ways of actually encoding Unicode entries. UTF-8 uses between 1 and 4 single bytes, and UTF-16 uses 1 or 2 double bytes. JavaScript and Windows C++ (wchar_t) use UTF-16. Almost everything else uses UTF-8.
Code unit: These are the individual chunks of an encoding – the single bytes of UTF-8 or the double bytes of UTF-16.

So a single Unicode code point may be encoded by up to 4 UTF-8 code units (4 bytes total) or 2 UTF-16 code units (4 bytes total).

### Limit the input

I hadn’t been limiting the username and password input fields because it didn’t seem necessary, for reasons like:

It’s important that the login interface allow at least as many characters as the server will allow in the creation of an account.
The bytes vs graphemes distinction makes things a little murky.
If we decided to raise the limit on the server side, it’d be nice if the clients just worked.
If the user enters too many characters… Then they’re entering bad credentials, and that’s really up to them.

But allowing the user to hit the >2020 bytes error is unacceptable, so I needed to add input limiting. The element’s maxlength attribute seemed like an easy-and-sufficient way to do that.

maxlength is supposed to limit by UTF-16 code units, and does so in modern browsers, which is kind of weird. That means that “🍕” and “𪘀” count as two towards the length. Probably not coincidentally, that’s how JS’s String.length works: "🍕".length === 2. There’s no mention on MDN or CanIUse of IE deviating from this (that I can find), but it does – maxlength counts code points, not code units. Which is great! It means I can set maxlength=50 and get the expected username limiting – no custom validators required.

But it would still be nice to provide more space for input, because of the reasons I had for not limiting in the first place. So let’s do a little math:

// We want to be sure we don't hit the ~2020 limit. Let's say 1900 is a safe maximum.
absolute_byte_limit = 1900
bytes_per_code_point = 4
allowed_code_points = absolute_byte_limit / bytes_per_code_point
==> 475

So we have something like a 475 code point allowance to split between username and password. Let’s say 75 for the username and 400 for the password (we won’t be staying here, so it doesn’t really matter).

### Simple change, quick test (famous last words)

So I use my numbered input of 0123456789, repeated, to fill the max lengths. Works as expected.

Then I try with big long string of “𪘀” and hit the >2020 error. Ugh.

The limiter is working, so that’s not the problem.

Inspecting the incoming URL on the C++ side reveals the problem: I’m calling encodeURIComponent on the query parameters part (after the ?) of the URL. So each “𪘀” becomes “%F0%AA%98%80”. That means the 4 bytes of the UTF-8 (or UTF-16) code point becomes 12 bytes – there are 3 one-byte characters per byte of UTF-8.

Let’s do the math again:

absolute_byte_limit = 1900
bytes_per_code_point = 4 * 3 // 4x UTF-8 code units, 3 bytes per code unit
allowed_code_points = absolute_byte_limit / bytes_per_code_point
==> 158.3

So… 158 allowed code points? But we need to allow at least 50+200 code points for the username+password. Time to advance to the next level of problem.

(Disclosure: I have the juice to change the username and password limits. But I don’t want to and this seems like a weak reason to do so.)

### The best encoding is no (or little) encoding?

The most obvious thing to try to alleviate the encoding bloat is to just remove encodeURIComponent. And it works fine. It appears that the code points are going through as UTF-16 binary – taking up the minimum possible bytes – and both the JS and C++ sides were happy.

Then I try a space in the password and it automatically gets encoded as %20 (the code point for the space character is U+0020 and so %20 is the URL-escaped UTF-8-encoded version of it).

That’s a bit of a wrinkle, but fine. I could put the URL-decode call back into the C++ code. Except… what if there happens to also be the percent-and-two-numbers sequence naturally occurring in the password? We’ll unintentionally be altering it. For example: If the password is x%41y, it would get URL-decoded to xAy.

After doing some research to satisfy myself that percent-encoding is the only thing going on in URLs, I decide that I only need to percent-encode the percent sign. So the password x y%20z becomes x%20y%2520z (%25 being the percent-encoded percent sign) in the URL. URL-decoding will reverse that value properly.

So, yay? We’re back to a reasonable number of bytes. Even a password of 50 percent signs (the only thing getting escaped) will still only bloat to 150 bytes.

### But then IE8 strikes

I do all my development on a Windows 10 machine, with IE 11. The installed IE version is what gets used for the web control in the app. IE 11 has a handy developer tool that lets you test in various even-older-IE modes. But it’s not always 100% accurate, so sometimes I fire up a Windows 7 VM with IE 8, 9, or 10 installed to test for real.

The simple-ASCII-characters-only test works fine in the Win7+IE8 VM. The test with the maximum count of “𪘀” does not.

Again, I inspect the URL coming into the backend (which is a bit harder now, because I don’t have a development environment in the VM). And it looks like IE8 is automatically encoding “𪘀” as \ud869\ude00 (the two-code-unit UTF-16 encoding). Which is again 12 bytes instead of 4.

I start getting pretty frustrated at this point.

### When all else fails, base64

So how do I get only ASCII characters in the URL, without bloating by a factor of 3x? How about base64-encoding? That gives us ASCII and a size increase of 33%, which is tolerable.

Let’s check the math:

absolute_byte_limit = 1900
bytes_per_code_point = 4 // 2x UTF-16 code units
base64_bloat = 1.33
allowed_code_points = absolute_byte_limit / bytes_per_code_point / base64_bloat
==> 357

That gives us an extra 100 code points to play with above our absolute limit of 50+200. Phew!

So, I change the encoding to be btoa(JSON.stringify(payload)), with appropriate decoding on the C++ side. Works as expected on Win10.

Doesn’t work at all on Win7+IE8. Oh right, forgot, there is no btoa in IE8. But we already have a polyfill for that, so I just swap it in and try again.

And, again, the URL is still too long. I’m losing my mind a little bit now.

### Re-polyfill JSON

After weeping a little and doing some MessageBox-ing and digging, I realize that the culprit now is IE8’s JSON.stringify.

Here’s IE9, IE10, IE11, and every other browser:

JSON.stringify("𪘀")
==> '"𪘀"'

And here’s IE8:

JSON.stringify("𪘀")
==> '"\ud869\ude00"'

Well that’s JUST GREAT.

I think for a few minutes about how encode objects without using JSON, but that’s dumb. And then I remember that, until recently, we used a JSON polyfill because we still supported WinXP+IE7 (which doesn’t have JSON support). So I try out the polyfill code in the IE8 console and… it gives the desired output! Oh, thank goodness.

So I modify the polyfill code to always replace the native JSON and conditionally include it for IE8.

And test. And it works. Everywhere. For every input.

## Do you see the mistake I made?

And later I decide that maybe this story would be amusing or educational for someone, so I should write up a blog post. And as I’m writing this blog post I realize that I got something wrong. This:

it looks like IE8 is automatically encoding “𪘀” as \ud869\ude00.

Nope. It was JSON doing that, not “automatic encoding”. Having encountered the space–>%20 automatic encoding, I think I was primed to lazily attribute more unexpected behaviour to magic.

Now, with the JSON polyfill replacement, I could go back to just percent-encoding-percent and regain even more code point space in the URL.

I don’t think I will, though. I can’t shake the question: “Is there anything besides percent-encoding that InternetCanonicalizeUrl(ICU_DECODE) (the win32 URL decode function) will try to decode?” If there is, then some user’s password will be unusable, and it’ll be super hard to diagnose. At the bottom of every email we say, “Psiphon will never ask you for your password”, so we can’t possibly figure out what’s wrong with it!

Fuzzing might be able to find other cases? Or maybe there’s source code for InternetCanonicalizeUrl that I can inspect (and hope it’s the same across Win 7, 8, 8.1, and 10)? But I already spent way too long on this and I can’t spend any more. Time to move on.

## This is an edited rendition

This may read like a logical progression of problems, investigations, and (attempted) solutions, but it was so much messier than that. This was my primary task for two days (not my only task, but still).

It was a painful cycle of:

Think everything works.
Test, expecting success.
Get weird results.
Debug, usually in VMs with the alert-and-MessageBox version of printfs.
Search for explanations. Read MSDN pages, Wikipedia pages, and anything else that might make it make sense.
Think of what to do to avoid the problem. Try stuff. Make it nominally work on Win10+IE11.
Repeat.

(You know what didn’t help? Running out of disk space while trying to work with VMs.)

There was also much, much more profanity than I have allowed here.

## Appendix: Unicode stuff, as I know it

I am not a Unicode pro, and quite a bit of what I know I learned during this work. I’ll give a quick-and-dirty description so we can be on the same page.

So, Unicode is a big list of, like, letters and characters and stuff. Each one gets a 32-bit number assigned to it (although there are only 24-bits actually used). “A” is 0x00000041 (65 in decimal), “あ” is 0x00003042 (12354), “🍕” is 0x0001F355 (127829), “𪘀” is 0x0002A600 (173568). These numbers are often written like U+0041, without so many leading zeros.

Those 32-bit Unicode numbers are “code points”. Some represent “graphemes” (rendered entities), but some are accents and whatnot that are to be combined with other code points.

(Tangent within a tangent: Some things that you might think of as a single “character”, like the “keycap digit one” emoji “1️⃣”, are actually “grapheme clusters”. In the case of “1️⃣”, it’s actually a combination of three code points: the usual ASCII number “1”, the “VARIATION SELECTOR-16” (U+FE0F), and the “COMBINING ENCLOSING KEYCAP” (U+20E3).)

So, think of “Unicode” as an abstract list of code points. Then we need to actually encode those code points.

The most direct approach is called UTF-32. It uses 32 bits to encode the 32-bit code point. But the vast majority of code points in common use don’t need all 32 bits, so this is a pretty inefficient encoding.

The most common encoding is UTF-8. It uses between one and four 8-bit “code units” to encode a code point. It has the very nice property of encoding English letters, numbers, and punctuation exactly the same as they are in ASCII. If you’re encoding text, use UTF-8.

But JavaScript and Windows use UTF-16. It uses one or two 16-bit code units to encode a code point. Some quick Wikipedia reading suggests that, once upon a time, 16 bits were thought to be enough for Unicode code points. JavaScript and Windows probably adopted that early (UCS2) spec, and then were stuck with it for legacy reasons. UTF-16 is compatible with that old spec, and here we are.

The short happy life of the Breached extension

Mon, 27 May 2019 00:00:00 +0000

In October 2017, Troy Hunt of Have I Been Pwned held a contest inviting people to do something cool with the HIBP API. I decided a) that I would kind of like the special edition ThinkPad he was giving away, and b) that I could probably whip something up pretty quickly.

I decided to create a browser extension that would simply pull HIBP breach information and show a browser notification – with the ability to view extra info – when the user visited a site that had been breached. And so was born the Breached extension. (Spoiler: I didn’t win.)

For non-technical readers: A “breach”, in this context, is when a hacker obtains the user database of a website. A breach generally includes email addresses, passwords (in some form), maybe credit cards, and other stuff you don’t want a hacker to have. So “breach information” about a website tells you that a breach occurred, when it happened, what data was stolen, and how much of it. Which is the kind of thing you should know about before using that website!

A month after I released the extension, someone created an issue pointing out that Mozilla “started working on integrating haveibeenpwned.com warnings into Firefox”. (As I promised there I did (nominally) reach out, but it didn’t go anywhere.) A year later that project turned into Firefox Monitor.

So, Breached is basically redundant on Firefox. In a sense I’m bummed, but I’m more flattered – it was obviously a pretty okay idea!

Monitor has a different notification policy: They only show alerts for sites that have been breached within the last two months. (If I’m reading that right. I think the 12-month part is a one-off and is likely just to increase the number of people that will ever see a notification). Breached’s policy is… just, like, show them all. Because I didn’t think about that while coding it. I think Monitor’s time-limit is better, since many sites will have sorted themselves out and don’t deserve a black mark for all time, so I might also add a time limit. (I don’t find the notifications noisy, but maybe I just don’t visit enough shady sites.)

In case you’re wondering what Monitor notifications look like, here’s one:

And here is Breached’s notification:

And the additional-detail popup:

Well, at least Breached is still relevant on Chrome! For now…

First Post: someone's last post

Thu, 05 Jan 2017 00:00:00 +0000

I often come across digital and physical interface designs that are either great (coherent) or terrible (incoherent). For some mysterious reason I’ve been wanting to capture such interfaces and share them.

To start, a curious headstone.

The goal of the design of a headstone is to be read by the living. For a few decades some of the people who read it will have known the deceased, but many won’t, and after a while no one will have personally known them. The design of this headstone ensures that it will get more reading-attention than an ordinary headstone. I certainly stood in front of it longer than any other, while figuring out how to decipher and read it.

It’s just the right degree of indecipherable. You can instantly tell that there’s something there to read and that it probably won’t require you to bust out pen and paper to do frequency analysis or ROT-13. It’s just hard enough to keep you standing there for a few minutes, and no harder.

Markdown Here: Splitting the Firefox and Thunderbird Extension

Thu, 14 Jul 2016 00:00:00 +0000

[This started as notes to myself to help clarify the problem and solution. It’s probably more suited to a Github issue than a blog post, and it may get copied into one.]

# The story so far

The Firefox and Thunderbird versions of Markdown Here both used nearly the same code – an old-style XUL extension. Tb is only capable of using a XUL extension, while Fx supports at least three extension types: XUL-based, Add-on SDK (aka Jetpack, aka jpm), and WebExtensions. WebExtensions is the newest, and is essentially an implementation of Chrome’s extension API.

I’ve kept my eye on WebExtensions because it’s tempting to be able to use identical code across the many major browsers: Chrome, Firefox, Opera, and Edge(?). It seemed premature to do any real work towards using it as it’s not yet fully released, and there would be no perceived benefit to users (it’s a more-locked-down extension API, so there would be an imperceptible sercurity benefit).

But… Firefox’s multi-process Electrolysis (E10s) update is nearing release, and it breaks MDH. There are two ways to go about fixing this:

Figure out what’s wrong in the XUL extension and correct it.
Switch to WebExtensions.

I tried to figure out why E10s is breaking the XUL extension, but without any success. Maybe I could figure it out if I tried longer/harder? Or maybe not.

I did some testing (and fixing) to make WebExtensions work, and it seems like the best option.

# The big problem: Splitting up Fx and Tb

Until now, the Firefox and Thunderbird extensions were literally the same extension – I upload a single file to Mozilla and check off the “Firefox” and “Thunderbird” boxes. There is only one extension ID for both. But Thunderbird only supports XUL extensions, so it’ll probably be using the XUL version forever.

We clearly have a problem: Fx and Tb will have to use fundamentally different extensions, but right now there’s only one extension for both. We’re going to have to split the userbase. And it’s going to be ugly.

Specifically, it’s going to be very ugly for either the Firefox users or the Thunderbird users. One platform will receive a message saying, “If you want Markdown Here to keep working for you, uninstall the one you have and go install this other one.” (The other platform will see no difference.) Ugh. That’s going to cost us some users for sure.

According to the Mozilla stats for MDH, there are two-thirds as many Thunderbird daily users of MDH as Firefox daily users. That makes it somewhat preferable to make life difficult for Tb users rather than Fx users.

However, I think it’s Firefox users who will have to jump through hoops. The XUL extension will still work in Firefox to a sufficient degree to show a message to the user. The WebExtensions extension will not work at all in Thunderbird. So existing Thunderbird users simply cannot be given the WebExtensions version, therefore the WebExtensions version must be the new, separate extension. Therefore it’s the Firefox users who must suffer.

(A couple of caveats: 1. I have asked in the Mozilla extension dev forum for suggestions for how to do this gracefully; so for I’ve received one reply, but it’s even more painful. 2. I have a terrible feeling that there actually is a graceful way to do this and I’m just not seeing it.)

# The plan, such as it is

Get the WebExtensions work done. (Most of the effort now is going to be building the separate versions in a sane way.)
Release the brand new WebExtensions version, with a separate ID. (And update the website to point to it, etc.)
Release a new XUL version that does nothing but show Firefox users a message telling them what they need to do to upgrade. (And explaining and apologizing profusely.)
Watch how many users are lost.

The E10s rollout is supposed to be gradual, so I plan on continuing to support Firefox in the XUL version for the time being, and continuing to show the switch-message when Firefox is detected.

Outstanding question: Continue using Addons.Mozilla.Org (aka AMO, aka the main Firefox extension site) or self-host. I’ve gone through ridiculous hassle getting MDH approved in the past, and I don’t want to do it again. Ever. (This is exacerbated by the fact that I’m planning on replacing the Markdown rendering library, and I fear push-back from the AMO reviewers for using a lot of third-party code that’s not pre-approved.) The downside of self-hosting is that MDH won’t show up in AMO searches, which is surely where a lot of users go when looking for a Firefox extension (versus googling).

(The current Firefox and Thunderbird extension is also used by Pale Moon, which is a Firefox fork, and Postbox and Ice Dove, which are Thunderbird forks. There aren’t enough users of them to change any of the rationale, and it looks like Pale Moon won’t be using E10s. So there’s Firefox, and then everything else.)

Android Non-Vulnerability: Steal a Device and Keep it Unlocked

Wed, 11 May 2016 00:00:00 +0000

While poking around in my Android phone’s developer options, I realized that if you steal a phone that’s currently unlocked because it’s in a “trusted place”, then you can force it to remain unlocked forever. (And then I got schooled about that not being a problem.)

## Security Feature: Smart Lock with Trusted Places

Android’s Smart Lock allows users to configure conditions under which to keep the phone unlocked. One of the conditions is location – you can set trusted locations where your phone shouldn’t prompt for a PIN/pattern/password when unlocking.

This is a pretty great feature. It’s difficult to convince people that the security gained by using a PIN outweighs the inconvenience of constantly entering it. Smart Lock helps mitigate the inconvenience by not requiring the user to constantly enter the PIN at home or at the office.

## Developer Feature: Mock Location

If you’re developing a location-aware app, you might want to trick the phone (and your app) into believing that it’s somewhere you’re not (in a restaurant, near a bus stop, etc.). The Android developer options provide the ability to set a “mock location app”. This is a separate app that allows you to configure your desired fake location. When the app is set as the mock location app, the phone pretends to be in the location specified by the app.

## Trusted Place + Mock Location = Perma-Unlock

So if Eve steals Alice’s phone from her desk at work, and wants to keep it unlocked until she has more time to peruse it, she can do this:

Install a mock location app. Set it to the current location.
Enable Developer Options.
Select the mock location app in the developer options.

And then Eve walks away, confident that Alice’s phone will remain unlocked.

Note that none of those steps requires Eve to type in the phone’s PIN. If any of them did, this attack would be nullified. (Adding a Smart Lock trusted location does require a PIN, but fooling the phone into thinking that it’s always in the current trusted location doesn’t.)

My recommended solution to the Android team: Require a PIN at step 3. If Alice is a developer, there might already be a mock location app installed and the phone will probably already have the developer options enabled. Step 3 seems like the best intervention point.

I also don’t like that Smart Lock (and Android Device Manager) respect the mock location. It seems to me that they should be “above” that.

### Bonus attack: Android Device Manager

Alice realizes that her phone is gone! She jumps on her computer and checks Android Device Manager! Except… it reports that the phone is still at the office, because it also uses the mock location being reported by the phone.

Hopefully Alice will do a remote lock (or wipe) anyway, or maybe the location confusion slows her down for a while.

## Vulnerability Disclosure

I filed a security issue with the Android team (#204776, but it’s not publicly visible). The response was that it is “working as intended”.

Once someone has access to an unlocked phone, they are able to do anything with it (attempt to root the device, install other malware, etc).

We appreciate the report but this is working as intended.

I see what they’re saying. In theory, the attacker could enable app side-loading, and then install some kind of data-snarfer service, and then give it sufficient permission to exfiltrate everything it can access. The data-snarfer could run even while the phone is locked.

Or the attacker could just keep touching the screen to keep it unlocked.

(Rooting typically requires a bootloader unlock, which wipes the device. But that’s irrelevant if rooting isn’t necessary to effect an equivalent attack.)

I’d really like to thank the Android team for taking the time to reply to my not-super-exciting bug report.

## Grand Conclusion

There isn’t one. The “Trusted Place + Mock Location” combo isn’t a real problem – it’s just a little distasteful. If someone steals your phone and it’s unlocked, you’d better hope they’re only after the hardware.

This is especially distressing for tablets. Unlike our phone, most of us don’t carry our tablet everywhere, so it’s much more likely to be stolen from our home. Entering your PIN on your tablet every time is almost as annoying as on your phone, so Smart Lock seems like a good choice. And that means that it’s almost certain that your tablet will be stolen unlocked. (If it gets stolen. I have no idea what the likelihood of that is.)

## Update

In Android O (a year after I wrote this) Google added a PIN prompt when enabling developer options. So, that negates this problem.

Unless you’re already a developer…

Why and How to Use a Contributor License Agreement

Tue, 17 Feb 2015 00:00:00 +0000

## Background and Motivation

I received a pull request for Markdown Here that was great: it found a bug, fixed it, and included tests for the fix. However, the PR submitter didn’t write the tests using the existing framework, so I figured I’d massage his test code into the proper form.

And then I noticed that he included a copyright line in the test file. It says “MIT License”, which is the license used for the rest of the project, but that got me thinking about what that might mean…

Wikipedia suggests that the MIT License would require me to include his copyright+license notice wherever I use his code. Not a big deal, but annoying. And maybe a slippery slope – what if I get a bunch more code submissions?

So I did some research into “Contributor License Agreements” and found that there are a couple more things to be concerned about:

If you ever want to change the project license, you have to get the agreement of all contributors. That includes dual-licensing. Good info about that (and CLAs in general), with specific KDE example: “In Defense of License Agreements”.

Patent something something. Contributors retain the patent rights unless explicitly granted in the CLA.

And just to be clear: Contributors have the copyright on their code by default, regardless of whether they put the © notice on it. I’m not sure about the license.

So I decided to require MDH contributors to “sign” a CLA. Seemed kind of irresponsible not to.

## Picking a CLA

Spending your weekend reading CLAs is a drag.

I decided to use Harmony Agreements to generate the agreement. (I chose “any license” for the “outbound license option”.) The agreement it provides seems pretty good and pretty standard.

(Well… when I first generated a agreement I chose the “copyright assignment” version instead of “copyright license”. I even committed it and got the pull-request submitter to sign it. But then I re-read it and realized it was a) not very standard, b) maybe not enforceable, and c) kind of heinous. So I changed to the “copyright license” form. The difference is something like “you’re transferring absolute power to me and you lose the ability to use your own code” versus “you’re letting me do whatever I want with your code but it’s still yours” – exclusive vs. non-exclusive license, kind of thing.)

## “Signing”

Ugh.

Some projects – like Apache – require you to print out the agreement, sign it, and mail, fax, or scan-and-email it back to them. Another one (I forget which) uses some Adobe e-signing plugin where you draw your signature on the screen. Google requires you to be signed in, but it’s just a button press; ditto Twitter (signed in with Twitter, I mean). CLAHub requires a Github sign-in (with optional typing of “I AGREE”). I think I also saw some projects that just require a filled-in form.

CLAHub is really cool. It gives you a nice link for your CLA, collects agreements, and runs a bot that watches your project’s PRs, checks against the CLAs, and comments on the PR as to whether the PR-user has agreed yet or not. Except… there’s a notice saying it’s not ready for prime-time, the bot is broken, and the blog hasn’t been updated in a year. Also, entrusting a (flaky?) third party with the agreements seems dangerous (although they can be downloaded, so blah).

The article I linked above mentions “some [projects] collect agreements through a simple web form (Google Doc is a fine choice)”. So I created a Google Form with the CLA and a form for the contributor to provide contact info and indicate agreement. But…

Then I started thinking about non-repudiability. So I did a little reading about electronic signatures (not to be confused with digital signatures). Which is a horrible rabbit hole. Anyway, it made me seriously doubt that a row in a Google Spreadsheet with someone’s maybe-correct address and the words “I AGREE” really constitutes a legal signature. (Even in the Google and Twitter cases – how would they prove that they didn’t just fiddle some bits to make it look like I signed?)

And if you’re going to do this annoying CLA crap then you might as well hope that it means something, right?

Then I found Medium’s open source project. It requires contributors to commit a file along with their pull request stating that they agree to the CLA. That seemed… totally reasonable. The agreement is in-band with the code. In a way that I can’t manipulate without invalidating. And much, much less annoying for the contributor than printing/signing/faxing.

So that’s how I did it. See MDH’s CONTRIBUTING.md. (That’s the filename to use.)

## Backlash warning

Doing the CLA read-and-agree dance is more effort than not doing it. And some people are offended at the idea of doing it (see the tweet at the top of the “In Defense Of” post.) So, it seems unavoidable that a project with a CLA will get fewer contributions than one without – some people just won’t get past that hurdle.

But it still seems necessary.

[Note: This is actually an email I wrote to my co-workers after going through this CLA exercise with Markdown Here. That’s why the tone is a bit informal and “blah” is used as if it means something.]

Test post: Markdown Here in Disqus

Fri, 18 Oct 2013 00:00:00 +0000

This is just a stub test post to allow me to try out Markdown Here in Disqus comments.

Right now MDH won’t work with Disqus in Chrome because of cross-origin restrictions. See: https://github.com/adam-p/markdown-here/issues/124

Update: The Disqus edit box is contenteditable, and MDH will render in it, but all formatting seems to get stripped out when you actually post the comment. Seems like the rich-edit-ness is probably just to support Disqus’s add-an-image feature.

Final update: I’m not longer using Disqus, so there’s nothing here to try. Sorry!

Safari Extensions Gallery: half-baked

Tue, 25 Jun 2013 00:00:00 +0000

Trying to get Markdown Here listed in the Safari Extensions Gallery is by far the worst browser extension “store” experience I’ve had so far. Shockingly bad.

## No hosting

First of all, but least of all: There’s no hosting. Unlike the Chrome and Mozilla stores, the Safari store doesn’t host the extension for you – it’s really more of a listing of links to wherever you host your extension files. That’s not terrible, but:

It’s costing me a little bit of money each month to host them.
I don’t get nice install/usage stats like I do with Chrome and Mozilla.

## No communication

Submitting the extension was basically the same as everywhere else. But this is the confirmation email:

Dear Adam Pritchard, Thank you for submitting your Safari Extension. Apple reviews all submissions and reserves the right to omit, edit, or reject any submission. Please note you will not receive any further notifications. We appreciate your interest in Safari. Sincerely, Apple Developer

(From noreply@adc.apple.com)

So… You’re going to review my extension, and I’ll never know if it’s been accepted or rejected or what? And there’s no mechanism for me to get an update?

Sure enough, 5 weeks has gone by now. There’ve been no status update emails, and there’s no review status info anywhere on the developer website (that I can find). And I’m not the only one – there are other people on the forums in the same boat.

To be clear, this is not at all what the Chrome and Mozilla extension approval process is like. I’ve had my fair share of problems with the Mozilla approval process, but I had a queue number, an IRC channel, and reviewers I could communicate with. (The Chrome store has no apparent approval process, because I’m using the standard API. Which is similar to the Safari extension API I’m using. So… why is there any non-negligible review at all?)

## No search

To top it off, the Safari Extensions Gallery itself… has no search?!? Let’s pick a not-front-page extension at random… how about the “Entertainment” category and then the Turboglue extension. (Sorry, there’s no way to give you a link to that!). Now try to find it some other way. The search box in the upper-right of that page? “No results were found.” And… I can’t find another search box. (Unlike Firefox, there’s no in-browser extension search.) How about a Google site search? That wouldn’t really be an acceptable work-around even if it worked… but it doesn’t anwyay.

(Maybe related: Here’s a forum post by a dev who has had his/her extension approved, but can’t actually find it in the Gallery.)

I’m not sure how this could be worse. Unfriendly to developers and unfriendly to users. And it’s not like Safari extensions are new – they’ve been around about the same amount of time as Chrome extensions.

(Screen-grabbed from the bottom of this page. Age of the Gallery derived from this article.)

## Sweet lemons

Safari extensions are clearly not Apple’s primary concern, and maybe that’s okay. It’s not a hardcore-extension-geek browser like Firefox; their browser isn’t also an OS, unlike Chrome; and they have guaranteed, bundled-with-OS market share, like Internet Explorer (I don’t know much about IE’s extension support, but it’s clearly not as robust as Firefox and Chrome). And they do give me a way to provide a Safari extension to my users and update it automatically.

## Bonus whinging

It’s necessary to generate and register a signing certificate before you can even start to develop a Safari extension (IIRC – but definitely before publishing). There’s no such stumbling block in Firefox and Chrome. Maybe this is due to the absence of hosting? I’m not entirely sure how painful it’s going to be to set up a new OS X development machine.

Check out the ghetto method of opening the Markdown Here options page in Safari. It’s probably not just my dumbness, since I stole the approach from AdBlock.

## Update: Accepted 6 months later

Two days ago (2013-11-04) I received an email indicating that Markdown Here had been accepted into the Safari Extensions Gallery. Phew!

I think the acceptance makes the whole thing even weirder. If they were just sending submissions directly to the trash I could understand, but… Complete silence and then acceptance after 6 months? Are they backed up half a year? Is my extension so complex that it took them that long to review? Wha?

No One Knows to Click on a Page Action

Sun, 24 Mar 2013 00:00:00 +0000

Page actions – the buttons in a browser’s address bar – are a surprising UI failure.

When adding a button for a browser extension, a choice must be made whether to make it a “page action” or a “browser action” (button on the toolbar). But browsers have failed to communicate the interactiveness of page actions, and almost no one – techy or layman – realizes that they’re clickable.

To complement the context menu item and hotkey, and to fulfil a user feature request, I decided to add a button to the Markdown Here browser extension. It turned out that simply deciding where to put the button was a big part of the effort…

## Page Action vs. Browser Action

I’m going to use the Chrome extension development terminology:

Page actions...: are the buttons and status indicators located in the address/omni/awesome bar. (See pageAction API info.)
Browser actions...: are buttons on the browser toolbar. (See browserAction API info.)

In the screenshot above you can see the two styles co-existing in Firefox, which suggests there’s no real implementation decision to make – just provide both, and let the user decide which style they like. That’s true in Firefox (although there’s still the lesser decision of whether or not to add the toolbar button by default), but in Chrome you can either have a page action or a browser action, not both.

The choice initially seemed pretty obvious: use a page action. From Chrome’s documentation for browser actions:

Don’t use browser actions for features that make sense for only a few pages. Use page actions instead.

Markdown Here’s button is only applicable to some rich-edit compose elements (email, mostly), so that admonition seems to apply pretty directly. Like many people, I don’t like occasional-use buttons cluttering up my toolbar, so I initially implemented the button as a page action.

## Apparently Imperceptible Affordance

…And then I showed the cool new button to my significant other, who said something along the lines of “I can click that?” Which is a pretty damning statement, for a button.

I must admit that I had some suspicions about the obviousness of page actions’ clickability. I’m fairly sure it took me a while to realize I could click them, and I’m a) pretty technically savvy, and b) pretty hover-over-everything-that-looks-interesting curious. But what if a user is not both of those things…?

So I asked around. I asked in the Markdown Here Google Group, the UX StackExchange, and on Google+. These are the sorts of responses I got:

“This [is] purely anecdotal, but I work in the web industry, and use [C]hrome everyday, and didn’t realise the page actions were clickable. I agree with you that they look more like signifiers than they do clickable buttons.”
“But I agree that they don’t function well as buttons, perhaps this is by the design of the icon (not “raising” the element to give it depth).”
“pageAction in the abstract is a great idea, but I always find its use a little jarring. And I agree it’s not button-like at all, more just informational.”

(Yes, there were some people who knew that page actions are clickable. But the fact that many computer/tech/web/UX-savvy people didn’t know is the more significant observation.)

I also asked around among people at the office (coders) and among non-programmer friends, and the vast majority of both groups didn’t know they could interact with page actions. At best they thought of them as status indicators, and at worst they couldn’t remember ever having noticed them before. Ugh.

### Missing Cues

It’s hard to blame users for this lack of affordance recognition. At least, not yet.

Page actions do not display any of the typical this-is-a-clickable-thing traits. For the most part, page actions:

are not raised or underlined, like a standard button or a link, so most people won’t hover over them, but even if the user does hover, page actions…
do not change at all when hovered over – no outline, no colour change, no raise-up, no clicky-hand mouse cursor.

Some page actions have a verb-based tooltip if you hover long enough. Some. If. Long enough.

It’s a little shocking how poorly the interactiveness is communicated to the user.

### Maybe our future selves will get it?

Above I coyly dropped “At least, not yet.” There is a trend in UI design toward everything on-screen being interactive unless explicitly disabled-looking. Windows 8 has gone this way, as has Chrome and, to a slightly lesser extent, Firefox. There’s very, very little text or window chrome that’s non-interactive.

But even if you accept the “everything is interactive” ideal, page actions are still different than most other elements, since there’s no hover effect. And page actions are further hampered by the minimalistic design aesthetic that Chrome and Firefox seem to have adopted for them – a monochrome outline icon that can easily be read as disabled.

Maybe once users have fully embraced/internalized the idea that there are no extraneous UI elements, they won’t need hover effects and raised borders. Maybe there’ll be a great awakening to the utility of page actions. But until then…

## How to rescue page actions

Page actions need to look less like small, monochrome, passive, static icons. They need some standard button cues, both initially and on hover; they should employ one or more of: raisèd-ness, colour, border, more visual strength.

(I suspect that even the Chrome-style toolbar buttons – like the three-line settings button – are also below most laypeople’s threshold to recognize the click affordance. I’ve seen that in action in my own family-tech-support experience. Those buttons also lack most historical click cues. But let’s tilt at one windmill at a time…)

### Tangent: Chrome needs to allow both page and browser actions

Finally, Chrome should allow extensions to provide both page actions and browser actions.

In the screenshot at the top of this post, you see can that Pocket’s Firefox extension uses both button styles: the page action is for saving the current page, while the browser action is for showing your saved pages. Similarly for the bookmarks buttons: page action for bookmarking the page, browser action for viewing bookmarks.

(Markdown Here also has a button in each place, but it’s not as compelling a use case, since it’s just a convenience to work around the page action affordance opacity. Both buttons toggle Markdown rendering; the page action only shows when focus is in a valid target; you can hide the toolbar button if you’re one of the few page-action-savvy users. But, still, I wish I could provide the same flexibility to my Chrome users that I do to my Firefox users.)

In Chrome, Pocket only has a browser action (which, oddly enough, acts only like its Firefox page action), and bookmarks only have a page action (and a whole toolbar). I can’t think of any reason for Chrome to prevent extensions from providing both, and there are certainly good use cases for allowing them.

## So it’s back to a browser action

I finally switched the Markdown Here toggle button in Chrome to be a browser action. Even though it clearly, spiritually, should be a page action, I just can’t ignore the fact that most users will not recognize it as clickable in that form.

I have had one complaint about the button location, but the user seemed satisfied that I made the rational choice after I explained it.

## Update 2017-10-28

Last year Chrome turned page actions into browser actions. The description of the change suggests that this was a security decision. Evil stealth extensions were being installed, so now all extensions have to have toolbar buttons to expose them to users. (FWIW, I don’t think this is a good solution to the problem. I’m sure the majority of users don’t pay much attention to random toolbar buttons.) It makes little sense for page action-centric extensions to have both a sometimes-visible address bar button and an always-visible toolbar button, so the Chrome team did away with the address bar button.

Firefox hasn’t made a similar change. You’d think that if there’s solid security rationale for one browser, that it’d apply to all browsers.

This makes the documentation for pageAction pretty confusing. Check out Chrome’s versus Mozilla’s documentation. Chrome’s documentation is kind of incoherent now. There seems to be no reason at all to use page actions, but the doc still suggests that you do.

I felt that page actions were kind of unusable before, but… this is worse?

The reason I was looking at this again is because I created another extension, called Breached, where a page action would, in theory, make the most sense. The button is enabled/shown when the user visits a site that has suffered a breach in the past, exposing user accounts. This is a pretty rare thing (for many people… who don’t use Yahoo Mail), so dedicating a perma-visible toolbar button to it is pretty wasteful. Still, I initially went with a browser action for the reasons given in this post. I even added a notification (shown once per site), so the user could hide the toolbar button and still notice when they visit a breached site.

Then a reviewer requested that it be a page action. I thought about it again and realized that the notification also helps with the “no one knows to click on a page action” problem by telling the user they can click on it. And the always-visible button still bugged me (and, let’s face it, most users won’t know how to hide buttons). So I changed it to a page action. And… discovered that there’s no visible difference in Chrome. Except page actions don’t support badge text, which I was using to show the number of breaches.

Anyway, I’m leaving Breached as a page action, for the sake of Firefox users.

Postscript: First blog post ever! Yay! Thanks to Casey Watts for suggesting that I write it.

Adam Pritchard's blog and miscellany on adam-p

The best – but not good – way to limit string length

## Character encodings and terminology

### Unicode

### Grapheme cluster

### Unicode normalization

### UTF-8

### UTF-16

### Other encodings

## String lengths

### Limiting and Consistency

### How to limit?

#### Grapheme clusters

#### UTF-16 code units

#### UTF-8 bytes

#### Unicode code points

#### Hybrid counting

##### Twitter counting

## Other considerations

### What to do when the limit is hit

#### Reject

#### Truncate

### Unicode versions

### Normalization concerns

### Encoding errors

### Processing overhead

## Additional reading

## Conclusions

## Appendix: Implement Hybrid Counting ✨Graph Length✨

## Appendix: A personal length-limiting example

Bug story: Sorting by timestamp

Inexpert Parenting Tips

## Carve your brain (and your child’s): Do what you want to be

## Always offer low-friction reconciliation

## Emotional bookending and bridging

## Turn towards

## Ambient affection

## Tell them when you’re cranky

## Show resolution after a spousal fight

## Don’t lie

## Lay the groundwork for the teenage years

## Think hard about saying “no”

## Give your parenting partner a break

## Don’t get carried away by your spouse’s emotions

## Emotional satiation

## Be wary of your intoxicated self

## Pay attention to your tone; aim for neutrality

## Never make your child responsible for your emotions

## Do the opposite of being angry

## Hold but don’t grasp

## Give credit when it’s due

## Be mealy-mouthed, get ignored

## Mindfulness helps

## Do your best; be compassionate with yourself

More Than a Password

## Why memorized passwords (by themselves) are super bad

## Why weak passwords are bad

## Why reusing passwords is bad

## Entering passwords by hand is pretty bad

## Use a password manager

## We need a second factor

### SMS

### Time-based codes

### Hardware keys

## What about “passkeys”

## Why should you care?

## Get a password manager and two YubiKeys

Forwarded Header Sabotage

## Syntax

## Sabotage!

## Why is X-Forwarded-For not sabotage-able?

## Mitigations

## Conclusion

## Addendum

Symbol and text labels

Should you strip the IPv6 zone?

## What are IPv6 zones?

## So should zones be stripped?

### Adding to the X-Forwarded-For header

### Rate limiting by IP

## Why is `X-Forwarded-For` not sabotage-able?

### Adding to the `X-Forwarded-For` header

### Other headers (`X-Client-IP`, `True-Client-IP`) might be present and spoofed

### Trying to learn about `X-Forwarded-For`

### `X-Forwarded-For` parser mismatch

### Rust’s `HeaderMap::get` returns first

### Go’s `net/http/httputil.ReverseProxy` XFF behaviour being re-examined