Add tests for common TextDecoder implementation mistakes#56892

ChALkeR · 2025-12-20T21:22:25Z

This testsuite collects cases where at least one implementation fails

This is an export of https://github.com/ExodusOSS/bytes/blob/9c8c9baa/tests/encoding/mistakes.test.js

It demonstrates bugs in all major implementations - not a single of the existing ones was correct for all encodings.
In total, 10 different implementations were tested + my implementation in js.

Important

All of the three major browser engines got UTF-8 decoding wrong.

#56799 and #56844 were tiny parts of this.

On no testcase all implementations agree on an incorrect behavior: at least one is consistent with the spec.
Perhaps the most controversial part of this is a test for whatwg/encoding#115, but Firefox, Deno, Servo agree with spec and the spec is clear on what should happen and documents that specific case. Also, WebKit and Chrome now treat concatenated input as invalid too, but perform replacement on it incorrectly.

On some tests,

For data, see https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit
Legend:

FAIL means returning incorrect results on some inputs, i.e. not matching the spec on a simple .decode() call.
STATE is more serious in some cases - it means internal inconsistency, non-streaming .decode() result depends on previous call, state leaks between calls.
STREAM also could be more serious in some cases and demonstrates an internal inconsistency
Decoding result depends on the buffers shape: decode(a, { stream }) + decode(b) !== decode(a + b)
With Dynamic Record Sizing, this could potentially be controlled with a MitM and cause different decoding results for the same bytes, all without affecting TLS (as it verifies bytes).
Demo for utf-8 mishandling on WebKit and Chrome: https://tmp-demo.rray.org/utf-8 (small static html over https)

Due to the nature of cross-tests, implementations not behaving correctly per spec on single-shot new TextDecoder(encoding).decode(arg) calls were not fully tested for STATE / STREAM, so FAIL takes priority (even in cases where it's less significant).

The rest is explained in the table and the testcases

Bug refs, found with these cross-tests:

For a clean implementation in js, see @exodus/bytes/encoding.js

annevk

This is amazing work. Thanks so much!

If have two questions:

I guess the lack of semicolons was maybe copied from existing tests? It would be preferable to have them, but it's not required.
Can we perhaps split this file by encoding type? E.g., UTF-8, UTF-16, single-byte encodings, and multi-byte encodings?

annevk · 2025-12-21T07:26:44Z

+// Bun is incorrect
+test(() => {
+  // This is the only decoder which does not clear internal state before throwing in stream mode (non-EOF throws)
+  // So the internal state of this decoder can legitimately persist after an error was thrown


This seems very sketchy. I wonder if we should change this somehow in the standard. No need to change the test for now though as it indeed appears correct.

My take is that continuing streaming after fatal has thrown an error is a very bad idea, regardless of the encoding, and should be banned.

I will elaborate more in whatwg/encoding#358

anonrig · 2026-02-27T01:40:50Z

I recommend merging this and we can iterate over it. None of the reviews seems to be a blocker.

wpt-pr-bot added the encoding label Dec 20, 2025

wpt-pr-bot assigned annevk Dec 20, 2025

wpt-pr-bot requested a review from annevk December 20, 2025 21:22

ChALkeR force-pushed the chalker/textdecoder-mistakes/0 branch 4 times, most recently from d412234 to 3cf258f Compare December 21, 2025 05:23

Add tests for common TextDecoder implementation mistakes

b132d29

ChALkeR force-pushed the chalker/textdecoder-mistakes/0 branch from 3cf258f to b132d29 Compare December 21, 2025 07:05

annevk reviewed Dec 21, 2025

View reviewed changes

annevk mentioned this pull request Dec 21, 2025

ISO-2022-JP does not reset state when returning error whatwg/encoding#358

Open

ChALkeR mentioned this pull request Dec 25, 2025

Encodings used in iconv-lite mismatch the WHATWG Encoding spec significantly pillarjs/iconv-lite#367

Open

This was referenced Feb 7, 2026

legacy multi-byte encodings in TextDecoder are incorrect cloudflare/workerd#6038

Closed

utf-16 TextDecoder is incorrect boa-dev/boa#4612

Open

anonrig approved these changes Feb 26, 2026

View reviewed changes

ChALkeR mentioned this pull request Feb 26, 2026

legacy multi-byte encodings in TextDecoder are wrong in streaming cloudflare/workerd#6193

Closed

anonrig merged commit 18f431a into web-platform-tests:master Feb 27, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for common TextDecoder implementation mistakes#56892

Add tests for common TextDecoder implementation mistakes#56892
anonrig merged 1 commit intoweb-platform-tests:masterfrom
ChALkeR:chalker/textdecoder-mistakes/0

ChALkeR commented Dec 20, 2025 •

edited

Loading

Uh oh!

annevk left a comment

Uh oh!

annevk Dec 21, 2025

Uh oh!

ChALkeR Dec 21, 2025 •

edited

Loading

Uh oh!

anonrig commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ChALkeR commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annevk left a comment

Choose a reason for hiding this comment

Uh oh!

annevk Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

ChALkeR Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anonrig commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChALkeR commented Dec 20, 2025 •

edited

Loading

ChALkeR Dec 21, 2025 •

edited

Loading