Add tests for common TextDecoder implementation mistakes#56892
Merged
anonrig merged 1 commit intoweb-platform-tests:masterfrom Feb 27, 2026
Merged
Add tests for common TextDecoder implementation mistakes#56892anonrig merged 1 commit intoweb-platform-tests:masterfrom
anonrig merged 1 commit intoweb-platform-tests:masterfrom
Conversation
This was referenced Dec 20, 2025
d412234 to
3cf258f
Compare
3cf258f to
b132d29
Compare
annevk
reviewed
Dec 21, 2025
Member
annevk
left a comment
There was a problem hiding this comment.
This is amazing work. Thanks so much!
If have two questions:
- I guess the lack of semicolons was maybe copied from existing tests? It would be preferable to have them, but it's not required.
- Can we perhaps split this file by encoding type? E.g., UTF-8, UTF-16, single-byte encodings, and multi-byte encodings?
| // Bun is incorrect | ||
| test(() => { | ||
| // This is the only decoder which does not clear internal state before throwing in stream mode (non-EOF throws) | ||
| // So the internal state of this decoder can legitimately persist after an error was thrown |
Member
There was a problem hiding this comment.
This seems very sketchy. I wonder if we should change this somehow in the standard. No need to change the test for now though as it indeed appears correct.
Contributor
Author
There was a problem hiding this comment.
My take is that continuing streaming after fatal has thrown an error is a very bad idea, regardless of the encoding, and should be banned.
I will elaborate more in whatwg/encoding#358
This was referenced Feb 7, 2026
anonrig
approved these changes
Feb 26, 2026
Member
|
I recommend merging this and we can iterate over it. None of the reviews seems to be a blocker. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This testsuite collects cases where at least one implementation fails
This is an export of https://github.com/ExodusOSS/bytes/blob/9c8c9baa/tests/encoding/mistakes.test.js
It demonstrates bugs in all major implementations - not a single of the existing ones was correct for all encodings.
In total, 10 different implementations were tested + my implementation in js.
Important
All of the three major browser engines got UTF-8 decoding wrong.
#56799 and #56844 were tiny parts of this.
On no testcase all implementations agree on an incorrect behavior: at least one is consistent with the spec.
Perhaps the most controversial part of this is a test for whatwg/encoding#115, but Firefox, Deno, Servo agree with spec and the spec is clear on what should happen and documents that specific case. Also, WebKit and Chrome now treat concatenated input as invalid too, but perform replacement on it incorrectly.
On some tests,
For data, see https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit
Legend:
FAILmeans returning incorrect results on some inputs, i.e. not matching the spec on a simple.decode()call.STATEis more serious in some cases - it means internal inconsistency, non-streaming.decode()result depends on previous call, state leaks between calls.STREAMalso could be more serious in some cases and demonstrates an internal inconsistencyDecoding result depends on the buffers shape:
decode(a, { stream }) + decode(b) !== decode(a + b)With Dynamic Record Sizing, this could potentially be controlled with a MitM and cause different decoding results for the same bytes, all without affecting TLS (as it verifies bytes).
Demo for
utf-8mishandling on WebKit and Chrome: https://tmp-demo.rray.org/utf-8 (small static html over https)Due to the nature of cross-tests, implementations not behaving correctly per spec on single-shot
new TextDecoder(encoding).decode(arg)calls were not fully tested forSTATE/STREAM, soFAILtakes priority (even in cases where it's less significant).The rest is explained in the table and the testcases
Bug refs, found with these cross-tests:
For a clean implementation in js, see
@exodus/bytes/encoding.js