gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904

serhiy-storchaka · 2025-11-02T09:47:11Z

Issue: html.parser(convert_charrefs=False) silently drops ampersand (&) on invalid named entities, causing data loss #140875

…Parser

serhiy-storchaka

I have added many tests to check consistency between unclosed character references followed by EOF and other character, with both convert_charrefs modes.

serhiy-storchaka · 2025-11-02T09:10:20Z

Lib/test/test_htmlparser.py

-        # Maybe HTMLParser should use self.unescape for these
-        data = [
-            ('a&', [('data', 'a&')]),
-            ('a&b', [('data', 'ab')]),


This is the reported bug. It was in the tests!

serhiy-storchaka · 2025-11-02T09:13:55Z

Lib/test/test_htmlparser.py

+        self._run_check('&gt', [('entityref', 'gt')], convert_charrefs=False)
+        self._run_check('&gt', [('data', '>')], convert_charrefs=True)
+
+        self._run_check('&g', [('entityref', 'g')], convert_charrefs=False)


Ampersand was only swallowed in the case of 1-character name before EOF.

serhiy-storchaka · 2025-11-02T09:15:02Z

Lib/test/test_htmlparser.py

+        self._run_check('& z', [('data', '& z')], convert_charrefs=True)
+
+    def test_eof_in_entityref(self):
+        self._run_check('&gt', [('entityref', 'gt')], convert_charrefs=False)


It was data before this change.

ezio-melotti · 2025-11-19T09:13:30Z

Lib/html/parser.py

+                elif i + 3 < n:  # larger than "&#x"
+                    # not the end of the buffer, and can't be confused
+                    # with some other construct
+                    self.handle_data("&#")


What's the reason for emitting &# as data now, instead of emitting it later with the rest of the data? IOW, in the case of '&x y', this will emit handle_data('&#') + handle_data(' y') instead of a single handle_data('&# y').

Having multiple handle_data is not wrong per se, but if we remove the elif block and let the else break, we can simplify the code and emit a single handle_data.

The same might apply below, where a single & is emitted. Also note that for ' z &x y', the first handle_data gets called with only 'z ', even if followed by one or more additional handle_data.

Try &#x <. With this PR it emits handle_data('&#'), handle_data('x '), handle_entityref('lt') which is not optimal, but correct. If remove this elif, it will emit handle_data('&#x <'), which is incorrect.

miss-islington-app · 2025-11-19T11:55:14Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app · 2025-11-19T11:55:22Z

GH-141745 is a backport of this pull request to the 3.14 branch.

…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app · 2025-11-19T11:55:27Z

GH-141746 is a backport of this pull request to the 3.13 branch.

…LParser (GH-140904) (GH-141746) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

…LParser (GH-140904) (GH-141745) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

…Parser (pythonGH-140904)

serhiy-storchaka added 2 commits November 2, 2025 10:49

pythongh-140875: Fix handling of unclosed charrefs before EOF in HTML…

fe8bcb2

…Parser

Update a NEWS entry.

8105625

serhiy-storchaka commented Nov 2, 2025

View reviewed changes

serhiy-storchaka requested a review from ezio-melotti as a code owner November 2, 2025 09:47

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Nov 2, 2025

bedevere-app bot added the awaiting core review label Nov 2, 2025

bedevere-app bot mentioned this pull request Nov 2, 2025

html.parser(convert_charrefs=False) silently drops ampersand (&) on invalid named entities, causing data loss #140875

Closed

serhiy-storchaka force-pushed the htmlparser-eof-in-entityref branch from c5ce0aa to fe8bcb2 Compare November 2, 2025 09:50

ezio-melotti reviewed Nov 19, 2025

View reviewed changes

serhiy-storchaka added 2 commits November 19, 2025 11:46

Merge branch 'main' into htmlparser-eof-in-entityref

ebc287b

Update tests.

5dedffc

ezio-melotti approved these changes Nov 19, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Nov 19, 2025

serhiy-storchaka merged commit 95296a9 into python:main Nov 19, 2025
87 of 89 checks passed

bedevere-app bot removed the awaiting merge label Nov 19, 2025

serhiy-storchaka deleted the htmlparser-eof-in-entityref branch November 19, 2025 11:55

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Nov 19, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Nov 19, 2025

serhiy-storchaka added a commit that referenced this pull request Nov 19, 2025

[3.13] gh-140875: Fix handling of unclosed charrefs before EOF in HTM…

c7064e7

…LParser (GH-140904) (GH-141746) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

serhiy-storchaka added a commit that referenced this pull request Nov 19, 2025

[3.14] gh-140875: Fix handling of unclosed charrefs before EOF in HTM…

562e23f

…LParser (GH-140904) (GH-141745) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>

StanFromIreland pushed a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025

pythongh-140875: Fix handling of unclosed charrefs before EOF in HTML…

f4fc0e6

…Parser (pythonGH-140904)

ashm-dev pushed a commit to ashm-dev/cpython that referenced this pull request Dec 8, 2025

pythongh-140875: Fix handling of unclosed charrefs before EOF in HTML…

37aec79

…Parser (pythonGH-140904)

jacobtylerwalls mentioned this pull request Dec 11, 2025

Refs #36499 -- Adjusted test_strip_tags following Python behavior change for incomplete entities. django/django#20390

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904

gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904

Uh oh!

serhiy-storchaka commented Nov 2, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Nov 2, 2025

Uh oh!

serhiy-storchaka Nov 2, 2025

Uh oh!

serhiy-storchaka Nov 2, 2025

Uh oh!

ezio-melotti Nov 19, 2025

Uh oh!

serhiy-storchaka Nov 19, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Nov 19, 2025

Uh oh!

bedevere-app bot commented Nov 19, 2025

Uh oh!

bedevere-app bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904

gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904

Uh oh!

Conversation

serhiy-storchaka commented Nov 2, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-app bot commented Nov 19, 2025

Uh oh!

bedevere-app bot commented Nov 19, 2025

Uh oh!

bedevere-app bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

serhiy-storchaka commented Nov 2, 2025 •

edited by bedevere-app bot

Loading