Skip to content

Conversation

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Nov 2, 2025

Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added many tests to check consistency between unclosed character references followed by EOF and other character, with both convert_charrefs modes.

# Maybe HTMLParser should use self.unescape for these
data = [
('a&', [('data', 'a&')]),
('a&b', [('data', 'ab')]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reported bug. It was in the tests!

self._run_check('&gt', [('entityref', 'gt')], convert_charrefs=False)
self._run_check('&gt', [('data', '>')], convert_charrefs=True)

self._run_check('&g', [('entityref', 'g')], convert_charrefs=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ampersand was only swallowed in the case of 1-character name before EOF.

self._run_check('& z', [('data', '& z')], convert_charrefs=True)

def test_eof_in_entityref(self):
self._run_check('&gt', [('entityref', 'gt')], convert_charrefs=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was data before this change.

elif i + 3 < n: # larger than "&#x"
# not the end of the buffer, and can't be confused
# with some other construct
self.handle_data("&#")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for emitting &# as data now, instead of emitting it later with the rest of the data? IOW, in the case of '&x y', this will emit handle_data('&#') + handle_data(' y') instead of a single handle_data('&# y').

Having multiple handle_data is not wrong per se, but if we remove the elif block and let the else break, we can simplify the code and emit a single handle_data.

The same might apply below, where a single & is emitted. Also note that for ' z &x y', the first handle_data gets called with only 'z ', even if followed by one or more additional handle_data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try &#x &lt;. With this PR it emits handle_data('&#'), handle_data('x '), handle_entityref('lt') which is not optimal, but correct. If remove this elif, it will emit handle_data('&#x &lt;'), which is incorrect.

@serhiy-storchaka serhiy-storchaka merged commit 95296a9 into python:main Nov 19, 2025
87 of 89 checks passed
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the htmlparser-eof-in-entityref branch November 19, 2025 11:55
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Nov 19, 2025
…Parser (pythonGH-140904)

(cherry picked from commit 95296a9)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Nov 19, 2025

GH-141745 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Nov 19, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Nov 19, 2025
…Parser (pythonGH-140904)

(cherry picked from commit 95296a9)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Nov 19, 2025

GH-141746 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Nov 19, 2025
serhiy-storchaka added a commit that referenced this pull request Nov 19, 2025
…LParser (GH-140904) (GH-141746)

(cherry picked from commit 95296a9)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this pull request Nov 19, 2025
…LParser (GH-140904) (GH-141745)

(cherry picked from commit 95296a9)

Co-authored-by: Serhiy Storchaka <[email protected]>
StanFromIreland pushed a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025
ashm-dev pushed a commit to ashm-dev/cpython that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants