-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser #140904
Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added many tests to check consistency between unclosed character references followed by EOF and other character, with both convert_charrefs modes.
| # Maybe HTMLParser should use self.unescape for these | ||
| data = [ | ||
| ('a&', [('data', 'a&')]), | ||
| ('a&b', [('data', 'ab')]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the reported bug. It was in the tests!
| self._run_check('>', [('entityref', 'gt')], convert_charrefs=False) | ||
| self._run_check('>', [('data', '>')], convert_charrefs=True) | ||
|
|
||
| self._run_check('&g', [('entityref', 'g')], convert_charrefs=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ampersand was only swallowed in the case of 1-character name before EOF.
| self._run_check('& z', [('data', '& z')], convert_charrefs=True) | ||
|
|
||
| def test_eof_in_entityref(self): | ||
| self._run_check('>', [('entityref', 'gt')], convert_charrefs=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was data before this change.
c5ce0aa to
fe8bcb2
Compare
| elif i + 3 < n: # larger than "&#x" | ||
| # not the end of the buffer, and can't be confused | ||
| # with some other construct | ||
| self.handle_data("&#") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for emitting &# as data now, instead of emitting it later with the rest of the data? IOW, in the case of '&x y', this will emit handle_data('&#') + handle_data(' y') instead of a single handle_data('&# y').
Having multiple handle_data is not wrong per se, but if we remove the elif block and let the else break, we can simplify the code and emit a single handle_data.
The same might apply below, where a single & is emitted. Also note that for ' z &x y', the first handle_data gets called with only 'z ', even if followed by one or more additional handle_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try &#x <. With this PR it emits handle_data('&#'), handle_data('x '), handle_entityref('lt') which is not optimal, but correct. If remove this elif, it will emit handle_data('&#x <'), which is incorrect.
|
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14. |
…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>
|
GH-141745 is a backport of this pull request to the 3.14 branch. |
…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>
|
GH-141746 is a backport of this pull request to the 3.13 branch. |
…LParser (GH-140904) (GH-141746) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>
…LParser (GH-140904) (GH-141745) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <[email protected]>
html.parser(convert_charrefs=False)silently drops ampersand (&) on invalid named entities, causing data loss #140875