Skip to content

html.parser(convert_charrefs=False) silently drops ampersand (&) on invalid named entities, causing data loss #140875

@T90REAL

Description

@T90REAL

Bug report

Bug description:

When HTMLParser is initialized with convert_charrefs=False, it behaves incorrectly when processing an invalid named entity reference (e.g., &A, which is not a valid HTML entity). The parser silently drops the & character and only passes the subsequent A to handle_data. I think this indicates a silent data loss problem.

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_data(self, data):
        print(f"handle_data received: {data!r}")

parser_false = MyParser(convert_charrefs=False)
parser_false.feed('&A')
parser_false.close()
handle_data received: 'A'

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Labels

3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions