Skip to content

Conversation

@StanFromIreland
Copy link
Member

@StanFromIreland StanFromIreland commented Oct 13, 2025

Passing a non-ascii bytes already raises a ValueError. The requirement of non-ascii input has been documented, but not enforced.


📚 Documentation preview 📚: https://cpython-previews--140030.org.readthedocs.build/

@StanFromIreland StanFromIreland requested a review from a team as a code owner October 13, 2025 11:13
Copy link
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the performance of the helper function.

@StanFromIreland
Copy link
Member Author

Thanks for the review! Using translate gives a nice performance bonus, changed the fallback to 'ascii' too.

return charset
sanitized = charset.translate(_SANITIZE_TABLE)
return sanitized if sanitized else fallback_charset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trigger for this change? Do I actually have a test that uses a non-ascii charset name? If I did it should be an error case, since non-ascii is not permitted in charset names per the RFCs. I'm surprised I don't appear to be registering a defect for that, though I didn't go through the code enough to be sure I don't ;)

Regardless it isn't clear to me that 'sanitizing' is a useful operation. It isn't likely to produce a valid charset name, we should just be falling back to ascii at that point. What led you to choose this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently done by normalize_encoding.

Copy link
Member

@bitdancer bitdancer Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. emal doesn't call lookup directly and no tests fail without the changes.

I presume you did this to preserve backward compatibility. Unless I'm missing something, I don't think we should bother to do that. Given a non-ascii charset name, there are two possible outcomes from the current code: the name after sanitizing is not a valid codec name, or it is. If it is valid after sanitizing, there are two cases: the sanitized name results in successful decoding, or it does not. It is only the first of these second two cases that would be affected by the post-deprecation change.

How often would that case occur in reality? I would guess it would be a vanishingly small number of cases, if it ever occurs at all.

I think it will be better to remove the changes to the email package from this PR. If anyone sees the deprecation warning maybe they'll open an issue, but I'm betting nobody ever sees it from the email package. The behavior after the deprecation is over is the behavior we want: if the codec name contains non-ascii it is not a valid codec name, so any non-ascii in the text being decoded using that charset name will ultimately get turned into the 'unknown character' glyph when decoded by the email package.

Copy link
Member Author

@StanFromIreland StanFromIreland Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you did this to preserve backward compatibility.

Yes, I'm no email expert and I did not dig into the specifications, so I did this to not change any behaviour. I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What't the conclusion here ? I still see the email package changes in place, but they look pretty harmless to me.

Comment on lines 5720 to 5723
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore", DeprecationWarning)
self.assertEqual(msg.get_filename(), 'myfile.txt')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitdancer It was indeed tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that. I guess maybe warnings weren't enabled the way I ran the tests, though I thought I was doing so...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to do, keep it like so or revert to my backwards-compatible approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've confirmed that the tests will still pass after lookup starts raising a ValueError, which is what I was hoping would happen. I wonder if we should actually assert the deprecation warning here, though, so that we are reminded to remove the check when the deprecation turns into reality. I'm fine with whatever is standard practice for such cases, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to asserts.

Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StanFromIreland
Copy link
Member Author

@malemburg Anything else I should do here?

@malemburg
Copy link
Member

@malemburg Anything else I should do here?

The PR looks fine to, but I want to hear back from @bitdancer about the email package changes before merging it.

@StanFromIreland
Copy link
Member Author

I thought we had reached agreement, he has said „LGTM” above.

@malemburg
Copy link
Member

Oh, ok, didn't see that.

@malemburg malemburg merged commit 5ba0a1a into python:main Nov 9, 2025
46 checks passed
@malemburg
Copy link
Member

Thanks, @StanFromIreland and @bitdancer for the reviews.

@StanFromIreland StanFromIreland deleted the encodings/non-ascii branch November 9, 2025 12:45
@hugovk
Copy link
Member

hugovk commented Nov 10, 2025

Looks like this is causing failures on refleaks buildbots:

======================================================================
FAIL: test_codecs_lookup (test.test_codecs.CodecNameNormalizationTest.test_codecs_lookup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_codecs.py", line 3891, in test_codecs_lookup
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered
======================================================================
FAIL: test_rfc2231_bad_character_in_encoding (test.test_email.test_email.TestRFC2231.test_rfc2231_bad_character_in_encoding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/test_email.py", line 5741, in test_rfc2231_bad_character_in_encoding
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered

======================================================================
FAIL: test_value_rfc2231_nonascii_in_charset_of_charset_parameter_value (test.test_email.test_headerregistry.TestContentTypeHeader.test_value_rfc2231_nonascii_in_charset_of_charset_parameter_value)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/__init__.py", line 160, in <lambda>
    getattr(self, name)(*params))
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/test_headerregistry.py", line 255, in content_type_as_value
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered

@hugovk
Copy link
Member

hugovk commented Nov 10, 2025

And I see the fix PR has just been merged :)

#141345

StanFromIreland added a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants