gh-136702: Deprecate passing non-ascii encoding (str) to `encodings.normalize_encoding` #140030

StanFromIreland · 2025-10-13T10:28:28Z

Passing a non-ascii bytes already raises a ValueError. The requirement of non-ascii input has been documented, but not enforced.

Issue: Improve encodings.normalize_encoding behaviour or docs #136702

📚 Documentation preview 📚: https://cpython-previews--140030.org.readthedocs.build/

malemburg

Please check the performance of the helper function.

Lib/email/_header_value_parser.py

Lib/email/utils.py

StanFromIreland · 2025-10-16T15:49:54Z

Thanks for the review! Using translate gives a nice performance bonus, changed the fallback to 'ascii' too.

bitdancer · 2025-10-31T21:10:18Z

Lib/email/utils.py

+        return charset
+    sanitized = charset.translate(_SANITIZE_TABLE)
+    return sanitized if sanitized else fallback_charset
+


What is the trigger for this change? Do I actually have a test that uses a non-ascii charset name? If I did it should be an error case, since non-ascii is not permitted in charset names per the RFCs. I'm surprised I don't appear to be registering a defect for that, though I didn't go through the code enough to be sure I don't ;)

Regardless it isn't clear to me that 'sanitizing' is a useful operation. It isn't likely to produce a valid charset name, we should just be falling back to ascii at that point. What led you to choose this approach?

This is currently done by normalize_encoding.

OK. emal doesn't call lookup directly and no tests fail without the changes.

I presume you did this to preserve backward compatibility. Unless I'm missing something, I don't think we should bother to do that. Given a non-ascii charset name, there are two possible outcomes from the current code: the name after sanitizing is not a valid codec name, or it is. If it is valid after sanitizing, there are two cases: the sanitized name results in successful decoding, or it does not. It is only the first of these second two cases that would be affected by the post-deprecation change.

How often would that case occur in reality? I would guess it would be a vanishingly small number of cases, if it ever occurs at all.

I think it will be better to remove the changes to the email package from this PR. If anyone sees the deprecation warning maybe they'll open an issue, but I'm betting nobody ever sees it from the email package. The behavior after the deprecation is over is the behavior we want: if the codec name contains non-ascii it is not a valid codec name, so any non-ascii in the text being decoded using that charset name will ultimately get turned into the 'unknown character' glyph when decoded by the email package.

I presume you did this to preserve backward compatibility.

Yes, I'm no email expert and I did not dig into the specifications, so I did this to not change any behaviour. I can remove it.

What't the conclusion here ? I still see the email package changes in place, but they look pretty harmless to me.

StanFromIreland · 2025-11-01T19:41:33Z

Lib/test/test_email/test_email.py

+        import warnings
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", DeprecationWarning)
+            self.assertEqual(msg.get_filename(), 'myfile.txt')


@bitdancer It was indeed tested.

Ah, I missed that. I guess maybe warnings weren't enabled the way I ran the tests, though I thought I was doing so...

What do you want to do, keep it like so or revert to my backwards-compatible approach?

I believe I've confirmed that the tests will still pass after lookup starts raising a ValueError, which is what I was hoping would happen. I wonder if we should actually assert the deprecation warning here, though, so that we are reminded to remove the check when the deprecation turns into reality. I'm fine with whatever is standard practice for such cases, though.

Converted to asserts.

bitdancer

LGTM

StanFromIreland · 2025-11-08T10:27:16Z

@malemburg Anything else I should do here?

malemburg · 2025-11-09T11:44:04Z

@malemburg Anything else I should do here?

The PR looks fine to, but I want to hear back from @bitdancer about the email package changes before merging it.

StanFromIreland · 2025-11-09T12:34:03Z

I thought we had reached agreement, he has said „LGTM” above.

malemburg · 2025-11-09T12:37:12Z

Oh, ok, didn't see that.

malemburg · 2025-11-09T12:38:01Z

Thanks, @StanFromIreland and @bitdancer for the reviews.

hugovk · 2025-11-10T14:41:28Z

Looks like this is causing failures on refleaks buildbots:

======================================================================
FAIL: test_codecs_lookup (test.test_codecs.CodecNameNormalizationTest.test_codecs_lookup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_codecs.py", line 3891, in test_codecs_lookup
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered

======================================================================
FAIL: test_rfc2231_bad_character_in_encoding (test.test_email.test_email.TestRFC2231.test_rfc2231_bad_character_in_encoding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/test_email.py", line 5741, in test_rfc2231_bad_character_in_encoding
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered

======================================================================
FAIL: test_value_rfc2231_nonascii_in_charset_of_charset_parameter_value (test.test_email.test_headerregistry.TestContentTypeHeader.test_value_rfc2231_nonascii_in_charset_of_charset_parameter_value)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/__init__.py", line 160, in <lambda>
    getattr(self, name)(*params))
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.refleak/build/Lib/test/test_email/test_headerregistry.py", line 255, in content_type_as_value
    with self.assertWarns(DeprecationWarning):
         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
AssertionError: DeprecationWarning not triggered

hugovk · 2025-11-10T14:45:33Z

And I see the fix PR has just been merged :)

#141345

…odings.normalize_encoding` (python#140030) Closes python#136702

deprecate non-ascii

c8fc658

StanFromIreland requested a review from malemburg October 13, 2025 10:28

bedevere-app bot added the awaiting review label Oct 13, 2025

bedevere-app bot mentioned this pull request Oct 13, 2025

Improve encodings.normalize_encoding behaviour or docs #136702

Open

StanFromIreland added 2 commits October 13, 2025 11:34

Relocate import

5b50daa

sanitize charset names in email

95f2e65

StanFromIreland requested a review from a team as a code owner October 13, 2025 11:13

malemburg approved these changes Oct 16, 2025

View reviewed changes

Lib/email/_header_value_parser.py Outdated Show resolved Hide resolved

Lib/email/utils.py Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Oct 16, 2025

Use table, replace with 'ascii'

fad52cd

Merge branch 'main' into encodings/non-ascii

3ac0804

bitdancer reviewed Oct 31, 2025

View reviewed changes

StanFromIreland requested a review from bitdancer November 1, 2025 19:39

Review

9d6f06e

StanFromIreland commented Nov 1, 2025

View reviewed changes

StanFromIreland added 2 commits November 1, 2025 20:22

Fix second warning

16697dc

Convert to asserts

e4036f8

bitdancer approved these changes Nov 1, 2025

View reviewed changes

StanFromIreland added 3 commits November 1, 2025 22:04

Fix for platforms with ordered tests

b8fc5f4

!fixup

7592af8

Fix CI on Android and iOS

8c59899

malemburg merged commit 5ba0a1a into python:main Nov 9, 2025
46 checks passed

bedevere-app bot removed the awaiting merge label Nov 9, 2025

StanFromIreland deleted the encodings/non-ascii branch November 9, 2025 12:45

StanFromIreland added a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025

pythongh-136702: Deprecate passing non-ascii *encoding* (str) to `enc…

51b2f1e

…odings.normalize_encoding` (python#140030) Closes python#136702

Uh oh!

gh-136702: Deprecate passing non-ascii *encoding* (str) to encodings.normalize_encoding #140030

gh-136702: Deprecate passing non-ascii *encoding* (str) to encodings.normalize_encoding #140030

Uh oh!

Conversation

StanFromIreland commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bitdancer Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StanFromIreland Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

StanFromIreland commented Nov 8, 2025

Uh oh!

malemburg commented Nov 9, 2025

Uh oh!

StanFromIreland commented Nov 9, 2025

Uh oh!

malemburg commented Nov 9, 2025

Uh oh!

Uh oh!

malemburg commented Nov 9, 2025

Uh oh!

hugovk commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hugovk commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gh-136702: Deprecate passing non-ascii encoding (str) to `encodings.normalize_encoding` #140030

gh-136702: Deprecate passing non-ascii encoding (str) to `encodings.normalize_encoding` #140030

StanFromIreland commented Oct 13, 2025 •

edited by github-actions bot

Loading

bitdancer Nov 1, 2025 •

edited

Loading

StanFromIreland Nov 1, 2025 •

edited

Loading

hugovk commented Nov 10, 2025 •

edited

Loading