Correctly fold unknown-8bit originating from encoded words. (#142517)

The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first.
2026-05-06 12:49:07 -04:00 · 2025-12-24 09:14:39 -05:00
parent d4dc3dd9aa
commit 1e17ccd030
3 changed files with 13 additions and 1 deletions
@@ -219,7 +219,7 @@ def encode(string, charset='utf-8', encoding=None, lang=''):

    """
    if charset == 'unknown-8bit':
-        bstring = string.encode('ascii', 'surrogateescape')
+        bstring = string.encode('utf-8', 'surrogateescape')
    else:
        bstring = string.encode(charset)
    if encoding is None:
@@ -3340,5 +3340,13 @@ class TestFolding(TestEmailBase):
        token = parser.get_address_list(text)[0]
        self._test(token, expected, policy=policy)

+    def test_encoded_word_with_undecodable_bytes(self):
+        self._test(parser.get_address_list(
+            ' =?utf-8?Q?=E5=AE=A2=E6=88=B6=E6=AD=A3=E8=A6=8F=E4=BA=A4=E7?='
+                )[0],
+            ' =?unknown-8bit?b?5a6i5oi25q2j6KaP5Lqk5w==?=\n',
+            )
+
+
 if __name__ == '__main__':
    unittest.main()
@@ -0,0 +1,4 @@
+The non-``compat32`` :mod:`email` policies now correctly handle refolding
+encoded words that contain bytes that can not be decoded in their specified
+character set.  Previously this resulting in an encoding exception during
+folding.