104 Commits

Author SHA1 Message Date
Stan Ulbrych f1f61bf872 gh-66802: Add unicodedata.block() function (#145042)
Closes #66802
2026-02-21 18:27:55 +01:00
Serhiy Storchaka 767b4d02e2 gh-144882: Optimize name tables in unicodedata by excluding names derived by rule NR2 (GH-144883)
Since the code for rule NR2 is already here, to support CJK unified
ideographs and Tangut ideographs, it can also be used for other names
derived by rule NR2.
2026-02-18 12:58:21 +02:00
Stan Ulbrych e49bfca87c gh-142224: unicodedata: support bidi classes for unassigned code points (GH-144815) 2026-02-18 12:54:07 +02:00
Serhiy Storchaka 8b7b5a9946 gh-80667: Fix Tangut ideographs names in unicodedata (GH-144789)
Co-authored-by: Pierre Le Marre <dev@wismill.eu>
2026-02-16 13:31:18 +02:00
Serhiy Storchaka bab1d7a561 gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076)
Add the unicodedata.iter_graphemes() function to iterate over grapheme
clusters according to rules defined in Unicode Standard Annex #29.

Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break()
and unicodedata.extended_pictographic() functions to get the properties
of the character which are related to the above algorithm.

Co-authored-by: Guillaume "Vermeille" Sanchez <guillaume.v.sanchez@gmail.com>
2026-01-14 14:37:57 +00:00
Benjamin Peterson 5bd4bf04c4 closes gh-138706: update Unicode to 17.0.0 (#138719) 2025-09-11 09:58:39 -07:00
Petr Viktorin c7364f79b2 gh-127833: lexical analysis: Improve section on Names (GH-131474)
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Blaise Pabon <blaise@gmail.com>
2025-05-21 16:01:52 +02:00
sobolevn f223efb2a2 gh-126525: Fix makeunicodedata.py output on macOS and Windows (#126526) 2024-11-12 13:23:57 +03:00
Benjamin Peterson bb904e063d closes gh-124016: update Unicode to 16.0.0 (#124017) 2024-09-13 07:47:04 -07:00
CF Bolz-Tereick 9573d14215 gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names (#97906)
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
2023-11-04 15:56:58 +01:00
James Gerity def828995a fixes gh-109559: Update unicodedata for Unicode 15.1.0 (GH-109560)
---------

Co-authored-by: Benjamin Peterson <benjamin@python.org>
2023-09-19 22:07:47 -07:00
LiarPrincess 0c1d7a06ed bpo-47243: Duplicate entry in 'Objects/unicodetype_db.h' (GH-32376)
Fix for duplicate 1st entry in 'Objects/unicodetype_db.h':

```c
/* a list of unique character type descriptors */
const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = {
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 0}, <--- HERE
    {0, 0, 0, 0, 0, 32},
    {0, 0, 0, 0, 0, 48},
    …
```

https://bugs.python.org/issue47243

Automerge-Triggered-By: GH:isidentical
2022-09-28 06:57:14 -07:00
Benjamin Peterson fd1e477f53 closes gh-96734: Update to Unicode 15.0.0. (GH-96809) 2022-09-13 15:45:12 -07:00
Carl Friedrich Bolz-Tereick 9c197bc8bf GH-96172 fix unicodedata.east_asian_width being wrong on unassigned code points (#96207) 2022-08-26 19:29:39 +03:00
Carl Friedrich Bolz-Tereick 2d9f252c0c gh-96019: Fix caching of decompositions in makeunicodedata (GH-96020) 2022-08-19 12:20:44 +03:00
Benjamin Peterson 024fda47d4 closes bpo-45190: Update Unicode data to version 14.0.0. (GH-28336) 2021-09-14 11:00:38 -07:00
Benjamin Peterson 51796e5d26 Update some www.unicode.org URLs to use HTTPS. (GH-18912) 2020-03-10 21:10:59 -07:00
Benjamin Peterson 051b9d08d1 closes bpo-39926: Update Unicode to 13.0.0. (GH-18910) 2020-03-10 20:41:34 -07:00
Greg Price a65678c5c9 bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)
Now the fields have names!  Much easier to keep straight as a
reader than the elements of an 18-tuple.

Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
Fortunately that's perfectly fine for this maintenance script.
2019-09-12 10:23:43 +01:00
Greg Price 3e4498d35c bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128) 2019-08-14 18:18:53 -07:00
Greg Price c03e698c34 bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)
Much like the lower-level logic in commit ef2af1ad4, we had
4 copies of this logic, written in a couple of different ways.
They're all implementing the same standard, so write it just once.
2019-08-13 19:28:38 -07:00
Greg Price 99d208efed bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)
The `expand` option was introduced in 2000 in commit fad27aee1.
It appears to have been always set since it was committed, and
what it does is tell the code to do something essential.  So,
just always do that, and cut the option.

Also cut the `linebreakprops` option, which isn't consulted anymore.
2019-08-12 22:59:30 -07:00
Greg Price ef2af1ad44 bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)
There were 10 copies of this, and almost as many distinct versions of
exactly how it was written.  They're all implementing the same
standard.  Pull them out to the top, so the more interesting logic
that remains becomes easier to read.
2019-08-12 22:20:56 -07:00
Stefan Behnel faa2948654 Clean up and reduce visual clutter in the makeunicode.py script. (GH-7558) 2019-06-01 21:49:03 +02:00
Benjamin Peterson 3aca40d3cb closes bpo-36861: Update Unicode database to 12.1.0. (GH-13214)
Adds ㋿.
2019-05-08 20:59:35 -07:00
Inada Naoki 6fec905de5 bpo-36642: make unicodedata const (GH-12855) 2019-04-17 08:40:34 +09:00
Benjamin Peterson 738c19f4c5 closes bpo-33376: Update to Unicode 12.0.0. (GH-12256) 2019-03-09 16:25:55 -08:00
Benjamin Peterson 7c69c1c0fb update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
Benjamin Peterson 279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Jon Dufresne 3972628de3 bpo-30296 Remove unnecessary tuples, lists, sets, and dicts (#1489)
* Replaced list(<generator expression>) with list comprehension
* Replaced dict(<generator expression>) with dict comprehension
* Replaced set(<list literal>) with set literal
* Replaced builtin func(<list comprehension>) with func(<generator
  expression>) when supported (e.g. any(), all(), tuple(), min(), &
  max())
2017-05-18 07:35:54 -07:00
Benjamin Peterson 6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Benjamin Peterson 4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
R David Murray 2623a5db6f Merge: #18176: Change generic UCD PropList link to version specific link. 2014-10-09 20:47:31 -04:00
R David Murray 5f16f90d1b #18176: Change generic UCD PropList link to version specific link. 2014-10-09 20:45:59 -04:00
R David Murray 532783bd5e Merge: #18176: fix another reference and add it to the makeunicodedata comment. 2014-10-09 17:41:55 -04:00
R David Murray 5bd62420f4 #18176: fix another reference and add it to the makeunicodedata comment. 2014-10-09 17:39:48 -04:00
R David Murray 5ac125cde3 Merge: #18176: updated stdtypes UCD link, added reminder to makeunicodedata. 2014-10-09 17:33:15 -04:00
R David Murray 7445a383a6 #18176: updated stdtypes UCD link, added reminder to makeunicodedata.
Patch by Alexander Belopolsky.
2014-10-09 17:30:33 -04:00
Benjamin Peterson 3032ed7cb1 upgrade to unicode 7.0.0 2014-07-06 13:04:20 -07:00
Benjamin Peterson 94d08d908b upgrade unicode db to 6.3.0 (closes #19221) 2013-10-10 17:24:45 -04:00
Ezio Melotti d640fe2af5 #18803: merge with 3.3. 2013-08-26 01:33:30 +03:00
Ezio Melotti 7c4a7e6f3c #18803: fix more typos. Patch by Févry Thibault. 2013-08-26 01:32:56 +03:00
Antoine Pitrou 9ed5f27266 Issue #18722: Remove uses of the "register" keyword in C code. 2013-08-13 20:18:52 +02:00
Benjamin Peterson b8350f1c7d upgrade to UCD 6.2 2012-09-29 13:47:39 -04:00
Florent Xicluna c20740109d Some cleanup in the Tools directory. 2012-07-07 17:03:54 +02:00
Benjamin Peterson 71f660e00f update to Unicode 6.1 2012-02-20 22:24:29 -05:00
Benjamin Peterson ad9c569825 delta encoding of upper/lower/title makes a glorious return (#12736) 2012-01-15 21:19:20 -05:00
Benjamin Peterson d5890c8db5 add str.casefold() (closes #13752) 2012-01-14 13:23:30 -05:00
Benjamin Peterson b2bf01d824 use full unicode mappings for upper/lower/title case (#12736)
Also broaden the category of characters that count as lowercase/uppercase.
2012-01-11 18:17:06 -05:00
Ezio Melotti 931b8aac80 #12753: Add support for Unicode name aliases and named sequences. 2011-10-21 21:57:36 +03:00