56 Commits

Author SHA1 Message Date
Serhiy Storchaka a183a11db8 [3.12] gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837) (GH-140842) (GH-140850)
(cherry picked from commit a17c57eee5)
(cherry picked from commit 0329bd11c7)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2025-10-31 17:57:28 +01:00
Serhiy Storchaka dcf24768c9 [3.12] gh-135661: Fix CDATA section parsing in HTMLParser (GH-135665) (#137774)
"] ]>" and "]] >" no longer end the CDATA section.

Make CDATA section parsing  context depending.
Add private method HTMLParser._set_support_cdata() to change the context.
If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>".
If called with False, "<[CDATA[" starts a bogus comments which ends with ">".
(cherry picked from commit 0cbbfc4621)
2025-10-06 16:06:29 +02:00
Miss Islington (bot) f66c75f11d [3.12] gh-118350: Fix support of elements "textarea" and "title" in HTMLParser (GH-135310) (GH-136986)
(cherry picked from commit 4d02f31cdd)

Co-authored-by: Timon Viola <44016238+timonviola@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
2025-07-22 14:31:27 +02:00
Miss Islington (bot) ad695f5328 [3.12] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) (GH-136919)
This fixes a regression introduced in GH-135930.
(cherry picked from commit dee6501894)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2025-07-22 11:56:39 +02:00
Miss Islington (bot) ef053a92d5 [3.12] gh-102555: Fix comment parsing in HTMLParser according to the HTML5 standard (GH-135664) (GH-136273)
* "--!>" now ends the comment.
* "-- >" no longer ends the comment.
* Support abnormally ended empty comments "<-->" and "<--->".

---------
(cherry picked from commit 8ac7613dc8)


Co-author: Kerim Kabirov <the.privat33r+gh@pm.me>

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
2025-07-12 14:24:52 +02:00
Serhiy Storchaka c555f889c3 [3.12] gh-135661: Fix parsing start and end tags in HTMLParser according to the HTML5 standard (GH-135930) (GH-136268)
* Whitespaces no longer accepted between `</` and the tag name.
  E.g. `</ script>` does not end the script section.

* Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized
  as whitespaces. The only whitespaces are `\t\n\r\f `.

* Null character (U+0000) no longer ends the tag name.

* Attributes and slashes after the tag name in end tags are now ignored,
  instead of terminating after the first `>` in quoted attribute value.
  E.g. `</script/foo=">"/>`.

* Multiple slashes and whitespaces between the last attribute and closing `>`
  are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`.

* Multiple `=` between attribute name and value are no longer collapsed.
  E.g. `<a foo==bar>` produces attribute "foo" with value "=bar".

* Whitespaces between the `=` separator and attribute name or value are no
  longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and
  "=bar", both with value None; `<a foo= bar>` produces two attributes:
  "foo" with value "" and "bar" with value None.

* Fix data loss after unclosed script or style tag (gh-86155).

Also backport test.support.subTests() (gh-135120).

---------
(cherry picked from commit 0243f97cba)

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Waylan Limberg <waylan.limberg@icloud.com>
2025-07-04 17:28:00 +02:00
Serhiy Storchaka ab0893fd5c [3.12] gh-135462: Fix quadratic complexity in processing special input in HTMLParser (GH-135464) (GH-135483)
End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5dbfb)
2025-07-04 00:12:10 +02:00
Dong-hee Na 157aef79b0 gh-95813: Improve HTMLParser from the view of inheritance (#95874)
* gh-95813: Improve HTMLParser from the view of inheritance

* gh-95813: Add unittest

* Address code review
2022-08-18 13:16:33 +02:00
Alberto Mardegan 562c0d7398 bpo-45421: Remove dead code from html.parser (GH-28847)
Support for HtmlParserError was removed back in 2014 with commit
73a4359eb0, however this small block was
missed.
2021-10-12 10:12:21 -07:00
Christian Clauss 745c9d9dfc Fix typos in the Lib directory (GH-28775)
Fix typos in the Lib directory as identified by codespell.

Co-authored-by: Terry Jan Reedy <tjreedy@udel.edu>
2021-10-06 16:13:48 -07:00
Karl Dubost 9eb11a139f bpo-41748: Handles unquoted attributes with commas (#24072)
* bpo-41748: Adds tests for unquoted attributes with comma

* bpo-41748: Handles unquoted attributes with comma

* bpo-41748: Addresses review comments

* bpo-41748: Addresses review comments

* Adds more test cases
* Simplifies the regex for handling spaces

* bpo-41748: Moves attributes tests under the right class

* bpo-41748: Addresses review about duplicate attributes

* bpo-41748: Adds NEWS.d entry for this patch
2021-02-01 21:32:50 +01:00
Inada Naoki fae0ed5099 bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)
It is deprecated since Python 3.4.
2019-08-27 11:48:06 +09:00
Motoki Naruse 3358d589fb bpo-30629: Remove second call of str.lower() in html.parser.parse_endtag. (#2099)
elem is the result of .lower() 6 lines above the handle_endtag call.
Patch by Motoki Naruse
2017-06-16 21:15:25 -04:00
Serhiy Storchaka c842efc6ae Revert "Fixed a typo in the HTMLParser.feed docstrings" (#1771)
* Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix.
This reverts commit 5ba185039f.
2017-05-24 07:20:45 +03:00
Jani Šumak 5ba185039f Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a rawstring. (#1759) 2017-05-23 16:40:54 +03:00
R David Murray 44b548dda8 #27364: fix "incorrect" uses of escape character in the stdlib.
And most of the tools.

Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and
Martin Panter.
2016-09-08 13:59:53 -04:00
Martin Panter 46f50726a0 Issue #27076: Doc, comment and tests spelling fixes
Most fixes to Doc/ and Lib/ directories by Ville Skyttä.
2016-05-26 05:35:26 +00:00
Ezio Melotti 20a2c6482e #23144: merge with 3.4. 2015-09-06 21:44:45 +03:00
Ezio Melotti 6f2bb98966 #23144: Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True. 2015-09-06 21:38:06 +03:00
Ezio Melotti 6fc16d81af #21047: set the default value for the *convert_charrefs* argument of HTMLParser to True. Patch by Berker Peksag. 2014-08-02 18:36:12 +03:00
Ezio Melotti 73a4359eb0 #15114: the strict mode and argument of HTMLParser, HTMLParser.error, and the HTMLParserError exception have been removed. 2014-08-02 14:10:30 +03:00
Ezio Melotti 153d97b24e #20288: merge with 3.3. 2014-02-01 21:22:26 +02:00
Ezio Melotti f27b9a741a #20288: fix handling of invalid numeric charrefs in HTMLParser. 2014-02-01 21:21:01 +02:00
Ezio Melotti 95401c5f6b #13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references. 2013-11-23 19:52:05 +02:00
Ezio Melotti f6de9eb2bb #19688: add back and deprecate the internal HTMLParser.unescape() method. 2013-11-22 05:49:29 +02:00
Ezio Melotti 4a9ee26750 #2927: Added the unescape() function to the html module. 2013-11-19 20:28:45 +02:00
Ezio Melotti b7038817fe #19480: merge with 3.3. 2013-11-07 18:35:27 +02:00
Ezio Melotti 7165d8b9ba #19480: HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard. 2013-11-07 18:33:24 +02:00
Ezio Melotti 88ebfb129b #15114: The html.parser module now raises a DeprecationWarning when the strict argument of HTMLParser or the HTMLParser.error method are used. 2013-11-02 17:08:24 +02:00
Ezio Melotti f6ca26fbff #17802: merge with 3.3. 2013-05-01 16:20:00 +03:00
Ezio Melotti 8e596a765c #17802: Fix an UnboundLocalError in html.parser. Initial tests by Thomas Barlow. 2013-05-01 16:18:25 +03:00
Ezio Melotti 1698babd1b #14679: add an __all__ (that contains only HTMLParser) to html.parser. 2013-05-01 16:09:34 +03:00
Ezio Melotti 46495182d0 #15156: HTMLParser now uses the new "html.entities.html5" dictionary. 2012-06-24 22:02:56 +02:00
Ezio Melotti 3861d8b271 #15114: the strict mode of HTMLParser and the HTMLParseError exception are deprecated now that the parser is able to parse invalid markup. 2012-06-23 15:27:51 +02:00
Ezio Melotti 0780b6bc58 #14538: HTMLParser can now parse correctly start tags that contain a bare /. 2012-04-18 19:18:22 -06:00
Ezio Melotti 29877e8e04 HTMLParser is now able to handle slashes in the start tag. 2012-02-21 09:25:00 +02:00
Ezio Melotti e31ddedb0e Fix an index and clean up comments. 2012-02-13 20:20:00 +02:00
Ezio Melotti f4ab491901 Improve handling of declarations in HTMLParser. 2012-02-13 15:50:37 +02:00
Ezio Melotti 5211ffe4df #13993: HTMLParser is now able to handle broken end tags when strict=False. 2012-02-13 11:24:50 +02:00
Ezio Melotti fa3702dc28 #13960: HTMLParser is now able to handle broken comments when strict=False. 2012-02-10 10:45:44 +02:00
Ezio Melotti 15cb489234 #13358: HTMLParser now calls handle_data only once for each CDATA. 2011-11-18 18:01:49 +02:00
Ezio Melotti c2fe57762b #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. 2011-11-14 18:53:33 +02:00
Ezio Melotti 7de56f6a04 #670664: Fix HTMLParser to correctly handle the content of `<script>...</script> and <style>...</style>`. 2011-11-01 14:12:22 +02:00
Ezio Melotti f50ffa94ab #13273: fix a bug that prevented HTMLParser to properly detect some tags when strict=False. 2011-10-28 13:21:09 +03:00
Ezio Melotti d9e0b068af #12888: Fix a bug in HTMLParser.unescape that prevented it to escape more than 128 entities. Patch by Peter Otten. 2011-09-05 17:11:06 +03:00
Éric Araujo 51b7aedadd Merge 3.1 2011-05-25 18:13:49 +02:00
Éric Araujo 39f180bb1f Fix display of html.parser.HTMLParser.feed docstring 2011-05-04 15:55:47 +02:00
Ezio Melotti 2e3607c1e7 #7311: fix html.parser to accept non-ASCII attribute values. 2011-04-07 22:03:31 +03:00
Senthil Kumaran 6c85838489 Merged revisions 87542 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k

........
  r87542 | senthil.kumaran | 2010-12-28 23:55:16 +0800 (Tue, 28 Dec 2010) | 3 lines

  Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax
........
2010-12-28 16:10:56 +00:00
Senthil Kumaran 164540fee1 Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax 2010-12-28 15:55:16 +00:00