pkgsrc-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[pkgsrc/trunk]: pkgsrc/textproc/py-ftfy Version 4.3.0 (December 29, 2016)



details:   https://anonhg.NetBSD.org/pkgsrc/rev/132d0d322ac9
branches:  trunk
changeset: 357055:132d0d322ac9
user:      rodent <rodent%pkgsrc.org@localhost>
date:      Thu Jan 12 00:45:43 2017 +0000

description:
Version 4.3.0 (December 29, 2016)

ftfy has gotten by for four years without dependencies on other Python libraries, but now we can spare ourselves some code and some maintenance burden by delegating certain tasks to other libraries 
that already solve them well. This version now depends on the html5lib and wcwidth libraries.

Feature changes:

    The remove_control_chars fixer will now remove some non-ASCII control characters as well, such as deprecated Arabic control characters and byte-order marks. Bidirectional controls are still left 
as is.

    This should have no impact on well-formed text, while cleaning up many characters that the Unicode Consortium deems "not suitable for markup" (see Unicode Technical Report #20).

    The unescape_html fixer uses a more thorough list of HTML entities, which it imports from html5lib.

    ftfy.formatting now uses wcwidth to compute the width that a string will occupy in a text console.

Heuristic changes:

    Updated the data file of Unicode character categories to Unicode 9, as used in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same data.)

Pending deprecations:

    The remove_bom option will become deprecated in 5.0, because it has been superseded by remove_control_chars.

    ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It was renamed to fix_encoding in 4.0.

    ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, please specify ftfy < 5 in your dependencies if you haven't already.

Version 4.2.0 (September 28, 2016)

Heuristic changes:

    Math symbols next to currency symbols are no longer considered 'weird' by the heuristic. This fixes a false positive where text that involved the multiplication sign and British pounds or euros 
(as in '5??35') could turn into Hebrew letters.

    A heuristic that used to be a bonus for certain punctuation now also gives a bonus to successfully decoding other common codepoints, such as the non-breaking space, the degree sign, and the byte 
order mark.

    In version 4.0, we tried to "future-proof" the categorization of emoji (as a kind of symbol) to include codepoints that would likely be assigned to emoji later. The future happened, and there are 
even more emoji than we expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), but this expanded range should include the emoji from Unicode 9 and 10.

    Emoji are increasingly being modified by variation selectors and skin-tone modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit right in with emoji, instead of being 
considered 'marks' as their Unicode category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    An old heuristic that wasn't necessary anymore considered Latin text with high-numbered codepoints to be 'weird', but this is normal in languages such as Vietnamese and Azerbaijani. This does not 
seem to have caused any false positives, but it caused ftfy to be too reluctant to fix some cases of broken text in those languages.

    The heuristic has been changed, and all languages that use Latin letters should be on even footing now.

Version 4.1.1 (April 13, 2016)

    Bug fix: in the command-line interface, the -e option had no effect on Python 3 when using standard input. Now, it correctly lets you specify a different encoding for standard input.

Version 4.1.0 (February 25, 2016)

Heuristic changes:

    ftfy can now deal with "lossy" mojibake. If your text has been run through a strict Windows-1252 decoder, such as the one in Python, it may contain the replacement character ? (U+FFFD) where 
there were bytes that are unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this situation, replace the entire lossy character with ?, and decode the rest of the characters. Previous versions would be 
unable to fix any string that contained U+FFFD.

    As an example, text in curly quotes that gets corrupted ??? like this ??? now gets fixed to be ? like this ?.

    Updated the data file of Unicode character categories to Unicode 8.0, as used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the same data.)

    Heuristics now count characters such as ~ and ^ as punctuation instead of wacky math symbols, improving the detection of mojibake in some edge cases.

New features:

    A new module, ftfy.formatting, can be used to justify Unicode text in a monospaced terminal. It takes into account that each character can take up anywhere from 0 to 2 character cells.

    Internally, the utf-8-variants codec was simplified and optimized.

Version 4.0.0 (April 10, 2015)

Breaking changes:

    The default normalization form is now NFC, not NFKC. NFKC replaces a large number of characters with 'equivalent' characters, and some of these replacements are useful, but some are not desirable 
to do by default.

    The fix_text function has some new options that perform more targeted operations that are part of NFKC normalization, such as fix_character_width, without requiring hitting all your text with the 
huge mallet that is NFKC.
        If you were already using NFC normalization, or in general if you want to preserve the spacing of CJK text, you should be sure to set fix_character_width=False.

    The remove_unsafe_private_use parameter has been removed entirely, after two versions of deprecation. The function name fix_bad_encoding is also gone.

New features:

    Fixers for strange new forms of mojibake, including particularly clear cases of mixed UTF-8 and Windows-1252.

    New heuristics, so that ftfy can fix more stuff, while maintaining approximately zero false positives.

    The command-line tool trusts you to know what encoding your input is in, and assumes UTF-8 by default. You can still tell it to guess with the -g option.

    The command-line tool can be configured with options, and can be used as a pipe.

    Recognizes characters that are new in Unicode 7.0, as well as emoji from Unicode 8.0+ that may already be in use on iOS.

Deprecations:

    fix_text_encoding is being renamed again, for conciseness and consistency. It's now simply called fix_encoding. The name fix_text_encoding is available but emits a warning.

Pending deprecations:

    Python 2.6 support is largely coincidental.

    Python 2.7 support is on notice. If you use Python 2, be sure to pin a version of ftfy less than 5.0 in your requirements.

diffstat:

 textproc/py-ftfy/Makefile |   4 ++--
 textproc/py-ftfy/PLIST    |   5 ++++-
 textproc/py-ftfy/distinfo |  10 +++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diffs (43 lines):

diff -r d33fef476bad -r 132d0d322ac9 textproc/py-ftfy/Makefile
--- a/textproc/py-ftfy/Makefile Thu Jan 12 00:45:31 2017 +0000
+++ b/textproc/py-ftfy/Makefile Thu Jan 12 00:45:43 2017 +0000
@@ -1,6 +1,6 @@
-# $NetBSD: Makefile,v 1.4 2017/01/03 13:23:04 jperkin Exp $
+# $NetBSD: Makefile,v 1.5 2017/01/12 00:45:43 rodent Exp $
 
-DISTNAME=      ftfy-3.4.0
+DISTNAME=      ftfy-4.2.0
 PKGNAME=       ${PYPKGPREFIX}-${DISTNAME}
 CATEGORIES=    python textproc
 MASTER_SITES=  ${MASTER_SITE_PYPI:=f/ftfy/}
diff -r d33fef476bad -r 132d0d322ac9 textproc/py-ftfy/PLIST
--- a/textproc/py-ftfy/PLIST    Thu Jan 12 00:45:31 2017 +0000
+++ b/textproc/py-ftfy/PLIST    Thu Jan 12 00:45:43 2017 +0000
@@ -1,4 +1,4 @@
-@comment $NetBSD: PLIST,v 1.1 2015/04/02 22:36:59 rodent Exp $
+@comment $NetBSD: PLIST,v 1.2 2017/01/12 00:45:43 rodent Exp $
 bin/ftfy${PYVERSSUFFIX}
 ${PYSITELIB}/${EGG_INFODIR}/PKG-INFO
 ${PYSITELIB}/${EGG_INFODIR}/SOURCES.txt
@@ -36,3 +36,6 @@
 ${PYSITELIB}/ftfy/fixes.py
 ${PYSITELIB}/ftfy/fixes.pyc
 ${PYSITELIB}/ftfy/fixes.pyo
+${PYSITELIB}/ftfy/formatting.py
+${PYSITELIB}/ftfy/formatting.pyc
+${PYSITELIB}/ftfy/formatting.pyo
diff -r d33fef476bad -r 132d0d322ac9 textproc/py-ftfy/distinfo
--- a/textproc/py-ftfy/distinfo Thu Jan 12 00:45:31 2017 +0000
+++ b/textproc/py-ftfy/distinfo Thu Jan 12 00:45:43 2017 +0000
@@ -1,6 +1,6 @@
-$NetBSD: distinfo,v 1.2 2015/11/04 02:00:02 agc Exp $
+$NetBSD: distinfo,v 1.3 2017/01/12 00:45:43 rodent Exp $
 
-SHA1 (ftfy-3.4.0.tar.gz) = 143f8eb98ae9e2f8e2c861cf654acdda023de18d
-RMD160 (ftfy-3.4.0.tar.gz) = de9d0f9cc874b6c1c18fb0a5800d3c3979b120f4
-SHA512 (ftfy-3.4.0.tar.gz) = a2f4161ac236035fc5127858a830dd00c8cdf229acdbe62a4304c2f4f6eec5813a4908ee3f55ae5ef754b1e202c9a3d0be2f9bb6bda5b39f62067c6fbc1c4b10
-Size (ftfy-3.4.0.tar.gz) = 26845 bytes
+SHA1 (ftfy-4.2.0.tar.gz) = 31b504c7abb80286210c4d484fd92e2717226232
+RMD160 (ftfy-4.2.0.tar.gz) = 9e0de31674bd19eb8f29fc1895c5db65c72628e4
+SHA512 (ftfy-4.2.0.tar.gz) = db46f865ec69ca28d2795b9f1dafc45c8968a4eda8d5aafc468f6fd027f37f81c417494ce289a636b80b9f055a7ea61dbf908c504717a56316bdab7a82b1d8a4
+Size (ftfy-4.2.0.tar.gz) = 35139 bytes



Home | Main Index | Thread Index | Old Index