NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales



>Number:         58619
>Category:       bin
>Synopsis:       nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Aug 20 07:10:00 +0000 2024
>Originator:     Rin Okuyama
>Release:        10.99.11
>Organization:
Internet Initiative Japan Inc.
>Environment:
NetBSD rp64 10.99.11 NetBSD 10.99.11 (GENERIC64) #2: Tue Aug 20 13:15:56 JST 2024  rin@dancena:/home/rin/src/sys/arch/evbarm/compile/GENERIC64 evbarm
>Description:
nawk 2024-08-17 has recently been imported as /usr/bin/awk.

This version is based on "2nd edition", but compatibility for
8-bit-clean single-byte locales like "C" seems to be improved:

https://github.com/onetrueawk/awk/commit/1087d46

(BTW, their documentation is *REALLY* poor.)

However, still, it gives broken results for non-UTF-8 multibyte
locales. Not only broken, results are incompatible with older
versions, at least for non-8-bit-clean multibyte locales.

For example, in the previous versions, length() builtin counts
number of bytes for, e.g., ja_JP.eucJP. However, the new version
counts number of characters, misinterpreted as UTF-8 :(
>How-To-Repeat:
Try euc.txt, which I converted to EUC-JP from
http://www.jp.netbsd.org/ja/JP/index.html

---
$ ftp https://www.netbsd.org/~rin/euc.txt
...
$ env LC_CTYPE=ja_JP.eucJP \
awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
---

Older versions and 2024-08-17 give 10978 and 10418, respectively.
>Fix:
Just for example above:

https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569

However, still, I'm not very sure whether "2nd edition" of
nawk should be used or not...



Home | Main Index | Thread Index | Old Index