tech-pkg archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: nawk-20230911 problem
Paolo Vincenzo Olivo <vins%netbsd.org@localhost> writes:
> lang/nawk is now at 20230909 (pre-UTF8); looking forward testing and
> upcoming bulk build reports.
Great, thanks.
> According to gawk's manual - 11.2.7.1 Modern Character Sets - [1] POSIX
> requires awk to work in terms of characters, not bytes, which is what
> nawk 2nd edition does.
> However, according the Open Group specification [2], this should depend
> on the system locale:
>
> ```
> The following environment variables shall affect the execution of awk:
>
> [..]
>
> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of
> text data as characters (for example, single-byte as opposed to
> multi-byte characters in arguments and input files), the behavior of
> character classes within regular expressions, the identification of
> characters as letters, and the mapping of uppercase and lowercase
> characters for the toupper and tolower functions.
> ```
>
> As far I can see, this matches gawk's current behaviour, and is
> reflected by the fact that mozilla-rootcerts sets LC_ALL=C to allow gawk
> to parse DER cert files.
Thanks; that makes sense.
> Apparently nawk 2nd edition is only capable of parsing strings as
> unicode code points and its behaviour is unaffected by locale
> environmental variables.
That is a bug then, which is what I think you just about said about POSIX.
>> pkgsrc needs to define how the awk tool behaves. I don't see any
>> approach being reasonable other than defining an awk(1) tool, and then
>> defining an awk2 tool if packges are going to need awk2.
>
> A quick search suggests that gawk's unicode support may be only
> partial and prone to pitfalls. This would make nawk2 the only nawk
> implementation to fully support unicode and likely the preferred tool to
> be used by packages which rely on awk to process multibyte encoded data.
>
> [1] https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
It seems there are then:
things that need awk to process bytes with LC_CTYPE=C
things that need awk to process utf-8
and we need both, at least until nawk2 is fixed to follow POSIX.
I have the impression NetBSD's locale support is a bit off and we don't
really deal with LC_CTYPE properly. I wonder if that's the real issue.
Home |
Main Index |
Thread Index |
Old Index