Re: nawk-20230911 problem

To: Greg Troxel <gdt%lexort.com@localhost>
Subject: Re: nawk-20230911 problem
From: Paolo Vincenzo Olivo <vins%netbsd.org@localhost>
Date: Sun, 17 Sep 2023 11:15:15 +0000

On 23/09/16 08:34PM, Greg Troxel wrote:
> Certainly, I think it makes sense tto change lang/nawk to be a version
> compatible with traditional awk.

lang/nawk is now at 20230909 (pre-UTF8); looking forward testing and
upcoming bulk build reports. 

> Whether we add awk2 right now, or we defer for more thought discussion,
> is another matter.

Maybe we can cover the topic in a separate discussion in future; this
will leave the 2nd edition enough time to be tested (thus to allow
possible important fixes to be patched upstream in a maintenance
release).   

> What does POSIX say?  Is awk2 non-compliant?  Is awk UB if not on ASCII
> only?

According to gawk's manual - 11.2.7.1 Modern Character Sets - [1] POSIX
requires awk to work in terms of characters, not bytes, which is what
nawk 2nd edition does. 
However, according the Open Group specification [2], this should depend
on the system locale:

```
The following environment variables shall affect the execution of awk:

[..]

LC_CTYPE
    Determine the locale for the interpretation of sequences of bytes of
    text data as characters (for example, single-byte as opposed to
    multi-byte characters in arguments and input files), the behavior of
    character classes within regular expressions, the identification of
    characters as letters, and the mapping of uppercase and lowercase
    characters for the toupper and tolower functions.
```

As far I can see, this matches gawk's current behaviour, and is
reflected by the fact that mozilla-rootcerts sets LC_ALL=C to allow gawk
to parse DER cert files.
Apparently nawk 2nd edition is only capable of parsing strings as
unicode code points and its behaviour is unaffected by locale
environmental variables.  

> pkgsrc needs to define how the awk tool behaves.  I don't see any
> approach being reasonable other than defining an awk(1) tool, and then
> defining an awk2 tool if packges are going to need awk2.

A quick search suggests that gawk's unicode support  may be only
partial and prone to pitfalls. This would make nawk2 the only nawk
implementation to fully support unicode and likely the preferred tool to
be used by packages which rely on awk to process multibyte encoded data.  

[1] https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

Follow-Ups:
- Re: nawk-20230911 problem
  - From: Greg Troxel

References:
- nawk-20230911 problem
  - From: Adam
- Re: nawk-20230911 problem
  - From: Paolo Vincenzo Olivo
- Re: nawk-20230911 problem
  - From: Greg Troxel

Prev by Date: Re: maintaining bulk-{small,medium,large}
Next by Date: Re: nawk-20230911 problem
Previous by Thread: Re: nawk-20230911 problem
Next by Thread: Re: nawk-20230911 problem
Indexes:

Home | Main Index | Thread Index | Old Index