tech-pkg archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: nawk-20230911 problem



Paolo Vincenzo Olivo <vins%netbsd.org@localhost> writes:

> lang/nawk is now at 20230909 (pre-UTF8); looking forward testing and
> upcoming bulk build reports. 

Great, thanks.

> According to gawk's manual - 11.2.7.1 Modern Character Sets - [1] POSIX
> requires awk to work in terms of characters, not bytes, which is what
> nawk 2nd edition does. 
> However, according the Open Group specification [2], this should depend
> on the system locale:
>
> ```
> The following environment variables shall affect the execution of awk:
>
> [..]
>
> LC_CTYPE
>     Determine the locale for the interpretation of sequences of bytes of
>     text data as characters (for example, single-byte as opposed to
>     multi-byte characters in arguments and input files), the behavior of
>     character classes within regular expressions, the identification of
>     characters as letters, and the mapping of uppercase and lowercase
>     characters for the toupper and tolower functions.
> ```
>
> As far I can see, this matches gawk's current behaviour, and is
> reflected by the fact that mozilla-rootcerts sets LC_ALL=C to allow gawk
> to parse DER cert files.

Thanks; that makes sense.

> Apparently nawk 2nd edition is only capable of parsing strings as
> unicode code points and its behaviour is unaffected by locale
> environmental variables.  

That is a bug then, which is what I think you just about said about POSIX.

>> pkgsrc needs to define how the awk tool behaves.  I don't see any
>> approach being reasonable other than defining an awk(1) tool, and then
>> defining an awk2 tool if packges are going to need awk2.
>
> A quick search suggests that gawk's unicode support  may be only
> partial and prone to pitfalls. This would make nawk2 the only nawk
> implementation to fully support unicode and likely the preferred tool to
> be used by packages which rely on awk to process multibyte encoded data.  
>
> [1] https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

It seems there are then:

  things that need awk to process bytes with LC_CTYPE=C

  things that need awk to process utf-8

and we need both, at least until nawk2 is fixed to follow POSIX.


I have the impression NetBSD's locale support is a bit off and we don't
really deal with LC_CTYPE properly.   I wonder if that's the real issue.


Home | Main Index | Thread Index | Old Index