NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/59766: awk does not handle RS="\0"
The following reply was made to PR bin/59766; it has been noted by GNATS.
From: Martin Neitzel <neitzel%hackett.marshlabs.gaertner.de@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Mon, 17 Nov 2025 16:49:44 +0100 (CET)
I'm not at all against fixing things which are easily fixed
but for this report the following context should be kept
in mind:
(1) POSIX/Single Unix Specification says (in "Shell & Utilies", "awk"):
Input files to the awk program from any of the following sources
shall be text files:
with "text files" having a specific meaning, clarified in the "Base
Definitions", Definitions", "Text File":
A file that contains characters organized into one or more
lines. The lines do not contain NUL characters [...]
That is, awk(1) is strictly speaking not the proper tool to deal
with "find -print0" output in first place, and any support for
that would be a (non-portable) extension.
(2) awk's RS has a special meaning when it "is NULL": paragraphs
(separated by empty lines) become the records, lines the fields.
This was historically new with nawk ("the one true awk"), and
POSIX demands it, too:
RS
The first character of the string value of RS shall be the
input record separator; a <newline> by default. If RS
contains more than one character, the results are unspecified.
If RS is null, then records are separated by sequences
consisting of a <newline> plus one or more blank lines,
leading or trailing blank lines shall not result in empty
records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value
of FS is.
The NetBSD-9-stable awk(1) man page is failing to point this out but
implements it just nicely:
$ man awk | awk -v RS= '/split/ {print NR, $0 "\n"}'
man: Formatting manual page...
14 An input line is normally made up of fields separated by white space, or
by regular expression FS. The fields are denoted $1, $2, ..., while $0
refers to the entire line. If FS is null, the input line is split into
one field per character.
49 split(s, a, [fs])
splits the string s into array elements a[1], a[2], ..., a[n],
and returns n. The separation is done with the regular
expression fs or with the field separator FS if fs is not given.
An empty string as field separator splits the string into one
array element per character.
$
An RS="" is the canonical way to set this within a script, and
I'd assume an RS="\0" to act not any different.
I haven't had a look at the suggested patch but this "paragraph
behaviour" should certainly not be broken.
Martin
Home |
Main Index |
Thread Index |
Old Index