NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/59766: awk does not handle RS="\0"



The following reply was made to PR bin/59766; it has been noted by GNATS.

From: Martin Neitzel <neitzel%hackett.marshlabs.gaertner.de@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: bin/59766: awk does not handle RS="\0"
Date: Mon, 17 Nov 2025 16:49:44 +0100 (CET)

 I'm not at all against fixing things which are easily fixed
 but for this report the following context should be kept
 in mind:
 
 (1) POSIX/Single Unix Specification says (in "Shell & Utilies", "awk"):
 
 	Input files to the awk program from any of the following sources
 	shall be text files:
 
 with "text files" having a specific meaning, clarified in the "Base
 Definitions", Definitions", "Text File":
 
 	A file that contains characters organized into one or more
 	lines.  The lines do not contain NUL characters [...]
 
 That is, awk(1) is strictly speaking not the proper tool to deal
 with "find -print0" output in first place, and any support for
 that would be a (non-portable) extension.
 
 
 (2) awk's RS has a special meaning when it "is NULL":  paragraphs
 (separated by empty lines) become the records, lines the fields.
 This was historically new with nawk ("the one true awk"), and
 POSIX demands it, too:
 
 RS
 	The first character of the string value of RS shall be the
 	input record separator; a <newline> by default. If RS
 	contains more than one character, the results are unspecified.
 	If RS is null, then records are separated by sequences
 	consisting of a <newline> plus one or more blank lines,
 	leading or trailing blank lines shall not result in empty
 	records at the beginning or end of the input, and a <newline>
 	shall always be a field separator, no matter what the value
 	of FS is.
 
 The NetBSD-9-stable awk(1) man page is failing to point this out but
 implements it just nicely:
 
 $ man awk | awk -v RS= '/split/ {print NR, $0 "\n"}'
 man: Formatting manual page...
 14      An input line is normally made up of fields separated by white space, or
      by regular expression FS.  The fields are denoted $1, $2, ..., while $0
      refers to the entire line.  If FS is null, the input line is split into
      one field per character.
 
 49      split(s, a, [fs])
              splits the string s into array elements a[1], a[2], ..., a[n],
              and returns n.  The separation is done with the regular
              expression fs or with the field separator FS if fs is not given.
              An empty string as field separator splits the string into one
              array element per character.
 
 $ 
 
 An   RS=""   is the canonical way to set this within a script, and
 I'd assume an   RS="\0"  to act not any different.
 
 
 I haven't had a look at the suggested patch but this "paragraph
 behaviour" should certainly not be broken.
 
 Martin
 


Home | Main Index | Thread Index | Old Index