NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/58014: wc no longer works with binary files
The following reply was made to PR bin/58014; it has been noted by GNATS.
From: Michael Cheponis <michael.cheponis%gmail.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:05:40 -0800
--0000000000005a4a3506133fd73f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
It's indeed the case that on my arm64 test of 'wc' that 'worked' on binary
files, the environment variable "LC_ALL=3DC" was set.
I think the man page for wc needs updating, at least, to explain its
interaction with that environment variable. There *is* a discussion on
that man page about needed to use the posix iswspace() function, but when I
followed that page, there was no detail about the LC_ALL environment
variable.
Also, historically, wc was something like this:
int main(int argc, char *argv[]) {
int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWor=
d =3D 0;
while ((character =3D getchar()) !=3D EOF) {
++byteCount;
if (character =3D=3D '\n')
++lineCount;
if (character =3D=3D ' ' || character =3D=3D '\n' || character =3D=
=3D '\t')
inWord =3D 0;
else if (inWord =3D=3D 0) {
inWord =3D 1;
++wordCount;
}
}
printf("%d %d %d\n", lineCount, wordCount, byteCount);
return 0;
}
That is, because unix 'files' are simply strings-of-bytes, it may be
meaningless to count 'words' and 'lines' -- but yes, characters (file size)
is useful.
Generally, I use this when I want to know source size, and the program's
executable is in the source directory as an artifact - I do "wc *"
Anyway, I'm asking for a documentation change.
Thank you,
Mike
On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre%munnari.oz.au@localhost> wrote=
:
> The following reply was made to PR bin/58014; it has been noted by GNATS.
>
> From: Robert Elz <kre%munnari.OZ.AU@localhost>
> To: gnats-bugs%netbsd.org@localhost
> Cc:
> Subject: Re: bin/58014: wc no longer works with binary files
> Date: Sat, 09 Mar 2024 16:50:02 +0700
>
> Date: Sat, 9 Mar 2024 07:50:00 +0000 (UTC)
> From: michael.cheponis%gmail.com@localhost
> Message-ID: <20240309075000.E456A1A9241%mollari.NetBSD.org@localhost>
>
> | when 'wc' is given input from a binary file, it now gives the error:
> |
> | wc: hello: invalid byte sequence
>
> | (Assuming 'hello' is a binary file)
>
> wc without flags needs to count characters. What is a character depend=
s
> upon your locale settings. Do
>
> LC_ALL=3DC wc hello
>
> (or prefix that with "env" if you're a csh user) and it will work.
>
> | wc works as one would expect on arm64. This error only shows up on
> amd64
>
> More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different =
in
> the two cases.
>
> I am not sure that it makes sense to attempt count characters, lines, or
> words, in a binary file - what would the answers mean? If you were
> looking
> to get the size of the file, wc is not the right tool.
>
> I see no bug here, nor any real need to explain that a "word count"
> program
> isn't intended to be sane on non word/character containing files in the
> manual page.
>
>
>
--0000000000005a4a3506133fd73f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
lvetica,sans-serif;font-size:small">It's indeed the case that on my arm=
64 test of 'wc' that 'worked' on binary files, the environm=
ent variable "LC_ALL=3DC" was set.</div><div class=3D"gmail_defau=
lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></=
div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
serif;font-size:small">I think the man page for wc needs updating, at least=
, to explain its interaction with that environment variable.=C2=A0 =C2=A0Th=
ere *is* a discussion on that man page about needed to use the posix iswspa=
ce() function, but when I followed=C2=A0that page, there was no detail abou=
t the LC_ALL environment variable.=C2=A0 =C2=A0</div><div class=3D"gmail_de=
fault" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br=
></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sa=
ns-serif;font-size:small">Also, historically, wc was something like this:</=
div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font=
-family:arial,helvetica,sans-serif;font-size:small">int main(int argc, char=
*argv[]) {<br>=C2=A0 =C2=A0 int character, lineCount =3D 0, wordCount =3D =
0, byteCount =3D 0, inWord =3D 0;<br><br>=C2=A0 =C2=A0 while ((character =
=3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++byteCount;<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D '\n')<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 if (character =3D=3D ' ' || character =3D=3D '\n' || ch=
aracter =3D=3D '\t')<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i=
nWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (inWord =3D=3D 0) {<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<b=
r>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 printf("%d %d %d\n", lineC=
ount, wordCount, byteCount);<br>=C2=A0 =C2=A0 return 0;<br>}<br></div><div =
class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
t-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:a=
rial,helvetica,sans-serif;font-size:small">That is, because unix 'files=
' are simply strings-of-bytes, it may be meaningless to count 'word=
s' and 'lines' -- but yes, characters (file size) is useful.</d=
iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
erif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-=
family:arial,helvetica,sans-serif;font-size:small">Generally, I use this wh=
en I want to know source size, and the program's executable is in the s=
ource directory as an artifact - I do "wc *"=C2=A0</div><div clas=
s=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-si=
ze:small"><br></div><div class=3D"gmail_default" style=3D"font-family:arial=
,helvetica,sans-serif;font-size:small">Anyway, I'm asking for a documen=
tation change.</div><div class=3D"gmail_default" style=3D"font-family:arial=
,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defaul=
t" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Thank y=
ou,</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
sans-serif;font-size:small">Mike</div></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 1:55=E2=80=AFA=
M Robert Elz <<a href=3D"mailto:kre%munnari.oz.au@localhost">kre%munnari.oz.au@localhost</a>=
> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The =
following reply was made to PR bin/58014; it has been noted by GNATS.<br>
<br>
From: Robert Elz <<a href=3D"mailto:kre%munnari.OZ.AU@localhost" target=3D"_blank"=
>kre%munnari.OZ.AU@localhost</a>><br>
To: <a href=3D"mailto:gnats-bugs%netbsd.org@localhost" target=3D"_blank">gnats-bugs@n=
etbsd.org</a><br>
Cc: <br>
Subject: Re: bin/58014: wc no longer works with binary files<br>
Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
<br>
=C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
07:50:00 +0000 (UTC)<br>
=C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
ael.cheponis%gmail.com@localhost" target=3D"_blank">michael.cheponis%gmail.com@localhost</a><br=
>
=C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 <<a href=3D"mailto:20240309075000.=
E456A1A9241%mollari.NetBSD.org@localhost" target=3D"_blank">20240309075000.E456A1A924=
1%mollari.NetBSD.org@localhost</a>><br>
<br>
=C2=A0 =C2=A0| when 'wc' is given input from a binary file, it now =
gives the error:<br>
=C2=A0 =C2=A0|<br>
=C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
<br>
=C2=A0 =C2=A0| (Assuming 'hello' is a binary file)<br>
<br>
=C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
racter depends<br>
=C2=A0upon your locale settings.=C2=A0 Do<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
<br>
=C2=A0(or prefix that with "env" if you're a csh user) and it=
will work.<br>
<br>
=C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
shows up on amd64<br>
<br>
=C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
nt in<br>
=C2=A0the two cases.<br>
<br>
=C2=A0I am not sure that it makes sense to attempt count characters, lines,=
or<br>
=C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
f you were looking<br>
=C2=A0to get the size of the file, wc is not the right tool.<br>
<br>
=C2=A0I see no bug here, nor any real need to explain that a "word cou=
nt" program<br>
=C2=A0isn't intended to be sane on non word/character containing files =
in the<br>
=C2=A0manual page.<br>
<br>
<br>
</blockquote></div>
--0000000000005a4a3506133fd73f--
Home |
Main Index |
Thread Index |
Old Index