NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/58014: wc no longer works with binary files



The following reply was made to PR bin/58014; it has been noted by GNATS.

From: Michael Cheponis <michael.cheponis%gmail.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:05:40 -0800

 --0000000000005a4a3506133fd73f
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable
 
 It's indeed the case that on my arm64 test of 'wc' that 'worked' on binary
 files, the environment variable "LC_ALL=3DC" was set.
 
 I think the man page for wc needs updating, at least, to explain its
 interaction with that environment variable.   There *is* a discussion on
 that man page about needed to use the posix iswspace() function, but when I
 followed that page, there was no detail about the LC_ALL environment
 variable.
 
 Also, historically, wc was something like this:
 
 int main(int argc, char *argv[]) {
     int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWor=
 d =3D 0;
 
     while ((character =3D getchar()) !=3D EOF) {
         ++byteCount;
         if (character =3D=3D '\n')
             ++lineCount;
         if (character =3D=3D ' ' || character =3D=3D '\n' || character =3D=
 =3D '\t')
             inWord =3D 0;
         else if (inWord =3D=3D 0) {
             inWord =3D 1;
             ++wordCount;
         }
     }
 
     printf("%d %d %d\n", lineCount, wordCount, byteCount);
     return 0;
 }
 
 That is, because unix 'files' are simply strings-of-bytes, it may be
 meaningless to count 'words' and 'lines' -- but yes, characters (file size)
 is useful.
 
 Generally, I use this when I want to know source size, and the program's
 executable is in the source directory as an artifact - I do "wc *"
 
 Anyway, I'm asking for a documentation change.
 
 Thank you,
 Mike
 
 On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre%munnari.oz.au@localhost> wrote=
 :
 
 > The following reply was made to PR bin/58014; it has been noted by GNATS.
 >
 > From: Robert Elz <kre%munnari.OZ.AU@localhost>
 > To: gnats-bugs%netbsd.org@localhost
 > Cc:
 > Subject: Re: bin/58014: wc no longer works with binary files
 > Date: Sat, 09 Mar 2024 16:50:02 +0700
 >
 >      Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
 >      From:        michael.cheponis%gmail.com@localhost
 >      Message-ID:  <20240309075000.E456A1A9241%mollari.NetBSD.org@localhost>
 >
 >    | when 'wc' is given input from a binary file, it now gives the error:
 >    |
 >    | wc: hello: invalid byte sequence
 >
 >    | (Assuming 'hello' is a binary file)
 >
 >  wc without flags needs to count characters.   What is a character depend=
 s
 >  upon your locale settings.  Do
 >
 >         LC_ALL=3DC wc hello
 >
 >  (or prefix that with "env" if you're a csh user) and it will work.
 >
 >    | wc works as one would expect on arm64.  This error only shows up on
 > amd64
 >
 >  More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different =
 in
 >  the two cases.
 >
 >  I am not sure that it makes sense to attempt count characters, lines, or
 >  words, in a binary file - what would the answers mean?    If you were
 > looking
 >  to get the size of the file, wc is not the right tool.
 >
 >  I see no bug here, nor any real need to explain that a "word count"
 > program
 >  isn't intended to be sane on non word/character containing files in the
 >  manual page.
 >
 >
 >
 
 --0000000000005a4a3506133fd73f
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable
 
 <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
 lvetica,sans-serif;font-size:small">It&#39;s indeed the case that on my arm=
 64 test of &#39;wc&#39; that &#39;worked&#39; on binary files, the environm=
 ent variable &quot;LC_ALL=3DC&quot; was set.</div><div class=3D"gmail_defau=
 lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></=
 div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
 serif;font-size:small">I think the man page for wc needs updating, at least=
 , to explain its interaction with that environment variable.=C2=A0 =C2=A0Th=
 ere *is* a discussion on that man page about needed to use the posix iswspa=
 ce() function, but when I followed=C2=A0that page, there was no detail abou=
 t the LC_ALL environment variable.=C2=A0 =C2=A0</div><div class=3D"gmail_de=
 fault" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br=
 ></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sa=
 ns-serif;font-size:small">Also, historically, wc was something like this:</=
 div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-=
 serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font=
 -family:arial,helvetica,sans-serif;font-size:small">int main(int argc, char=
  *argv[]) {<br>=C2=A0 =C2=A0 int character, lineCount =3D 0, wordCount =3D =
 0, byteCount =3D 0, inWord =3D 0;<br><br>=C2=A0 =C2=A0 while ((character =
 =3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ++byteCount;<br>=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D &#39;\n&#39;)<br>=C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 if (character =3D=3D &#39; &#39; || character =3D=3D &#39;\n&#39; || ch=
 aracter =3D=3D &#39;\t&#39;)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i=
 nWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (inWord =3D=3D 0) {<br>=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<b=
 r>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 printf(&quot;%d %d %d\n&quot;, lineC=
 ount, wordCount, byteCount);<br>=C2=A0 =C2=A0 return 0;<br>}<br></div><div =
 class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
 t-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:a=
 rial,helvetica,sans-serif;font-size:small">That is, because unix &#39;files=
 &#39; are simply strings-of-bytes, it may be meaningless to count &#39;word=
 s&#39; and &#39;lines&#39; -- but yes, characters (file size) is useful.</d=
 iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
 erif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-=
 family:arial,helvetica,sans-serif;font-size:small">Generally, I use this wh=
 en I want to know source size, and the program&#39;s executable is in the s=
 ource directory as an artifact - I do &quot;wc *&quot;=C2=A0</div><div clas=
 s=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-si=
 ze:small"><br></div><div class=3D"gmail_default" style=3D"font-family:arial=
 ,helvetica,sans-serif;font-size:small">Anyway, I&#39;m asking for a documen=
 tation change.</div><div class=3D"gmail_default" style=3D"font-family:arial=
 ,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defaul=
 t" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Thank y=
 ou,</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
 sans-serif;font-size:small">Mike</div></div><br><div class=3D"gmail_quote">=
 <div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 1:55=E2=80=AFA=
 M Robert Elz &lt;<a href=3D"mailto:kre%munnari.oz.au@localhost";>kre%munnari.oz.au@localhost</a>=
 &gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The =
 following reply was made to PR bin/58014; it has been noted by GNATS.<br>
 <br>
 From: Robert Elz &lt;<a href=3D"mailto:kre%munnari.OZ.AU@localhost"; target=3D"_blank"=
 >kre%munnari.OZ.AU@localhost</a>&gt;<br>
 To: <a href=3D"mailto:gnats-bugs%netbsd.org@localhost"; target=3D"_blank">gnats-bugs@n=
 etbsd.org</a><br>
 Cc: <br>
 Subject: Re: bin/58014: wc no longer works with binary files<br>
 Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
 07:50:00 +0000 (UTC)<br>
 =C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
 ael.cheponis%gmail.com@localhost" target=3D"_blank">michael.cheponis%gmail.com@localhost</a><br=
 >
 =C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 &lt;<a href=3D"mailto:20240309075000.=
 E456A1A9241%mollari.NetBSD.org@localhost" target=3D"_blank">20240309075000.E456A1A924=
 1%mollari.NetBSD.org@localhost</a>&gt;<br>
 <br>
 =C2=A0 =C2=A0| when &#39;wc&#39; is given input from a binary file, it now =
 gives the error:<br>
 =C2=A0 =C2=A0|<br>
 =C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
 <br>
 =C2=A0 =C2=A0| (Assuming &#39;hello&#39; is a binary file)<br>
 <br>
 =C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
 racter depends<br>
 =C2=A0upon your locale settings.=C2=A0 Do<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
 <br>
 =C2=A0(or prefix that with &quot;env&quot; if you&#39;re a csh user) and it=
  will work.<br>
 <br>
 =C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
  shows up on amd64<br>
 <br>
 =C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
 nt in<br>
 =C2=A0the two cases.<br>
 <br>
 =C2=A0I am not sure that it makes sense to attempt count characters, lines,=
  or<br>
 =C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
 f you were looking<br>
 =C2=A0to get the size of the file, wc is not the right tool.<br>
 <br>
 =C2=A0I see no bug here, nor any real need to explain that a &quot;word cou=
 nt&quot; program<br>
 =C2=A0isn&#39;t intended to be sane on non word/character containing files =
 in the<br>
 =C2=A0manual page.<br>
 <br>
 <br>
 </blockquote></div>
 
 --0000000000005a4a3506133fd73f--
 



Home | Main Index | Thread Index | Old Index