NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/58014: wc no longer works with binary files



The following reply was made to PR bin/58014; it has been noted by GNATS.

From: Michael Cheponis <michael.cheponis%gmail.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 9 Mar 2024 12:16:32 -0800

 --00000000000033175e06133ffe14
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable
 
 Crap.  I see what the problem is:  I want my "ll" alias to give me commas
 in the file length reported.
 
 This requires setting env vars like this:
 LANG=3Den_US.UTF-8
 LC_ALL=3D""
 LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep
 
 Specifically, note that LC_ALL must be set to ""
 
 producing output like this
 -rwxr-xr-x  1 mac  users  17,096 Mar  8 23:31 n*
 -rw-r--r--  1 mac  users     560 Mar  8 23:31 n.c
 -rw-r--r--  1 mac  users     155 Mar  8 23:01 n.c~
 
 Now, if I set LC_ALL=3DC   (to make 'wc' count ok on binary files), then I
 get from my "ll" :
 -rwxr-xr-x  1 mac  users   17096 Mar  8 23:31 n*
 -rw-r--r--  1 mac  users     560 Mar  8 23:31 n.c
 -rw-r--r--  1 mac  users     155 Mar  8 23:01 n.c~
 
 Catch 22 -- I have to use an alias for wc that changes the local
 environment variable when running wc
 
 alias wc=3D"LC_ALL=3DC wc"
 
 Again, I'm not sure this is sufficiently documented.   I'd be happy to make
 suggested changes to the man page(s).
 
 Thanks again,
 Mike
 
 
 On Sat, Mar 9, 2024 at 12:05=E2=80=AFPM Michael Cheponis <michael.cheponis@=
 gmail.com>
 wrote:
 
 > It's indeed the case that on my arm64 test of 'wc' that 'worked' on binar=
 y
 > files, the environment variable "LC_ALL=3DC" was set.
 >
 > I think the man page for wc needs updating, at least, to explain its
 > interaction with that environment variable.   There *is* a discussion on
 > that man page about needed to use the posix iswspace() function, but when=
  I
 > followed that page, there was no detail about the LC_ALL environment
 > variable.
 >
 > Also, historically, wc was something like this:
 >
 > int main(int argc, char *argv[]) {
 >     int character, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inW=
 ord =3D 0;
 >
 >     while ((character =3D getchar()) !=3D EOF) {
 >         ++byteCount;
 >         if (character =3D=3D '\n')
 >             ++lineCount;
 >         if (character =3D=3D ' ' || character =3D=3D '\n' || character =
 =3D=3D '\t')
 >             inWord =3D 0;
 >         else if (inWord =3D=3D 0) {
 >             inWord =3D 1;
 >             ++wordCount;
 >         }
 >     }
 >
 >     printf("%d %d %d\n", lineCount, wordCount, byteCount);
 >     return 0;
 > }
 >
 > That is, because unix 'files' are simply strings-of-bytes, it may be
 > meaningless to count 'words' and 'lines' -- but yes, characters (file siz=
 e)
 > is useful.
 >
 > Generally, I use this when I want to know source size, and the program's
 > executable is in the source directory as an artifact - I do "wc *"
 >
 > Anyway, I'm asking for a documentation change.
 >
 > Thank you,
 > Mike
 >
 > On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz <kre%munnari.oz.au@localhost> wro=
 te:
 >
 >> The following reply was made to PR bin/58014; it has been noted by GNATS=
 .
 >>
 >> From: Robert Elz <kre%munnari.OZ.AU@localhost>
 >> To: gnats-bugs%netbsd.org@localhost
 >> Cc:
 >> Subject: Re: bin/58014: wc no longer works with binary files
 >> Date: Sat, 09 Mar 2024 16:50:02 +0700
 >>
 >>      Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
 >>      From:        michael.cheponis%gmail.com@localhost
 >>      Message-ID:  <20240309075000.E456A1A9241%mollari.NetBSD.org@localhost>
 >>
 >>    | when 'wc' is given input from a binary file, it now gives the error=
 :
 >>    |
 >>    | wc: hello: invalid byte sequence
 >>
 >>    | (Assuming 'hello' is a binary file)
 >>
 >>  wc without flags needs to count characters.   What is a character depen=
 ds
 >>  upon your locale settings.  Do
 >>
 >>         LC_ALL=3DC wc hello
 >>
 >>  (or prefix that with "env" if you're a csh user) and it will work.
 >>
 >>    | wc works as one would expect on arm64.  This error only shows up on
 >> amd64
 >>
 >>  More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different
 >> in
 >>  the two cases.
 >>
 >>  I am not sure that it makes sense to attempt count characters, lines, o=
 r
 >>  words, in a binary file - what would the answers mean?    If you were
 >> looking
 >>  to get the size of the file, wc is not the right tool.
 >>
 >>  I see no bug here, nor any real need to explain that a "word count"
 >> program
 >>  isn't intended to be sane on non word/character containing files in the
 >>  manual page.
 >>
 >>
 >>
 
 --00000000000033175e06133ffe14
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable
 
 <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,he=
 lvetica,sans-serif;font-size:small">Crap.=C2=A0 I see what the problem is:=
 =C2=A0 I want my &quot;ll&quot; alias to give me commas in the file length =
 reported.</div><div class=3D"gmail_default" style=3D"font-family:arial,helv=
 etica,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" st=
 yle=3D"font-family:arial,helvetica,sans-serif;font-size:small">This require=
 s setting env vars like this:</div><div class=3D"gmail_default" style=3D"fo=
 nt-family:arial,helvetica,sans-serif;font-size:small">LANG=3Den_US.UTF-8<br=
 >LC_ALL=3D&quot;&quot;<br>LC_NUMERIC=3Den_US.UTF-8 locale -k thousands_sep<=
 br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,=
 sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D=
 "font-family:arial,helvetica,sans-serif;font-size:small">Specifically, note=
  that LC_ALL must be set to &quot;&quot;</div><div class=3D"gmail_default" =
 style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div>=
 <div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-seri=
 f;font-size:small">producing output like this</div><div class=3D"gmail_defa=
 ult" style=3D"font-size:small"><font face=3D"monospace">-rwxr-xr-x =C2=A01 =
 mac =C2=A0users =C2=A017,096 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac=
  =C2=A0users =C2=A0 =C2=A0 560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 =
 mac =C2=A0users =C2=A0 =C2=A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><d=
 iv class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;=
 font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-famil=
 y:arial,helvetica,sans-serif;font-size:small">Now, if I set LC_ALL=3DC=C2=
 =A0 =C2=A0(to make &#39;wc&#39; count ok on binary files), then I get from =
 my &quot;ll&quot; :</div><div class=3D"gmail_default" style=3D"font-size:sm=
 all"><font face=3D"monospace">-rwxr-xr-x =C2=A01 mac =C2=A0users =C2=A0 170=
 96 Mar =C2=A08 23:31 n*<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=A0=
  560 Mar =C2=A08 23:31 n.c<br>-rw-r--r-- =C2=A01 mac =C2=A0users =C2=A0 =C2=
 =A0 155 Mar =C2=A08 23:01 n.c~</font><br></div><div class=3D"gmail_default"=
  style=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div=
 ><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-ser=
 if;font-size:small">Catch 22 -- I have to use an alias for wc that changes =
 the local environment=C2=A0variable when running wc</div><div class=3D"gmai=
 l_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small"=
 ><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helvetic=
 a,sans-serif;font-size:small"><div class=3D"gmail_default">alias wc=3D&quot=
 ;LC_ALL=3DC wc&quot;<br></div></div><div class=3D"gmail_default" style=3D"f=
 ont-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class=
 =3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
 e:small">Again, I&#39;m not sure this is sufficiently documented.=C2=A0 =C2=
 =A0I&#39;d be happy to make suggested changes to the man page(s).</div><div=
  class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fo=
 nt-size:small"><br></div><div class=3D"gmail_default" style=3D"font-family:=
 arial,helvetica,sans-serif;font-size:small">Thanks again,</div><div class=
 =3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
 e:small">Mike</div><div class=3D"gmail_default" style=3D"font-family:arial,=
 helvetica,sans-serif;font-size:small"><br></div></div><br><div class=3D"gma=
 il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 9, 2024 at 12:0=
 5=E2=80=AFPM Michael Cheponis &lt;<a href=3D"mailto:michael.cheponis@gmail.=
 com">michael.cheponis%gmail.com@localhost</a>&gt; wrote:<br></div><blockquote class=
 =3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
 b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_defau=
 lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">It&#39=
 ;s indeed the case that on my arm64 test of &#39;wc&#39; that &#39;worked&#=
 39; on binary files, the environment variable &quot;LC_ALL=3DC&quot; was se=
 t.</div><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,s=
 ans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"=
 font-family:arial,helvetica,sans-serif;font-size:small">I think the man pag=
 e for wc needs updating, at least, to explain its interaction with that env=
 ironment variable.=C2=A0 =C2=A0There *is* a discussion on that man page abo=
 ut needed to use the posix iswspace() function, but when I followed=C2=A0th=
 at page, there was no detail about the LC_ALL environment variable.=C2=A0 =
 =C2=A0</div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
 ca,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" style=
 =3D"font-family:arial,helvetica,sans-serif;font-size:small">Also, historica=
 lly, wc was something like this:</div><div class=3D"gmail_default" style=3D=
 "font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div cla=
 ss=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-s=
 ize:small">int main(int argc, char *argv[]) {<br>=C2=A0 =C2=A0 int characte=
 r, lineCount =3D 0, wordCount =3D 0, byteCount =3D 0, inWord =3D 0;<br><br>=
 =C2=A0 =C2=A0 while ((character =3D getchar()) !=3D EOF) {<br>=C2=A0 =C2=A0=
  =C2=A0 =C2=A0 ++byteCount;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =
 =3D=3D &#39;\n&#39;)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++lineCou=
 nt;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (character =3D=3D &#39; &#39; || char=
 acter =3D=3D &#39;\n&#39; || character =3D=3D &#39;\t&#39;)<br>=C2=A0 =C2=
 =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inWord =3D 0;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 else if (inWord =3D=3D 0) {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 inWord =3D 1;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ++wordCount;=
 <br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 p=
 rintf(&quot;%d %d %d\n&quot;, lineCount, wordCount, byteCount);<br>=C2=A0 =
 =C2=A0 return 0;<br>}<br></div><div class=3D"gmail_default" style=3D"font-f=
 amily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"g=
 mail_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:sma=
 ll">That is, because unix &#39;files&#39; are simply strings-of-bytes, it m=
 ay be meaningless to count &#39;words&#39; and &#39;lines&#39; -- but yes, =
 characters (file size) is useful.</div><div class=3D"gmail_default" style=
 =3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div =
 class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fon=
 t-size:small">Generally, I use this when I want to know source size, and th=
 e program&#39;s executable is in the source directory as an artifact - I do=
  &quot;wc *&quot;=C2=A0</div><div class=3D"gmail_default" style=3D"font-fam=
 ily:arial,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gma=
 il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
 ">Anyway, I&#39;m asking for a documentation change.</div><div class=3D"gma=
 il_default" style=3D"font-family:arial,helvetica,sans-serif;font-size:small=
 "><br></div><div class=3D"gmail_default" style=3D"font-family:arial,helveti=
 ca,sans-serif;font-size:small">Thank you,</div><div class=3D"gmail_default"=
  style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Mike</div=
 ></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr"=
 >On Sat, Mar 9, 2024 at 1:55=E2=80=AFAM Robert Elz &lt;<a href=3D"mailto:kr=
 e%munnari.oz.au@localhost" target=3D"_blank">kre%munnari.oz.au@localhost</a>&gt; wrote:<br></di=
 v><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
 r-left:1px solid rgb(204,204,204);padding-left:1ex">The following reply was=
  made to PR bin/58014; it has been noted by GNATS.<br>
 <br>
 From: Robert Elz &lt;<a href=3D"mailto:kre%munnari.OZ.AU@localhost"; target=3D"_blank"=
 >kre%munnari.OZ.AU@localhost</a>&gt;<br>
 To: <a href=3D"mailto:gnats-bugs%netbsd.org@localhost"; target=3D"_blank">gnats-bugs@n=
 etbsd.org</a><br>
 Cc: <br>
 Subject: Re: bin/58014: wc no longer works with binary files<br>
 Date: Sat, 09 Mar 2024 16:50:02 +0700<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0Date:=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sat,=C2=A0 9 Mar 2024 =
 07:50:00 +0000 (UTC)<br>
 =C2=A0 =C2=A0 =C2=A0From:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"mailto:mich=
 ael.cheponis%gmail.com@localhost" target=3D"_blank">michael.cheponis%gmail.com@localhost</a><br=
 >
 =C2=A0 =C2=A0 =C2=A0Message-ID:=C2=A0 &lt;<a href=3D"mailto:20240309075000.=
 E456A1A9241%mollari.NetBSD.org@localhost" target=3D"_blank">20240309075000.E456A1A924=
 1%mollari.NetBSD.org@localhost</a>&gt;<br>
 <br>
 =C2=A0 =C2=A0| when &#39;wc&#39; is given input from a binary file, it now =
 gives the error:<br>
 =C2=A0 =C2=A0|<br>
 =C2=A0 =C2=A0| wc: hello: invalid byte sequence<br>
 <br>
 =C2=A0 =C2=A0| (Assuming &#39;hello&#39; is a binary file)<br>
 <br>
 =C2=A0wc without flags needs to count characters.=C2=A0 =C2=A0What is a cha=
 racter depends<br>
 =C2=A0upon your locale settings.=C2=A0 Do<br>
 <br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3DC wc hello<br>
 <br>
 =C2=A0(or prefix that with &quot;env&quot; if you&#39;re a csh user) and it=
  will work.<br>
 <br>
 =C2=A0 =C2=A0| wc works as one would expect on arm64.=C2=A0 This error only=
  shows up on amd64<br>
 <br>
 =C2=A0More likely your default locale (LANG, LC_CTYPE or LC_ALL) is differe=
 nt in<br>
 =C2=A0the two cases.<br>
 <br>
 =C2=A0I am not sure that it makes sense to attempt count characters, lines,=
  or<br>
 =C2=A0words, in a binary file - what would the answers mean?=C2=A0 =C2=A0 I=
 f you were looking<br>
 =C2=A0to get the size of the file, wc is not the right tool.<br>
 <br>
 =C2=A0I see no bug here, nor any real need to explain that a &quot;word cou=
 nt&quot; program<br>
 =C2=A0isn&#39;t intended to be sane on non word/character containing files =
 in the<br>
 =C2=A0manual page.<br>
 <br>
 <br>
 </blockquote></div>
 </blockquote></div>
 
 --00000000000033175e06133ffe14--
 



Home | Main Index | Thread Index | Old Index