NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/58014: wc no longer works with binary files



It's indeed the case that on my arm64 test of 'wc' that 'worked' on binary files, the environment variable "LC_ALL=C" was set.

I think the man page for wc needs updating, at least, to explain its interaction with that environment variable.   There *is* a discussion on that man page about needed to use the posix iswspace() function, but when I followed that page, there was no detail about the LC_ALL environment variable.   

Also, historically, wc was something like this:

int main(int argc, char *argv[]) {
    int character, lineCount = 0, wordCount = 0, byteCount = 0, inWord = 0;

    while ((character = getchar()) != EOF) {
        ++byteCount;
        if (character == '\n')
            ++lineCount;
        if (character == ' ' || character == '\n' || character == '\t')
            inWord = 0;
        else if (inWord == 0) {
            inWord = 1;
            ++wordCount;
        }
    }

    printf("%d %d %d\n", lineCount, wordCount, byteCount);
    return 0;
}

That is, because unix 'files' are simply strings-of-bytes, it may be meaningless to count 'words' and 'lines' -- but yes, characters (file size) is useful.

Generally, I use this when I want to know source size, and the program's executable is in the source directory as an artifact - I do "wc *" 

Anyway, I'm asking for a documentation change.

Thank you,
Mike

On Sat, Mar 9, 2024 at 1:55 AM Robert Elz <kre%munnari.oz.au@localhost> wrote:
The following reply was made to PR bin/58014; it has been noted by GNATS.

From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: bin/58014: wc no longer works with binary files
Date: Sat, 09 Mar 2024 16:50:02 +0700

     Date:        Sat,  9 Mar 2024 07:50:00 +0000 (UTC)
     From:        michael.cheponis%gmail.com@localhost
     Message-ID:  <20240309075000.E456A1A9241%mollari.NetBSD.org@localhost>

   | when 'wc' is given input from a binary file, it now gives the error:
   |
   | wc: hello: invalid byte sequence

   | (Assuming 'hello' is a binary file)

 wc without flags needs to count characters.   What is a character depends
 upon your locale settings.  Do

        LC_ALL=C wc hello

 (or prefix that with "env" if you're a csh user) and it will work.

   | wc works as one would expect on arm64.  This error only shows up on amd64

 More likely your default locale (LANG, LC_CTYPE or LC_ALL) is different in
 the two cases.

 I am not sure that it makes sense to attempt count characters, lines, or
 words, in a binary file - what would the answers mean?    If you were looking
 to get the size of the file, wc is not the right tool.

 I see no bug here, nor any real need to explain that a "word count" program
 isn't intended to be sane on non word/character containing files in the
 manual page.




Home | Main Index | Thread Index | Old Index