Re: File types [was: Re: wide characters and i18n]

To: markucz%gmail.com@localhost
Subject: Re: File types [was: Re: wide characters and i18n]
From: Giles Lean <giles.lean%pobox.com@localhost>
Date: 21 Jul 2010 10:54:35 +1000
Date: Wed, 21 Jul 2010 10:54:35 +1000

markucz%gmail.com@localhost wrote:

> I don't understand what "alternate" means in this case. Is this "data stream"
> kept in the executable itself (that's what Windows does with icons, IMO since
> 3.x) or in a separate file but still being treated as a part of the
> executable (like DOS overlay files)?

(With the proviso that like much else in this thread, this is
nearly all "IMHO" -- disagreement is both expected and welcome
if it adds clarity or better ideas than I have.)

When you call open() you get the main data stream; there are other
calls to find out what resource forks exist and to open one of them
instead.  Sounds more like the Windows example you gave.

> IMO you're right :-) but some people want to have ACLs, MAC and
> whatever else it takes to become B1 compliant "Trusted $OS" and
> hopefully make big buck$.

True.  But that market is so small (in number) that doing anything
to inconvenience the non-trusted users is inappropriate, I think.

Were I (perish the thought) trying to turn a Unix system into
something that would pass B1 certification (if that still
means anything) I would be deciding things like "if I have
ACLs, I'm not going to have Unix groups".

The implementations I have seen which have tried to maintain
"optional" ACLs on top of Unix groups always seemed awkward:
naive programs and users couldn't figure out what was going
on as soon as an ACL appeared.

> This again brings up the question if files are just bags of bytes or
> not. I'd say text files are just another file format. There're just
> too many programs which deal with it. Here Plan 9 got it right -
> assume all text is UTF-8 (I leave aside considerations about file size
> or bandwidth wasted) and add some functions to libc if needed. "Assume
> all text is ASCII" is just as valid though.

Historically, the "bag of bytes" model split into:

o text, which lots of things understood, and
o binary, which very likely had a specific format and only the
  appropriate program could actually work with was fine

_Plus_, by and large, there was no rule that you couldn't
apply a text handling utility to a file that happened to have
binary contents etc.  A pretty revolutionary idea when it was
new.

That goal was not and is not always achieved as well as it
might be: early versions of sed truncated(!) data lines at 512
bytes; a few utilities do object to "binary files" and modify
their behaviour when they see one:

    $ diff /mach_kernel /dev/null
    Binary files /mach_kernel and /dev/null differ

Here true, and I probably don't want to see the output of that
diff. But it's annoying if there are only a few binary
chracters and they're not on the lines required for the diff.
(Oddly, I lack a running NetBSD system right now; that example
is from OS X diff, which is GNU diff.  NetBSD's may be
different.)

Our present day problem is all the world's _not_ ASCII, so our
idea of what a "text file" is is under pressure: I (English
speaker in a country where English is the major language
spoken) have a mixture of ASCII and UTF-8 files.

A Japanese user would be likely to have some Shift-JIS files
as well, I suspect.

Who uses UTF-16 these days I'm not sure (Windows for at least
some things?), but there are definitely files (and file
systems) which use them and byte order raises its ugly head,
so a mixture of ASCII, UTF-8, and UTF-16 (maybe little endian,
maybe big-endian, maybe both) is also reasonable to find on
an individual system.

Thus, it looks more reasonable than it did in 1970 or so to
place type (meta data) on text files; personally I'd make this
advisory: only applications that care (e.g. text editors) need
enquire, and applications that don't (e.g. cp(1)) would not
enquire except to propagate the metadata along with the file.

Plan 9 had two luxuries the world in general doesn't have:

a) they were a new OS, largely unconstrained by backward
   compatibility

b) they punted on some serious issues, such as ignoring
   combining characters in UTF-8, thus "UTF-8" in Plan 9
   was effectively enforced (so far as I understand from
   the paper quoted earlier in the thread) to be UTF-8 in
   NFC (Normalization FOrm Canonical Composition).

   A translation utility was provided, but what input
   character sets it supported I don't know.

OS X increasingly places extended attributes on files;
browsers tag downloaded files and such.  I'm not aware
of charset tags (or conventions for them) -- but I think
would be useful.

As people probably get tired of me saying, UTF-8 can be
unnormalised (a mixture of composed and decomposed characters)
or normalised to one of serveral forms.  Simply saying "UTF-8"
doesn't mean I can diff two files to see if the list of names
in each is the same: I have to normalise each line (name)
first unless I _know_ the file contents are normalised.

Sorting those names gets into issues of locales, and _that_
is a whole 'nuther problem (and one both Plan 9 and Google's
Go language have so far punted on, too).

I'm not a huge fan of POSIX locales -- they have several
issues -- but at the moment they're the only game in town
I've even heard about.  (More information welcome.)

> I believe XFS has xattrs by design so I wonder if all Irix
> programs were written to be xattr-aware or if it was pushed
> down to the filesystem level.  Maybe that's the right
> approach?

If you can figure out how to push it down to the file system
level, I think you'll be doing well; I don't think you can,
realistically.

Here's a case where copying extended attributes (and even
ACLs) is pretty clearly "The Right Thing" to do:

    $ cp -p original copy

On the other hand, what about when I fire up vi on a file I
don't have write permission to, and then save a copy after
I've edited it?  At minimum, I must maintain the right to edit
it (so ACLs become troublesome) and how does the file system
possibly work out what I want for extended attributes when I
use _multiple_ source files to create an entirely new file?

    # vi original1
    ...
    :i original2
    ...
    :w copy

If both files were UTF-8 normalised to NFD, then it migh be
reasonable for vi to note that in metadata, but I don't think
the file system can do it for me automatically unless it
examines every byte in the file as they're written.  (And
then, how do I create an intentionally malformed file for
testing?)

Cheers,

Giles

Follow-Ups:
- Re: File types [was: Re: wide characters and i18n]
  - From: markucz

References:
- wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: Erik Fair
- Re: wide characters and i18n
  - From: David Holland
- Re: wide characters and i18n
  - From: Erik Fair
- Re: wide characters and i18n
  - From: markucz
- Re: wide characters and i18n
  - From: Giles Lean
- Re: wide characters and i18n
  - From: markucz
- File types [was: Re: wide characters and i18n]
  - From: Giles Lean
- Re: File types [was: Re: wide characters and i18n]
  - From: markucz

Prev by Date: Re: File types [was: Re: wide characters and i18n]
Next by Date: Re: File types [was: Re: wide characters and i18n]
Previous by Thread: Re: File types [was: Re: wide characters and i18n]
Next by Thread: Re: File types [was: Re: wide characters and i18n]
Indexes:

Home | Main Index | Thread Index | Old Index