tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)

On 16-Sep-08, at 1:03 PM, der Mouse wrote:

The ideal solution, given these limitations, is to maintain the
illusion that there is effectively only one true encoding and charset
(for "text" files, at least), just as Unix has always done,

I don't know what Unix you've been using, but I've certainly never had
any such "illusion".

I'm being pedantic by using "Unix".  :-)

So, I'm saying Unix(tm) is/was an ASCII-only system. Systems conforming to some POSIX enxtension/revision can give the illusion that each given "session" (usually per "user", but potentially per process) works with one given charset and encoding.

Before I had a more world-aware view of systems I often laughed at the oddness of things like the "text" and "binary" commands in communications tools such as FTP.

The impression I've gotten from the Unices I've used is that files
don't _have_ encodings or charsets; those are imposed, if at all, by
the software (and sometimes hardware, eg, serial terminals) that
interprets the octet stream stored in the file.  Indeed, there isn't
even any notion of a text file per se, only tools that treat files - or
more often octet streams - as text (and others that don't, of course).

Indeed, but that's obviously an isolationist view of things, wouldn't you agree?

The _sanest_ alternatives would be to add content-specifying metadata
to the filesystem, and all the tools necessary to make sure it's
always set and used as correctly as possible;

I dunno; I wouldn't call that sane;

Indeed, that's why I pedantically used "_sanest_" to emphasize that it's not ideal in any way and that these alternatives are by far quite secondary to the hopefully better idea promoted by Plan 9.

I'd call it diametrically opposed
to one of Unix's great strengths (that strength being a lack of
distinctions such as text vs binary or STREAM-LF vs fixed-length
records vs ISAM vs etc).

I think the Unix (and now Plan 9) way of looking at text encoding and charset is the best compromise given the way the tools were designed.

I'm still, after 28 years of pretty much full-on exposure to, and immersion in, the idea, not quite convinced that it's a good idea to always be so agnostic about text versus arbitrary and perhaps opaque binary content of files. There are of course many advantages, and perhaps even some of the minor disadvantages (such as forcing users to be cautious) are actually good in the long run. I see advantages in the way Apple's systems have used metadata to enhance functionality and to provide a "safer" environment for naive users.

Even further from sane I suggested, almost entirely in jest, the idea of MIME processing in STDIO. Truly though I think that would be the only way to allow applications to be completely free from having to rely on user judgement about file contents. Files, at least text files, would always be wrapped in meta information about their content encoding and charset (and perhaps type too). Any file that was not binary data would therefore have MIME headers. I.e. this would be like having filesystem metadata to provide the same information, but it would keep everything in userland, and even more importantly it would provide applications with a system-supplied set of "invisible" methods for dealing with all the conversion issues while at the same time maintaining the illusion for users that they are free to view the whole world as using their personally chosen encoding and charset. I.e. STDIO would automatically convert all MIME files into the user's current locale for use by all applications. It's sort of the extreme of what someone earlier proposed as the way to do everything "properly".

                                        Greg A. Woods; Planix, Inc.

Attachment: PGP.sig
Description: This is a digitally signed message part

Home | Main Index | Thread Index | Old Index