tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



> [...] i really hate to give up the 1:1 relation of array-index <->
> character, but it is no longer possible to do these kind of things
> anyway; you have to sequentially work a string from the front to the
> back, [...]

Sometimes, perhaps.  But only sometimes - only when you're working with
it as a _character_ string, rather than a char string.

>> Things like combining characters I would be inclined to not worry
>> about; it's basically the same issue we've had forever with things
>> like underscore-backspace-letter sequences.
> It is necessary to deal with these sequences for at least comparison
> purposes.

I definitely disagree with this - or, more precisely, I think it's true
significantly less often than you seem to.

For example, in filenames, I would consider <combining-acute><e> to be
different from <e-acute>, just as today I consider
<underscore><backspace><a> in a filename different from
<a><backspace><underscore>.  The only places where I would consider
them the same are for display and for programs that are specifically
supposed to operate on characters rather than bytes - and even for
display, I would display them differently in roughly the same cases
where today I'd display _\ba differently from a\b_.

That is, I think trying to turn filenames from octet sequences
(historically; char sequences, if char is widened) into character
sequences is a recipe for frustration and disaster.  Unix has never
used character sequences for filenames; they've always been octet
sequences, and I think that's as it should be (except that they should
actually be char sequences, if char != octet).  (If you think Unix
filenames are character sequences, consider the difference between
someone using 8859-1 naming a file <e-grave> and someone using 8859-7
naming a file <theta>: looking at the filename on disk, or at the
syscall interface, there is no difference.)

Interpreting octets - or chars - as characters is a human-interface
thing, and I think it should stay at the human-interface layer.

> E.g., in the terminal on Mac OS X tab-completion and file-globbing
> doesn't work for files on Apple-managed filesystems, because the
> names are normalized, [...]

If Apple treats filenames as character sequences, yes, I would expect
issues, because all the Unix software is built for a paradigm that
expects them to be octet sequences, not character sequences.

Shells are a difficult issue, because they straddle the boundary
between the underlying representation and the human presentation of it.
Speaking in traditional terms, this means that they have consider
filenames as both octet sequences and as character sequences; the
resulting dissonance is where a lot of the issues with shells come
from.  I think it's no coincidence that the examples you named involve
filenames, but similar issues arise with command-line arguments in
general.

Historically, the difference between `char' and `character' has gotten
very blurred.  Most of the issues I see around this stuff come from
that confusion, I think, and one of the biggest benefits I expect to
come out of this is clarifying when octets (or, more generally, chars)
are characters and when they're codepoints - and when they're just
numbers.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index