tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



> While most ASCII punctuation characters are legal in Unix filenames,

I actually would warn against some thinking that could be (not "is")
present here.

UNIX filenames are not character strings.  They are octet strings,
which may be - often are - interpreted as encoding character strings.
Two octets, 0x2f and 0x00, have special significance.  But it is the
octets, not any characters they may or may not represent, that have the
significance.  (For example, positing a character encoding with shift
states where a 0x2f octet may, because of shift state, represent
something other than /, trying to put that character in a filename is
going to cause trouble even though the _character_ is not an "ASCII
punctuation character".  UTF-8 would cause similar issues if it didn't
promise things about the 0x00-0x7f range that make 0x00 and 0x2f safe.)

The difference tends to get blurred, especially in view of code like

	if (path[x] == '/')

rather than

	if (path[x] == 0x2f)

but it is still an important distinction to at least keep in the back
of your mind.  (Related issues are why SSH, as standardized, is,
strictly speaking, unimplementable on many UNIX variants.)

I don't know whether anyone has done anything UNIXy based on any
character encoding where / is not 0x2f (EBCDIC maybe?) or 0x00 is not
the canonical string terminator (I think I've heard of using 0xff for
that).  If there is such a thing, it would be interesting to examine
its choices.  Does it use 0x2f, /, or something else as its pathname
separator?  What as the terminator?  How does it handle the i14y issues
resulting from its choice (either choice has such issues, just
different ones)?

What does POSIX say?  What about POSIX layers atop filesystems that
_don't_ represent pathnames as relatively unstructured octet strings?
ISTR that at least one Windows FS represents pathname components as
strings of two-octet BMP Unicode codepoints - how is the impedance
mismatch handled?

As for the problem at immediate hand, it strikes me as somewhat
difficult to define if you can encode any octet.  For example, what
happens if you find that you have both, say, ls.0 and %6Cs.0 in a cat1/
directory somewhere?  Or both foo::bar.0 and foo%3A%3Abar.0?  (And,
strictly speaking, even those two lines blur the distinction between
octets and characters in pathnames.  It's things like that that make it
hard to maintain the mental distinction.  Those encodings assume, of
course, the use of ASCII, or at least an ASCII superset.)

I've found myself caring about this, too, because I find myself using
both 8859-1 and 8859-14.  I'm not sure what the right resolution is.
(To forestall one likely suggestion: I am, however, sure that - at
least for my purposes - it is not UTF-8.  Variable-sized characters is
a disaster I do not want to go anywhere near.)

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index