tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames

What does POSIX say?  What about POSIX layers atop filesystems that
_don't_  represent pathnames as relatively unstructured octet strings?
ISTR that at least one Windows FS represents pathname components as
strings of two-octet BMP Unicode codepoints - how is the impedance
mismatch handled?

The Open Group Base Specifications Issue 7, 2018 edition
IEEE Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008)
6. Character Set
6.1 Portable Character Set

gives a table of portable characters which includes most of ASCII. However, their ASCII/Unicode codepoints are given for reference only; it seems equivalent characters can be encoded in any character set.

Happily, the Portable Character Set includes all of dash, underscore, dot, percent sign, and equals sign.

So we could specify the man conventions as such:

1. The manpage name is encoded as UTF-8.

2. Each byte in the UTF-8 encoding is interpreted as ASCII and that ASCII character (if any) is matched with the Portable Character Set.

3. If there's a matching character, use that one. If not, insert a hex encoding of the byte.

For example, what
happens if you find that you have both, say, ls.0 and %6Cs.0 in a cat1/
directory somewhere?  Or both foo::bar.0 and foo%3A%3Abar.0?

We can define a precedence rule about which variant of the name wins in case both are present.

I've found myself caring about this, too, because I find myself using
both 8859-1 and 8859-14.  I'm not sure what the right resolution is.
(To forestall one likely suggestion: I am, however, sure that - at
least for my purposes - it is not UTF-8.  Variable-sized characters is
a disaster I do not want to go anywhere near.)

Manpages are now commonly served over the web with URLs like this:

URLs use hex encoding with % signs. Re-using that in the local file system would have the benefit that the convention is much the same. (However, web browsers differ in which characters they show percent encoded in their address bar, so the display may not always match.)

AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode characters in domain names. It's confusing and likely has no advantage over a straight-up hex encoding.

Home | Main Index | Thread Index | Old Index