tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Encoding non-alphanumeric characters in manpage filenames


We have a collection of manual pages with unusual ASCII punctuation
characters in their names. [1] These are for the Scheme programming
language. Any language that permits a large set of characters in its
identifiers (Lisp, Haskell, APL, Forth, ...) is likely to hit the same
problem if manpages are written for it.

While most ASCII punctuation characters are legal in Unix filenames,
many of them are not portable across different types of file systems.
FAT and NTFS have particularly stringent restrictions. Additionally,
many punctuation characters are shell metacharacters, making them
potentially unsafe to handle from scripts. Most significantly, the
forward slash (which we have in some identifiers) cannot be encoded in
a filename at all without creating a subdirectory.

A simple fix would be for man(1) to recognize filenames in which
unusual bytes are escaped as hex digits. Two familiar standards do it:

* URL encoding (using a "%" character before the digits).

* Quoted-printable (using a "=" character before the digits).

On the choice of "%" vs "=", both are valid in FAT/NTFS filenames. A
quick web search shows that "%" is forbidden in SharePoint filenames,
and must be escaped by doubling it as "%%" in Windows batch files. "="
might present the fewest problems.

Non-ASCII filenames could be handled by hex-escaping each unusual byte
of their UTF-8 encoding.

RFC 3986 uses the following set of unreserved characters in URIs:

    ALPHA / DIGIT / "-" / "." / "_" / "~"

That's a pretty good set for manpage names as well. However, "~" is
often used as a metacharacter, and "." is used for the filename
extension, so it would make sense to hex-encode these two as well.

I ran a search on my computer and found that there are almost no
manpage names using characters other than the above unreserved ones.
The only major exception is that lots of Perl manpages use ":".
However, if man(1) stays backward compatible and also looks for
non-escaped filenames, no disruption is caused.

Would you accept a patch to NetBSD's man(1) to look for hex-escaped
names in addition to the unescaped names that it currently finds?



Home | Main Index | Thread Index | Old Index