tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



On Fri, 5 Nov 2021 at 21:51, Lassi Kortela <lassi%lassi.io@localhost> wrote:
>
> Hello,
>
> We have a collection of manual pages with unusual ASCII punctuation
> characters in their names. [1] These are for the Scheme programming
> language. Any language that permits a large set of characters in its
> identifiers (Lisp, Haskell, APL, Forth, ...) is likely to hit the same
> problem if manpages are written for it.
>
> While most ASCII punctuation characters are legal in Unix filenames,
> many of them are not portable across different types of file systems.
> FAT and NTFS have particularly stringent restrictions. Additionally,
> many punctuation characters are shell metacharacters, making them
> potentially unsafe to handle from scripts. Most significantly, the
> forward slash (which we have in some identifiers) cannot be encoded in
> a filename at all without creating a subdirectory.
>
> A simple fix would be for man(1) to recognize filenames in which
> unusual bytes are escaped as hex digits. Two familiar standards do it:
>
> * URL encoding (using a "%" character before the digits).
>
> * Quoted-printable (using a "=" character before the digits).
>
> On the choice of "%" vs "=", both are valid in FAT/NTFS filenames. A
> quick web search shows that "%" is forbidden in SharePoint filenames,
> and must be escaped by doubling it as "%%" in Windows batch files. "="
> might present the fewest problems.
>
> Non-ASCII filenames could be handled by hex-escaping each unusual byte
> of their UTF-8 encoding.
>
> RFC 3986 uses the following set of unreserved characters in URIs:
>
>      ALPHA / DIGIT / "-" / "." / "_" / "~"
>
> That's a pretty good set for manpage names as well. However, "~" is
> often used as a metacharacter, and "." is used for the filename
> extension, so it would make sense to hex-encode these two as well.
>
> I ran a search on my computer and found that there are almost no
> manpage names using characters other than the above unreserved ones.
> The only major exception is that lots of Perl manpages use ":".
> However, if man(1) stays backward compatible and also looks for
> non-escaped filenames, no disruption is caused.
>
> Would you accept a patch to NetBSD's man(1) to look for hex-escaped
> names in addition to the unescaped names that it currently finds?

Just curious - do we know if there is any prior art on other systems
for handling this?

David


Home | Main Index | Thread Index | Old Index