tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames

Unix filenames have no "extension" - that's a concept from other
operating systems, and we should not pay that any regard at all.
A dot in a filename is simply a dot (though a leading one is
given some special status by some utilities).

Some utilities give specific meanings to filenames that end with
specific characters (like RCS likes ",v" endings) - but such
conventions belong to the utilities that implement them (having a
section name at the end of a man page is one like that - though it
is truly redundant, we don't really need the "1" twice in man1/cat.1
man1/cat would have worked just fine).

Manpage filename escaping involves the conventions of man(1) and related programs. Operating systems are only relevant insofar as they restrict the portable character set we can use with these programs.

For present purposes, the extension of "foobar.1" is ".1" and the extension of "foobar.1.gz" is ".1.gz". We can call it "suffix" instead. The suffix is hopefully restricted to [0-9a-z.] in all cases, and hence doesn't need to be escaped.

Escaping "." in the stem part is good practice when the name of a manpage contains a dot. The manpage for "java.lang.System" in section "3java" should become something like "java%2Elang%2ESystem.3java". Then it's clear what's the suffix and what's not.

If the objective is to be portable to other systems, then some of those
impose a 6+3 naming rule, with a very limited char set (upper case letters,
digits, and a couple of other chars for some) - the only way to encode
anything reasonable for those would be to hash the original filename into
a 24 bit value, then use the name that that 24 bit value expands into).
And hope for no collisions (but 24 bits is 16 million, so you'd need 4 thousand
names in the same directory before the probability of a collision gets
above 50%).

If you're just going to aim for some other systems, then you'd have
to justify why those, and not others.

If the objective is to make reasonable, easy to manipulate (if perhaps
ugly to look at while encoded) names for unix systems, then there's a
totally different mindset when looking for an encoding method, and you
can easily find something where 99% of all real life man pages encode
into themselves (encoding changes nothing) which is, for this purpose,

What you shouldn't be attempting to do is solve all of the issues,
generate something that is a panacea.   That way just results in madness.

The objective is to find something portable to most modern OSes: BSD, Linux, Windows, WSL, Cygwin, Msys, Haiku, Minix, ... Most of these are close enough to "Unix", but Windows without Unix emulation is not.

Something very similar to web URLs is ideal, since manpages are often viewed on the web. That suggests "%NN" where NN is hex.

Filename length limits are a moot point on current OSes, except perhaps for niche embedded stuff. I wouldn't worry about that.

Home | Main Index | Thread Index | Old Index