tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



It is irrelevant what you call it, with unix filenames they are just
characters (or, perhaps wrt Mouse's comments, octets).   Meaning is
given to them only by programs that for whatever reason desire to parse
the filenames.   There is nothing special.   Whether in the example
above .1.gz is interpreted by the program, or just .gz depends which
program is involved.   (Neither the .1 nor the .gz are really required,
the .1 is implied by the file being inside man1/ and the .gz can be
determined by reading the first few bytes of the file ... these things
sometimes aid humans tell what the file is about, but that's often it.)

The boundaries between Unix and other OSes are blurred by X desktop environments and file sharing over networks. Best to come up with conventions that are well suited to all the main ones.

Any desktop environment (including Windows and Mac) can do file associations on the suffix after the last dot, so things like ".1" and ".gz" work naturally. Subsection extensions like ".3lua" can be made to work, so they aren't much of a problem.

   | The suffix is hopefully restricted to [0-9a-z.] in all cases, and hence
   | doesn't need to be escaped.

Don't bet on that.   That kind of assumption is doomed to eventually fail.
(And that even if you include A-Z in your list of acceptable chars.)

I'm fine with calling escape() on the section part as well.

   | Escaping "." in the stem part is good practice when the name of a
   | manpage contains a dot.

Why?

   | The manpage for "java.lang.System" in section
   | "3java" should become something like "java%2Elang%2ESystem.3java". Then
   | it's clear what's the suffix and what's not.

Why?

Why not? If we're escaping other bytes, escaping dots comes for free.

What's wrong with leaving it alone?

For the man program, there is a list of suffixes that it looks for (or
adds more often) to names - whether the name might have some other
periods in it doesn't matter at all.   If we wanted to be able to put
a man page "foo.3" in section 6, resulting in foo.3.6 as the filename,
that might (just might, it also might not) cause an issue - but
one hopes that we don't actually want to do that.

You're mixing two layers of the design:

1. The way each part of the filename is encoded.

2. The list of specific choices that happen to be available for a filename part (in this case, the section part).

If all dots were escaped, these layers would be kept separate. Then manpage filenames would reliably be of the form:

    page "." section-extension ( "." other-extension )*

Such a filename can be correctly split at dots without knowing what manual sections, compression tools, and other tools are in existence.

If the page name can contain dots, you have to use heuristics, such as:

[H1]: Look for the first extension that starts with a digit. That fails with "foo.3.6", as you say.

[H2]: Look for the _last_ extension that starts with a digit. Then "foo.3.6" is correctly identified as belonging to section 6. Hope that no compressors or other tools use filename extensions starting with a digit.

[H3] When the pathname is "man6/foo.3.6", match "man6" in the directory name with ".6" in the filename.

Designs that rely on heuristics predictably cause problems, and those problems are compounded when you layer one such design on top of another.

However, in practice we would almost certainly want to keep man(1) and other tools backward compatible, i.e. they should also look for non-escaped filenames. Then "foo.3.6" has to be a valid filename, and we should probably use heuristic [H2].

For what purpose, what are you actually attempting to achieve?

Explained in the first message of the thread.

Or is this some academic exercise?

No.

Viewing man pages over the web is irrelevant: if a name needs to be
encoded for the URL, it will be, and then decoded by the server at
the other end, the encoding scheme used for that is relevant only
as an example of such a scheme.

Prior art is rarely irrelevant. Consistency, familiarity, and re-use are some of the best design principles.

Actually copying it might be counter-productive, as to encode a
man page name, which has been encoded, would need to encode the %
chars to pass them as URLs, so you'd end up with an encoded encoded
name.

That's a valid point. However, serving static files from a web server's file system doesn't seem to be the custom. For example, the URL https://man.netbsd.org/intro.3 goes through a tool called man-cgi. In this case, the tool can handle the URL-to-filename mapping (if any).

Home | Main Index | Thread Index | Old Index