tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Encoding non-alphanumeric characters in manpage filenames
It is irrelevant what you call it, with unix filenames they are just
characters (or, perhaps wrt Mouse's comments, octets). Meaning is
given to them only by programs that for whatever reason desire to parse
the filenames. There is nothing special. Whether in the example
above .1.gz is interpreted by the program, or just .gz depends which
program is involved. (Neither the .1 nor the .gz are really required,
the .1 is implied by the file being inside man1/ and the .gz can be
determined by reading the first few bytes of the file ... these things
sometimes aid humans tell what the file is about, but that's often it.)
The boundaries between Unix and other OSes are blurred by X desktop
environments and file sharing over networks. Best to come up with
conventions that are well suited to all the main ones.
Any desktop environment (including Windows and Mac) can do file
associations on the suffix after the last dot, so things like ".1" and
".gz" work naturally. Subsection extensions like ".3lua" can be made to
work, so they aren't much of a problem.
| The suffix is hopefully restricted to [0-9a-z.] in all cases, and hence
| doesn't need to be escaped.
Don't bet on that. That kind of assumption is doomed to eventually fail.
(And that even if you include A-Z in your list of acceptable chars.)
I'm fine with calling escape() on the section part as well.
| Escaping "." in the stem part is good practice when the name of a
| manpage contains a dot.
Why?
| The manpage for "java.lang.System" in section
| "3java" should become something like "java%2Elang%2ESystem.3java". Then
| it's clear what's the suffix and what's not.
Why?
Why not? If we're escaping other bytes, escaping dots comes for free.
What's wrong with leaving it alone?
For the man program, there is a list of suffixes that it looks for (or
adds more often) to names - whether the name might have some other
periods in it doesn't matter at all. If we wanted to be able to put
a man page "foo.3" in section 6, resulting in foo.3.6 as the filename,
that might (just might, it also might not) cause an issue - but
one hopes that we don't actually want to do that.
You're mixing two layers of the design:
1. The way each part of the filename is encoded.
2. The list of specific choices that happen to be available for a
filename part (in this case, the section part).
If all dots were escaped, these layers would be kept separate. Then
manpage filenames would reliably be of the form:
page "." section-extension ( "." other-extension )*
Such a filename can be correctly split at dots without knowing what
manual sections, compression tools, and other tools are in existence.
If the page name can contain dots, you have to use heuristics, such as:
[H1]: Look for the first extension that starts with a digit. That fails
with "foo.3.6", as you say.
[H2]: Look for the _last_ extension that starts with a digit. Then
"foo.3.6" is correctly identified as belonging to section 6. Hope that
no compressors or other tools use filename extensions starting with a digit.
[H3] When the pathname is "man6/foo.3.6", match "man6" in the directory
name with ".6" in the filename.
Designs that rely on heuristics predictably cause problems, and those
problems are compounded when you layer one such design on top of another.
However, in practice we would almost certainly want to keep man(1) and
other tools backward compatible, i.e. they should also look for
non-escaped filenames. Then "foo.3.6" has to be a valid filename, and we
should probably use heuristic [H2].
For what purpose, what are you actually attempting to achieve?
Explained in the first message of the thread.
Or is this some academic exercise?
No.
Viewing man pages over the web is irrelevant: if a name needs to be
encoded for the URL, it will be, and then decoded by the server at
the other end, the encoding scheme used for that is relevant only
as an example of such a scheme.
Prior art is rarely irrelevant. Consistency, familiarity, and re-use are
some of the best design principles.
Actually copying it might be counter-productive, as to encode a
man page name, which has been encoded, would need to encode the %
chars to pass them as URLs, so you'd end up with an encoded encoded
name.
That's a valid point. However, serving static files from a web server's
file system doesn't seem to be the custom. For example, the URL
https://man.netbsd.org/intro.3 goes through a tool called man-cgi. In
this case, the tool can handle the URL-to-filename mapping (if any).
Home |
Main Index |
Thread Index |
Old Index