Re: Encoding non-alphanumeric characters in manpage filenames

To: Robert Elz <kre%munnari.OZ.AU@localhost>
Subject: Re: Encoding non-alphanumeric characters in manpage filenames
From: Lassi Kortela <lassi%lassi.io@localhost>
Date: Thu, 11 Nov 2021 09:36:11 +0200

It is irrelevant what you call it, with unix filenames they are just
characters (or, perhaps wrt Mouse's comments, octets).   Meaning is
given to them only by programs that for whatever reason desire to parse
the filenames.   There is nothing special.   Whether in the example
above .1.gz is interpreted by the program, or just .gz depends which
program is involved.   (Neither the .1 nor the .gz are really required,
the .1 is implied by the file being inside man1/ and the .gz can be
determined by reading the first few bytes of the file ... these things
sometimes aid humans tell what the file is about, but that's often it.)

The boundaries between Unix and other OSes are blurred by X desktopenvironments and file sharing over networks. Best to come up withconventions that are well suited to all the main ones.

Any desktop environment (including Windows and Mac) can do fileassociations on the suffix after the last dot, so things like ".1" and".gz" work naturally. Subsection extensions like ".3lua" can be made towork, so they aren't much of a problem.

   | The suffix is hopefully restricted to [0-9a-z.] in all cases, and hence
   | doesn't need to be escaped.

Don't bet on that.   That kind of assumption is doomed to eventually fail.
(And that even if you include A-Z in your list of acceptable chars.)


I'm fine with calling escape() on the section part as well.

   | Escaping "." in the stem part is good practice when the name of a
   | manpage contains a dot.

Why?

   | The manpage for "java.lang.System" in section
   | "3java" should become something like "java%2Elang%2ESystem.3java". Then
   | it's clear what's the suffix and what's not.

Why?


Why not? If we're escaping other bytes, escaping dots comes for free.

What's wrong with leaving it alone?

For the man program, there is a list of suffixes that it looks for (or
adds more often) to names - whether the name might have some other
periods in it doesn't matter at all.   If we wanted to be able to put
a man page "foo.3" in section 6, resulting in foo.3.6 as the filename,
that might (just might, it also might not) cause an issue - but
one hopes that we don't actually want to do that.


You're mixing two layers of the design:

1. The way each part of the filename is encoded.

2. The list of specific choices that happen to be available for afilename part (in this case, the section part).

If all dots were escaped, these layers would be kept separate. Thenmanpage filenames would reliably be of the form:


    page "." section-extension ( "." other-extension )*

Such a filename can be correctly split at dots without knowing whatmanual sections, compression tools, and other tools are in existence.


If the page name can contain dots, you have to use heuristics, such as:

[H1]: Look for the first extension that starts with a digit. That failswith "foo.3.6", as you say.

[H2]: Look for the _last_ extension that starts with a digit. Then"foo.3.6" is correctly identified as belonging to section 6. Hope thatno compressors or other tools use filename extensions starting with a digit.

[H3] When the pathname is "man6/foo.3.6", match "man6" in the directoryname with ".6" in the filename.

Designs that rely on heuristics predictably cause problems, and thoseproblems are compounded when you layer one such design on top of another.

However, in practice we would almost certainly want to keep man(1) andother tools backward compatible, i.e. they should also look fornon-escaped filenames. Then "foo.3.6" has to be a valid filename, and weshould probably use heuristic [H2].

For what purpose, what are you actually attempting to achieve?


Explained in the first message of the thread.

Or is this some academic exercise?

No.

Viewing man pages over the web is irrelevant: if a name needs to be
encoded for the URL, it will be, and then decoded by the server at
the other end, the encoding scheme used for that is relevant only
as an example of such a scheme.

Prior art is rarely irrelevant. Consistency, familiarity, and re-use aresome of the best design principles.

Actually copying it might be counter-productive, as to encode a
man page name, which has been encoded, would need to encode the %
chars to pass them as URLs, so you'd end up with an encoded encoded
name.

That's a valid point. However, serving static files from a web server'sfile system doesn't seem to be the custom. For example, the URLhttps://man.netbsd.org/intro.3 goes through a tool called man-cgi. Inthis case, the tool can handle the URL-to-filename mapping (if any).

Follow-Ups:
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Mouse
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: tlaronde

References:
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Lassi Kortela
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Lassi Kortela
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Lassi Kortela
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Mouse
- Encoding non-alphanumeric characters in manpage filenames
  - From: Lassi Kortela
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Robert Elz
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Mouse
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Robert Elz
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: RVP
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Robert Elz
- Re: Encoding non-alphanumeric characters in manpage filenames
  - From: Robert Elz

Prev by Date: Re: Encoding non-alphanumeric characters in manpage filenames
Next by Date: Re: Encoding non-alphanumeric characters in manpage filenames
Previous by Thread: Re: Encoding non-alphanumeric characters in manpage filenames
Next by Thread: Re: Encoding non-alphanumeric characters in manpage filenames
Indexes:

Home | Main Index | Thread Index | Old Index