tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



<slash> is posix speak for '/'

But is that "Unicode codepoint 47" or "ASCII codepoint 0x2f" or
"whatever the character set in use provides that is a line between
upper right and lower left" or what?

"A line between upper right and lower left", by my reading of POSIX.

Does POSIX mandate an ASCII
superset, for example?  C99 demands that certain characters be present,
but I don't think it mandates anything about their representations
except that they must all be strictly positive and the digits 0..9 are
consecutive and in order.

POSIX seems similar.

Hence asking if POSIX mandates an ASCII superset.

AFAICT it mandates that the Platonic Form of most (but not all) ASCII characters be present. Doesn't mandate ASCII codepoints for them.

And:
3.141 Filename
         A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to
      name a file. The bytes composing the name shall not contain the <NUL>
      or <slash> characters.  [...]
I think for some character sets that may be ill-defined, and it
definitely contradicts existing practice (which is that the octet
string shall not contain 0x00 or 0x2f octets, regardless of what
characters they may or may not be part of).  Perhaps that's just sloppy
language in the spec, but perhaps not, too.

AFAICT the POSIX term "filename" means one pathname component.

No, that's not what I mean.  I mean, `you find you have a directory
entry whose d_name contains "ls.0" and another which contains
"%6Cs.0"', not `you want to have both a manpage called "ls" and another
manpage called "%6Cs"'.  (Or, perhaps harder to handle, one cojntaining
%6Cs.0 and one containing l%73.0.)

The point is, man(1) has to find the underlying file.  But, when you
have encodings, you have multiple possible names.  In most cases, there
will be 2^N possible names, where N is the number of characters (or
possibly octets) in the name, fewer if any of the characters/octets
_must_  be encoded, such as / or %.  So, either it has to try all
possible encodings (which will be impractically large; for example,
XtDisplayStringConversionWarning would generate 16 (binary) billion
different names) or it has to read the directory and look for a name
that, after decoding, matches.  I was assuming the latter case and
pointing out that there's the question of what to do if you find
multiple different names that decode to matches.

As you say, trying all combinations of escaped vs non-escaped octet is not wise.

Calling readdir() until a name matches avoids combinatorial explosion. But since readdir() returns names in arbitrary order, this would cause different results on different file systems (and in different directories on the same file system) that have the same set of files.

IMHO the best solution is to try two variants of the filename:

- One where no bytes are escaped (i.e. current behavior of man(1)).

- Another where all non-portable bytes are escaped.

We should specify in which order these are tried.

In practice, we'll find the files via open("foo", ...) or similar. This means we use whatever character set and encoding open() uses. If C character and string literals use the right character set, and standard library functions like isalnum() from ctype.h work reasonably, we're all set.

Home | Main Index | Thread Index | Old Index