tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



> 3.243 Pathname

> [...] <slash> [...]

> <slash> is posix speak for '/'

But is that "Unicode codepoint 47" or "ASCII codepoint 0x2f" or
"whatever the character set in use provides that is a line between
upper right and lower left" or what?  Does POSIX mandate an ASCII
superset, for example?  C99 demands that certain characters be present,
but I don't think it mandates anything about their representations
except that they must all be strictly positive and the digits 0..9 are
consecutive and in order.

Hence asking if POSIX mandates an ASCII superset.

> And:

> 3.141 Filename

>         A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to
>      name a file. The bytes composing the name shall not contain the <NUL>
>      or <slash> characters.  [...]

I think for some character sets that may be ill-defined, and it
definitely contradicts existing practice (which is that the octet
string shall not contain 0x00 or 0x2f octets, regardless of what
characters they may or may not be part of).  Perhaps that's just sloppy
language in the spec, but perhaps not, too.

>> For example, what happens if you find that you have both, say, ls.0
>> and %6Cs.0 in a cat1/ directory somewhere?
> Obviously, whenever one picks a character to have special meaning,
> there needs to be a way to encode that character, even though it
> looks like it could just be stored literally, so if there was an
> encoding scheme like that, a filename like "%6Cs.0" would be encoded
> as %256Cs.0 (or something).

No, that's not what I mean.  I mean, `you find you have a directory
entry whose d_name contains "ls.0" and another which contains
"%6Cs.0"', not `you want to have both a manpage called "ls" and another
manpage called "%6Cs"'.  (Or, perhaps harder to handle, one cojntaining
%6Cs.0 and one containing l%73.0.)

The point is, man(1) has to find the underlying file.  But, when you
have encodings, you have multiple possible names.  In most cases, there
will be 2^N possible names, where N is the number of characters (or
possibly octets) in the name, fewer if any of the characters/octets
_must_ be encoded, such as / or %.  So, either it has to try all
possible encodings (which will be impractically large; for example, 
XtDisplayStringConversionWarning would generate 16 (binary) billion
different names) or it has to read the directory and look for a name
that, after decoding, matches.  I was assuming the latter case and
pointing out that there's the question of what to do if you find
multiple different names that decode to matches.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index