tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



    Date:        Mon, 8 Nov 2021 13:47:09 -0500 (EST)
    From:        Mouse <mouse%Rodents-Montreal.ORG@localhost>
    Message-ID:  <202111081847.NAA28875%Stone.Rodents-Montreal.ORG@localhost>


  | What does POSIX say?

From XBD (basic definitions)

3.243 Pathname

            A string that is used to identify a file. In the context
     of POSIX.1-202x, a pathname may be limited to {PATH_MAX} bytes,
     including the terminating null byte. It has optional beginning <slash>
     characters, followed by zero or more filenames separated by <slash>
     characters. A pathname can optionally contain one or more trailing
     <slash> characters. Multiple successive <slash> characters are
     considered to be the same as one <slash>, except it is
     implementation-defined whether the case of exactly two leading
     <slash> characters is treated specially.

<slash> is posix speak for '/'

And:

3.141 Filename

        A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to
     name a file. The bytes composing the name shall not contain the <NUL>
     or <slash> characters. In the context of a pathname, each filename
     shall be followed by a <slash> or a <NUL> character; elsewhere, a
     filename followed by a <NUL> character forms a string (but not
     necessarily a character string).
     The filenames dot and dot-dot have special meaning. A filename is
     sometimes referred to as a ``pathname component''. See also
     Section 3.243 (on page 63).


  | What about POSIX layers atop filesystems that
  | _don't_ represent pathnames as relatively unstructured octet strings?

Unspecified.   As soon as you step outside a POSIX defined filesystem
you're in uncharted territory, and POSIX does not apply.   That includes
relatively minor non-conforming filesystems, like NFS (which has no concept
of open files, and hence cannot retain a file in the filesystem after it
has been unlinked if it remains open - and requires tricks to simulate that)
as well as filesystems like FAT and NTFS, which are "kind of" similar, in
general operation, but don't support much of what is required (esp FAT), and
anything wilder than that would be right off the chart.

  | As for the problem at immediate hand, it strikes me as somewhat
  | difficult to define if you can encode any octet.  For example, what
  | happens if you find that you have both, say, ls.0 and %6Cs.0 in a cat1/
  | directory somewhere?

Obviously, whenever one picks a character to have special meaning, there
needs to be a way to encode that character, even though it looks like it
could just be stored literally, so if there was an encoding scheme like
that, a filename like "%6Cs.0" would be encoded as %256Cs.0 (or something).
There's nothing odd about this, we do it all the time (in a C string, '\'
needs to be written "\\" as \ is used as part of the encoding of \n \t ...).

I have a (private use) encoding scheme for filenames like this, though I
use it to represent book and movie (etc) titles as filenames, mostly for
conversion to HTML to greate web indexes .. I use ',' as the (main) magic
char (there are a few others, _ represents space for example, and ,u an
underscore, ,z (for reasons to bizarre to go into) an actual ',' - except
that ,_ represents a comma followed by a space, which is the normal way
that a comma is found in one of these titles) - this thing grew over time,
and is kind of, no, actually more than that, very, ugly).

It would not be suitable for the proposed purpose, as while it can encode
any unicode char, it does so in a way derived from html, and uses html
char names where they exist, if there isn't a shorter encoding (like
,=agrave= and stuff like that.

kre



Home | Main Index | Thread Index | Old Index