tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames



On Mon, Nov 08, 2021 at 03:30:14PM -0500, Mouse wrote:
> >> What does POSIX say?
> > [...]
> > 2. Each byte in the UTF-8 encoding is interpreted as ASCII
> 
> As soon as any of the input codepoints are non-ASCII, UTF-8 generates
> octets which are ouside the ASCII range and thus cannot be interpreted
> as ASCII (at least not without further processing).

As I've had to deal with UTF-8 in UDF, I'd say its not a big deal. There is
AFIAK no possibility for confusion with ASCII; only the string length
calculation can go wrong.

As Unicode has support for glyphs next to characters, UTF-8 supports alternate
encodings like U+FF05 (fullwidth percent sign) for U+0025 (%). See
https://www.compart.com/en/unicode/category/Po

Some replacements could be used for the filenames though not really type-able
they are readable and obvious.

> > 3. If there's a matching character, use that one.  If not, insert a
> > hex encoding of the byte.
> 
> Provided the "If not" case covers the case where the octet isn't ASCII,
> this is then well-defined...provided manpage names are taken as
> sequences of characters.
> 
> > AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode
> > characters in domain names.  It's confusing and likely has no
> > advantage over a straight-up hex encoding.

Its smaller for one; its better readable than &=0xF3; stuff

> The major advantage I see is that it's more compact; hex encoding
> doubles or, with the % prefix, triples, octet count, and to compare
> fairly with punycode you have to first convert the Unicode codepoint
> string into an octet string; assuming this is done with UTF-8, it leads
> to two to five times as many octets in the intermediate string as there
> are codepoints in the original string (counting only the non-ASCII
> characters, of course).  This count is then doubled or tripled, leading
> to at least four and possibly as many as 15 output octets per
> (non-ASCII) input codepoint.  Since there is a small maximum - 63 - on
> DNS label length, this degree of expansion is undesirable.
> 
> Punycode is substantially more compact.  See the examples in RFC3492.
> 
> I am actually somewhat surprised they didn't just specify use of UTF-8.
> The DNS supports all 256 possible octet values in labels, except that
> there is the historical misfeature that 26 of them are treated as
> identical to a different 26.  I see no particular reason to not just
> use UTF-8 labels.  Presumably they had some, but if it's in 3492 then
> my reading missed it.

Indeed, UTF-8 would have sufficed IMHO.

Reinoud



Home | Main Index | Thread Index | Old Index