tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



tlaronde%polynum.com@localhost wrote:
  |On Mon, Apr 15, 2013 at 05:51:33PM -0400, James K. Lowden wrote:
  |> If he types "vi året", I think the file
  |> should open if the character strings match regardless of the
  |> byte-sequences, but today the odds are 1:4 against.  
  |> 
  |
  |No. If the policy _in this network_ is to considered that filenames do
  |not represent themselves but are an instance of a class of considered
  |equal filenames, and that a canonical class name is used as the filename
  |stored, vi(1) would not be the program but a shell wrapper, doing this:
  |
  |1) Normalize the argument $1 into a byte string (UTF-8) pattern, with
  |all the ligatures expanded ('oe' -> 'o' 'e'; 'fi' -> 'f' 'i') all the
  |equivalent characters replaced by the class representant.

I don't think that using any of the compatibility normalizations
- you seem to have NFKD in mind - is a good idea.  Let me quote
a person who is experienced (the full text, which is quite
interesting for this topic, under [2]):

  Care should be used with mapping to NFKD or NFKC, as semantic
  information might be lost (for instance U+00B2 (SUPERSCRIPT TWO)
  maps to 2) and extra mark-up information might have to be added
  to preserve it (e.g., <SUP>2</SUP> in HTML).

  |What is the problem? The main program to write is the normalize (a
  |translator), that is _not_ a regex program, but a program that
  |translates from UTF-8 to UTF-8 but replacing lower case by higher
  |case, ligatures by sequences, equivalent classes by a canonical
  |representant (this has not to be standardized: if one uses the
  |same translator for the text and the pattern, this is it, the
  |representant will be the same).

Well, actually normalization is heavily standardized [1] and
completely embedded into / a basic part of the Unicode character
tables.  NFD is even pretty cheap (except of the potential 1:many
character mapping -- and not talking about cache efficiency);  NFC
(NFD decomposition followed by a canonical composition) isn't,
however.  Unfortunately NFC seems to be preferred (also [2]),
because the resulting normalized precomposed character is that
what users (at least often seem to) produce when they type a key
on their localized keyboard.

Except for that i totally agree in that it would be nice if a
tab-completion or glob/xy wildcard match would be able to find
anything that starts with a character that happens to be a base
character, *if* the filesystem stores filenames in a decomposed
way.  But if that is agnostic, why should a* match ä or â or á or
any other composition with the base character „a“ if i type a*?
However, new magic characters for glob/xy that do match them would
possibly be nice, since typing POSIX equivalence classes is pain
(and imo nothing that a normal user should be tortured with).

So, shells etc. should possibly perform NFC normalization on file
paths before creating entries in the filesystem, because that
results in the most portable (and, who knows, iconv(3)able)
representation of a string.

[1] <http://www.unicode.org/reports/tr15/>
[2] <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

--steffen



Home | Main Index | Thread Index | Old Index