Subject: Re: Mount option to ignore case
To: Bill Studenmund <wrstuden@netbsd.org>
From: Johan Ihren <johani@autonomica.se>
List: tech-kern
Date: 04/02/2002 22:49:39
Bill Studenmund <wrstuden@netbsd.org> writes:

Bill,

> > If instead the problem is treated from a perspective of "characters",
> > i.e. a filename is a string of individual characters concatenated
> > together and these characters come from some sort of agreed upon set
> > of available characters (sofar this has been ASCII, and will likely be
> > Unicode in the future) then the "locale" aspect disappears. This is
> > *good*.
> 
> The problem is that we are not just talking about characters, we are
> talking about case-insensitive character matching. Different languages
> have different ideas about what characters are the same.

Exactly. And as soon as you start down that path you will be lost.

What locale will you use? The one used by the creator of the file? The
one you (as the reader) uses? The one indicated by the choice of
characters in the filename?

> > I.e. UTF-8 is only an encoding of a sequence of Unicode characters and
> > what is needed is mostly for case mapping tables as documented by the
> > Unicode consortium.
> 
> >From the comments I heard in one of the plenary sessions at the last IETF,
> is that that mapping table (from the unicode consortium) isn't too useful.
> It was made to please everyone and as such pleases no one.

That's more or less my impression also. However (somewhat going out on
limb here), my impression is also that there is no solution to be
found within the Unicode framework. For exactly the reason that they
work with characters rather than language. I.e. we shouldn't scorn the
Unicode people for failure to solve problems that are inherent to the
framework they are working within.

So the best you can do is to decide on one mapping table (possibly
versioned) that will manage that subset of characters that it covers
and the rest will, for the time being, not be case converted in name
comparisions.

If I create a filename consisting of a mixture of swedish, turkish and
ethiopian characters there is no language-based case conversion
*possible* that can sort out the result. For instance (well known
example), our lower case "i" has different uppercase equivalents in
english and turkish. But no language-sensitive system will ever be
able to sort out whether it was a lower case swedish "i" or a
lowercase turkish "i" that was intended.  You will have to choose one
of:

a) choose one of the mappings and lose the other(s).

b) decide that "i" cannot be case converted for the purposes of
   filename comparision. That would hurt quite badly, since we *can*
   case convert "i" right now.

c) make everything *really* ugly and only allow characters from the
   present locale when creating new files (so that it is possible to
   know that all characters are from the same locale). Not even worth
   thinking about.

My vote would be for (a), but I realize that I'm probably biased due
to my expectation that the english casefolding rules for "i" will be
considered useful to a larger audience than the turkish version. Had I
been from Turkey I might have made a different choice.

... 

> > My point is that (a) avoid "language", stick to "characters" and (b)
> > before deciding on the "full solution" it is important to find out
> > whether such a solution *exists*. Oops, that's two points.
> 
> My point is that if we don't include "language", then we will get a
> case-insensitivity matching method which will be _not_ what a number
> of users want. Depending on what we do, different groups of users
> will be in the "not" group.

Correct. And by including "language" you'll get nowhere at all. Sorry.

Regards,

Johan