Subject: Re: Unicode support in iso9660.
To: Pavel Cahyna <pavel.cahyna@st.cuni.cz>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 11/22/2004 11:46:10
--NU0Ex4SbNnrxsi6C
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Nov 22, 2004 at 11:07:30AM -0500, Allen Briggs wrote:
> On Mon, Nov 22, 2004 at 02:41:03PM +0100, Pavel Cahyna wrote:
> > The GTK applications do need to display the filenames using X11. How to=
 do
> > that without assuming that filenames are encoded character sequences?
>=20
> The user needs to be able to read and write meaningful filenames.
> So the encoding / decoding has to happen somewhere.  The question
> is, where?
>=20
> Currently, the filesystem is agnostic.  As long as paths are
> separated by '/' and end with a NUL character, the kernel doesn't
> really care what the encoding is.  I think der Mouse's point is
> that this is the way it should be--why should the kernel care what
> the encoding is when it's essentially the province of userland to
> make sense of the data.
>=20
> In any event, a given piece of media will have filenames encoded
> in some fashion, be it ASCII, UTF-8, or "other".  I don't see how
> having the kernel know anything about the actual encoding would be
> particularly practical.

If I remember part of the thread right, the issue is that the above=20
statement isn't 100% true, and the places where this isn't true are the=20
ones that get the kernel involved.

Our kernel assumes that a file name is representable as a string of 8-bit
bytes. Thus if we ever see a NUL, we're at the end. If we ever see a '/',
we are hitting a component separator. Likewise the directory-reading API
we offer assumes file names are NUL-terminated 8-byte char strings.  The
problem (assuming I understand it right) is that there are some file
system naming structures (Joliet) that break that assumption. They store=20
file names in UTF-16 (well in a 16-bit format). So there is no way that=20
Joliet names can be returned as-is for directory listings nor can we ever=
=20
expect one to be passed in by a path component.

So the kernel has to do something inside the FS, and that something is
some sort of 16<->8 conversion. Since the fs has to convert to some 8-bit
form, how does it know what one? I think a lot of this thread has resulted
from the logic that the easiest way to tell the FS what 8-bit format to
convert to/from (i.e. the easiest choice) is for us to assume all
systemcalls use UTF-8, and thus the iso conversion is specified.

> The question is, how do you determine the encoding?  And where do
> you want that knowledge to be?  My inclination is that it would be
> much more flexible and expandable to have it in userland.  It would
> be more uniform to have it in the kernel, but it's not clear to me
> that the problem is well-enough defined yet for the kernel to do
> the right thing.

For the case of cd9660, maybe the thing to do is add a mount option to=20
control the 16<->8 conversion. Also, I think it'd be easy to whip up a=20
layered file system that converts one encoding to another. Also, a layered=
=20
file system could be used to experiment with different encoding=20
manipulations, like looking at an ENV variable to determine what to do.

> Coming from scratch, I'd think that it would be best to use some
> sort of i18n library to encode/decode paths to a known format (UTF-8
> or whatever) from the filesystem's encoding.  This would have to
> be initialized with the codeset information from the locale and/or
> from the media.

There is a lot of existing prior art. MacOS has been dealing with this for=
=20
almost 20 years. I'm not saying we should do what they did, but they sure=
=20
have a lot of mistakes we can avoid. :-)

I think the biggest issue we will run into is how do we deal with existing=
=20
installs, especially ones that mix encodings.

Take care,

Bill

--NU0Ex4SbNnrxsi6C
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBokICWz+3JHUci9cRAjBxAJ96XOBgx8XiluEIH/yUelzHsAZTYACfd3lk
YGYrvOAjRyftGWyWNB5JTeY=
=CF4B
-----END PGP SIGNATURE-----

--NU0Ex4SbNnrxsi6C--