Subject: Re: fs transcoding, was Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: Ian Lance Taylor <ian@wasabisystems.com>
List: tech-kern
Date: 11/23/2004 12:49:01
der Mouse <mouse@Rodents.Montreal.QC.CA> writes:

> > I just checked SUSv3.  It says nothing particularly useful.
> 
> > "For a filename to be portable across implementations conforming to
> >  IEEE Std 1003.1-2001, it shall consist only of the portable filename
> >  character set as defined in Portable Filename Character Set.
> 
> That's very interesting information.  But it makes me ask, does the SUS
> specify any particular encoding scheme for converting those characters
> into addressing units, or is the encoding left unspecified?

Not really.  There is this description of the Portable Character Set
(the Portable Filename Character Set is a subset of this):

"IEEE Std 1003.1-2001 places only the following requirements on the
 encoded values of the characters in the portable character set:

    * If the encoded values associated with each member of the
      portable character set are not invariant across all locales
      supported by the implementation, if an application accesses any
      pair of locales where the character encodings differ, or
      accesses data from an application running in a locale which has
      different encodings from the application's current locale, the
      results are unspecified.

    * The encoded values associated with the digits 0 to 9 shall be
      such that the value of each character after 0 shall be one
      greater than the value of the previous character.

    * A null character, NUL, which has all bits set to zero, shall be
      in the set of characters.

    * The encoded values associated with the members of the portable
      character set are each represented in a single byte. Moreover,
      if the value is stored in an object of C-language type char, it
      is guaranteed to be positive (except the NUL, which is always
      zero)."

Also, I found this in the rationale:

"At the present time, the primary responsibility for truncating
 filenames containing multi-byte characters must reside with the
 application. Some industry groups involved in internationalization
 believe that in the future the responsibility must reside with the
 kernel. For the moment, a clearer understanding of the implications
 of making the kernel responsible for truncation of multi-byte
 filenames is needed.

 Character-level truncation was not adopted because there is no
 support in POSIX.1 that advises how the kernel distinguishes between
 single and multi-byte characters. Until that time, it must be
 incumbent upon application writers to determine where multi-byte
 characters must be truncated."

Ian