Subject: Re: NetBSD NSF server with OS X NFS clients
To: Bill Stouder-Studenmund <wrstuden@netbsd.org>
From: Chuck Swiger <cswiger@mac.com>
List: netbsd-users
Date: 09/13/2007 14:38:41
On Sep 13, 2007, at 1:28 PM, Bill Stouder-Studenmund wrote:
>>> o TextEdit.app fails to open the test file whether I try File->Open
>>>  within TextEdit or if I specify FileEdit.app from the Finder
>>>
>>> o Safari was the same as TextEdit, and won't open the file
>>>
>>> o Firefox exited with 'The application Firefox quit unexpectedly' (I
>>>  tried twice, and it fell over twice) when I try to open the file
>>
>> The Mac HFS filesystem represents all filenames in Unicode (UTF16)
>> [1], which is not the case for Berkeley FFS aka the UFS of NetBSD.  I
>> suspect that the tools mentioned above notice that the filename
>> contains non-ASCII characters, and convert the path into the UTF16
>> representation they expect to find and fail because that pathname
>> doesn't really exist over NFS.
>
> Yes, but NFS isn't HFS.

I don't recall anyone saying it was, but true enough.

> No one's going to send UTF-16 over the wire for NFS.  It would =20
> never work.
> UTF-8 however would.

Sort of.  The XDR definition from RFC-1832 actually limits strings in =20=

the NFS protocol to 7-bit US-ASCII, but it is common for =20
implementations to support filenames encoded in 8-bit ISO-Latin-1.

Rumor has it that NFSv4 might have full support for Unicode via =20
either UTF-16 or UTF-8.

> A more likely failure is that one of the frameworks is detecting =20
> that the
> path is not valid UTF-8 and rejecting things based on that.

That's certainly possible, from http://developer.apple.com/=20
documentation/MacOSX/Conceptual/BPInternational/Articles/=20
FileEncodings.html

"File Systems and Unicode Support

Different file systems in Mac OS X have different levels of Unicode =20
support:

Mac OS Extended (HFS+) uses canonically decomposed Unicode 3.2 in =20
UTF-16 format, which consists of a sequence of 16-bit codes. =20
(Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-U2FA1D =20=

are not decomposed.)  The UFS file system allows any character from =20
Unicode 2.1 or later, but uses the UTF-8 format, which consists =20
mostly of 8-bit ASCII codes but which may also include multibyte =20
codes. (Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-=20=

U2FA1D are not decomposed.)  Mac OS Standard (HFS) does not support =20
Unicode and instead uses legacy Mac encodings, such as MacRoman.
Locking the canonical decomposition to a particular version of =20
Unicode does not exclude usage of characters defined in a newer =20
version of Unicode. Because the Unicode consortium has guaranteed not =20=

to add any more precomposed characters, applications can expect to =20
store characters defined in future versions of Unicode without =20
compatibility issues.

All BSD system functions expect their string parameters to be in =20
UTF-8 encoding and nothing else. Code that calls BSD system routines =20
should ensure that the contents of all const *char parameters are in =20
canonical UTF-8 encoding.  In a canonical UTF-8 string, all =20
decomposable characters are decomposed; for example, =E9 (0x00E9) is =20
represented as e (0x0065) + =B4 (0x0301). To put things into a =20
canonical UTF-8 encoding, use the =93file-system representation=94 =20
interfaces defined in Cocoa and Carbon (including Core Foundation)."

It'd be interesting for one of the people reporting the problem to =20
run tcpdump against your NFS traffic and see how these filenames are =20
actually being encoded in the requests.

Regards,
--=20
-Chuck