tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Tue, Apr 02, 2013 at 07:45:42PM -0400, James K. Lowden wrote:
> On Tue, 2 Apr 2013 12:21:03 -0400
> Thor Lancelot Simon <tls%panix.com@localhost> wrote:
> 
> > On Tue, Apr 02, 2013 at 06:08:01PM +0200, tlaronde%polynum.com@localhost 
> > wrote:
> > > 
> > > That UTF-8 is the answer, since this allows to use C "char" (at
> > > least an octet, signed or unsigned) programs.
> > 
> > Except it can't, really, quite be UTF-8 -- it has to be "Modified
> > UTF-8", because C strings can't contain 0.
> 
> What are you referring to, exactly?  UTF-8 and ASCII both represent NUL
> with 0.  The filename rule is that only '/' and NUL are prohibited.  I

Non-NUL UTF8 sequences can contain bytes with value 0, breaking C string
handling.  There's a common workaround, but, technically, once you apply
it, you are no longer compliant with UTF8.  You're emitting "Modified UTF-8"
like Java does.  It's the great thing about standards: pick one...



Home | Main Index | Thread Index | Old Index