tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] Support for mbsnrtowcs and wcsnrtomb



>>>>> On Fri, 26 Apr 2013 11:52:19 +0200,
      Antoine LECA <antoine.leca.1%gmail.com@localhost> said:

>> It seems the description in the OpenGroup specification has a problem
>> about this point.

> You really should file a austin-bug-report then.

That's right, if my understanding was correct.
but it seems I was misunderstanding the specification of mbsrtowcs().

>> I guess this is because mbsnrtowcs() was glibc
>> extension originally, and OpenGroup just copied the glibc specification.
>> Note that glibc doesn't support stateful encodings, but ours does.

> Well, I am not that sure "Glibc does not support stateful encoding" when
> I read http://austingroupbugs.net/view.php?id=616.
> It's a sequel of http://austingroupbugs.net/view.php?id=601
> 
> Basically, this (already accepted, and already implemented) added
> interpretation requires that if the input buffer ends with an
> unterminated character, then the implementation should consume the
> available part, and record within the mbstate_t object all the needed
> information to be able to restart directly at the end of the buffer:
> this very much seems stateful encoding to me (although not as complex as
> ISO-2022-*.)

Some sort of state, yes.
But stateful encodings are totally different from stateless encodings
in some aspect.
For example, the behavior of mbrstowcs() and mbsnrtowcs() in the
specification (i.e. resetting the state) makes those functions
unuseful to parse a file with a stateful encoding.
We should use mbrtowc() to parse such file instead of those functions.

> I notice that this Austin-group interpretation botched the C99/C11
> description for the value to be returned in src when an EILSEQ occurs:
> in such a case, under the C99/C11 Standard you can reset the mbstate_t
> --since it's now undefined-- and then restart some process from the
> updated *src, which holds the pointer "past the last converted
> multibyte", ie a pointer to the start of still unconverted part.
> Under the Austin-group reading *src is to be updated to "last byte
> processed" which can be anything since the process detected an error;
> the potential for restarting is now close to 0. Worse, you cannot know
> exactly where you can insert a \0 to transform the original input string
> into a valid one.
> I understand implementations followed the same ideas as Austin Group
> commentators and probably did not fully observe the requirements of the
> C99/C11 Standard (thus botching the value returned in *src.)
> I also understand that the overwhelming majority of programs using that
> functions just abort when EILSEQ is detected.

Probably it's better to use mbrtowc() to handle EILSEQ precisely,
because you can give one byte by one byte in that way.
-- 
soda


Home | Main Index | Thread Index | Old Index