tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] Support for mbsnrtowcs and wcsnrtomb



>>>>> On Fri, 26 Apr 2013 12:42:09 +0200,
      Antoine LECA <antoine.leca.1%gmail.com@localhost> said:

>> For example, the behavior of mbrstowcs() and mbsnrtowcs() in the
>> specification (i.e. resetting the state) makes those functions
>> unuseful to parse a file with a stateful encoding.

 > I can agree with you about mbrstowcs() here: if you pass a 0-terminated
 > string to that function, it is required to end in the initial state,
 > that is, to reset the state to initial.
> Note this is transposed into the iso-2022-jp encoding for emails: you
> are required to put (otherwise unnecessary) "ESC $ x Y" at the start of
> each line, even if the previous line ended in kanji.

Yeah, I know.  ISO-2022-JP is intentionally designed as above
to make parse it easier.

> But I do not see why this would apply to mbrnstowcs().
> Moreover, I believe the real point of that function is exactly here: to
> be able to call it with just the content of a line, _without_ the final
> \0; thus the call will translate all the characters in the line, but
> will keep the state information (since _only_ in case of terminated line
> it is reset to the initial state); thus it would allow to be called
> later with another line, without needing the introducing sequence.

Yeah.
Also, maybe it's better to remove the "reset the state at \0" part
from the specification of mbsnrtowcs().  mbsnrtowcs() doesn't have to
be completely compatible with mbrstowcs(), and removing the clause
makes it more useful for stateful encodings, and more importantly,
the removal doesn't break any existing code for stateless encodings,
because the state will be reset at that plase as a side effect in case
of stateless encodings anyway.
Perhaps I should propose that to Austin Group.
-- 
soda


Home | Main Index | Thread Index | Old Index