NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

PR/60369 CVS commit: src



The following reply was made to PR standards/60369; it has been noted by GNATS.

From: "Taylor R Campbell" <riastradh%netbsd.org@localhost>
To: gnats-bugs%gnats.NetBSD.org@localhost
Cc: 
Subject: PR/60369 CVS commit: src
Date: Sat, 4 Jul 2026 13:21:05 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Jul  4 13:21:05 UTC 2026
 
 Modified Files:
 	src/lib/libc/citrus/modules: citrus_utf8.c
 	src/lib/libc/locale: c8rtomb.c
 	src/tests/lib/libc/locale: t_c8rtomb.c
 
 Log Message:
 libc: Fix two bugs in UTF-8 decoding and add exhaustive tests.
 
 1. Despite the recent slew of changes, mbrtowc(3) in UTF-8 locales
    would still fail to return EILSEQ at the first byte where it can,
    because it would silently consume the first two bytes encoding a
    surrogate code point and only reject it after consuming the third
    byte.
 
 2. The byte classification table for the compressed DFA used by
    c8rtomb(3) was wrong for the bytes c0-c1 and f5-ff.  Somehow the
    program I used, over a decade ago, to generate the compressed DFA
    and classification table got those, and only those, wrong, and I
    can't find that program now, so I'll just have to correct this by
    hand.  Class 10 was missing from the table (but class 11 was not)
    and is obviously the right class for the always-invalid c0-c1 and
    f5-ff because all states transition to UTF8_REJECT (=96) on class
    10, and the same is not true for any other class.
 
 Now mbrtowc(3) in LC_CTYPE=C.UTF_8 is checked against c8rtomb(3)
 systematically to verify they agree on all possible inputs one byte
 at a time.  There are 4501261 distinct such inputs, including the
 valid encodings of all Unicode scalar values, the invalid encodings
 of Unicode surrogate code points, and other invalid encodings.  We
 stop at the first invalid byte, so it is not necessary to examine all
 ~four billion 4-byte strings.  I considered randomly subsampling to
 make this test take less time, but decided that would be too
 confusing.  For debugging purposes, you can run this new test with
 `atf-run -v c8rtomb_all_faliures=yes' to show all failures rather
 than just the first one; this produces megabytes of output with some
 of the bugs we've had, so it's off by default.
 
 PR standards/60369: mbrtowc, mbrlen have wrong return value for some
 invalid byte sequences
 
 
 To generate a diff of this commit:
 cvs rdiff -u -r1.21 -r1.22 src/lib/libc/citrus/modules/citrus_utf8.c
 cvs rdiff -u -r1.9 -r1.10 src/lib/libc/locale/c8rtomb.c
 cvs rdiff -u -r1.7 -r1.8 src/tests/lib/libc/locale/t_c8rtomb.c
 
 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.
 



Home | Main Index | Thread Index | Old Index