NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
PR/60369 CVS commit: src
The following reply was made to PR standards/60369; it has been noted by GNATS.
From: "Taylor R Campbell" <riastradh%netbsd.org@localhost>
To: gnats-bugs%gnats.NetBSD.org@localhost
Cc:
Subject: PR/60369 CVS commit: src
Date: Sat, 4 Jul 2026 13:21:05 +0000
Module Name: src
Committed By: riastradh
Date: Sat Jul 4 13:21:05 UTC 2026
Modified Files:
src/lib/libc/citrus/modules: citrus_utf8.c
src/lib/libc/locale: c8rtomb.c
src/tests/lib/libc/locale: t_c8rtomb.c
Log Message:
libc: Fix two bugs in UTF-8 decoding and add exhaustive tests.
1. Despite the recent slew of changes, mbrtowc(3) in UTF-8 locales
would still fail to return EILSEQ at the first byte where it can,
because it would silently consume the first two bytes encoding a
surrogate code point and only reject it after consuming the third
byte.
2. The byte classification table for the compressed DFA used by
c8rtomb(3) was wrong for the bytes c0-c1 and f5-ff. Somehow the
program I used, over a decade ago, to generate the compressed DFA
and classification table got those, and only those, wrong, and I
can't find that program now, so I'll just have to correct this by
hand. Class 10 was missing from the table (but class 11 was not)
and is obviously the right class for the always-invalid c0-c1 and
f5-ff because all states transition to UTF8_REJECT (=96) on class
10, and the same is not true for any other class.
Now mbrtowc(3) in LC_CTYPE=C.UTF_8 is checked against c8rtomb(3)
systematically to verify they agree on all possible inputs one byte
at a time. There are 4501261 distinct such inputs, including the
valid encodings of all Unicode scalar values, the invalid encodings
of Unicode surrogate code points, and other invalid encodings. We
stop at the first invalid byte, so it is not necessary to examine all
~four billion 4-byte strings. I considered randomly subsampling to
make this test take less time, but decided that would be too
confusing. For debugging purposes, you can run this new test with
`atf-run -v c8rtomb_all_faliures=yes' to show all failures rather
than just the first one; this produces megabytes of output with some
of the bugs we've had, so it's off by default.
PR standards/60369: mbrtowc, mbrlen have wrong return value for some
invalid byte sequences
To generate a diff of this commit:
cvs rdiff -u -r1.21 -r1.22 src/lib/libc/citrus/modules/citrus_utf8.c
cvs rdiff -u -r1.9 -r1.10 src/lib/libc/locale/c8rtomb.c
cvs rdiff -u -r1.7 -r1.8 src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Home |
Main Index |
Thread Index |
Old Index