NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
PR/60369 CVS commit: [netbsd-11] src
The following reply was made to PR standards/60369; it has been noted by GNATS.
From: "Martin Husemann" <martin%netbsd.org@localhost>
To: gnats-bugs%gnats.NetBSD.org@localhost
Cc:
Subject: PR/60369 CVS commit: [netbsd-11] src
Date: Sat, 4 Jul 2026 15:44:51 +0000
Module Name: src
Committed By: martin
Date: Sat Jul 4 15:44:51 UTC 2026
Modified Files:
src/lib/libc/citrus/modules [netbsd-11]: citrus_utf8.c
src/lib/libc/locale [netbsd-11]: c8rtomb.c
src/tests/lib/libc/locale [netbsd-11]: t_c8rtomb.c
Log Message:
Additionally pull up following revision(s) (requested by riastradh in ticket #339):
tests/lib/libc/locale/t_c8rtomb.c: revision 1.8
lib/libc/locale/c8rtomb.c: revision 1.10
lib/libc/citrus/modules/citrus_utf8.c: revision 1.20
lib/libc/citrus/modules/citrus_utf8.c: revision 1.21
lib/libc/citrus/modules/citrus_utf8.c: revision 1.22
Be truly pedantic about UTF-8 encodings
If we're not going to be accepting "legacy" UTF-8
(5 and 6 byte encodings for code points >= 0x00200000 which the
standards don't allow, as they won't fit in UTF-16) then we
certainly should never be able to generate them, and even more
should certainly be pedantic about not allowing the various
forms of mis-coded strings for which there is no justification
but have been known to be used to attempt to violate security.
This, I believe, now enforces all the current restrictions, eg,
it will no longer be possible to encode ascii in 2 bytes (0xc0 '.')
and similar, the shortest legal encoding is all that will be
accepted (and all that will be generated, but that was always true).
It is quite possible that this will break things, probably many
tests, as now random garbage won't be accepted as valid, things
must be properly encodedd.
mbrtowc() fix a stupid typo in the previous version.
No idea how I managed to miss this previously. This update should
make at least some of the ATF tests (and other stuff) which failed
after the previous change start working again.
libc: Fix two bugs in UTF-8 decoding and add exhaustive tests.
1. Despite the recent slew of changes, mbrtowc(3) in UTF-8 locales
would still fail to return EILSEQ at the first byte where it can,
because it would silently consume the first two bytes encoding a
surrogate code point and only reject it after consuming the third
byte.
2. The byte classification table for the compressed DFA used by
c8rtomb(3) was wrong for the bytes c0-c1 and f5-ff. Somehow the
program I used, over a decade ago, to generate the compressed DFA
and classification table got those, and only those, wrong, and I
can't find that program now, so I'll just have to correct this by
hand. Class 10 was missing from the table (but class 11 was not)
and is obviously the right class for the always-invalid c0-c1 and
f5-ff because all states transition to UTF8_REJECT (=96) on class
10, and the same is not true for any other class.
Now mbrtowc(3) in LC_CTYPE=C.UTF_8 is checked against c8rtomb(3)
systematically to verify they agree on all possible inputs one byte
at a time. There are 4501261 distinct such inputs, including the
valid encodings of all Unicode scalar values, the invalid encodings
of Unicode surrogate code points, and other invalid encodings. We
stop at the first invalid byte, so it is not necessary to examine all
~four billion 4-byte strings. I considered randomly subsampling to
make this test take less time, but decided that would be too
confusing. For debugging purposes, you can run this new test with
`atf-run -v c8rtomb_all_faliures=yes' to show all failures rather
than just the first one; this produces megabytes of output with some
of the bugs we've had, so it's off by default.
PR standards/60369: mbrtowc, mbrlen have wrong return value for some
invalid byte sequences
To generate a diff of this commit:
cvs rdiff -u -r1.18.42.1 -r1.18.42.2 \
src/lib/libc/citrus/modules/citrus_utf8.c
cvs rdiff -u -r1.9 -r1.9.4.1 src/lib/libc/locale/c8rtomb.c
cvs rdiff -u -r1.7 -r1.7.4.1 src/tests/lib/libc/locale/t_c8rtomb.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Home |
Main Index |
Thread Index |
Old Index