NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

lib/57064: Import OpenBSD's script to autogen Unicode ctype definition?



>Number:         57064
>Category:       lib
>Synopsis:       Import OpenBSD's script to autogen Unicode ctype definition?
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 18 04:30:01 +0000 2022
>Originator:     Rin Okuyama
>Release:        9.99.x, 9.x, 8.x
>Organization:
Department of Physics, Meiji University
>Environment:
NetBSD rp64 9.99.100 NetBSD 9.99.100 (GENERIC64EB) #1: Sat Oct  8 01:43:00 JST 2022  rin@latipes:/build/src/sys/arch/evbarm/compile/GENERIC64EB evbarm
>Description:
Unicode has added thousands characters per year in a totally
unorganized ways. Our ctype definition for UTF-8 has been left
untouched in the last decade, with very few exceptions:

[1] http://cvsweb.netbsd.org/bsdweb.cgi/src/share/locale/ctype/en_US.UTF-8.src

wiz@ noticed that OpenBSD uses perl script:

[2] http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/share/locale/ctype/gen_ctype_utf8.pl

to automatically generate UTF-8 ctype definition:

[3] http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/share/locale/ctype/en_US.UTF-8.src

Adding thousands characters every year by hand apparently
exceeds our capacities, and I basically agree to adopt this
script. However, there are some concerns:

(1) Generated file [3] is not completely same for characters we
already have. There may be possible compatibility problems. It
may (or may not?) be better to switch after netbsd-10 branch?

(2) Generated file [3] is of under Unicode license (see link [3]).
I'm not 100% sure whether this license is acceptable for
src/share/locale/ctype.

(3) I don't understand what script [2] actually does. It seems to
involves two conversions in principle: data supplied by Unicode
Consortium --> perl module --> ctype definition. I'd like to
clarify what exactly happens, and document it in source or commit
log. Perl gurus familiar to its I18N facilities?

(4) This is not a big problem, but we don't have perl in base.
Database can be generated only when explicitly indicated with
pkgsrc perl. Anyway, generating ctype definition takes order of
hours on modern Intel processor, and it is unrealistic to build
it every time.

Also note that switch to OpenBSD's ctype definition of UTF-8 does
*not* completely resolve our problems related to UTF-8. Our Citrus
locale does not recognize combining characters (incl. variation
selectors). Such characters may confuse applications.
>How-To-Repeat:
n/a
>Fix:
described above



Home | Main Index | Thread Index | Old Index