NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

lib/44600: libedit does not properly handle UTF-8 when glyphs are multiple Unicode characters



>Number:         44600
>Category:       lib
>Synopsis:       libedit does not properly handle UTF-8 when glyphs are 
>multiple Unicode characters
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 18 20:00:00 +0000 2011
>Originator:     Steven Vernon
>Release:        sources as of 2011/02/04
>Organization:
Citrix
>Environment:
>Description:
libedit when using UTF-8 assumes that one glyph (visible character) corresponds 
to one Unicode "code point" (character number) [and it reasonably assumes one 
glyph takes up one column and one row]. Unfortunately that is not always the 
case. There are non-composed glyphs that take up multiple Unicode code points. 
Examples include European languages that have accents that are not composed 
(e.g. a French "e" with an accent circumflex, but these are two different 
Unicode characters) and Indian character sets with viramas (?) that indicate 
vowels, such as in Hindi (which again are multiple Unicode code points).

libedit does not correctly do character deletion nor update the cursor position 
correctly.
>How-To-Repeat:
Enter data with non-composed accents or viramas, etc. Try backspacing over the 
data, moving the cursor left/right and deleting and/or inserting, and 
redisplaying after changes are made.

Beware that some character combinations also have pre-composed versions, which 
are given a single Unicode code point, such as the above French "e" with accent 
circumflex. These were only created for backward compability with certain 
character sets, such as Latin-1. Make sure you enter the non-composed versions 
if testing with these values.
>Fix:
Probably need to import Unicode information that determines which characters 
are combining. I believe that in all cases such combining characters follow the 
base character. See the Unicode site.



Home | Main Index | Thread Index | Old Index