bin/57616: sed(1) is unable to process multibyte unicode characters properly

To: gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: bin/57616: sed(1) is unable to process multibyte unicode characters properly
From: marc.fege%uni-bonn.de@localhost
Date: Mon, 11 Sep 2023 13:25:00 +0000 (UTC)

>Number:         57616
>Category:       bin
>Synopsis:       sed(1) is unable to process multibyte unicode characters properly
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 11 13:25:00 +0000 2023
>Originator:     Marc Fege
>Release:        9.3 evbarm/i386/amd64
>Organization:
>Environment:
NetBSD rpi 9.3 NetBSD 9.3 (RPI) #0: Thu Aug  4 15:30:37 UTC 2022  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/evbarm/compile/RPI evbarm
>Description:
Hello dear community,

sed(1) has a problem processing multibyte unicode characters properly. For example, if you try to seperate a sequence of chars into single characters, NetBSD's sed(1) passes out, when a char is longer than a byte of ASCII code, e.g. when faced with unicode characters such as German umlauts "Ä/ä, Ö/ö or Ü/ü" for example, which are two bytes long or such as the big variant German ß, "&#7838;", which is represented by even three bytes per character.

I played with shell env variables $LC_CTYPE and $LC_ALL, as well as $LANG, both with "C" as their contents, as well as "de_DE.UTF-8". No effect so far.

All I try to achieve is to space characters, which were dynamically stored in a shell variable by a running shell script. The string will be processed by a command comparable to the following:

     echo "abcÄÖÜxyz" | sed 's/./& /g'

I expect the following output format for further processing:
     "a b c Ä Ö Ü x y z "

But the output of NetBSD's sed(1) produces either a deletion of the multibyte character(s) in question or two garbled "?" characters according to the byte length of the non-processable multibyte characters depending on the env variables setting in advance. But sed(1) never a produces proper output as mentioned in the desired format above. So NetBSD's sed(1) outputs either
     "a b c x y z "  ,
     "a b c       x y z"
           or
     "a b c ? ? ? ? ? ? x y z "  ,
rendering some shell scripts useless, when they dare to expect full unicode support of a shell userland in the 2020's.

I tested the desired behaviour also with current GNU sed (gsed) from pkgsrc, as well as FreeBSD's sed implementation of their current 13.2 release on a proper FreeBSD system. Both of them understand multibyte characters without any issue and process them as one actual character, independent of byte length out of the box and represent them properly. Unfortunately, NetBSDs sed(1) implementation needs to be fixed in that regard according to POLA, because a common user nowadays find it rather annoying, if not confusing, if a common shell tool, such as sed is producing that kind of an undesired behaviour, when he is requesting the program to edit an input stream with the proper syntax.

Anyway: you are doing a great job, guy's!
Thumb's up!

Best regards,
Marc.
>How-To-Repeat:
On a shell that understands unicode (e.g. env LANG="de_DE.UTF-8") type german umlauts, echo them and try to pipe the echoed output into sed(1):

echo "abcÄÖÜxyz" | sed 's/./& /g'
>Fix:

Prev by Date: Re: port-shark/57613: Abysmal network performance on shark
Next by Date: PR/57523 CVS commit: [netbsd-10] src/sys/arch/i386/stand/efiboot
Previous by Thread: port-evbarm/57614: Kernel panic rebooting a Pine64 RockPro64 with netbsd10.0_beta; successful after panic reboot.
Next by Thread: Re: bin/57616: sed(1) is unable to process multibyte unicode characters properly
Indexes:

Home | Main Index | Thread Index | Old Index