NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/59657: syslogd outputs BOM in the message
The following reply was made to PR bin/59657; it has been noted by GNATS.
From: Christos Zoulas <christos%zoulas.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost
Subject: Re: bin/59657: syslogd outputs BOM in the message
Date: Fri, 19 Sep 2025 12:16:39 -0400
--Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37
Content-Type: multipart/alternative;
boundary="Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3"
--Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
> SYSLOG(3) and rfc5424 similarly state:=20
>=20
> "If the msgfmt contains UTF-8 characters, then it has to start with=20
> a Byte Order Mark."
>=20
>=20
> The BOM is unexpected as a prefix for every message logged:
>=20
> 2025-09-17T01:59:10.205820+01:00 funcube potato - - - =
<feff>=C3=81rv=C3=ADztűrő t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p
>> How-To-Repeat:
> syslogd -o rfc5424 -d
>=20
> logger $(printf "\xEF\xBB\xBF%s" "=C3=81rv=C3=ADztűrő =
t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p")
>=20
>=20
>=20
> tail -n 1 /var/log/messages | xxd
> 00000000: 3230 3235 2d30 392d 3137 5430 313a 3539 2025-09-17T01:59
> 00000010: 3a31 302e 3230 3538 3230 2b30 313a 3030 :10.205820+01:00
> 00000020: 2066 756e 6375 6265 2070 6f74 6174 6f20 funcube potato=20
> 00000030: 2d20 2d20 2d20 efbb bfc3 8172 76c3 ad7a - - - .....rv..z
> 00000040: 74c5 b172 c591 2074 c3bc 6bc3 b672 66c3 t..r.. t..k..rf.
> 00000050: ba72 c3b3 67c3 a970 0a .r..g..p.
>=20
Why do you say that? The BNF in the RFC says:
MSG =3D MSG-ANY / MSG-UTF8
MSG-ANY =3D *OCTET ; not starting with BOM
MSG-UTF8 =3D BOM UTF-8-STRING
BOM =3D %xEF.BB.BF
Now in practice according to ChatGPT:
Almost all modern syslog implementations do not emit a BOM, even for =
UTF-8 content.
Many receivers are tolerant and just assume UTF-8 without requiring BOM.
Some parsers can actually get confused if a BOM is present.
And:
RFC 5424 says the BOM is required if you send UTF-8 MSG.
In practice, it=E2=80=99s usually skipped, and interoperability tends to =
be better without it.
If your tool (msgfmt) prepends a BOM automatically, you should check the =
target syslog receiver. If it understands RFC 5424 to the letter, the =
BOM is technically correct. But if you=E2=80=99re aiming for =
compatibility with common syslog daemons (rsyslog, syslog-ng, journald =
forwarders), skipping the BOM is typically safer.
Perhaps adding a flag to select the behavior? What should the default =
be?
christos
>=20
>> Fix:
> Index: ./usr.sbin/syslogd/syslogd.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> RCS file: /cvsroot/src/usr.sbin/syslogd/syslogd.c,v
> retrieving revision 1.147
> diff -u -r1.147 syslogd.c
> --- ./usr.sbin/syslogd/syslogd.c 9 Nov 2024 16:31:31 -0000 =
1.147
> +++ ./usr.sbin/syslogd/syslogd.c 17 Sep 2025 01:08:30 -0000
> @@ -1243,6 +1243,7 @@
> DPRINTF(D_DATA, "UTF-8 BOM\n");
> utf8allowed =3D true;
> p +=3D 3;
> + start +=3D 3; /* skip BOM in output */
> }
>=20
> if (*p !=3D '\0' && !utf8allowed) {
--Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=utf-8
<html aria-label=3D"message body"><head><meta http-equiv=3D"content-type" =
content=3D"text/html; charset=3Dutf-8"></head><body =
style=3D"overflow-wrap: break-word; -webkit-nbsp-mode: space; =
line-break: after-white-space;"><br =
id=3D"lineBreakAtBeginningOfMessage"><br><blockquote =
type=3D"cite">SYSLOG(3) and rfc5424 similarly state: <br><br>"If the =
msgfmt contains UTF-8 characters, then it has to start with <br>a Byte =
Order Mark."<br><br><br>The BOM is unexpected as a prefix for every =
message logged:<br><br>2025-09-17T01:59:10.205820+01:00 funcube potato - =
- - <feff>=C3=81rv=C3=ADzt&#369;r&#337; =
t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p<br><blockquote =
type=3D"cite">How-To-Repeat:<br></blockquote>syslogd -o rfc5424 =
-d<br><br>logger $(printf "\xEF\xBB\xBF%s" =
"=C3=81rv=C3=ADzt&#369;r&#337; =
t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p")<br><br><br><br>tail -n 1 =
/var/log/messages | xxd<br>00000000: 3230 3235 2d30 392d 3137 5430 313a =
3539 2025-09-17T01:59<br>00000010: 3a31 302e 3230 3538 3230 2b30 =
313a 3030 :10.205820+01:00<br>00000020: 2066 756e 6375 6265 2070 =
6f74 6174 6f20 funcube potato <br>00000030: 2d20 2d20 2d20 =
efbb bfc3 8172 76c3 ad7a - - - .....rv..z<br>00000040: 74c5 b172 =
c591 2074 c3bc 6bc3 b672 66c3 t..r.. t..k..rf.<br>00000050: ba72 =
c3b3 67c3 a970 0a =
&n=
bsp; .r..g..p.<br><br></blockquote><br>Why =
do you say that? The BNF in the RFC says:<div><br><div> =
MSG =3D MSG-ANY / =
MSG-UTF8</div><div> MSG-ANY =3D *OCTET =
; not starting with BOM</div><div> MSG-UTF8 =3D=
BOM UTF-8-STRING</div><div> BOM =
=3D =
%xEF.BB.BF</div><div><br></div><div><br></div><div>Now in practice =
according to ChatGPT:</div><div><ul data-start=3D"862" =
data-end=3D"1092"><li data-start=3D"862" data-end=3D"951"><p =
data-start=3D"864" data-end=3D"951">Almost all modern syslog =
implementations <em data-start=3D"905" data-end=3D"913">do =
not</em> emit a BOM, even for UTF-8 content.</p></li><li =
data-start=3D"952" data-end=3D"1028"><p data-start=3D"954" =
data-end=3D"1028">Many receivers are tolerant and just assume UTF-8 =
without requiring BOM.</p></li><li data-start=3D"1029" =
data-end=3D"1092"><p data-start=3D"1031" data-end=3D"1092">Some parsers =
can actually get confused if a BOM is =
present.</p></li></ul></div><div>And:</div><div><ul data-start=3D"1129" =
data-end=3D"1288"><li data-start=3D"1129" data-end=3D"1193"><p =
data-start=3D"1131" data-end=3D"1193"><strong data-start=3D"1131" =
data-end=3D"1191">RFC 5424 says the BOM is required if you send UTF-8 =
MSG.</strong></p></li><li data-start=3D"1194" data-end=3D"1288"><p =
data-start=3D"1196" data-end=3D"1288"><strong data-start=3D"1196" =
data-end=3D"1233">In practice, it=E2=80=99s usually skipped</strong>, =
and interoperability tends to be better without it.</p></li></ul><p =
data-start=3D"1290" data-end=3D"1610">If your tool (<code =
data-start=3D"1304" data-end=3D"1312">msgfmt</code>) prepends a BOM =
automatically, you should check the target syslog receiver. If it =
understands RFC 5424 to the letter, the BOM is technically correct. But =
if you=E2=80=99re aiming for compatibility with common syslog daemons =
(rsyslog, syslog-ng, journald forwarders), skipping the BOM is typically =
safer.</p><p data-start=3D"1290" data-end=3D"1610">Perhaps adding a flag =
to select the behavior? What should the default be?</p><p =
data-start=3D"1290" data-end=3D"1610">christos</p></div><blockquote =
type=3D"cite"><br><blockquote type=3D"cite">Fix:<br></blockquote>Index: =
./usr.sbin/syslogd/syslogd.c<br>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D<br>RCS file: =
/cvsroot/src/usr.sbin/syslogd/syslogd.c,v<br>retrieving revision =
1.147<br>diff -u -r1.147 syslogd.c<br>--- ./usr.sbin/syslogd/syslogd.c =
9 Nov 2024 16:31:31 -0000 =
1.147<br>+++ =
./usr.sbin/syslogd/syslogd.c =
17 Sep 2025 01:08:30 =
-0000<br>@@ -1243,6 +1243,7 @@<br> =
&n=
bsp; DPRINTF(D_DATA, "UTF-8 BOM\n");<br> =
&n=
bsp; utf8allowed =3D true;<br> =
&n=
bsp; p +=3D 3;<br>+ =
&n=
bsp; start +=3D 3; /* skip BOM in output */<br> =
}<br><br> =
if (*p !=3D '\0' && =
!utf8allowed) {<br></blockquote><br></div></body></html>=
--Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3--
--Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename=signature.asc
Content-Type: application/pgp-signature;
name=signature.asc
Content-Description: Message signed with OpenPGP
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iF0EARECAB0WIQS+BJlbqPkO0MDBdsRxESqxbLM7OgUCaM2B5wAKCRBxESqxbLM7
OvQXAJ9iUiFrauG8L+Ja/OitOkvIlRBy6wCgghpWnOmrsCWaecpLjTUIQHHBxrQ=
=Uf4G
-----END PGP SIGNATURE-----
--Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37--
Home |
Main Index |
Thread Index |
Old Index