NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/59657: syslogd outputs BOM in the message



The following reply was made to PR bin/59657; it has been noted by GNATS.

From: Christos Zoulas <christos%zoulas.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gnats-admin%netbsd.org@localhost,
 netbsd-bugs%netbsd.org@localhost
Subject: Re: bin/59657: syslogd outputs BOM in the message
Date: Fri, 19 Sep 2025 12:16:39 -0400

 --Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37
 Content-Type: multipart/alternative;
 	boundary="Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3"
 
 
 --Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3
 Content-Transfer-Encoding: quoted-printable
 Content-Type: text/plain;
 	charset=utf-8
 
 
 
 > SYSLOG(3) and rfc5424 similarly state:=20
 >=20
 > "If the msgfmt contains UTF-8 characters, then it has to start with=20
 > a Byte Order Mark."
 >=20
 >=20
 > The BOM is unexpected as a prefix for every message logged:
 >=20
 > 2025-09-17T01:59:10.205820+01:00 funcube potato - - - =
 <feff>=C3=81rv=C3=ADzt&#369;r&#337; t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p
 >> How-To-Repeat:
 > syslogd -o rfc5424 -d
 >=20
 > logger $(printf "\xEF\xBB\xBF%s" "=C3=81rv=C3=ADzt&#369;r&#337; =
 t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p")
 >=20
 >=20
 >=20
 > tail -n 1 /var/log/messages | xxd
 > 00000000: 3230 3235 2d30 392d 3137 5430 313a 3539  2025-09-17T01:59
 > 00000010: 3a31 302e 3230 3538 3230 2b30 313a 3030  :10.205820+01:00
 > 00000020: 2066 756e 6375 6265 2070 6f74 6174 6f20   funcube potato=20
 > 00000030: 2d20 2d20 2d20 efbb bfc3 8172 76c3 ad7a  - - - .....rv..z
 > 00000040: 74c5 b172 c591 2074 c3bc 6bc3 b672 66c3  t..r.. t..k..rf.
 > 00000050: ba72 c3b3 67c3 a970 0a                   .r..g..p.
 >=20
 
 Why do you say that? The BNF in the RFC says:
 
       MSG             =3D MSG-ANY / MSG-UTF8
       MSG-ANY     =3D *OCTET ; not starting with BOM
       MSG-UTF8   =3D BOM UTF-8-STRING
       BOM             =3D %xEF.BB.BF
 
 
 Now in practice according to ChatGPT:
 Almost all modern syslog implementations do not emit a BOM, even for =
 UTF-8 content.
 
 Many receivers are tolerant and just assume UTF-8 without requiring BOM.
 
 Some parsers can actually get confused if a BOM is present.
 
 And:
 RFC 5424 says the BOM is required if you send UTF-8 MSG.
 
 In practice, it=E2=80=99s usually skipped, and interoperability tends to =
 be better without it.
 
 If your tool (msgfmt) prepends a BOM automatically, you should check the =
 target syslog receiver. If it understands RFC 5424 to the letter, the =
 BOM is technically correct. But if you=E2=80=99re aiming for =
 compatibility with common syslog daemons (rsyslog, syslog-ng, journald =
 forwarders), skipping the BOM is typically safer.
 
 Perhaps adding a flag to select the behavior? What should the default =
 be?
 
 christos
 
 >=20
 >> Fix:
 > Index: ./usr.sbin/syslogd/syslogd.c
 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 > RCS file: /cvsroot/src/usr.sbin/syslogd/syslogd.c,v
 > retrieving revision 1.147
 > diff -u -r1.147 syslogd.c
 > --- ./usr.sbin/syslogd/syslogd.c        9 Nov 2024 16:31:31 -0000      =
  1.147
 > +++ ./usr.sbin/syslogd/syslogd.c        17 Sep 2025 01:08:30 -0000
 > @@ -1243,6 +1243,7 @@
 >                DPRINTF(D_DATA, "UTF-8 BOM\n");
 >                utf8allowed =3D true;
 >                p +=3D 3;
 > +               start +=3D 3;  /* skip BOM in output */
 >        }
 >=20
 >        if (*p !=3D '\0' && !utf8allowed) {
 
 
 --Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3
 Content-Transfer-Encoding: quoted-printable
 Content-Type: text/html;
 	charset=utf-8
 
 <html aria-label=3D"message body"><head><meta http-equiv=3D"content-type" =
 content=3D"text/html; charset=3Dutf-8"></head><body =
 style=3D"overflow-wrap: break-word; -webkit-nbsp-mode: space; =
 line-break: after-white-space;"><br =
 id=3D"lineBreakAtBeginningOfMessage"><br><blockquote =
 type=3D"cite">SYSLOG(3) and rfc5424 similarly state: <br><br>"If the =
 msgfmt contains UTF-8 characters, then it has to start with <br>a Byte =
 Order Mark."<br><br><br>The BOM is unexpected as a prefix for every =
 message logged:<br><br>2025-09-17T01:59:10.205820+01:00 funcube potato - =
 - - &lt;feff&gt;=C3=81rv=C3=ADzt&amp;#369;r&amp;#337; =
 t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p<br><blockquote =
 type=3D"cite">How-To-Repeat:<br></blockquote>syslogd -o rfc5424 =
 -d<br><br>logger $(printf "\xEF\xBB\xBF%s" =
 "=C3=81rv=C3=ADzt&amp;#369;r&amp;#337; =
 t=C3=BCk=C3=B6rf=C3=BAr=C3=B3g=C3=A9p")<br><br><br><br>tail -n 1 =
 /var/log/messages | xxd<br>00000000: 3230 3235 2d30 392d 3137 5430 313a =
 3539 &nbsp;2025-09-17T01:59<br>00000010: 3a31 302e 3230 3538 3230 2b30 =
 313a 3030 &nbsp;:10.205820+01:00<br>00000020: 2066 756e 6375 6265 2070 =
 6f74 6174 6f20 &nbsp;&nbsp;funcube potato <br>00000030: 2d20 2d20 2d20 =
 efbb bfc3 8172 76c3 ad7a &nbsp;- - - .....rv..z<br>00000040: 74c5 b172 =
 c591 2074 c3bc 6bc3 b672 66c3 &nbsp;t..r.. t..k..rf.<br>00000050: ba72 =
 c3b3 67c3 a970 0a =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
 bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.r..g..p.<br><br></blockquote><br>Why =
 do you say that? The BNF in the RFC says:<div><br><div>&nbsp; &nbsp; =
 &nbsp; MSG &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =3D MSG-ANY / =
 MSG-UTF8</div><div>&nbsp; &nbsp; &nbsp; MSG-ANY &nbsp; &nbsp; =3D *OCTET =
 ; not starting with BOM</div><div>&nbsp; &nbsp; &nbsp; MSG-UTF8 &nbsp; =3D=
  BOM UTF-8-STRING</div><div>&nbsp; &nbsp; &nbsp; BOM &nbsp; &nbsp; =
 &nbsp; &nbsp; &nbsp; &nbsp; =3D =
 %xEF.BB.BF</div><div><br></div><div><br></div><div>Now in practice =
 according to ChatGPT:</div><div><ul data-start=3D"862" =
 data-end=3D"1092"><li data-start=3D"862" data-end=3D"951"><p =
 data-start=3D"864" data-end=3D"951">Almost all modern syslog =
 implementations&nbsp;<em data-start=3D"905" data-end=3D"913">do =
 not</em>&nbsp;emit a BOM, even for UTF-8 content.</p></li><li =
 data-start=3D"952" data-end=3D"1028"><p data-start=3D"954" =
 data-end=3D"1028">Many receivers are tolerant and just assume UTF-8 =
 without requiring BOM.</p></li><li data-start=3D"1029" =
 data-end=3D"1092"><p data-start=3D"1031" data-end=3D"1092">Some parsers =
 can actually get confused if a BOM is =
 present.</p></li></ul></div><div>And:</div><div><ul data-start=3D"1129" =
 data-end=3D"1288"><li data-start=3D"1129" data-end=3D"1193"><p =
 data-start=3D"1131" data-end=3D"1193"><strong data-start=3D"1131" =
 data-end=3D"1191">RFC 5424 says the BOM is required if you send UTF-8 =
 MSG.</strong></p></li><li data-start=3D"1194" data-end=3D"1288"><p =
 data-start=3D"1196" data-end=3D"1288"><strong data-start=3D"1196" =
 data-end=3D"1233">In practice, it=E2=80=99s usually skipped</strong>, =
 and interoperability tends to be better without it.</p></li></ul><p =
 data-start=3D"1290" data-end=3D"1610">If your tool (<code =
 data-start=3D"1304" data-end=3D"1312">msgfmt</code>) prepends a BOM =
 automatically, you should check the target syslog receiver. If it =
 understands RFC 5424 to the letter, the BOM is technically correct. But =
 if you=E2=80=99re aiming for compatibility with common syslog daemons =
 (rsyslog, syslog-ng, journald forwarders), skipping the BOM is typically =
 safer.</p><p data-start=3D"1290" data-end=3D"1610">Perhaps adding a flag =
 to select the behavior? What should the default be?</p><p =
 data-start=3D"1290" data-end=3D"1610">christos</p></div><blockquote =
 type=3D"cite"><br><blockquote type=3D"cite">Fix:<br></blockquote>Index: =
 ./usr.sbin/syslogd/syslogd.c<br>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D<br>RCS file: =
 /cvsroot/src/usr.sbin/syslogd/syslogd.c,v<br>retrieving revision =
 1.147<br>diff -u -r1.147 syslogd.c<br>--- ./usr.sbin/syslogd/syslogd.c =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;9 Nov 2024 16:31:31 -0000 =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.147<br>+++ =
 ./usr.sbin/syslogd/syslogd.c =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;17 Sep 2025 01:08:30 =
 -0000<br>@@ -1243,6 +1243,7 @@<br> =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
 bsp;&nbsp;&nbsp;DPRINTF(D_DATA, "UTF-8 BOM\n");<br> =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
 bsp;&nbsp;&nbsp;utf8allowed =3D true;<br> =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
 bsp;&nbsp;&nbsp;p +=3D 3;<br>+ =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
 bsp;&nbsp;start +=3D 3; &nbsp;/* skip BOM in output */<br> =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br><br> =
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (*p !=3D '\0' &amp;&amp; =
 !utf8allowed) {<br></blockquote><br></div></body></html>=
 
 --Apple-Mail=_8D6F4724-65D5-4C18-9E64-8931C7F899E3--
 
 --Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename=signature.asc
 Content-Type: application/pgp-signature;
 	name=signature.asc
 Content-Description: Message signed with OpenPGP
 
 -----BEGIN PGP SIGNATURE-----
 Comment: GPGTools - http://gpgtools.org
 
 iF0EARECAB0WIQS+BJlbqPkO0MDBdsRxESqxbLM7OgUCaM2B5wAKCRBxESqxbLM7
 OvQXAJ9iUiFrauG8L+Ja/OitOkvIlRBy6wCgghpWnOmrsCWaecpLjTUIQHHBxrQ=
 =Uf4G
 -----END PGP SIGNATURE-----
 
 --Apple-Mail=_E881427E-F680-4FDC-972E-7B2A6D942B37--
 



Home | Main Index | Thread Index | Old Index