tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Alternative to hash-bang



Hello,

Justin Cormack <justin%specialbusservice.com@localhost> wrote:
 |On Jul 19, 2014 4:00 PM, "Steffen Nurpmeso" <sdaoden%yandex.com@localhost> 
wrote:
 |> And because of this last part again i finally come the conclusion
 |> that the UTF-8 BOM will become a vivid part of the future, because
 |> it carries information of a file's encoding along with the file as
 |> a part of the encoding itself.
 |
 |UTF8 BOMs are only really used on Windows due to its UTF16 heritage. I have
 |never seen them used on a Unix system. That is probably why Perl added
 |support. That should not mean the use should be encouraged.

Maybe.  Yes.  But in respect to the first two i had to learn that
some Unix systems (AIX) also use UTF-16; i don't know how hard IBM
as a i think paying core member of the POSIX standard will try to
push UTF-16 into the standard once that finally moves forward
towards true support for the languages of the world; maybe not at
all (their ICU library seems to improve UTF-8 support, still
i think the core is UTF-16).

 |> The real question is: what should be done with BOMs in `$ cat f1
 |> f2 > f3', they cannot simply become stripped off?
 |
 |Write a utfcat command?

Tja.  A locale modifier like POSIX.UTF-8@BOM wouldn't cause the
right thing.  Martin Dürst of W3C wrote a few years ago

  Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know 
  was thinking about a BOM for UTF-8. Only later BOMs at the start of 
  HTML4 started to turn up, and browser makers were surprised. Roughly the 
  same happened for XML. Early XML parsers didn't handle the BOM.

  When Windows notepad started to use the BOM to distinguish between UTF-8 
  and "ANSI" (the local system legacy encoding), this BOM leaked into 
  HTML, and was difficult to stop. So XML got updated, and parsers started 
  to get updated, too.

  ...

  The problem with the BOM in UTF-8 is that it can be quite helpful (for 
  quickly distinguishing between UTF-8 and legacy-encoded files) and quite 
  damaging (for programs that use the Unix/Linux model of text 
  processing), and that's why it creates so much controversy.

--steffen


Home | Main Index | Thread Index | Old Index