tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)



Joerg Sonnenberger wrote:
> On Thu, Sep 11, 2008 at 05:22:51PM +0300, Andy Shevchenko wrote:
> > But most of new versions of the famous tools are going to be UTF-8
> > (wide char internally) compatible. Thus, less, wc, e.t.c. are
> > complaining on that kind of symbols which are looked like Unicode
> > sequence starters.
> 
> I think you should only complain about files that are not valid latin1.

HTML files need to have their encoding "declared by the server" according
to the standard, http://www.w3.org/TR/html401/charset.html:

        "The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
default character encoding when the "charset" parameter is absent from the
"Content-Type" header field. In practice, this recommendation has proved
useless because some servers don't allow a "charset" parameter to be sent,
and others may not be configured to send the parameter. Therefore, user
agents must not assume any default value for the "charset" parameter."

IMO the only reasonable choice for distributed HTML -- which by definition
can't control the server environment -- is to use the META "Content-Type"
declaration.  Absent that, the document's character's *can't* be wrong.  A
character can't not be in a set when the set is undefined.  

So, if you're going to submit patches to satisfy Fedora, your patch should
in fairness also add a Content-Type declaration.  But the real fix of
course it to change the Makefile that produces the HTML.... ("openjade -b
utf-8" works pretty well.) 

--jkl



Home | Main Index | Thread Index | Old Index