Subject: Re: WWW query engine bug (was Query-PR)
To: None <current-users@NetBSD.ORG>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: current-users
Date: 02/21/1996 06:55:30
>>> But the question is: how can you tell 'intentional' html from
>>> something that just looks like HTML?
>> Force people to insert <HTML>...</HTML> around their text if they
>> dont want tags to be converted to &amp;, &lt; and &gt;.

> You didn't answer the second question: what impact does that have?

> i don't think that's workable, for several reasons:

> 	(1) PRs are sent as e-mail messages, and for the most part look
> 	    like e-mail messages.  How can you put that before your
> 	    headers, so it will do the right thing with HTML in the
> 	    headers?  (e.g. an X-Organization: header...)


> 	(2) the PR machinery appears to mangle some submissions in ways
> 	    that are not obvious to me, e.g. reordering some headers,
> 	    etc.  How are people supposed to set things up so that they
> 	    work right?

Don't put <HTML> in the headers, then.

> 	(3) if the user does a 'long-range' <html>, perhaps one which
> 	    is never closed, how does the scanner deal with that?  some
> 	    of the PRs are gigantic, and i think it's unreasonable to
> 	    have to have it parse them completely before it processes
> 	    any of them.

I don't see why there's any need to.  Your scanner just has to keep a
bit saying whether it's inside an unclosed <HTML>...</HTML>, and if
it's not, just do mindless mapping of < to &lt;, etc.

> 	(4) this still doesn't solve the problem!  the user can _still_
> 	    supply bad html!

Oh, sure.  And within the marked-as-HTML portion of the text, you can
still do all the checks you used to do.  This just solves the
text-that-happens-to-look-like-HTML problem.  It doesn't do anything
about any others.

(Imagine what your code will do with a PR that includes C code to
generate HTML...it'll _really_ come out mangled.  This sort of thing is
why I want some way to get just and only the PR, as little modified
from the bits on disk as possible.  Perhaps I'm just weird.)

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu