Subject: Re: hardware caches and DMA
To: None <jfw@jfwhome.funhouse.com, port-i386@NetBSD.ORG>
From: Terry Moore <tmm@databook.com>
List: port-i386
Date: 04/17/1995 13:21:20
> Most x86 motherboards have write-through caches, but many have write-back,
> and according to the manual that came with mine, it has write-back. 

Mine too.  I've been running NetBSD using the cache in write-back mode
(according to the BIOS, anyway), with no problems.  Of course, I don't
think I have any DMA devices other than the FDC, which I hardly ever
use (but which works for TAR when I use it).  The IDE drives use
programmed I/O.

It is true that write-back caches without snooping appear here and
there.  But I think, though, that DOS and Windows may be helping us
here.  If NetBSD is bad for write-back caches, imagine how poorly DOS
3.3 works in conjunction with a bus-mastering card.  I don't believe
that there is much cache coherency management in the bulk of DOS apps
(which is relevant, since so many dos apps go directly to the
hardware).  Possibly there's a bifurcation:  the integrated DMA 
controllers are snooped, but not bus masters?

In any event, whatever the architecture, there are some arguments for
handling the cache coherency issues in software if possible.  It 
turns out that hardware cache coherency protocols are the bus-level
equivalent of CISC.  Lots of overhead is added to each bus cycle to 
solve a problem that doesn't really occur very often.  In multi-level
busses, you can end up (with a poorly designed protocol) tieing up all
the busses doing snoop cycles.  (On a bad day, something like this
can happen with PCI, for example.  That's one reason you hear about
PCI bus switches and other exotic things.  It's also a reason cache
coherency stuff was dropped from the "CardBus" spec [PCI on a PCMCIA card].)
Supercomputers typically eschew caches for this reason -- they make
it software's problem.
 
Since NetBSD is sort of a research OS, it would be really cool to have 
the hooks in -- calling the appropriate invalidation routines at the
appropriate places in the drivers.)  In which case, just adding wvinvd()
(or the appropriate set of abstractions) to the architecture-independent
set of abstractions would be a nice thing to do.  Yes, it would be a
wasted procedure call on the machines where it wasn't needed -- but
as a percentage increase of overhead, it would not be that expensive.
Or, one could always expand to whitespace.  Bakul's idea of macros
that specify the base and range is not a bad one -- on an X86, it could
translate to a call to WBVIND; on an architecture with finer grain,
it could translate to whatever was appropriate.  Alas that WBVIND will
be such a pig of an operation -- most of what it will flush will be
stack data that will soon be discarded anyway.  Hence, the ability to
take advantage of chipset-specific line invalidation might really
improve I/O performance.

--Terry