Subject: Re: change bus_space_barrier() semantic
To: Garrett D'Amore <garrett_damore@tadpole.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 01/07/2006 19:28:35
On Sat, Jan 07, 2006 at 09:43:48AM -0800, Garrett D'Amore wrote:
> Manuel Bouyer wrote:
> 
> >Hi,
> >while looking at how bus_space_read/write can be reordered on various ports
> >I've dicovered the following:
> >1- not all drivers use bus_space_barrier() to ensure ordering of register
> >  access
> >2- because of 1, all bus_space implementation (on hardware where it's needed)
> >  have implicit barrier in the read and write methods
> >3- the implementation isn't in sync with the man page. The man page defines
> >  5 different barrier: BUS_SPACE_BARRIER_{READ,WRITE}_BEFORE_{READ,WRITE}
> >  and BUS_SPACE_BARRIER_SYNC, while the implementation has only
> >  BUS_SPACE_BARRIER_READ and BUS_SPACE_BARRIER_WRITE. The man page match
> >  more closely what's some hardware (e.g. sparc64) can do.
> >4- the man page mention a BUS_SPACE_MAP_PREFETCHABLE flag to bus_space_map(),
> >  which should "try to map the space so that accesses can be prefetched by the
> >  system, and writes can be buffered". This implies that if
> >  BUS_SPACE_MAP_PREFETCHABLE is not used, read and writes should stay
> >  ordered and bus_space_barrier() calls should not be needed.
> >  
> >
> Solaris provides the following options, which probably reflects what
> some SPARC or x86 hardware has supported in the past.
> 
> DDI_STRICTORDER_ACC -- strict ordering, etc.
> DDI_UNORDERED_OK_ACC -- CPU can reorder references
> DDI_MERGING_OK_ACC -- CPU can merge consecutive stores
> DDI_LOADCACHING_OK_ACC -- CPU can cache loaded data
> 
> Each of these consecutive levels implies all of the previous.  (I.e.
> LOADCACHING means that the mapped region can also support any of the
> previous methods.)

Isn't DDI_STRICTORDER_ACC incmpatible with DDI_MERGING_OK_ACC ?
Also I'm not sure these definitions maps properly to all hardware.
What does "merge consecutive stores" ? Ignoring the first store when there
are 2 stores at the same address, or e.g. merging 2 consecutive 8bit
stores in one 16bit ?
In the first case, I can't see how this can be usefull in real life.
In the second case, the driver should probably use bus_space_write_2 in
the first place.

> 
> >5- for most hardware, a barrier is needed for *each* bus_space access.
> >  
> >
> I disagree with this statement.  There are lot of devices I've run into
> where the barriers are only needed on e.g. one register, or a certain
> group of registers.

For normal operations (i.e. frequent operations), there are usually only a few
registers access, and ordering is important (I'm not talking about device
setup, like what happens when an ethernet interface state changes from down
to up or when the multicast filter change: this is rare enouth to not care
about performances penalty implied by the bus_space sync access).

> 
> There is also the case of larger windows where e.g. I might update a
> bunch of fields for a packet and only need a barrier when I finally tell
> the hardware I'm doing updating fields and it can process the packet (or
> disk block, or crypto buffer, or whatever.)

For these devices (e.g. ethernet adapters with the buffer mapped in I/O space,
or framebuffer devices) the driver can use BUS_SPACE_MAP_PREFETCHABLE and
bus_space_barrier().

> 
> >  The only case where this may not be needed is hardware providing some
> >  ROM or buffer space directly mappable.
> >
> >Because of this, I propose the following:
> >- change the implementation to be in sync with the man page (that is,
> >  remove BUS_SPACE_BARRIER_{READ,WRITE} and implement the 5 barriers
> >  defined in the man page)
> >  
> >
> The problem is that some ports don't have all these levels.  I think you
> still want to keep the older _READ/WRITE methods, but you can implement
> them in terms of the finer grained methods.

I'd say the opposite: keep only the finer grained fields, and on hardware
on which these don't map, implement using whatever is available.
If hardware can't make a difference between write before read and write
before write, flushing the write buffers will fit both cases anyway.

> 
> >- change the implementation so that if the space is mapped without
> >  BUS_SPACE_MAP_PREFETCHABLE, bus_space_barrier() calls are not needed.
> >  This may require using bus_space_barrier() in the read/write methods
> >  to enforce this, and in fact I'll probably be conservative and do it this
> >  way on most ports, leaving the portmaster change this to better suit
> >  the hardware.
> >  
> >
> Changing the "implementation" in this case means changing a *lot* of
> implementations.

Yes.

> Each port has its own bus_space logic.  Not all of the
> bus_spaces have a structure associated with them where you can tag how
> the space was mapped.
> 
> I'm also not 100% certain about the change in semantic of
> BUS_SPACE_MAP_PREFETCHABLE.

change in semantic for which ? This would match what the man page says.

> 
> I would actually think it would be the reverse -- if you mapped it
> prefetchable, then you are saying the hardware doesn't care (e.g. memory
> without side effects) and you shouldn't need barriers.

Hum, I don't see how this could be used. There would always be one point
where you care. This would seriously reduce the scope of
BUS_SPACE_MAP_PREFETCHABLE.

> 
> I think to get this right, you need to have more flags to bus_space_map
> which describe the needs of the target device (see the flags given by
> Solaris above) so that the implementation can do the "right" thing.

For now I don't see a use for more flags. Either the driver needs all
bus_space accesses to be ordered, or the driver choose which are ordered
with bus_space_barrier() and no bus_space access should be orderded
implicitely. Anything between this would only fit a very small number of
drivers.
Also note that I only mentionned BUS_SPACE_MAP_PREFETCHABLE, but there are
other flags available: BUS_SPACE_MAP_LINEAR and BUS_SPACE_MAP_CACHEABLE.

> 
> Just randomly applying barriers to all bus space acccesses could have a
> nasty impact on performance.  For a lot of hardware these barriers are
> implemented as a cache flush.  Now imagine that I'm working with DMA
> mapped descriptors (e.g. a typical NIC).  Instead of having a single
> cache flush when I finish writing several (maybe up to a dozen?) fields
> for a given descriptor, I am going to get a several dozen cache flushes.

Hum, I think you're mixing bus_space and bus_dma here. Memory barrier for
dma memory is done via bus_mammap_sync().
I'm only talking about bus_space. Note that, *right now*, all bus_space
implementation have implicit ordered read/write, because most drivers don't
use bus_space_barrier() at all.

> 
> Now since this probably happens with the interrupts blocked, it might
> not be that bad -- for a uniprocessor system -- but it probably depends
> on the hardware whether a cache flush is evil or not.  On SMP hardware
> it seems likely that this would be very unfriendly.  (Hopefully the SMP
> hardware has something to make these efficient, but in the worst case it
> might require some expensive calls to require *two* caches to be flushed.)

This is where implicit bus_space barrier are a win: the MD impleemntation
can choose to map these registers uncached, so no cache flushs are
needed. With the current implementation, it should be mapped uncached
unless BUS_SPACE_MAP_CACHEABLE is specified, and I don't plan to change that.

> 
> >- update drivers to match the new implementation (probably removing
> >  bus_space_barrier() calls for most of them, or converting them to
> >  BUS_SPACE_MAP_PREFETCHABLE)
> >  
> >
> Very, very few drivers can probably use BUS_SPACE_MAP_PREFETCHABLE.  
> Most of these devices use bus spaces for register accesses.

Yes, I think this makes sense only for framebuffers, ROM or other kind
of buffer directly accessibles.

> 
> In general, I like the idea of removing this logic from drivers, they
> shouldn't have to worry about this (other than establishing the type of
> support needed when the device is first mapped) and the bus should take
> care of it.
> 
> But getting this right is going to be a big job, and require changes to
> each of the ports.

Not that big, because the current status-quo is already to have implicit
barriers in bus_space accesses. Changing BUS_SPACE_MAP_PREFETCHABLE
and BUS_SPACE_MAP_CACHEABLE to do something is not required for correct
operations, it's just a performance improvements and I'll leave this
to portmasters.

> 
> Mightn't this be a good time to try to centralize some of this logic?  
> Imagine instead of common bus_space layer that provides a single handle,
> which exports function pointers for each of the bus space routines
> (including a function for barriers.)  Then you have somewhere to store
> the flags, and can probably export some "common" routines for use by
> typical ports -- e.g. memory mapped busses that have very little
> differences from one another.

Then you'll always have a function pointer dereference, which on some
ports is expensive. On some ports, bus_space is implemented as macros
instead.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--