Subject: Re: change bus_space_barrier() semantic
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Garrett D'Amore <garrett_damore@tadpole.com>
List: tech-kern
Date: 01/07/2006 12:14:16
Manuel Bouyer wrote:

>On Sat, Jan 07, 2006 at 09:43:48AM -0800, Garrett D'Amore wrote:
>  
>
>>Manuel Bouyer wrote:
>>
>>    
>>
>>>Hi,
>>>while looking at how bus_space_read/write can be reordered on various ports
>>>I've dicovered the following:
>>>1- not all drivers use bus_space_barrier() to ensure ordering of register
>>> access
>>>2- because of 1, all bus_space implementation (on hardware where it's needed)
>>> have implicit barrier in the read and write methods
>>>3- the implementation isn't in sync with the man page. The man page defines
>>> 5 different barrier: BUS_SPACE_BARRIER_{READ,WRITE}_BEFORE_{READ,WRITE}
>>> and BUS_SPACE_BARRIER_SYNC, while the implementation has only
>>> BUS_SPACE_BARRIER_READ and BUS_SPACE_BARRIER_WRITE. The man page match
>>> more closely what's some hardware (e.g. sparc64) can do.
>>>4- the man page mention a BUS_SPACE_MAP_PREFETCHABLE flag to bus_space_map(),
>>> which should "try to map the space so that accesses can be prefetched by the
>>> system, and writes can be buffered". This implies that if
>>> BUS_SPACE_MAP_PREFETCHABLE is not used, read and writes should stay
>>> ordered and bus_space_barrier() calls should not be needed.
>>> 
>>>
>>>      
>>>
>>Solaris provides the following options, which probably reflects what
>>some SPARC or x86 hardware has supported in the past.
>>
>>DDI_STRICTORDER_ACC -- strict ordering, etc.
>>DDI_UNORDERED_OK_ACC -- CPU can reorder references
>>DDI_MERGING_OK_ACC -- CPU can merge consecutive stores
>>DDI_LOADCACHING_OK_ACC -- CPU can cache loaded data
>>
>>Each of these consecutive levels implies all of the previous.  (I.e.
>>LOADCACHING means that the mapped region can also support any of the
>>previous methods.)
>>    
>>
>
>Isn't DDI_STRICTORDER_ACC incmpatible with DDI_MERGING_OK_ACC ?
>  
>
Depends on how you mean.  You cannot do DDI_MERGING_OK_ACC if you have
DDI_STRICTORDER_ACC set.  These are not bits, but rather values.  All
kinds of mapping support DDI_STRICTORDER_ACC, but some kinds also
support these other values.  Supporting DDI_MERGING_OK_ACC means that
the hardware better support UNORDERED and STRICTORDER.  (And all
hardware always supports STRICTORDER, at least if it is operative.)

>Also I'm not sure these definitions maps properly to all hardware.
>What does "merge consecutive stores" ? Ignoring the first store when there
>are 2 stores at the same address, or e.g. merging 2 consecutive 8bit
>stores in one 16bit ?
>  
>
The latter case -- merging two smaller  stores into a larger store.

I forgot to list one more value, DDI_STORECACHING_OK, which would do
just what it suggests.  STORECACHING permits all of the above behaviors.

>In the first case, I can't see how this can be usefull in real life.
>In the second case, the driver should probably use bus_space_write_2 in
>the first place.
>  
>
Probably, but sometimes optimizations or bus logic makes that hard to
do.  For example, imagine two 8 bit registers side by side.  One of them
gets updated in an interrupt, and one gets updated in main line code. 
You can't make this work right easily from a software perspective, but
hardware can.

Granted, the win here is probably pretty small, and I'm not sure what
hardware actually supports this without supporting the other options.

The point is that these values are ways for the driver to tell the bus
implementation what it can or cannot do to its accesses.

>  
>
>>>5- for most hardware, a barrier is needed for *each* bus_space access.
>>> 
>>>
>>>      
>>>
>>I disagree with this statement.  There are lot of devices I've run into
>>where the barriers are only needed on e.g. one register, or a certain
>>group of registers.
>>    
>>
>
>For normal operations (i.e. frequent operations), there are usually only a few
>registers access, and ordering is important (I'm not talking about device
>setup, like what happens when an ethernet interface state changes from down
>to up or when the multicast filter change: this is rare enouth to not care
>about performances penalty implied by the bus_space sync access).
>  
>
What about PIO devices that stash entire frames/buffers in register
space.  Think of wi(4) or IDE as examples.  These don't need or want
prefetch or caching, but reading a bunch of 32-bit words shouldn't cause
a gazillion cache flushes to occur.

>  
>
>>There is also the case of larger windows where e.g. I might update a
>>bunch of fields for a packet and only need a barrier when I finally tell
>>the hardware I'm doing updating fields and it can process the packet (or
>>disk block, or crypto buffer, or whatever.)
>>    
>>
>
>For these devices (e.g. ethernet adapters with the buffer mapped in I/O space,
>or framebuffer devices) the driver can use BUS_SPACE_MAP_PREFETCHABLE and
>bus_space_barrier().
>  
>
I don't think so.  If e.g. I have a wi(4), then I want to ensure that
all reads actually do happen, and I care about the order of access. 
What I don't want is a cache flush between every read.

In fact, what I'd really like to have happen is on hardware that
supports it, not have a cacheable mapping to this memory.

If however, you can set it up so that bus_space_barrier knows whether
the memory space is cached or not (e.g. by way of the flags), and is a
noop for those mappings that are not cached, then this would probably be
very good.

>  
>
>>> The only case where this may not be needed is hardware providing some
>>> ROM or buffer space directly mappable.
>>>
>>>Because of this, I propose the following:
>>>- change the implementation to be in sync with the man page (that is,
>>> remove BUS_SPACE_BARRIER_{READ,WRITE} and implement the 5 barriers
>>> defined in the man page)
>>> 
>>>
>>>      
>>>
>>The problem is that some ports don't have all these levels.  I think you
>>still want to keep the older _READ/WRITE methods, but you can implement
>>them in terms of the finer grained methods.
>>    
>>
>
>I'd say the opposite: keep only the finer grained fields, and on hardware
>on which these don't map, implement using whatever is available.
>  
>
>If hardware can't make a difference between write before read and write
>before write, flushing the write buffers will fit both cases anyway.
>  
>
What I'm saying is that drivers that want _READ or _WRITE (and don't
care to distinguish between the variations) should be able to express
that simply.  (I.e. BARRIER_WRITE can be a bit field combining
read-before-write and write-before-write.)

In fact, it is so much easier if the mappings are not cached, and you
don't have to worry about all this nonsense. 

>>>I'm also not 100% certain about the change in semantic of
>>>BUS_SPACE_MAP_PREFETCHABLE.
>>>      
>>>
>
>change in semantic for which ? This would match what the man page says.
>  
>
Maybe I'm not understanding correctly.

As I understand, prefetchable mappings allow the use a load cache.  The
idea that for PIO devices like IDE or wi(4) should use prefetchable
mappings to get good performance seems busted to me, because you really
*do not* want these to be cached -- which is what prefetchable says.

Each register read is meaningful, and you cannot have them get
skipped.   If the barrier implementation is a noop then doing it for
each read is harmless, but for some (many?) ports, the barrier flushes
the entire system cache, which is a really bad idea to do for each
32-bits of a disk block.

>  
>
>>I would actually think it would be the reverse -- if you mapped it
>>prefetchable, then you are saying the hardware doesn't care (e.g. memory
>>without side effects) and you shouldn't need barriers.
>>    
>>
>
>Hum, I don't see how this could be used. There would always be one point
>where you care. This would seriously reduce the scope of
>BUS_SPACE_MAP_PREFETCHABLE.
>  
>
PREFETCHABLE mappings (as I understand it) imply that load caching is
permitted.  The only point at which you should care (if you are using
prefetchable) is on the initial load.  So my interpretation is that
prefetchable should probably only be used for mapped ROMs and such.  Any
memory that has side effects with it probably should not be mapped
prefetchable.

I guess BUS_SPACE_MAP_CACHEABLE also implies storecaching is OK?

>  
>
>>I think to get this right, you need to have more flags to bus_space_map
>>which describe the needs of the target device (see the flags given by
>>Solaris above) so that the implementation can do the "right" thing.
>>    
>>
>
>For now I don't see a use for more flags. Either the driver needs all
>bus_space accesses to be ordered, or the driver choose which are ordered
>with bus_space_barrier() and no bus_space access should be orderded
>implicitely. Anything between this would only fit a very small number of
>drivers.
>  
>
True enough.

>Also note that I only mentionned BUS_SPACE_MAP_PREFETCHABLE, but there are
>other flags available: BUS_SPACE_MAP_LINEAR and BUS_SPACE_MAP_CACHEABLE.
>
>  
>
>>Just randomly applying barriers to all bus space acccesses could have a
>>nasty impact on performance.  For a lot of hardware these barriers are
>>implemented as a cache flush.  Now imagine that I'm working with DMA
>>mapped descriptors (e.g. a typical NIC).  Instead of having a single
>>cache flush when I finish writing several (maybe up to a dozen?) fields
>>for a given descriptor, I am going to get a several dozen cache flushes.
>>    
>>
>
>Hum, I think you're mixing bus_space and bus_dma here. Memory barrier for
>dma memory is done via bus_mammap_sync().
>  
>
You're right -- I was briefly confused.  I thought I eradicated DMA
references, but I blew it.  Sorry about that.

>I'm only talking about bus_space. Note that, *right now*, all bus_space
>implementation have implicit ordered read/write, because most drivers don't
>use bus_space_barrier() at all.
>  
>
Ordering read/write is okay, but I don't want ordered read/read or
write/write by default -- that would cause PIO performance to go down
the drain.

If you look at e.g. the mips bus_space_align_chipdep.c, you find that it
only does the implicit ordering for bus_space_xxxx_multi.   So I'm not
sure I agree with your statement above.

>>Now since this probably happens with the interrupts blocked, it might
>>not be that bad -- for a uniprocessor system -- but it probably depends
>>on the hardware whether a cache flush is evil or not.  On SMP hardware
>>it seems likely that this would be very unfriendly.  (Hopefully the SMP
>>hardware has something to make these efficient, but in the worst case it
>>might require some expensive calls to require *two* caches to be flushed.)
>>    
>>
>
>This is where implicit bus_space barrier are a win: the MD impleemntation
>can choose to map these registers uncached, so no cache flushs are
>needed. With the current implementation, it should be mapped uncached
>unless BUS_SPACE_MAP_CACHEABLE is specified, and I don't plan to change that.
>  
>
I agree.  Having the MD code do the "right" thing seems appropriate. 
But this also means its a bad idea to overload the meaning of
prefetchable in devices like wi(4) and IDE.

>>In general, I like the idea of removing this logic from drivers, they
>>shouldn't have to worry about this (other than establishing the type of
>>support needed when the device is first mapped) and the bus should take
>>care of it.
>>
>>But getting this right is going to be a big job, and require changes to
>>each of the ports.
>>    
>>
>
>Not that big, because the current status-quo is already to have implicit
>barriers in bus_space accesses. Changing BUS_SPACE_MAP_PREFETCHABLE
>and BUS_SPACE_MAP_CACHEABLE to do something is not required for correct
>operations, it's just a performance improvements and I'll leave this
>to portmasters.
>  
>
As long as its read/write barrier that continues to be implicit.  But I
think there are device drivers that need read/read or write/write or
write/read barriers, and that removing the barriers from those drivers
means putting a more strict barrier in the implementation.  That will
cause things like wi(4) and IDE (in PIO mode) to go slower.

>>Mightn't this be a good time to try to centralize some of this logic?  
>>Imagine instead of common bus_space layer that provides a single handle,
>>which exports function pointers for each of the bus space routines
>>(including a function for barriers.)  Then you have somewhere to store
>>the flags, and can probably export some "common" routines for use by
>>typical ports -- e.g. memory mapped busses that have very little
>>differences from one another.
>>    
>>
>
>Then you'll always have a function pointer dereference, which on some
>ports is expensive. On some ports, bus_space is implemented as macros
>instead.
>
>  
>
Ah, but haven't we heard about how evil macros have caused the code size
to grow and have other impacts on performance (due to cache effects)?

What platforms exist where function pointer deferences are that slow? 
There is the cache impact, but if you're calling the same function over
and over again, it should be in the i-cache already.

The function call overhead could be an issue, but if you don't have
that, then you'll have that code growth problem caused by macros
instead.  And you're going to have function calls if you're going to add
implicit barriers, anyway.

-- 
Garrett D'Amore                          http://www.tadpolecomputer.com/
Sr. Staff Engineer          Extending the Power of 64-bit UNIX Computing
Tadpole Computer, Inc.                             Phone: (951) 325-2134