Subject: Re: change bus_space_barrier() semantic
To: Garrett D'Amore <garrett_damore@tadpole.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 01/07/2006 22:20:03
On Sat, Jan 07, 2006 at 12:14:16PM -0800, Garrett D'Amore wrote:
> >>
> >>Each of these consecutive levels implies all of the previous.  (I.e.
> >>LOADCACHING means that the mapped region can also support any of the
> >>previous methods.)
> >>    
> >>
> >
> >Isn't DDI_STRICTORDER_ACC incmpatible with DDI_MERGING_OK_ACC ?
> >  
> >
> Depends on how you mean.  You cannot do DDI_MERGING_OK_ACC if you have
> DDI_STRICTORDER_ACC set.  These are not bits, but rather values.  All
> kinds of mapping support DDI_STRICTORDER_ACC, but some kinds also
> support these other values.  Supporting DDI_MERGING_OK_ACC means that
> the hardware better support UNORDERED and STRICTORDER.  (And all
> hardware always supports STRICTORDER, at least if it is operative.)

Hum, then I guess there are some hardware not well described by this model,
although I don't examples in hand.

> >In the first case, I can't see how this can be usefull in real life.
> >In the second case, the driver should probably use bus_space_write_2 in
> >the first place.
> >  
> >
> Probably, but sometimes optimizations or bus logic makes that hard to
> do.  For example, imagine two 8 bit registers side by side.  One of them
> gets updated in an interrupt, and one gets updated in main line code. 
> You can't make this work right easily from a software perspective, but
> hardware can.

But I don't think of a hardware were this would do desireable. This is
typically the kind of situation where you want write order to be preserved,
and probably synchronous in the interrupt case.

> >
> >>>5- for most hardware, a barrier is needed for *each* bus_space access.
> >>> 
> >>>
> >>>      
> >>>
> >>I disagree with this statement.  There are lot of devices I've run into
> >>where the barriers are only needed on e.g. one register, or a certain
> >>group of registers.
> >>    
> >>
> >
> >For normal operations (i.e. frequent operations), there are usually only a few
> >registers access, and ordering is important (I'm not talking about device
> >setup, like what happens when an ethernet interface state changes from down
> >to up or when the multicast filter change: this is rare enouth to not care
> >about performances penalty implied by the bus_space sync access).
> >  
> >
> What about PIO devices that stash entire frames/buffers in register
> space.  Think of wi(4) or IDE as examples.  These don't need or want
> prefetch or caching, but reading a bunch of 32-bit words shouldn't cause
> a gazillion cache flushes to occur.

And if the bus_space was mapped without prefetch and caching, why would the
bus_space_read cause a cache flush to occur in this case ?
This is specifically the case where you wan implicit bus_space barriers.

> >
> >>There is also the case of larger windows where e.g. I might update a
> >>bunch of fields for a packet and only need a barrier when I finally tell
> >>the hardware I'm doing updating fields and it can process the packet (or
> >>disk block, or crypto buffer, or whatever.)
> >>    
> >>
> >
> >For these devices (e.g. ethernet adapters with the buffer mapped in I/O space,
> >or framebuffer devices) the driver can use BUS_SPACE_MAP_PREFETCHABLE and
> >bus_space_barrier().
> >  
> >
> I don't think so.  If e.g. I have a wi(4), then I want to ensure that
> all reads actually do happen, and I care about the order of access. 
> What I don't want is a cache flush between every read.

And this is specifially the case where you don't want
BUS_SPACE_MAP_PREFETCHABLE. BUS_SPACE_MAP_PREFETCHABLE isn't appropriate for
drivers using bus_space_read/write_multi_*, but for drivers using
bus_space_read/write_region_*. in the _multi_ case, either the space can
be mapped uncached, and then there is no cache flush needed (and the
barrier, implicit or explicit, shouldn't issue one) or the space
can't be mapped uncached and the cache flush is unavoidable.

> 
> In fact, what I'd really like to have happen is on hardware that
> supports it, not have a cacheable mapping to this memory.

This is why there is BUS_SPACE_MAP_CACHEABLE flag to bus_space_map.
The space won't be mapped cachable (if possible) if BUS_SPACE_MAP_CACHEABLE
is not specified.

> 
> If however, you can set it up so that bus_space_barrier knows whether
> the memory space is cached or not (e.g. by way of the flags), and is a
> noop for those mappings that are not cached, then this would probably be
> very good.

bus_space_barrier already has the informations for this. If it's not using it
it's a bug.

> >
> >>> The only case where this may not be needed is hardware providing some
> >>> ROM or buffer space directly mappable.
> >>>
> >>>Because of this, I propose the following:
> >>>- change the implementation to be in sync with the man page (that is,
> >>> remove BUS_SPACE_BARRIER_{READ,WRITE} and implement the 5 barriers
> >>> defined in the man page)
> >>> 
> >>>
> >>>      
> >>>
> >>The problem is that some ports don't have all these levels.  I think you
> >>still want to keep the older _READ/WRITE methods, but you can implement
> >>them in terms of the finer grained methods.
> >>    
> >>
> >
> >I'd say the opposite: keep only the finer grained fields, and on hardware
> >on which these don't map, implement using whatever is available.
> >  
> >
> >If hardware can't make a difference between write before read and write
> >before write, flushing the write buffers will fit both cases anyway.
> >  
> >
> What I'm saying is that drivers that want _READ or _WRITE (and don't
> care to distinguish between the variations) should be able to express
> that simply.  (I.e. BARRIER_WRITE can be a bit field combining
> read-before-write and write-before-write.)

Ha, OK. Yes, we could keep this as a shortcut. But it's better if
drivers specify accurately their needs, to that hardware can acurately
specify what they need. I think read-before-read and write-before-write
are the most common cases.

> 
> In fact, it is so much easier if the mappings are not cached, and you
> don't have to worry about all this nonsense. 

Cache is not the only issue. There are read/write reordering by the CPU,
instruction reordering by the CPU, and instruction reordering by the
compiler. The barrier also have to handle that.

> 
> >>>I'm also not 100% certain about the change in semantic of
> >>>BUS_SPACE_MAP_PREFETCHABLE.
> >>>      
> >>>
> >
> >change in semantic for which ? This would match what the man page says.
> >  
> >
> Maybe I'm not understanding correctly.
> 
> As I understand, prefetchable mappings allow the use a load cache.  The

"Try to map the space so that accesses can be prefetched by the system, and
writes can be buffered"

> idea that for PIO devices like IDE or wi(4) should use prefetchable
> mappings to get good performance seems busted to me, because you really
> *do not* want these to be cached -- which is what prefetchable says.

Err, no. These devices really don't want prefetchable mappings. 

> 
> Each register read is meaningful, and you cannot have them get
> skipped.   If the barrier implementation is a noop then doing it for
> each read is harmless, but for some (many?) ports, the barrier flushes
> the entire system cache, which is a really bad idea to do for each
> 32-bits of a disk block.

Yes, of course.

I think we agree on this, but I don't see where one of us is misunderstanding
what the other wrote ...

> 
> >  
> >
> >>I would actually think it would be the reverse -- if you mapped it
> >>prefetchable, then you are saying the hardware doesn't care (e.g. memory
> >>without side effects) and you shouldn't need barriers.
> >>    
> >>
> >
> >Hum, I don't see how this could be used. There would always be one point
> >where you care. This would seriously reduce the scope of
> >BUS_SPACE_MAP_PREFETCHABLE.
> >  
> >
> PREFETCHABLE mappings (as I understand it) imply that load caching is
> permitted.  The only point at which you should care (if you are using
> prefetchable) is on the initial load.  So my interpretation is that
> prefetchable should probably only be used for mapped ROMs and such.  Any
> memory that has side effects with it probably should not be mapped
> prefetchable.
> 
> I guess BUS_SPACE_MAP_CACHEABLE also implies storecaching is OK?

Yes. Hum, I think we're agreeing on this, but don't understand each other.
An example where PREFETCHABLE is desireable, but you still want memory
barrier are ethernet drivers for device which have their buffer mapped
(for example ix(4), which has its ram mapped in memory space).
To transmit a packet you copy the data to the mapped ram (with
bus_space_write_region, not bus_space_write_multi - on hardware where this
is memory-mapped linearly, it's just a memcpy, but on some hardware it may
have to be a bus_space_write in a loop) and then start the transmit with
a write to another register. You don't care if the writes to the ram are
reordered, they're each at different addresses. however, all these writes
have to be completed before the write which starts the transmit.
In this case it's desireable to to map the space PREFETCHABLE because the
data copy to/from the device's ram will be much faster, and you enforce
ordering of access to the control registers with a bus_space_barrier.

> 
> >  
> >
> >>I think to get this right, you need to have more flags to bus_space_map
> >>which describe the needs of the target device (see the flags given by
> >>Solaris above) so that the implementation can do the "right" thing.
> >>    
> >>
> >
> >For now I don't see a use for more flags. Either the driver needs all
> >bus_space accesses to be ordered, or the driver choose which are ordered
> >with bus_space_barrier() and no bus_space access should be orderded
> >implicitely. Anything between this would only fit a very small number of
> >drivers.
> >  
> >
> True enough.
> 
> >Also note that I only mentionned BUS_SPACE_MAP_PREFETCHABLE, but there are
> >other flags available: BUS_SPACE_MAP_LINEAR and BUS_SPACE_MAP_CACHEABLE.
> >
> >  
> >
> >>Just randomly applying barriers to all bus space acccesses could have a
> >>nasty impact on performance.  For a lot of hardware these barriers are
> >>implemented as a cache flush.  Now imagine that I'm working with DMA
> >>mapped descriptors (e.g. a typical NIC).  Instead of having a single
> >>cache flush when I finish writing several (maybe up to a dozen?) fields
> >>for a given descriptor, I am going to get a several dozen cache flushes.
> >>    
> >>
> >
> >Hum, I think you're mixing bus_space and bus_dma here. Memory barrier for
> >dma memory is done via bus_mammap_sync().
> >  
> >
> You're right -- I was briefly confused.  I thought I eradicated DMA
> references, but I blew it.  Sorry about that.
> 
> >I'm only talking about bus_space. Note that, *right now*, all bus_space
> >implementation have implicit ordered read/write, because most drivers don't
> >use bus_space_barrier() at all.
> >  
> >
> Ordering read/write is okay, but I don't want ordered read/read or
> write/write by default -- that would cause PIO performance to go down
> the drain.

Err, I don't understand. If you're making reference to devices like IDE
or wi, these read/read or write/write *have to* be ordered, as they all
access the same register address ! you don't want to have 2 access reordered,
as it would swap 2 bytes in your data stream.

> 
> If you look at e.g. the mips bus_space_align_chipdep.c, you find that it
> only does the implicit ordering for bus_space_xxxx_multi.   So I'm not
> sure I agree with your statement above.

Hum. So either mips don't need anything special for registers access barrier,
or some drivers will break. Most drivers assume implicit barrier.

> 
> >>Now since this probably happens with the interrupts blocked, it might
> >>not be that bad -- for a uniprocessor system -- but it probably depends
> >>on the hardware whether a cache flush is evil or not.  On SMP hardware
> >>it seems likely that this would be very unfriendly.  (Hopefully the SMP
> >>hardware has something to make these efficient, but in the worst case it
> >>might require some expensive calls to require *two* caches to be flushed.)
> >>    
> >>
> >
> >This is where implicit bus_space barrier are a win: the MD impleemntation
> >can choose to map these registers uncached, so no cache flushs are
> >needed. With the current implementation, it should be mapped uncached
> >unless BUS_SPACE_MAP_CACHEABLE is specified, and I don't plan to change that.
> >  
> >
> I agree.  Having the MD code do the "right" thing seems appropriate. 
> But this also means its a bad idea to overload the meaning of
> prefetchable in devices like wi(4) and IDE.

Again I think we agree but I don't understand what you mean with
"overload the meaning of prefetchable"

> 
> >>In general, I like the idea of removing this logic from drivers, they
> >>shouldn't have to worry about this (other than establishing the type of
> >>support needed when the device is first mapped) and the bus should take
> >>care of it.
> >>
> >>But getting this right is going to be a big job, and require changes to
> >>each of the ports.
> >>    
> >>
> >
> >Not that big, because the current status-quo is already to have implicit
> >barriers in bus_space accesses. Changing BUS_SPACE_MAP_PREFETCHABLE
> >and BUS_SPACE_MAP_CACHEABLE to do something is not required for correct
> >operations, it's just a performance improvements and I'll leave this
> >to portmasters.
> >  
> >
> As long as its read/write barrier that continues to be implicit.  But I
> think there are device drivers that need read/read or write/write or
> write/read barriers, and that removing the barriers from those drivers
> means putting a more strict barrier in the implementation.  That will
> cause things like wi(4) and IDE (in PIO mode) to go slower.

This I can't see why it would be slower than the current code. These
devices needs read/read and write/write barrier on every register
access.

> 
> >>Mightn't this be a good time to try to centralize some of this logic?  
> >>Imagine instead of common bus_space layer that provides a single handle,
> >>which exports function pointers for each of the bus space routines
> >>(including a function for barriers.)  Then you have somewhere to store
> >>the flags, and can probably export some "common" routines for use by
> >>typical ports -- e.g. memory mapped busses that have very little
> >>differences from one another.
> >>    
> >>
> >
> >Then you'll always have a function pointer dereference, which on some
> >ports is expensive. On some ports, bus_space is implemented as macros
> >instead.
> >
> >  
> >
> Ah, but haven't we heard about how evil macros have caused the code size
> to grow and have other impacts on performance (due to cache effects)?

The macro can just reference a function, or contain a single asm instruction,
or just degerate to empty string.
It depends on the hardware (and some hardware don't have cache :)
Here's an example from sun68k:
#define bus_space_read_1(t, h, o)                                       \
            ((void)t, *(volatile uint8_t *)((h) + (o)))

> 
> What platforms exist where function pointer deferences are that slow? 

I think vax is an example.

> There is the cache impact, but if you're calling the same function over
> and over again, it should be in the i-cache already.
> 
> The function call overhead could be an issue, but if you don't have
> that, then you'll have that code growth problem caused by macros
> instead.  And you're going to have function calls if you're going to add
> implicit barriers, anyway.

Again it depends on the platform. Some are not smart enouth to reorder
memory access.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--