Subject: Re: change bus_space_barrier() semantic
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Garrett D'Amore <garrett_damore@tadpole.com>
List: tech-kern
Date: 01/09/2006 03:39:32
Manuel Bouyer wrote:

>On Sat, Jan 07, 2006 at 12:14:16PM -0800, Garrett D'Amore wrote:
>  
>
>>>>Each of these consecutive levels implies all of the previous.  (I.e.
>>>>LOADCACHING means that the mapped region can also support any of the
>>>>previous methods.)
>>>>   
>>>>
>>>>        
>>>>
>>>Isn't DDI_STRICTORDER_ACC incmpatible with DDI_MERGING_OK_ACC ?
>>> 
>>>
>>>      
>>>
>>Depends on how you mean.  You cannot do DDI_MERGING_OK_ACC if you have
>>DDI_STRICTORDER_ACC set.  These are not bits, but rather values.  All
>>kinds of mapping support DDI_STRICTORDER_ACC, but some kinds also
>>support these other values.  Supporting DDI_MERGING_OK_ACC means that
>>the hardware better support UNORDERED and STRICTORDER.  (And all
>>hardware always supports STRICTORDER, at least if it is operative.)
>>    
>>
>Hum, then I guess there are some hardware not well described by this model,
>although I don't examples in hand.
>  
>
Not described by which model, the NetBSD one or the Solaris one?

Are you saying that there is a case where hardware caching would be 
desirable, but yet merging or reordering accesses would be bad?  I can't 
imagine this to be true, but maybe I'm not understanding what you are 
meaning here.

>  
>
>>>In the first case, I can't see how this can be usefull in real life.
>>>In the second case, the driver should probably use bus_space_write_2 in
>>>the first place.
>>> 
>>>
>>>      
>>>
>>Probably, but sometimes optimizations or bus logic makes that hard to
>>do.  For example, imagine two 8 bit registers side by side.  One of them
>>gets updated in an interrupt, and one gets updated in main line code. 
>>You can't make this work right easily from a software perspective, but
>>hardware can.
>>    
>>
>
>But I don't think of a hardware were this would do desireable. This is
>typically the kind of situation where you want write order to be preserved,
>and probably synchronous in the interrupt case.
>  
>
You would want to ensure that the write completed in the interrupt case, 
but if another register write was beside it and had been written 
earlier, it might be benign to write them both concurrently in the 
interrupt case.

I can't think of any concrete examples of this, just hypothesizing that 
such devices might exist.

>  
>
>>>>There is also the case of larger windows where e.g. I might update a
>>>>bunch of fields for a packet and only need a barrier when I finally tell
>>>>the hardware I'm doing updating fields and it can process the packet (or
>>>>disk block, or crypto buffer, or whatever.)
>>>>   
>>>>
>>>>        
>>>>
>>>For these devices (e.g. ethernet adapters with the buffer mapped in I/O space,
>>>or framebuffer devices) the driver can use BUS_SPACE_MAP_PREFETCHABLE and
>>>bus_space_barrier().
>>> 
>>>
>>>      
>>>
>>I don't think so.  If e.g. I have a wi(4), then I want to ensure that
>>all reads actually do happen, and I care about the order of access. 
>>What I don't want is a cache flush between every read.
>>    
>>
>
>And this is specifially the case where you don't want
>BUS_SPACE_MAP_PREFETCHABLE. BUS_SPACE_MAP_PREFETCHABLE isn't appropriate for
>drivers using bus_space_read/write_multi_*, but for drivers using
>bus_space_read/write_region_*. in the _multi_ case, either the space can
>be mapped uncached, and then there is no cache flush needed (and the
>barrier, implicit or explicit, shouldn't issue one) 
>
A lot (most?) of MIPS-based ports seem to ignore the mapping and flush 
the write cache in the barrier.  I've not  adequately checked the other 
ports.  More about this "bug" below.

However, I think I am seeing your point now.

>or the space
>can't be mapped uncached and the cache flush is unavoidable.
>
>  
>
>>In fact, what I'd really like to have happen is on hardware that
>>supports it, not have a cacheable mapping to this memory.
>>    
>>
>
>This is why there is BUS_SPACE_MAP_CACHEABLE flag to bus_space_map.
>The space won't be mapped cachable (if possible) if BUS_SPACE_MAP_CACHEABLE
>is not specified.
>  
>
So a typical PCI implementation would then have nop for barrier?  
(Assuming driver mapped uncacheable and not prefetchable?)  This sounds 
pretty good to me.

>  
>
>>If however, you can set it up so that bus_space_barrier knows whether
>>the memory space is cached or not (e.g. by way of the flags), and is a
>>noop for those mappings that are not cached, then this would probably be
>>very good.
>>    
>>
>
>bus_space_barrier already has the informations for this. If it's not using it
>it's a bug.
>  
>
Okay.  Will you be fixing any implementation(s)? that suffer from this 
defect?

>>What I'm saying is that drivers that want _READ or _WRITE (and don't
>>care to distinguish between the variations) should be able to express
>>that simply.  (I.e. BARRIER_WRITE can be a bit field combining
>>read-before-write and write-before-write.)
>>    
>>
>
>Ha, OK. Yes, we could keep this as a shortcut. But it's better if
>drivers specify accurately their needs, to that hardware can acurately
>specify what they need. I think read-before-read and write-before-write
>are the most common cases.
>  
>
I think there are a lot of cases where the driver just wants plain, 
unordered, uncached access to device registers.  Accurately figuring out 
barriers beyond that is likely to be somewhat painful for a large 
variety of drivers.

A big part of my misunderstanding probably stemmed from a bug in the 
mips implementation (and also that I'm still pretty new to NetBSD.)  
More on that later.

>  
>
>>In fact, it is so much easier if the mappings are not cached, and you
>>don't have to worry about all this nonsense. 
>>    
>>
>
>Cache is not the only issue. There are read/write reordering by the CPU,
>instruction reordering by the CPU, and instruction reordering by the
>compiler. The barrier also have to handle that.
>  
>
OK, I missed that point.

>  
>
>>>>>I'm also not 100% certain about the change in semantic of
>>>>>BUS_SPACE_MAP_PREFETCHABLE.
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>change in semantic for which ? This would match what the man page says.
>>> 
>>>
>>>      
>>>
>>Maybe I'm not understanding correctly.
>>
>>As I understand, prefetchable mappings allow the use a load cache.  The
>>    
>>
>
>"Try to map the space so that accesses can be prefetched by the system, and
>writes can be buffered"
>  
>
I missed the second part of that statement (doh), and was assuming 
CACHEABLE semantics as well.  My apologies. 

>>idea that for PIO devices like IDE or wi(4) should use prefetchable
>>mappings to get good performance seems busted to me, because you really
>>*do not* want these to be cached -- which is what prefetchable says.
>>    
>>
>
>Err, no. These devices really don't want prefetchable mappings. 
>  
>
Which was my point.

[lots of detail snipped]

>Yes. Hum, I think we're agreeing on this, but don't understand each other.
>  
>
This is likely the case. :-)  More below.

>An example where PREFETCHABLE is desireable, but you still want memory
>barrier are ethernet drivers for device which have their buffer mapped
>(for example ix(4), which has its ram mapped in memory space).
>To transmit a packet you copy the data to the mapped ram (with
>bus_space_write_region, not bus_space_write_multi - on hardware where this
>is memory-mapped linearly, it's just a memcpy, but on some hardware it may
>have to be a bus_space_write in a loop) and then start the transmit with
>a write to another register. You don't care if the writes to the ram are
>reordered, they're each at different addresses. however, all these writes
>have to be completed before the write which starts the transmit.
>In this case it's desireable to to map the space PREFETCHABLE because the
>data copy to/from the device's ram will be much faster, and you enforce
>ordering of access to the control registers with a bus_space_barrier.
>  
>
I agree that this kind of device may work well.  But what impact will 
prefetch have on other register accesses to the device?  I guess the 
driver can explictly use barriers everywhere to get correct behavior.   
(This creates two kinds of driver -- those that use prefetch and 
explicit barriers, and those that rely on implicit barriers.)

I think the documentation will need to be updated.  It would be nice if 
the name of the flag BUS_SPACE_MAP_PREFETCHABLE somehow reflected the 
fact that barriers were explicit vs. implicit.

>  
>
>>>I'm only talking about bus_space. Note that, *right now*, all bus_space
>>>implementation have implicit ordered read/write, because most drivers don't
>>>use bus_space_barrier() at all.
>>> 
>>>
>>>      
>>>
>>Ordering read/write is okay, but I don't want ordered read/read or
>>write/write by default -- that would cause PIO performance to go down
>>the drain.
>>    
>>
>
>Err, I don't understand. If you're making reference to devices like IDE
>or wi, these read/read or write/write *have to* be ordered, as they all
>access the same register address ! you don't want to have 2 access reordered,
>as it would swap 2 bytes in your data stream.
>  
>
Yes, they do have to be ordered.  And this works today because these are 
located in uncached memory space, so the barrier isn't strictly 
needed.   More on that below, which will hopefully clear it up.

>  
>
>>If you look at e.g. the mips bus_space_align_chipdep.c, you find that it
>>only does the implicit ordering for bus_space_xxxx_multi.   So I'm not
>>sure I agree with your statement above.
>>    
>>
>
>Hum. So either mips don't need anything special for registers access barrier,
>or some drivers will break. Most drivers assume implicit barrier.
>  
>
Most device on MIPS plaforms are located in uncached memory space, so 
the assumption is *usually* correct.

Yet the barrier operation is non-null.  One could argue reasonably that 
this is a bug in the mips implementation.  But until this bug is fixed, 
your proposed changes will have a bad effect on performance, at least on 
MIPS hardware.

I wonder if other ports have similar issues.  I have not done a survey 
of them.  Have you?

>  
>
>>I agree.  Having the MD code do the "right" thing seems appropriate. 
>>But this also means its a bad idea to overload the meaning of
>>prefetchable in devices like wi(4) and IDE.
>>    
>>
>
>Again I think we agree but I don't understand what you mean with
>"overload the meaning of prefetchable"
>  
>
I went back and reread the description of prefetch in the man page, and 
I concur.  I think the name BUS_SPACE_MAP_PREFETCHABLE is a poor choice, 
because it doesn't describe the idempotent and the write side behavior 
(IMO), but in this case the misunderstanding was mine.  I apologize.

>>>>In general, I like the idea of removing this logic from drivers, they
>>>>shouldn't have to worry about this (other than establishing the type of
>>>>support needed when the device is first mapped) and the bus should take
>>>>care of it.
>>>>
>>>>But getting this right is going to be a big job, and require changes to
>>>>each of the ports.
>>>>   
>>>>
>>>>        
>>>>
>>>Not that big, because the current status-quo is already to have implicit
>>>barriers in bus_space accesses. Changing BUS_SPACE_MAP_PREFETCHABLE
>>>and BUS_SPACE_MAP_CACHEABLE to do something is not required for correct
>>>operations, it's just a performance improvements and I'll leave this
>>>to portmasters.
>>> 
>>>
>>>      
>>>
>>As long as its read/write barrier that continues to be implicit.  But I
>>think there are device drivers that need read/read or write/write or
>>write/read barriers, and that removing the barriers from those drivers
>>means putting a more strict barrier in the implementation.  That will
>>cause things like wi(4) and IDE (in PIO mode) to go slower.
>>    
>>
>
>This I can't see why it would be slower than the current code. These
>devices needs read/read and write/write barrier on every register
>access.
>  
>
Concrete example: wi(4) on mips.  Currently these just "work" because 
the PCI space is mapped uncached.  But the barrier ops for that platform 
issues a cache flush.

I guess a reasonable argument is that the platform code is busted 
because it should know whether it is uncached or not, and do the "right" 
thing.  But it doesn't, and the end result of your changes will 
seriously impact performance of these devices on (probably) all MIPS 
platforms.  I suspect that other ports may have similar issues, but I 
have not checked.

>The macro can just reference a function, or contain a single asm instruction,
>or just degerate to empty string.
>It depends on the hardware (and some hardware don't have cache :)
>Here's an example from sun68k:
>#define bus_space_read_1(t, h, o)                                       \
>            ((void)t, *(volatile uint8_t *)((h) + (o)))
>
>  
>
>>What platforms exist where function pointer deferences are that slow? 
>>    
>>
>
>I think vax is an example.
>  
>
Okay, if the consensus is that centralizing the implementation would 
impact performance on platforms that we care about, then I concede this 
point and retract my proposal to centralize the implementation.

    -- Garrett