Subject: Re: bus_dmamap_sync not enough ?
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Stephan Uphoff <ups@stups.com>
List: tech-kern
Date: 02/13/2003 21:17:58
> If you want to write clarifying text for the manual page, that would
> be terrific.

"Want" might not be the right word - but I guess I kind of volunteered.

I think we agree that adding an API to find out the bus_dmamap_sync 
footprint and describing the consequences of the footprint is a
good idea.

In additional HOWTO explanations should be grouped in three sections:

	Use as described by the interface:
		CPU or DMA use the memory - never both concurrently

	"Save" abuses of the interface:
 		CPU read operations and concurrent DMA read operations
		CPU read/write operation and concurrent DMA write
		operations

	"Dangerous" abuses of the interface
		CPU write operation and concurrent DMA read operation	 

Should I send drafts/questions directly to you and keep
it off the kernel list ? 

For the long term I would like to see an extended bus_dma interface so
that all drivers can be written without the use of the abuse sections.


 
>  > b) Unfortunately there is currently no machine independent way to determine the
>  >    bus_dmamap_sync footprint of a given bus_dmamap_sync call.
> 
> Yah, one possible enhancement to the API might be to add a call which
> takes a bus_dma_tag_t and returns the "sync footprint" (which would
> essentially be the cacheline size, in most cases).  Then drivers would
> make intelligent decisions about how to align data structures, etc.
> based on that information.

Maybe we can extend this to more properties.
###
>  > c) Data in the bus_dmamap_sync footprint should not be accessed by the 
>  > processor
>  >    if a BUS_DMASYNC_PREREAD is in effect for the bus_dmamap_sync footprint.
>  >    ( Example: XSCALE architecture?)
>  > 
>  >    Problems:
>  >     Reading data that falls in the bus_dmamap_sync footprint might cause a 
>  > cacheline to be loaded.
>  >     This cacheline might persist in the cache and present outdated memory 
>  > state to the CPU
>  >     even after a BUS_DMASYNC_POSTREAD sync operation.
>  >     The outdated state in the cacheline can be flushed by a 
>  > BUS_DMASYNC_PREREAD sync.
>  >     ( This could be fixed by  flushing cache lines on BUS_DMASYNC_POSTREAD)
> 
> No, I don't think this is a problem.  If you determine that the device is
> going to update the memory location again, then you issue another PREREAD,
> which will do the necessary non-writeback-invalidate.  This is what current
> drivers do.
> 
> Basically, the CPU should not even try to access DMA memory (read *or* write)
> while the device "owns" it (i.e. between PRE and POST ops).

Without knowing the bus_dmamap_sync footprint (the current status) 
drivers might accidently do this. ( Example: Multiple dma data and
control data mixed in one cacheline)

I really don't think it would take me a long time to come up with
a few drivers that have this race condition.

I also think that it is possible for cachelines to be loaded
speculatively.( micro load opcode speculatively executed - but not
retired due to missed branch prediction )
This would require BUS_DMASYNC_POSTREAD to flush any accidently
loaded cachelines instead of being a no-op. (on non main-memory 
snooping CPUs)
But this a problem with some architecture specific implementations
 - not the interface.

###
> 
>  >     Writing data also might lead to a stale cache line - however this condition
>  >     can not be reverted on a write-back cache (write through should be ok)
>  >     since the write marks the cacheline as dirty. (and a flush will just save 
>  > it to memory)
>  >     ( This might revert dma region memory with which it shares a cache line to 
>  > an earlier state)
> 
> ...right, which is why the CPU should not dirty that cache line (by issuing
> a write) until it knows it is safe to do so (i.e. knows the device is not
> going to update that region of memory again).

Easier said than done.

###
> 
>  > d) DMA_READ operation on memory not currently synced as BUS_DMASYNC_PREREAD.
>  > 
>  >    Same problems as (c) and the following additional problem that the
>  >    CPU might see some (but not all)  data from dma transfers and the CPU might 
>  > not see the
>  >    data in the order it was transfered by DMA.
>  >    Again BUS_DMASYNC_PREREAD should synchronize.
> 
> Err, right, a DMA read which is not preceeded by a PREREAD operation is
> basically incorrect use of the bus_dma API, and thus results in undefined
> behavior :-)

;-)

###
> 
>  > e) DMA_WRITE operation on memory not currently synced as BUS_DMASYNC_PREWRITE. 
>  > 
>  >     
>  >    DMA access could see some data written by the CPU before a 
>  > BUS_DMASYNC_PREWRITE operation
>  >    and it might see the data out of order.
> 
> Ditto.


;-)

###
> 
>  > 2) Examples of race conditions in the i82557 driver:
>  > 
>  > (I)
>  > Adding a new receive buffer at the end of the DMA queue.
>  > File i82257var.h:  
>  > 	#define	FXP_INIT_RFABUF(sc, m  \
>  > 	      [...]
>  > 		FXP_RFASYNC((sc), __p_m,				\
>  > 		    BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);	\
>  > 		/* ## a ## */
>  > 		memcpy((void *)&__p_rfa->link_addr, &__v,		\
>  > 		    sizeof(__v));					\
>  > 		__p_rfa->rfa_control &= htole16(~FXP_RFA_CONTROL_EL);	\
>  > 	        /* ## b ## */
>  > 		FXP_RFASYNC((sc), __p_m,				\
>  > 		    BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);		\
>  > 
>  > 
>  > Between /* ## a ## */ and /* ## b ## */ DMA reads might be active for the RFA 
>  > and
>  > for the frame data area.
> 
> Yes, and there isn't really any good way to fix this problem.  The i82557
> was clearly designed with blinders on, and relies on at least two features
> of the IA-32:
> 
> 	1. Fully coherent memory system.
> 	2. Ability to atomically update a 16-bit word in memory.
> 
> Luckily, the driver is able to arrange things to that we always win the
> race.  

How ?
One Example:
Assuming write-back cache with 32 Byte cacheline:	
	1) memcpy((void *)&__p_rfa->link ......
	   Loads the cacheline and dirties it.
	2) i82557 DMAs frame data to memory immediately following
            the rfa (rfa + 16)
           Since the rfa has an offset of 4 and a size of 16
            => 12 frame data bytes share the cacheline with the rfa.

	3) FXP_RFASYNC((sc), 
                  __p_m, BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
	   Flushes the cacheline - clobbering the first 12 Bytes of
	   frame data.

Perhaps I can agree with:
	Luckily, the driver is almost always lucky to win the race.
	Even if the driver loses the race it is almost always lucky
        enough that nothing serious happens. ;-)  

###
> There's not really much that can be done about it, in the case
> of the i82557.
> 
>  > Non-cacheable memory would do the trick but:
>  > Jason R Thorpe wrote: [..] driver should not rely on BUS_DMA_COHERENT for 
>  > correct operation [...]
> 
> Right, see above ... this is a "feature" of the i82557.

Unfortunately this is not the only chip requiring functionality not provided by
the current interface.
( And I think this or similar styles of dma chaining become more 
popular )

###
> 
>  > (II)
>  > File i82257.c
>  > fxp_rxintr(struct fxp_softc *sc)
>  > 	[...]
>  > 
>  > 	FXP_RFASYNC(sc, m,
>  > 		    BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);
>  > 
>  > 		/** A **/
>  > 		rxstat = le16toh(rfa->rfa_status);
>  > 
>  > 		if ((rxstat & FXP_RFA_STATUS_C) == 0) {
>  > 			/*
>  > 			 * We have processed all of the
>  > 			 * receive buffers.
>  > 			 */
>  > 			FXP_RFASYNC(sc, m, BUS_DMASYNC_PREREAD);
>  > 			return;
>  > 		}
>  > 
>  > 		IF_DEQUEUE(&sc->sc_rxq, m);
>  > 
>  > 		FXP_RXBUFSYNC(sc, m, BUS_DMASYNC_POSTREAD);
>  > 		/** B **/
>  > 		len = le16toh(rfa->actual_size) &
>  > 		    (m->m_ext.ext_size - 1);
>  > 
>  > 		if (len < sizeof(struct ether_header)) {
>  > 			/*
>  > 			 * Runt packet; drop it now.
>  > 			 */
>  > 			FXP_INIT_RFABUF(sc, m);
>  > 			continue;
>  > 		}
>  > 
>  > 
>  > The rfa might be dma-updated at point /** A **/
>  > Due to problem d) the driver might see the update of the rfa_status field but 
>  > not
>  > the update of the actual_size field at point /** B **/
> 
> No.  The chip does not transfer ownership of the RFA to the host (by setting
> the C bit) until the entire RFA descriptor is updated (I have PCI bus analyzer
> traces that confirm this :-)

I fully agree.

###
>, thus if the host sees C, then it is guaranteed
> that all of the fields in the RFA are valid.  This is what all descriptor-
> based messaging systems do with the "owner bit".

No.
I think we agree that if C is in memory, then  all of the other
fields in the RFA are also valid in memory.

Some CPUs (Example Intel P6) internally reorder their instructions
and even speculatively execute instructions.

So in the absents of special memory-barrier instructions the CPU 
is free to reorder instructions to read the rfa->actual_size
field before reading the rfa->status field.

If the driver is really unlucky the device DMAed both fields between
the read of rfa->actual_size and the read of rfa->status.

As such the driver sees C - but an invalid actual_size field.

I have seen this happening in stress tests on a proprietary 82257 driver
 - but maybe the RFA spanned two cache lines. 
 
The driver will probably just throw the packet away because the length is 
invalid - so this example is not a real problem.

Just imagine the detection of the C Bit as the acquisition 
of a mutex or spinnlock.
A mutex or spinnlock without a memory-barrier is pretty useless.

###
> Since the driver does not modify the RFA memory (does not dirty cache lines),
> and issues only a PREREAD, there is no chance that the CPU's view of memory
> will overwrite the device's view.

I agree



----------------------
Stephan Uphoff 
ups@stups.com
9275 Martin Road
Roswell, GA 30076
770-518-4058