Port-amd64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: x86 bus_dmamap_sync



> Date: Fri, 27 Oct 2017 23:08:23 +0000
> From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
> 
> As noted by Mouse in a thread elsewhere, it looks like the LFENCE in
> x86's bus_dmamap_sync in the POSTREAD case is in the wrong place.

Correction: Since yamt@ committed a fix for PR 21665 in 2004, there
has been an LFENCE (or a stronger serializing instruction, on i386) in
bus_dmamap_sync proper, in the POSTREAD case, _before_ calling the
internal _bus_dmamap_sync logic.  So now I am puzzled by the LFENCE
and MFENCE that bouyer@ added back in 2008 for PR 38935 -- it seems to
me MFENCE is unnecessary for PREWRITE/POSTWRITE, LFENCE is not enough
for PREREAD which I think needs an MFENCE, and a second LFENCE for
POSTREAD is redundant.

I've attached a patch with a detailed comment explaining exactly what
ordering of loads and stores I believe we need in each case, and the
weakest x86 memory barrier that is needed to guarantee that ordering.
The summary that I've provisionally convinced myself of is:

[Pre-DMA: driver is about to issue store to device register to
initiate DMA transfer.]

PREREAD         MFENCE because program-prior loads must bus-precede
                program-subsequent store to device register that
                initiates DMA

PREWRITE        nothing because x86 guarantees bus-order of stores
                matches program-order of stores, and loads are
                irrelevant

[Post-DMA: driver just issued load from device register to learn of
DMA completion.]

POSTREAD        LFENCE because program-prior load from device register
                that notifies of DMA completion must bus-precede
                program-subsequent loads to read data

POSTWRITE       nothing because x86 guarantees program-order
                load/store is always issued as bus-order load/store,
                never as bus-order store/load

It also seems to me that it would be better to use membar_sync and
membar_consumer than x86_mfence and x86_lfence, so that we can take
advantage of the patching to actually use MFENCE and LFENCE on i386
when SSE2 is available.

All that said, it is possible that the guarantees I cited above do not
apply to either (in the PREWRITE case) a store to main memory and a
store to memory-mapped I/O, or (in the POSTWRITE case) a load from
memory-mapped I/O and a store to main memory[*].  If not, PREWRITE
would need an SFENCE and POSTWRITE an MFENCE.  I'd find that a little
surprising, but I'm not an x86 wizard.


[*] Certainly these guarantees _do_ apply when it is an I/O
instruction rather than a memory-mapped I/O load/store, since I/O
instructions are all serializing.  Of course, in that case, no memory
barriers are needed at all.
Index: sys/arch/x86/x86/bus_dma.c
===================================================================
RCS file: /cvsroot/src/sys/arch/x86/x86/bus_dma.c,v
retrieving revision 1.77
diff -p -u -r1.77 bus_dma.c
--- sys/arch/x86/x86/bus_dma.c	31 Jul 2017 19:29:19 -0000	1.77
+++ sys/arch/x86/x86/bus_dma.c	28 Oct 2017 01:30:55 -0000
@@ -813,11 +813,79 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dm
 #endif
 
 	/*
+	 * Pre-DMA operations, before the caller issues a store to
+	 * initiate a DMA transfer:
+	 *
+	 * PREREAD
+	 *
+	 *	All stores to the DMA buffer must be completed before
+	 *	the caller issues a store to initiate the DMA transfer,
+	 *	so that we don't step on the DMA engine's toes while
+	 *	it's reading.  No fence needed because x86 stores are
+	 *	always issued in program order, outside write-combining
+	 *	regions of memory and non-temporal stores.
+	 *
+	 *	All loads from the DMA buffer must be completed before
+	 *	the caller issues a store to initiate the DMA transfer,
+	 *	so that the DMA transfer doesn't clobber them.  Hence
+	 *	MFENCE, via membar_sync.
+	 *
+	 *	(If we have to bounce, the MFENCE may not be necessary,
+	 *	but it doesn't hurt either, and I'd wildly guess the
+	 *	performance impact of an MFENCE is negligible compared
+	 *	to the performance impact of using a bounce buffer.)
+	 *
+	 * PREWRITE
+	 *
+	 *	All stores to the DMA buffer must be completed before
+	 *	the caller issues a store to initiate a DMA transfer,
+	 *	so that the DMA engine transfers the data we want it
+	 *	to.  No fence needed because x86 stores are always
+	 *	issued in program order, outside write-combining
+	 *	regions of memory and non-temporal stores.
+	 *
+	 *	The DMA engine will issue no stores to the DMA buffer,
+	 *	so loads can happen before or after the DMA transfer
+	 *	without trouble.  No fence needed.
+	 *
+	 * Post-DMA, after the caller has issued a load from a device
+	 * register informing it that the DMA transfer has completed:
+	 *
+	 * POSTREAD
+	 *
+	 *	No stores to the DMA buffer may be issued until after
+	 *	the load that informed the caller that the DMA transfer
+	 *	had completed.  No fence needed because x86 never
+	 *	reorders program-order load/store to store/load.
+	 *
+	 *	No loads from the DMA buffer must be issued until after
+	 *	the load that informed the caller that the DMA transfer
+	 *	had completed.  Hence LFENCE, via membar_consumer, and
+	 *	be sure to do so before we bounce.
+	 *
+	 * POSTWRITE
+	 *
+	 *	No stores to the DMA buffer may be issued until after
+	 *	the load that informed the caller that the DMA transfer
+	 *	had completed.  No fence needed because x86 never
+	 *	reorders program-order load/store to store/load.
+	 *
+	 *	The DMA engine will issue no stores to the DMA buffer,
+	 *	so loads can happen before or after the DMA transfer
+	 *	without trouble.  No fence needed.
+	 */
+
+	if (ops & BUS_DMASYNC_PREREAD)
+		membar_sync();
+	if (ops & BUS_DMASYNC_POSTREAD)
+		membar_consumer();
+
+	/*
 	 * If we're not bouncing, just return; nothing to do.
 	 */
 	if (len == 0 || cookie == NULL ||
 	    (cookie->id_flags & X86_DMA_IS_BOUNCING) == 0)
-		goto end;
+		return;
 
 	switch (cookie->id_buftype) {
 	case X86_DMA_BUFTYPE_LINEAR:
@@ -940,21 +1008,6 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dm
 		    cookie->id_buftype);
 		break;
 	}
-end:
-	if (ops & (BUS_DMASYNC_PREWRITE|BUS_DMASYNC_POSTWRITE)) {
-		/*
-		 * from the memory POV a load can be reordered before a store
-		 * (a load can fetch data from the write buffers, before
-		 * data hits the cache or memory), a mfence avoids it.
-		 */
-		x86_mfence();
-	} else if (ops & (BUS_DMASYNC_PREREAD|BUS_DMASYNC_POSTREAD)) {
-		/*
-		 * all past reads should have completed at before this point,
-		 * and future reads should not have started yet.
-		 */
-		x86_lfence();
-	}
 }
 
 /*
@@ -1336,9 +1389,6 @@ bus_dmamap_sync(bus_dma_tag_t t, bus_dma
 		return;
 	}
 
-	if (ops & BUS_DMASYNC_POSTREAD)
-		x86_lfence();
-
 	_bus_dmamap_sync(t, p, o, l, ops);
 }
 


Home | Main Index | Thread Index | Old Index