Subject: kern/9572: polled DMA can hang the machine (fxp)
To: None <gnats-bugs@gnats.netbsd.org>
From: John Hawkinson <jhawk@mit.edu>
List: netbsd-bugs
Date: 03/08/2000 01:57:34
>Number:         9572
>Category:       kern
>Synopsis:       polled DMA can hang the machine (fxp)
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Mar  8 01:57:00 2000
>Last-Modified:
>Originator:     John Hawkinson
>Organization:
	MIT
>Release:        NetBSD 1.4.1
>Environment:
	
System: NetBSD zorkmid.mit.edu 1.4.1 NetBSD 1.4.1 (ZORKMID) #62: Wed Mar 8 03:59:10 EST 2000 jhawk@zorkmid.mit.edu:/usr/src/sys/arch/i386/compile/ZORKMID i386


>Description:
	Some device drivers, such as dev/pci/if_fxp.c [moved around
in -current], use a construct like this to do DMA (line 1318ff):

        while (!(cb_ias->cb_status & FXP_CB_STATUS_C))
                bus_dmamap_sync(sc->sc_dmat, sc->sc_tx_dmamaps[0],
                    0, sizeof(struct fxp_cb_ias),
                    BUS_DMASYNC_POSTREAD);

Unfortunately, if the device happens to be confused and for some
reason elects not to complete the DMA, the machine will hang
in this loop forever. I encountered this problem in kern/9571,
where an APM suspend/resume was trashing the PCI config state of
the device, and consequently DMA was not happening and the
status bits never changed, and so the machine hung when it
tried to execute this code.

Ideally, confused devices should never hang the machine. Certainly
the OS should not go into tight unconditional loops assuming that
a device is going to function properly if it could be avoided (timeouts,
etc.).

I don't know how widespread a problem this is, and the particular
problem I saw with the fxp driver isn't relevent for kern/9571 anymore,
so this may not be a terribly strong concern. Still, it is an
architectural issue, or possibly at least a driver author education issue.
I'm not certain how popular this particular construct is, but it
does not seem right to me.

>How-To-Repeat:
	Boot stock 1.4.1 on a Sony VAIO Z505HE with built-in ethernet;
	ifconfig fxp0 up
	suspend the laptop
	resume
	observe a few "SCB timed out" errors, and then the machine
hangs hard, DDB not being responsive (presumably because we were
at splnet()).
>Fix:
	WORKAROUND: apply the patch included with kern/9571 and wait
for another instance of this problem in the tree to bite someone
else?
>Audit-Trail:
>Unformatted: