Subject: pciide lost interrupt - losing access to the file system
To: None <current-users@netbsd.org>
From: Rick Byers <rickb@iaw.on.ca>
List: current-users
Date: 10/13/1999 22:04:59
Hi,

I'm having a problem with NetBSD-current/i386 (a few days old).  I wasn't
having this problem a few months ago (sorry I can't be more specific).  
I've got two harddrives, wd0 is my windoze drive, wd1 is my NetBSD drive.  
When playing mp3s (music files) from my NetBSD drive I get the following
kernel messages every minute or so (and access to the disk is suspended
for a few seconds):

pciide0:0:1: lost interrupt
	type: ata
	c_bcount: 8192
	c_skip: 0
pciide0:0:1: Bus-Master DMA error: missing interrupt, status=0x61
wd1e: DMA error writing fsbn 1517072 of 1517072-1517087 (wd1 bn 1694480; cn 1681 tn 0 sn 32), retrying
wd1: soft error (corrected)

When playing mp3s from my Win95 drive, I get the same sort of message, but
instead of "wd1: soft error (corrected)", I get "pciide0:0:0: missing
untimeout" and the system never recovers - all access to either disk just
blocks forever.  I can break into the debugger and do a "reboot", but I
get "syncing disks ... panic: lockmgr: no context".

I can easily reproduce the problem by playing mp3s, but I occasionally see
the problem when just copying data from the Win95 drive.  I'm guessing
it's timing specific - playing mp3s would read a chunk then pause, then
read another chunk (on average, I get the error after reading about 3-7 Mb
of data).  I don't see the problem on normal disk access to the UNIX
drive, but since it doesn't hang the system when only wd1 is involved, I
could have missed it.

Here is the relevant information from the kernel boot messages:

pciide0 at pci0 dev 4 function 1: Intel 82371AB IDE controller (PIIX4)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <ST52520A>
wd0: drive supports 16-sector pio transfers, lba addressing
wd0: 2446MB, 4970 cyl, 16 head, 63 sec, 512 bytes/sect x 5010016 sectors
wd0: 32-bits data port
wd0: drive supports PIO mode 4, DMA mode 2
wd1 at pciide0 channel 0 drive 1: <ST33240A>
wd1: drive supports 16-sector pio transfers, lba addressing
wd1: 3077MB, 6253 cyl, 16 head, 63 sec, 512 bytes/sect x 6303024 sectors
wd1: 32-bits data port
wd1: drive supports PIO mode 4, DMA mode 2
pciide0: primary channel interrupting at irq 14
wd0(pciide0:0:0): using PIO mode 4, DMA mode 2 (using DMA data transfers)
wd1(pciide0:0:1): using PIO mode 4, DMA mode 2 (using DMA data transfers)
pciide0: secondary channel wired to compatibility mode
atapibus0 at pciide0 channel 1
pciide0: secondary channel interrupting at irq 15
cd0(pciide0:1:0): using PIO mode 4, DMA mode 2 (using DMA data transfers)
cd1(pciide0:1:1): using PIO mode 3
wd0: no disk label
boot device: wd1
root on wd1a dumps on wd1b

Any idea what might be causing this?  Doesn't feel like a hardware problem
to me (I would expect doing massive amounts of normal disk access - like
rebuilding the userland and xsrc - to case problems as well).  Any
major changes to the pciide stuff in the last few months? Has anyone else
seen this sort of behaviour? Any suggestions?

Thanks,
	Rick

=========================================================================
Rick Byers                       University of Waterloo, Computer Science
rickb@iaw.on.ca                               http://www.iaw.on.ca/rickb/