Subject: Re: Odd data faults on U5 with Promise U66 IDE controller
To: None <tech-kern@netbsd.org>
From: Rafal Boni <rafal@pobox.com>
List: port-sparc64
Date: 08/28/2003 01:55:39
[...Following up to my own message; this looks like it may be sparc64-
 specific, but it may also be related to RAIDFrame, so I'm sending it
 on to tech-kern as welll...]
 
In message <200308280112.h7S1CWp1001018@fearless-vampire-killer.waterside.net>,
I wrote:

-> I've retooled my U5 to be a more useful server box (which is what it has
-> been doing anyway), and to that aim I installed a Promise Ultra/66 IDE
-> controller with two Seagate 120GB IDE drives hanging off of it, which I
-> intended to use as a mirror set (I had to rip out the CDROM & floppy to
-> do this and be able to fit everything in, but those seldom got any use
-> anyway :-).
-> 
[...]

-> Each time I've tried this, so far (2 or 3 times), I've gotten an odd panic
-> from what looks like an async data error, like so:
-> 
->     data error type 32 sfsr=0 sfva=778000 afsr=84000000 afva=1fe02000458 tf=
-> 0xe0017c30
->     data fault: pc=116a808 addr=778000 sfsr=0<ASI=0> 
->     kernel trap 32: data access error
->     Stopped at      netbsd:pdc202xx_pci_intr+0x24:  subcc     %l3, %o1, %g0

[ "this" above, was copying data over from a non-RF wd1 disk, to the raid
  partition on wd2 in preparation of adding the other disk to the mirror
  set when all the data had been copied over...]

This appears to be related to RAIDFrame and read/write activity on both
disks on the Promise controller; particularly, the RF disk being written
while reading the non-RF'ed disk.

Here's what I've tried:
	No RF, two parallel dd's reading from each of the disks (wd1 and wd2).
		* No problems.

	No RF, two parallel dd's, one reading from wd1, one writing to wd2,
	both through the FS and to the raw device (IIRC, I'm pretty sure
	I did the raw device as well).
		* No problems.

	RF (wd2 and nonexistant wd3 in a mirrored set) on wd2 being written
	while the non-RF device (wd1) is read.
		* BOOM!

The panics appear a slight bit different when dd'ing, or when untar'ing to
the RAID disk (rather than using dump/restore as I tried before):

    data error type 32 sfsr=0 sfva=8b3c000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
    panic: Privileged Async Fault: AFAR 0x1fe02000458 AFSR 84000000<PRIV,BERR,ETS=0,P_SYND=0>

and:

    data error type 32 sfsr=0 sfva=8df0000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
    panic: Privileged Async Fault: AFAR 0x1fe02000458 AFSR 84000000<PRIV,BERR,ETS=0,P_SYND=0>

and again very similar to the original panic (this time copying using
tar and untar rather than dump/restore):

    data error type 32 sfsr=0 sfva=8be6000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
    data fault: pc=116a808 addr=8be6000 sfsr=0<ASI=0> 
    kernel trap 32: data access error 
    Stopped in pid 700.1 (tar) at   netbsd:pdc202xx_pci_intr+0x24:  subcc

--rafal

-> Any ideas on how to debug this?  I'm guessing that this has something to do
-> with (heavier) concurrent access to both drives on the promise controller,
-> as I've had no problems with the single drive on the box, nor with both
-> drives present but only one being accessed.  Just to make sure it's not
-> dump or restore doing something dumb, I suppose I could dump to /dev/null
-> and see if that breaks it (but only after I send this email out, since
-> the U5 is also my mail box :-) -- unfortunately, I don't have enough room
-> to keep the dump on the boot/root disk, but I could try and see how far
-> it gets.
-> 
-> I rebuilt the kernel from today's CVS as I needed to add RAIDFrame anyway,
-> and had noticed issues with processes never making it off the run queue in
-> the previous kernel I was running, so figured it was worth a try...
-> 
-> Thanks!
-> --rafal
-> 
-> dmesg follows:
-> 
-> NetBSD 1.6X (FEARLESS_VAMPIRE_KILLER) #11: Wed Aug 27 14:53:21 EDT 2003
-> 	rafal@fearless-vampire-killer.waterside.net:/extra/sparc64/obj/sys/arch
-> /sparc64/compile/FEARLESS_VAMPIRE_KILLER
-> total memory = 128 MB
-> avail memory = 110 MB
-> using 832 buffers containing 6656 KB of memory
-> bootpath: /pci@1f,0/pci@1,1/ide@3,0/disk@0,0
-> mainbus0 (root): SUNW,Ultra-5_10
-> cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 360 MHz, version 0 FPU
-> cpu0: 32K instruction (32 b/l), 16K data (32 b/l), 256K external (64 b/l)
-> psycho0 at mainbus0 addr 0xfffc4000
-> SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 2; PCI bus 0
-> DVMA map: c0000000 to e0000000
-> IOTSB: a46000 to ac6000
-> pci0 at psycho0
-> pci0: i/o space, memory space enabled
-> ppb0 at pci0 dev 1 function 1: Sun Microsystems, Inc. Simba PCI bridge (rev.
->  0x13)
-> pci1 at ppb0 bus 1
-> pci1: i/o space, memory space enabled
-> ebus0 at pci1 dev 1 function 0
-> ebus0: Sun Microsystems, Inc. PCIO Ebus2, revision 0x01
-> auxio0 at ebus0 addr 726000-726003, 728000-728003, 72a000-72a003, 72c000-72c
-> 003, 72f000-72f003
-> power at ebus0 addr 724000-724003 ipl 37 not configured
-> SUNW,pll at ebus0 addr 504000-504002 not configured
-> sab0 at ebus0 addr 400000-40007f ipl 43: rev 3.2
-> sabtty0 at sab0 port 0
-> sabtty1 at sab0 port 1: console i/o
-> com0 at ebus0 addr 3083f8-3083ff ipl 41: ns16550a, working fifo       
-> kbd0 at com0
-> com1 at ebus0 addr 3062f8-3062ff ipl 42: ns16550a, working fifo       
-> ms0 at com1
-> lpt0 at ebus0 addr 3043bc-3043cb, 30015c-30015d, 700000-70000f ipl 34 
-> fdthree at ebus0 addr 3023f0-3023f7, 706000-70600f, 720000-720003 ipl 39 not
->  configured
-> clock0 at ebus0 addr 0-1fff: mk48t59: hostid 80d164db
-> flashprom at ebus0 addr 0-fffff not configured
-> audiocs0 at ebus0 addr 200000-2000ff, 702000-70200f, 704000-70400f, 722000-7
-> 22003 ipl 35 ipl 36: CS4231A
-> audio0 at audiocs0: full duplex
-> hme0 at pci1 dev 1 function 1: Sun Happy Meal Ethernet, rev. 1
-> hme0: interrupting at ivec 3021    
-> hme0: Ethernet address 08:00:20:XX:XX:XX
-> nsphy0 at hme0 phy 1: DP83840 10/100 media interface, rev. 1
-> nsphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
-> pciide0 at pci1 dev 3 function 0: CMD Technology PCI0646 (rev. 0x03)  
-> pciide0: bus-master DMA support present
-> pciide0: primary channel configured to native-PCI mode
-> pciide0: using ivec 1820 for native-PCI interrupt
-> wd0 at pciide0 channel 0 drive 0: <ST38410A>
-> wd0: drive supports 32-sector PIO transfers, LBA addressing
-> wd0: 8223 MB, 16708 cyl, 16 head, 63 sec, 512 bytes/sect x 16841664 sectors 
->  
-> wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4 (Ultra/66)
-> wd0(pciide0:0:0): using PIO mode 4, DMA mode 2 (using DMA data transfers)
-> pciide0: secondary channel configured to native-PCI mode
-> pciide0: disabling secondary channel (no drives)
-> ppb1 at pci0 dev 1 function 0: Sun Microsystems, Inc. Simba PCI bridge (rev.
->  0x13)
-> pci2 at ppb1 bus 2
-> pci2: i/o space, memory space enabled
-> pciide1 at pci2 dev 1 function 0: Promise Ultra66/ATA Bus Master IDE Acceler
-> ator (rev. 0x01)
-> pciide1: bus-master DMA support present
-> pciide1: primary channel configured to native-PCI mode
-> pciide1: using ivec 10 for native-PCI interrupt
-> wd1 at pciide1 channel 0 drive 0: <ST3120026A>
-> wd1: drive supports 16-sector PIO transfers, LBA48 addressing
-> wd1: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
->  
-> wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
-> wd1(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA d
-> ata
-> transfers)
-> pciide1: secondary channel configured to native-PCI mode
-> wd2 at pciide1 channel 1 drive 0: <ST3120026A>
-> wd2: drive supports 16-sector PIO transfers, LBA48 addressing
-> wd2: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
-> wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
-> wd2(pciide1:1:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA d
-> ata
-> transfers)
-> wm0 at pci2 dev 2 function 0: Intel i82544EI 1000BASE-T Ethernet, rev. 2
-> wm0: interrupting at ivec 14
-> wm0: Ethernet address 00:02:b3:YY:YY:YY
-> makphy0 at wm0 phy 1: Marvell 88E1000 Gigabit PHY, rev. 0
-> makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000base
-> T-FDX, auto
-> uhci0 at pci2 dev 3 function 0: VIA Technologies VT83C572 USB Controller (re
-> v. 0x50)
-> uhci0: interrupting at ivec 18
-> usb0 at uhci0: USB revision 1.0
-> uhub0 at usb0
-> uhub0: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
-> uhub0: 2 ports with 2 removable, self powered
-> uhci1 at pci2 dev 3 function 1: VIA Technologies VT83C572 USB Controller (re
-> v. 0x50)
-> uhci1: interrupting at ivec 19
-> usb1 at uhci1: USB revision 1.0
-> uhub1 at usb1
-> uhub1: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
-> uhub1: 2 ports with 2 removable, self powered
-> ehci0 at pci2 dev 3 function 2: VIA Technologies VT8237 EHCI USB Controller 
-> (rev. 0x51)
-> ehci0: EHCI version 0.95
-> ehci0: companion controllers, 2 ports each: uhci0 uhci1
-> usb2 at ehci0: USB revision 2.0
-> uhub2 at usb2
-> uhub2: VIA Technologie EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
-> uhub2: 4 ports with 4 removable, self powered
-> pcons at mainbus0 not configured   
-> No counter-timer -- using %tick at 360MHz as system clock.
-> Kernelized RAIDframe activated     
-> IPsec: Initialized Security Association Processing.
-> ehci0: handing over low speed device on port 1 to uhci0
-> uhub2: port 1, device disappeared after reset
-> ehci0: handing over full speed device on port 2 to uhci0
-> uhub2: port 2, device disappeared after reset
-> root on wd0a dumps on wd0b
-> [...]
-> 
-> (Yes, I know, it hardly looks like a U5 anymore... I should probably use a
->  cheap PC for this instead, but the only PC I've got that's more powerful
->  is a big honking machine that sounds like a 747, and I hate having it on
->  all the time :-)

----
Rafal Boni                                                     rafal@pobox.com
  We are all worms.  But I do believe I am a glowworm.  -- Winston Churchill