port-sparc64: Re: Odd data faults on U5 with Promise U66 IDE controller

Subject: Re: Odd data faults on U5 with Promise U66 IDE controller
To: Rafal Boni <rafal@pobox.com>
From: Eduardo Horvath <eeh@NetBSD.ORG>
List: port-sparc64
Date: 08/28/2003 19:09:52
On Thu, Aug 28, 2003 at 01:55:39AM -0400, Rafal Boni wrote:
> [...Following up to my own message; this looks like it may be sparc64-
>  specific, but it may also be related to RAIDFrame, so I'm sending it
>  on to tech-kern as welll...]
>  
> In message <200308280112.h7S1CWp1001018@fearless-vampire-killer.waterside.net>,
> I wrote:
> 
> -> I've retooled my U5 to be a more useful server box (which is what it has
> -> been doing anyway), and to that aim I installed a Promise Ultra/66 IDE
> -> controller with two Seagate 120GB IDE drives hanging off of it, which I
> -> intended to use as a mirror set (I had to rip out the CDROM & floppy to
> -> do this and be able to fit everything in, but those seldom got any use
> -> anyway :-).
> -> 
> [...]
> 
> -> Each time I've tried this, so far (2 or 3 times), I've gotten an odd panic
> -> from what looks like an async data error, like so:
> -> 
> ->     data error type 32 sfsr=0 sfva=778000 afsr=84000000 afva=1fe02000458 tf=
> -> 0xe0017c30
> ->     data fault: pc=116a808 addr=778000 sfsr=0<ASI=0> 
> ->     kernel trap 32: data access error
> ->     Stopped at      netbsd:pdc202xx_pci_intr+0x24:  subcc     %l3, %o1, %g0

Lesee....  an AFSR of 84000000 means you got a privileged bus error 
accessing hardware address 0x1fe02000458.

> 
> [ "this" above, was copying data over from a non-RF wd1 disk, to the raid
>   partition on wd2 in preparation of adding the other disk to the mirror
>   set when all the data had been copied over...]
> 
> This appears to be related to RAIDFrame and read/write activity on both
> disks on the Promise controller; particularly, the RF disk being written
> while reading the non-RF'ed disk.
> 
> Here's what I've tried:
> 	No RF, two parallel dd's reading from each of the disks (wd1 and wd2).
> 		* No problems.
> 
> 	No RF, two parallel dd's, one reading from wd1, one writing to wd2,
> 	both through the FS and to the raw device (IIRC, I'm pretty sure
> 	I did the raw device as well).
> 		* No problems.
> 
> 	RF (wd2 and nonexistant wd3 in a mirrored set) on wd2 being written
> 	while the non-RF device (wd1) is read.
> 		* BOOM!
> 
> The panics appear a slight bit different when dd'ing, or when untar'ing to
> the RAID disk (rather than using dump/restore as I tried before):
> 
>     data error type 32 sfsr=0 sfva=8b3c000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
>     panic: Privileged Async Fault: AFAR 0x1fe02000458 AFSR 84000000<PRIV,BERR,ETS=0,P_SYND=0>
> 
> and:
> 
>     data error type 32 sfsr=0 sfva=8df0000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
>     panic: Privileged Async Fault: AFAR 0x1fe02000458 AFSR 84000000<PRIV,BERR,ETS=0,P_SYND=0>
> 
> and again very similar to the original panic (this time copying using
> tar and untar rather than dump/restore):
> 
>     data error type 32 sfsr=0 sfva=8be6000 afsr=84000000 afva=1fe02000458 tf=0xe0017c30
>     data fault: pc=116a808 addr=8be6000 sfsr=0<ASI=0> 
>     kernel trap 32: data access error 
>     Stopped in pid 700.1 (tar) at   netbsd:pdc202xx_pci_intr+0x24:  subcc

And these seem to be identical: PRIV and BERR bits set in the AFSR and the
faulting address is 0x1fe02000458.  This usually indicates that the access
to that hardware address was not acknowledged.  

You can figure out what device register that corresponds to by finding
the base address for that device in the boot messages and looking at
the driver headers to determine what that offset corresponds to.

You can try reading and writing to that address from DDB or OBP to
determine if this condition occurs on every access or might be timing
related.

Then you can pester the driver maintainer to get it fixed.

Eduardo