Subject: parity error on disk - system hung
To: 'NetBSD/i386 list' <port-i386@NetBSD.ORG>
From: Gunnar Helliesen <gunnar@bitcon.no>
List: port-i386
Date: 02/13/1998 23:58:19
Had a bit of a crisis yesterday, our ftp server went down only days
after we moved it to Oslo. Having it in Oslo means that I need help in
case the machine crashes as I'm 500 Km away. The irritating bit is that
it's been stable for ages until it we moved it out of reach (of course).

System info: AOpen AP65, PPRO, 128 MB, Intel 440FX, 2 x AHA2940AU, 2 x
IDE HD, 11 x SCSI HD, 1 x SCSI tape, NetBSD/i386 1.3 release.

After my helping hand in Oslo managed to get it back up again here's
what I found in /var/log/messages:


Feb 12 16:30:52 atlas /netbsd: sd4(ahc0:4:0): parity error during
Command phase.
Feb 12 16:30:52 atlas /netbsd: ahc0: ahc_intr - referenced scb not valid
during 
scsiint 0x17 scb(1)
Feb 12 16:30:53 atlas /netbsd: ahc0: WARNING no command for scb 1
(cmdcmplt)
Feb 12 16:30:53 atlas /netbsd: QOUTCNT == 0
Feb 12 16:31:07 atlas /netbsd: sd4(ahc0:4:0): parity error during
Command phase.
Feb 12 16:31:07 atlas /netbsd: ahc0: ahc_intr - referenced scb not valid
during 
scsiint 0x17 scb(0)


Here's what my helper reported was on the console before he rebooted:


sd4(ahc0:4:0): parity error during Command phase
ahc0:ahc_intr - referenced scb not valid during scsiint 0x17 scb(0)
sd4(ahc0:4:0): timed out in datain phase, SCSISIGI == 0xc6
sd4(ahc0:4:0): asserted ATN - device reset in message buffer
sd4(ahc0:4:0): timed out in datain phase, SCSISIGI == 0xd6
ahc0: Issued channel A Bus Reset #1: 2 SCBs aborted
sd4(ahc0:4:0): data overrun of 16773119 bytes detected. Forcing a retry
ahc0: target4 synchronous at 10.0 MHz, offset=0xf

sd4(ahc0:4:0): Check Condition on on opcode 0x28
SENSE KEY: Not Ready
ASC/ASCQ: Logical Unit Not Ready, Cuase Not Reportable

sd4(ahc0:4:0): Check Condition on on opcode 0x28
SENSE KEY: Not Ready
ASC/ASCQ: Logical Unit Not Ready, Cuase Not Reportable


There was no panic, the system just hung completely with the above
messages the last stuff printed to the console. It was impossible to get
a login-prompt or any other response from the system except it did
respond to pings.

I know this looks like a bad disk, but my question is: Should the system
hang completely if it detects a parity error on the SCSI bus? Shouldn't
it either panic and crash or (preferrably) flag the disk in question as
read-only and then continue? I'd consider just going into a hang
situation like this a bug. It's certainly not a situation I can live
with for a remotely-managed server.

The weird thing is that after the machine was rebooted the disk in
question has behaved just fine. No more errors even though I've stressed
it for hours. Could this be a cabling or termination problem?

Gunnar

--
Gunnar Helliesen   | Bergen IT Consult AS  | NetBSD/VAX on a uVAX II
Systems Consultant | Bergen, Norway        | '86 Jaguar Sovereign 4.2
gunnar@bitcon.no   | http://www.bitcon.no/ | '73 Mercedes 280 (240D)