port-i386: Re: parity error on disk

Subject: Re: parity error on disk - system hung
To: Gunnar Helliesen <gunnar@bitcon.no>
From: Brian Buhrow <buhrow@cats.ucsc.edu>
List: port-i386
Date: 02/14/1998 14:41:03
	Sounds like a loose cable or, perhaps, a loose power connector on one
of the drives.  The fact that it started mis-behaving after you moved it
makes me think one of the internal ribbon cables might be loose or, as I've
often seen, one of the drive power connectors.  Sounds like the system
tried to reset things and lost the disk completely.
-Brian

On Feb 13, 11:58pm, Gunnar Helliesen wrote:
} Subject: parity error on disk - system hung
} Had a bit of a crisis yesterday, our ftp server went down only days
} after we moved it to Oslo. Having it in Oslo means that I need help in
} case the machine crashes as I'm 500 Km away. The irritating bit is that
} it's been stable for ages until it we moved it out of reach (of course).
} 
} System info: AOpen AP65, PPRO, 128 MB, Intel 440FX, 2 x AHA2940AU, 2 x
} IDE HD, 11 x SCSI HD, 1 x SCSI tape, NetBSD/i386 1.3 release.
} 
} After my helping hand in Oslo managed to get it back up again here's
} what I found in /var/log/messages:
} 
} 
} Feb 12 16:30:52 atlas /netbsd: sd4(ahc0:4:0): parity error during
} Command phase.
} Feb 12 16:30:52 atlas /netbsd: ahc0: ahc_intr - referenced scb not valid
} during 
} scsiint 0x17 scb(1)
} Feb 12 16:30:53 atlas /netbsd: ahc0: WARNING no command for scb 1
} (cmdcmplt)
} Feb 12 16:30:53 atlas /netbsd: QOUTCNT == 0
} Feb 12 16:31:07 atlas /netbsd: sd4(ahc0:4:0): parity error during
} Command phase.
} Feb 12 16:31:07 atlas /netbsd: ahc0: ahc_intr - referenced scb not valid
} during 
} scsiint 0x17 scb(0)
} 
} 
} Here's what my helper reported was on the console before he rebooted:
} 
} 
} sd4(ahc0:4:0): parity error during Command phase
} ahc0:ahc_intr - referenced scb not valid during scsiint 0x17 scb(0)
} sd4(ahc0:4:0): timed out in datain phase, SCSISIGI == 0xc6
} sd4(ahc0:4:0): asserted ATN - device reset in message buffer
} sd4(ahc0:4:0): timed out in datain phase, SCSISIGI == 0xd6
} ahc0: Issued channel A Bus Reset #1: 2 SCBs aborted
} sd4(ahc0:4:0): data overrun of 16773119 bytes detected. Forcing a retry
} ahc0: target4 synchronous at 10.0 MHz, offset=0xf
} 
} sd4(ahc0:4:0): Check Condition on on opcode 0x28
} SENSE KEY: Not Ready
} ASC/ASCQ: Logical Unit Not Ready, Cuase Not Reportable
} 
} sd4(ahc0:4:0): Check Condition on on opcode 0x28
} SENSE KEY: Not Ready
} ASC/ASCQ: Logical Unit Not Ready, Cuase Not Reportable
} 
} 
} There was no panic, the system just hung completely with the above
} messages the last stuff printed to the console. It was impossible to get
} a login-prompt or any other response from the system except it did
} respond to pings.
} 
} I know this looks like a bad disk, but my question is: Should the system
} hang completely if it detects a parity error on the SCSI bus? Shouldn't
} it either panic and crash or (preferrably) flag the disk in question as
} read-only and then continue? I'd consider just going into a hang
} situation like this a bug. It's certainly not a situation I can live
} with for a remotely-managed server.
} 
} The weird thing is that after the machine was rebooted the disk in
} question has behaved just fine. No more errors even though I've stressed
} it for hours. Could this be a cabling or termination problem?
} 
} Gunnar
} 
} --
} Gunnar Helliesen   | Bergen IT Consult AS  | NetBSD/VAX on a uVAX II
} Systems Consultant | Bergen, Norway        | '86 Jaguar Sovereign 4.2
} gunnar@bitcon.no   | http://www.bitcon.no/ | '73 Mercedes 280 (240D)
} 
} 
>-- End of excerpt from Gunnar Helliesen