Subject: Re: File System Corruption
To: Ray Phillips <r.phillips@mailbox.uq.edu.au>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-alpha
Date: 01/09/2002 22:28:49
On Wed, Jan 09, 2002 at 08:48:00PM +1000, Ray Phillips wrote:
> Dear NetBSD/alpha:
> 
> I have NetBSD/alpha version 1.5.2 running on a 3000/400 with the 
> system disk (the only one at the moment) mounted internally.  About a 
> week after setting this machine up it crashed with messages like 
> these on its console:
> 
> asc0: STATUS_PHASE: msg 2
> sd0(asc0:2:0): max sync rate 5.00MB/s
> (asc0:2:0): selection failed; 3 left in FIFO [intr 18, stat 93, step 3]
> sd0(asc0:2:0): asc0: timed out [ecb 0xfffffe000001e150 (flags 0x1, 
> dleft 2000, >
> sd0(asc0:2:0):  Check Condition on CDB: 0x0a 01 1b d0 10 00
>      SENSE KEY:  Aborted Command
>       ASC/ASCQ:  SCSI Parity Error
> asc0: SCSI bus parity error
> dev = 0x803, ino = 157, fs = /usr
> panic: ifree: freeing free inode
> Stopped in nmbd at      cpu_Debugger+0x4:       ret     zero,(ra)
> db>
> 
> Some, such as the first, were repeated *many* times.  When I 
> rebooted, problems were found in its file system:
> 
> Automatic boot in progress: starting file system checks.
> /dev/rsd0a: UNALLOCATED  I=8299  OWNER=root MODE=0
> /dev/rsd0a: SIZE=0 MTIME=Dec 24 18:00 2001
> NAME=/var/log/messages.5.gz
> 
> /dev/rsd0a: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
> Automatic file system check failed; help!
> Dec 24 18:47:42 init: /bin/sh on /etc/rc terminated abnormally, going 
> to singlee
> Enter pathname of shell or RETURN for sh:
> 
> When I ran fsck_ffs on /dev/rsd0a and /dev/rsd0d I told it to:
> - correct all incorrect block counts it mentioned
> - clear the files it said had unknown type
> - fix files it said had bad type values
> - remove files it said were unallocated
> - reconnect directories it said were unref'ed, and
> - adjust the link count for files it said had an incorrect value
> 
> There were many of each type of error.  Luckily the files it 
> suggested I remove were ones I could easily replace--mostly from the 
> NetBSD distribution.  After this, the machine booted normally, but 
> the following morning it had crashed again with the the same 
> symptoms.  I concluded the SCSI controller for the internal bus must 
> be faulty and attached the system disk to the external bus,agreed? 
> There've been no crashes in the week since then, so that seems 
> likely.  I presume the internal SCSI controller chip is soldered to 
> the system board and hence not replaceable?


The parity error would point to a problem  between the SCSI
chip and the SCSI connector, so I'm not sure remplacing the SCSI chip
will solve it.

> [...]
> When pax was running it generated a few error messages, which I can't 
> find now and can't quote verbatim, but they mentioned not being able 
> to extract some files because something couldn't be unlinked.  So, it 
> seems there are still some errors in the file system.  Is it likely 
> the only way to remove them is to newfs the disk?

No, this is probably because of something else, like some files
gained a flag. Try ls -lo on these files.
If there are a lot of them newfs may be the faster way of solving it, though.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
--