Subject: File System Corruption
To: None <port-alpha@netbsd.org>
From: Ray Phillips <r.phillips@mailbox.uq.edu.au>
List: port-alpha
Date: 01/09/2002 20:48:00
Dear NetBSD/alpha:

I have NetBSD/alpha version 1.5.2 running on a 3000/400 with the 
system disk (the only one at the moment) mounted internally.  About a 
week after setting this machine up it crashed with messages like 
these on its console:

asc0: STATUS_PHASE: msg 2
sd0(asc0:2:0): max sync rate 5.00MB/s
(asc0:2:0): selection failed; 3 left in FIFO [intr 18, stat 93, step 3]
sd0(asc0:2:0): asc0: timed out [ecb 0xfffffe000001e150 (flags 0x1, 
dleft 2000, >
sd0(asc0:2:0):  Check Condition on CDB: 0x0a 01 1b d0 10 00
     SENSE KEY:  Aborted Command
      ASC/ASCQ:  SCSI Parity Error
asc0: SCSI bus parity error
dev = 0x803, ino = 157, fs = /usr
panic: ifree: freeing free inode
Stopped in nmbd at      cpu_Debugger+0x4:       ret     zero,(ra)
db>

Some, such as the first, were repeated *many* times.  When I 
rebooted, problems were found in its file system:

Automatic boot in progress: starting file system checks.
/dev/rsd0a: UNALLOCATED  I=8299  OWNER=root MODE=0
/dev/rsd0a: SIZE=0 MTIME=Dec 24 18:00 2001
NAME=/var/log/messages.5.gz

/dev/rsd0a: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
Automatic file system check failed; help!
Dec 24 18:47:42 init: /bin/sh on /etc/rc terminated abnormally, going 
to singlee
Enter pathname of shell or RETURN for sh:

When I ran fsck_ffs on /dev/rsd0a and /dev/rsd0d I told it to:
- correct all incorrect block counts it mentioned
- clear the files it said had unknown type
- fix files it said had bad type values
- remove files it said were unallocated
- reconnect directories it said were unref'ed, and
- adjust the link count for files it said had an incorrect value

There were many of each type of error.  Luckily the files it 
suggested I remove were ones I could easily replace--mostly from the 
NetBSD distribution.  After this, the machine booted normally, but 
the following morning it had crashed again with the the same 
symptoms.  I concluded the SCSI controller for the internal bus must 
be faulty and attached the system disk to the external bus,agreed? 
There've been no crashes in the week since then, so that seems 
likely.  I presume the internal SCSI controller chip is soldered to 
the system board and hence not replaceable?

I tried to replace the system files I'd deleted with fsck by booting 
from a NetBSD CD, running sh from sysinst's utility menu and then

mount /dev/cd0a /mnt2
mount /dev/sd0a /mnt
mount /dev/sd0d /mnt/usr
cd /mnt
pax -zrpe -f /mnt2/alpha/binary/sets/base.tgz

When pax was running it generated a few error messages, which I can't 
find now and can't quote verbatim, but they mentioned not being able 
to extract some files because something couldn't be unlinked.  So, it 
seems there are still some errors in the file system.  Is it likely 
the only way to remove them is to newfs the disk?

By the way, why are sendmail and named included in the NetBSD 
distribution?  They're the only non-system programs that are aren't 
they?  I (carelessly, I'll admit) overwrote version 8.12.1 of 
sendmail which I'd previously installed using pax as above. 
Upgrading from one version of NetBSD to another would be simpler in 
this case if sendmail and co. weren't in the way.


Ray