Subject: trying to track down filesystem corruption
To: None <current-users@NetBSD.ORG>
From: Daniel Carosone <danielce@ee.mu.OZ.AU>
List: current-users
Date: 01/29/1995 14:29:16
A friend brought over his disks so I could install NetBSD-current on
them for him. During the process, an fsck run detected corrupted
directories on one of these disks, after some heavy activity.

I have been able to replicate the corruption a number of times, and
it's always the same kind of trouble (corrupted directory, plus other
things that follow on from that, like missing .. and . entries).

At this stage, I'm not sure what the cause is, and I'm going through
trying to eliminate possibilities before I send-pr.  Has anyone else
seen trouble like this after heavy disk activity?

My first thought was bad blocks on the disk. Using the BIOS utility of
my scsi controller (AHA-1542A) I reformatted the disk, and ran a
surface verification to remap any bad sectors.  This didn't help.
However, both processes took rather less time than I expected, so
probably weren't very thorough, and certainly weren't at all verbose
in their output.

In a fresh filesystem, I dd'd /dev/zero into a file till the
filesystem filled, and then cmp'd the file with /dev/zero. No
differences were found. I decided I wanted to run the same test over
the raw disk, but for some reason I can't dd to /dev/rsd0*, it tells
me "read only filesystem". (even in single-user mode with
kern.securelevel = 0).  I *can* dd to the block device, but it's very
slow, much slower than going through the filesystem, and slows the
rest of the machine to glacial speeds.

Why?  How do I dd data over the entire physical disk?

Are there any tools for formatting and pattern-testing scsi disks for
NetBSD?  I want to do more extensive testing of the disk itself.

As I write this, I'm trying to replicate the problem on an IDE disk,
to see if it's a filesystem code problem.  It worries me that the
damage is the same every time (but to different directory inode
numbers).  If I can't make it happen there, I will try and heist
another scsi disk from a different machine to test for scsi driver
problems.

What I did to create the problem: (yes, I know it's a little kooky,
but that's what I did..  oh, /usr/src is a symlink to an nfs-mounted
directory, in case that matters)

  . newfs -i 8192 -m 5 /dev/rsd0a   
  . mount /dev/sd0a /fs/g
  . mkdir /fs/g/NetBSD /fs/g/X11R6
  . cd /usr/src/..  (to go to the dir with src, sup, doc, othersrc dirs)
  . tar cf - . | tar xCf /fs/g/NetBSD - &
  . (in a loop while the stuff is tarring) 
    find /fs/g/NetBSD -name obj.sparc -o -name obj.i386 | xargs rm -rf
 
During one run, I was also untarring X11R6 sources into the filesystem
at the same time, but it's happened without that. Anyway, heavy
filesystem and disk activity, as well as a fair bit of ethernet.

The last possibility I can see is a bug in fsck, since if there really
were corrupted directories, I would have expected those find(1)'s to
have come up with something nasty.  The only way I can think of to
test this is to provoke the problem, then try and crash the machine by
using the filesystem.

If someone else wants to try a procedure like the one above to see if
they can find the problem too, that would help a lot.


Kernel boot-time messages from the machine in question:
------------------------------------------------------------------------
NetBSD 1.0A (_anarres_) #10: Thu Jan 26 09:07:33 EST 1995
    dan@anarres:/amd/oi/fs/f/l/NetBSD/src/sys/arch/i386/compile/_anarres_
CPU: i486DX (486-class CPU)
real mem  = 16384000
avail mem = 14069760
using 225 buffers containing 921600 bytes of memory
isa0 (root)
npx0 at isa0 port 0xf0-0xff: using exception 16
vt0 at isa0 port 0x60-0x6f irq 1: et4000, 80/132 col, color, 8 scr, mf2-kbd, [R3.00]
com0 at isa0 port 0x3f8-0x3ff irq 4: ns82450 or ns16450, no fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns82450 or ns16450, no fifo
com2 at isa0 port 0x3e8-0x3ef irq 5: ns16550a, working fifo
com3 at isa0 port 0x2e8-0x2ef irq 9: ns16550a, working fifo
lpt0 at isa0 port 0x378-0x37f irq 7
lpt2 at isa0 port 0x3bc-0x3c3: polled
aha0 at isa0 port 0x330-0x333 irq 11 drq 5
scsibus0 at aha0
aha0 targ 0 lun 0: <CONNER, CFP1060S 1.05GB, 2035> SCSI2 0/direct fixed
sd0 at scsibus0: 1013MB, 2756 cyl, 8 head, 94 sec, 512 bytes/sec
aha0 targ 2 lun 0: <ARCHIVE, VIPER 150  21247, -005> SCSI1 1/sequential removable
st0 at scsibus0: rogue, drive empty
aha0 targ 4 lun 0: <SONY, SDT-2000, 2.09> SCSI2 1/sequential removable
st1 at scsibus0: drive empty
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.2MB 80 cyl, 2 head, 15 sec
fd1 at fdc0 drive 1: 1.44MB 80 cyl, 2 head, 18 sec
wdc0 at isa0 port 0x1f0-0x1f7 irq 14
wd0 at wdc0 drive 0: 515MB, 1048 cyl, 16 head, 63 sec, 512 bytes/sec <WDC AC2540H>
wd0: using 16-sector 16-bit pio transfers, lba addressing
wd1 at wdc0 drive 1: 515MB, 1048 cyl, 16 head, 63 sec, 512 bytes/sec <WDC AC2540F>
wd1: using 16-sector 16-bit pio transfers, lba addressing
ed0 at isa0 port 0x280-0x29f iomem 0xd0000-0xd3fff irq 10: address 00:00:c0:98:37:20, type WD8013EBT (16-bit)
root device eisa not configured
root device pci not configured
biomask 4840 netmask 63a ttymask 23a


--
Dan.