Subject: Re: filesystem issues after rude powerdown
To: Ignatios Souvatzis <ignatios@cs.uni-bonn.de>
From: Brian <bmcewen@comcast.net>
List: netbsd-users
Date: 08/03/2004 07:50:59
On Tuesday, August 3, 2004, at 05:34 AM, Ignatios Souvatzis wrote:
>
> Then you should run something that listens to the UPS and shutdowns the
> machine cleanly when the time is near. All a battery-powered UPS is 
> supposed
> to do is to help you survive short power failures, give you time to 
> start
> your diesel generators, or do a clean shutdown. If you insist to 
> survive
> longer outages on battery, you need ... bigger batteries.
>

What exists that would let me configure output to the serial port?  
This is a headless, non-USB Cobalt Qube, it would have to be something 
that would take for input the USB output from the UPS, and then on the 
serial console, issue the shutdown commands.

I'd read of issues (likely here) trying to get USB-capable NetBSD boxes 
to trigger shutdown based on UPS output; I didn't think that was 
working well even on systems with working USB.

>
> Uhm... mapping out bad blocks is a function of modern disks (IDE as 
> well
> as SCSI). However, this might be configures off for your driver, or 
> might
> only happen when you _write_ them, as the disk can not know what to 
> write
> into the remapped blocks when it can't read the original ones.
>

I might have to pull the (IDE) HD out, put it in a desktop, and 
reformat the partition using appropriate tools that way.  But as you 
say, I would have expected bad blocks should get remapped automagically 
using the reserved areas.

> Assuming (check that!) that the error message was from the driver, and
> refers to disk block numbers (as from the file system, and refers to
> filesystem sectors), you could try to
>
> umount /tmp (in single user mode, obviously)
>
> /sbin/sysctl kern.rawpartition
>
> if it is 3:
> dd bs=512 count=13 if=/dev/zero seek=756 of=/dev/rwd0d (on i387
>
> if it is 2:
> dd bs=512 count=13 if=/dev/zero seek=756 of=/dev/rwd0c
>
> After that you'll have to "fsck -f" the affected file system.
>
> You do this at your own risk; read the manual pages until you 
> understand
> what those commands do. Especially, as you didn't show the original
> error message, I have no idea whether it really referred to disk blocks
> or filesystem sectors (which would be relative to the partition 
> boundary,
> and using units!)
>

At bootup, I get this (my system hung again, after bootup I captured 
the console output this time):

	Starting file system checks:
	/dev/rwd0a: file system is clean; not checking
	/dev/rwd0f: file system is clean; not checking
	wd0: transfer error, downgrading to Ultra-DMA mode 1
	wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 1 (using DMA data 
transfers)
	wd0g: error reading fsbn 736 of 736-847 (wd0 bn 4782688; cn 4744 tn 11 
sn 43), retrying

bootup fails; I end up in single user mode, running fsck_ffs on 
/dev/rwd0g with output:

	** Phase 1 - Check Blocks and Sizes
	wd0g: error reading fsbn 752 of 736-847 (wd0 bn 4782704; cn 4744 tn 11 
sn 59), retrying
		wd0: (uncorrectable data error)
	wd0g: error reading fsbn 752 of 736-847 (wd0 bn 4782704; cn 4744 tn 11 
sn 59), retrying
	wd0: (uncorrectable data error)
	wd0g: error reading fsbn 752 of 736-847 (wd0 bn 4782704; cn 4744 tn 11 
sn 59), retrying
	wd0: (uncorrectable data error)
	wd0g: error reading fsbn 752 of 736-847 (wd0 bn 4782704; cn 4744 tn 11 
sn 59), retrying
	wd0: (uncorrectable data error)
	wd0g: error reading fsbn 756 of 736-847 (wd0 bn 4782708; cn 4744 tn 12 
sn 0), retrying
	wd0: (uncorrectable data error)
	wd0g: error reading fsbn 756 of 736-847 (wd0 bn 4782708; cn 4744 tn 12 
sn 0)wd0: (uncorrectable data 	error)
	CANNOT READ: BLK 736
	CONTINUE? [yn] y
	wd0g: error reading fsbn 756 (wd0 bn 4782708; cn 4744 tn 12 sn 0), 
retrying
	wd0: (uncorrectable data error)
[...]
	wd0g: error reading fsbn 756 (wd0 bn 4782708; cn 4744 tn 12 sn 0)wd0: 
(uncorrectable data error)
	wd0g: error reading fsbn 757 (wd0 bn 4782709; cn 4744 tn 12 sn 1), 
retrying
	wd0g: error reading fsbn 768 (wd0 bn 4782720; cn 4744 tn 12 sn 12)wd0: 
(uncorrectable data error)
[...]
	THE FOLLOWING DISK SECTORS COULD NOT BE READ: 756, 757, 758, 759, 760, 
761, 762, 763, 764, 	765, 766, 767, 768,
	** Phase 2 - Check Pathnames
	** Phase 3 - Check Connectivity
	** Phase 4 - Check Reference Counts
	** Phase 5 - Check Cyl groups
	1 files, 1 used, 496302 free (14 frags, 62036 blocks, 0.0% 
fragmentation)
	MARK FILE SYSTEM CLEAN? [yn] y
	***** FILE SYSTEM MARKED CLEAN *****
	***** FILE SYSTEM WAS MODIFIED *****


I tried the dd copy (the point being to force write to the blocks, to 
see if the IDE drive hardware would remap the bad stuff using reserved 
areas, yes?) and the copy reportedly completed successfully, but the 
areas remain bad during fsck. (I did overwrite the magic number for the 
parition but that's fixable.  I guess "disk sector 756" equals "block 
736" i.e. the first one of the /tmp partition).

What's the best way to reformat /tmp or rebuild the partition map for 
this partition from within NetBSD?  Just fdisk it, or is there a more 
thorough way to format & test?  I didn't notice any   I built this 
bootable image using a the Cobalt netboot CD from the cobalt-support 
area (so I didn't have to set up the partitions and prep them myself). 
I could pull the drive, put it in a Win98 desktop, and reformat just 
the /tmp partition- but I'm not sure I have any tools that know about 
BSD filesystems, I'd have to know how to relabel it properly (it's not 
just getting fstab set up properly, is it?)

And now that I look at it, the partitioning is:

	Qube: {4} df -k
	Filesystem  1K-blocks     Used     Avail Capacity  Mounted on
	/dev/wd0a    54842010  9624912  42474996    18%    /
	/dev/wd0f     2064766    15876   1945650     0%    /var
	/dev/wd0g      496303        1    471486     0%    /tmp

	Qube: {1} fdisk
	Disk: /dev/rwd0d
	NetBSD disklabel disk geometry:
	cylinders: 16383 heads: 16 sectors/track: 63 (1008 sectors/cylinder)
	BIOS disk geometry:
	cylinders: 16383 heads: 16 sectors/track: 63 (1008 sectors/cylinder)
	Partition table:
	0: sysid 131 (Linux native)
     	start 1, size 61488 (30 MB), flag 0x0
        	 beg: cylinder    0, head   0, sector  2
         	end: cylinder   61, head   0, sector  1
	1: sysid 130 (Linux swap or Prime or Solaris)
   	  start 61488, size 525168 (256 MB), flag 0x0
       	  	beg: cylinder   61, head   0, sector  1
        	 end: cylinder  581, head  15, sector 63
	2: sysid 169 (NetBSD)
     	start 586656, size 116644752 (56955 MB), flag 0x0
        	 beg: cylinder  582, head   0, sector  1
         	end: cylinder  588, head  15, sector 63
	3: <UNUSED>.

Thanks for help!  At this time, I have a machine that boots, runs, but 
eventually hang after a couple days, and fails to reboot until I run 
fsck_ffs manually, always with the same issues in /tmp.

Brian