Subject: Sparing sectors with soft errors?
To: None <port-pmax@NetBSD.ORG>
From: Andy Sparrow <andy@aonix.com>
List: port-pmax
Date: 01/24/1997 15:33:09
Hi.

I've got a number of diskpacks with 2 x RZ56's in them, and they
all work fine except for one drive, which has an excessive number
of "soft" errors on it, I would guess.

The drive takes a _looong_ time to come up, with mucho recalibration
to track 0 and re-seeking, and runs very slowly in use for the
same reason. The kernel also complains about a bunch of sectors.

Interestingly, if I try to use the disk pack with this drive in
it (booting from the other drive, not necessarily even mounting this
drive), I get a bunch of ECC memory errors reported from the kernel,
even after I switch this drive off/reset and reboot from another
disk pack altogether.

(If this drive isn't in the picture, I've not had any ECC errors
reported during some 3-4 weeks of running this 5000/200 almost
constantly).

So, it seems to me that I need to do something about this
drive. I _could_ use it as a doorstop, but this seems contrary
to the spirit of the whole exercise :)

A technique I've always used in the past (other machines
and OSs) was to spare out the problem sectors/tracks manually
in this scenario.

However, I have a couple of questions re: this procedure in NetBSD,
and I'm not clear on it, even after scanning the mail archives:


i)	My disktab and disklabel presently agree WRT the size of
	the 'c' partition, e.g. pc=1299174, rather than 
	ns * nt * nc = 54 * 15 * 1632 = 1321920

	I note that this is the same information as the disktab
	in the minimal Ultrix installation I still have on one disk,
	so I figured that this was deliberate, in order to provide
	a pool of alternates for the bad block table which was out
	of the range of any filesystem.

ii)	Looking at the man pages, I figure I want to use 'bad144'
	rather than 'badsect', as anything done by the latter is 
	prone to suddenly vanishing if I decide to re-partition/newfs 
	the drive, as I might.

	But 'bad144' complains that it can't access the parts of
	the disk it wants to. 

	I guess that this is because the label on the disk says 
	that there are "only" 1299174 sectors, and it wants to
	talk to the end of the drive?


Questions, questions, questions:
--------------------------------
Should I re-label the disk so that 'bad144' can address the
end of it to re-map the sectors etc., and then re-label it
afterwards so that any full-length filesystem (e.g. on the
'c' partition) cannot write over that table?

'bad144' is in the later snapshot. The 'badsect' man page
makes reference to 'format', which I can't find in any 
snapshot... Will there be a 'format' sometime?

Would it help to format the drive on some other machine?
My Ultrix installation won't have any Sys Admin tools
on it (didn't even have 'telnet', jeez..)

If I were to simply replace the drive with another HD
of similar (or larger size), am I correct in assuming that
only DEC drives use the 'bad144' scheme, and another vendors
drive would use some other, possibly transparent, scheme?


TIA for any advice,

AS