Subject: Sparing sectors with soft errors?
To: None <port-pmax@NetBSD.ORG>
From: Andy Sparrow <andy@aonix.com>
List: port-pmax
Date: 01/24/1997 15:33:09
Hi.
I've got a number of diskpacks with 2 x RZ56's in them, and they
all work fine except for one drive, which has an excessive number
of "soft" errors on it, I would guess.
The drive takes a _looong_ time to come up, with mucho recalibration
to track 0 and re-seeking, and runs very slowly in use for the
same reason. The kernel also complains about a bunch of sectors.
Interestingly, if I try to use the disk pack with this drive in
it (booting from the other drive, not necessarily even mounting this
drive), I get a bunch of ECC memory errors reported from the kernel,
even after I switch this drive off/reset and reboot from another
disk pack altogether.
(If this drive isn't in the picture, I've not had any ECC errors
reported during some 3-4 weeks of running this 5000/200 almost
constantly).
So, it seems to me that I need to do something about this
drive. I _could_ use it as a doorstop, but this seems contrary
to the spirit of the whole exercise :)
A technique I've always used in the past (other machines
and OSs) was to spare out the problem sectors/tracks manually
in this scenario.
However, I have a couple of questions re: this procedure in NetBSD,
and I'm not clear on it, even after scanning the mail archives:
i) My disktab and disklabel presently agree WRT the size of
the 'c' partition, e.g. pc=1299174, rather than
ns * nt * nc = 54 * 15 * 1632 = 1321920
I note that this is the same information as the disktab
in the minimal Ultrix installation I still have on one disk,
so I figured that this was deliberate, in order to provide
a pool of alternates for the bad block table which was out
of the range of any filesystem.
ii) Looking at the man pages, I figure I want to use 'bad144'
rather than 'badsect', as anything done by the latter is
prone to suddenly vanishing if I decide to re-partition/newfs
the drive, as I might.
But 'bad144' complains that it can't access the parts of
the disk it wants to.
I guess that this is because the label on the disk says
that there are "only" 1299174 sectors, and it wants to
talk to the end of the drive?
Questions, questions, questions:
--------------------------------
Should I re-label the disk so that 'bad144' can address the
end of it to re-map the sectors etc., and then re-label it
afterwards so that any full-length filesystem (e.g. on the
'c' partition) cannot write over that table?
'bad144' is in the later snapshot. The 'badsect' man page
makes reference to 'format', which I can't find in any
snapshot... Will there be a 'format' sometime?
Would it help to format the drive on some other machine?
My Ultrix installation won't have any Sys Admin tools
on it (didn't even have 'telnet', jeez..)
If I were to simply replace the drive with another HD
of similar (or larger size), am I correct in assuming that
only DEC drives use the 'bad144' scheme, and another vendors
drive would use some other, possibly transparent, scheme?
TIA for any advice,
AS