Subject: Re: wd.c crashes/hard errors
To: current-users <current-users@sun-lamp.cs.berkeley.edu>
From: Burgess, David (TSgt) ~U <BurgessD@J64.STRATCOM.AF.MIL>
List: current-users
Date: 02/10/1994 17:55:00
>
>Bwaa hah hah. Never underestimate the power of the system to do anything
it
>wants. In 1988 I was working for Sun Consulting and using an old Sun-3/160
at
>home. All of a sudden it started developing the tendancy to get random
>Watchdog Resets and the net result would be that upon reboot, "fsck" would
find
>300+ trashed files in /usr. After about the 3rd time this happened (and
after
>the 4-5 hours I'd waste finding/restoring things, groveling through
lost+found
>et al.), I said "To Hell with this" and got the CPU board replaced. Never
had
>that problem again. Dunno why a Watchdog Reset would always cause the disk
to
>get so scrambled, but that's life.
>
> - Greg
>
I had a similar problem when I tried to install NetBSD after I bought my new
IDE drive.
I would be working along, and the disk would hiccup. It turns out that I
had a bad
spot in the swap space. As soon as the system tried to swap out to the swap
space,
it would trash the block, but it wouldn't notify the system at all. When
the page was
swapped back in, it would proceed to trash large portions of all kinds of
stuff. I have
also had a similar problem with a bad SIMM that would lose its mind from
time to
time. It always amazed me how much of the system could get destroyed in a
very
short period of time. All it takes is one or two warped pointers that used
to look at
memory resident disk structures for the system to go away.
If I had any suggestions, I would say to look closely at your hard drive. I
suspect
that, although IDE isn't supposed to have 'bad media', your drive may have a
bad
spot. I have also (anecdotally, of course) noticed that virtually every one
of my
hard drive 'hangs' is because of disk re-reads taking the controller too
long. If that
was truly the reason, that would explain why some people have the problem
and
others do not. It would be a function of the controller (and its ability to
remap
flakey sectors) or the hard drive internals (depending on their ability to
recover) as
to whether or not the controller locks up or not. It would also be as
random as it
appears to be. Also, if your controller remaps bad spots on the fly, it is
just possible
that the drive may be initializing the replacement with incorrect
information from
the drive.
Of course, this is all about as authoritative as an X-Men comic, but I would
like to
tell those of you that are having these problems that I too have had them,
and have
overcome them through 'bad144' and use of MANY bad spots on the disk being
identified whether they were really (versus correctably) bad.
TSgt Dave Burgess
------------------------------------------------------------------------------