Subject: Re: wd.c crashes/hard errors
To: current-users <current-users@sun-lamp.cs.berkeley.edu>
From: Burgess, David (TSgt) ~U <BurgessD@J64.STRATCOM.AF.MIL>
List: current-users
Date: 02/10/1994 17:55:00
>
>Bwaa hah hah.  Never underestimate the power of the system to do anything 
it
>wants.  In 1988 I was working for Sun Consulting and using an old Sun-3/160 
at
>home.  All of a sudden it started developing the tendancy to get random
>Watchdog Resets and the net result would be that upon reboot, "fsck" would 
find
>300+ trashed files in /usr.  After about the 3rd time this happened (and 
after
>the 4-5 hours I'd waste finding/restoring things, groveling through 
lost+found
>et al.), I said "To Hell with this" and got the CPU board replaced.  Never 
had
>that problem again.  Dunno why a Watchdog Reset would always cause the disk 
to
>get so scrambled, but that's life.
>
>        - Greg
>

I had a similar problem when I tried to install NetBSD after I bought my new 
IDE drive.
I would be working along, and the disk would hiccup.  It turns out that I 
had a bad
spot in the swap space.  As soon as the system tried to swap out to the swap 
space,
it would trash the block, but it wouldn't notify the system at all.  When 
the page was
swapped back in, it would proceed to trash large portions of all kinds of 
stuff.  I have
also had a similar problem with a bad SIMM that would lose its mind from 
time to
time.  It always amazed me how much of the system could get destroyed in a 
very
short period of time.  All it takes is one or two warped pointers that used 
to look at
memory resident disk structures for the system to go away.

If I had any suggestions, I would say to look closely at your hard drive.  I 
suspect
that, although IDE isn't supposed to have 'bad media', your drive may have a 
bad
spot.  I have also (anecdotally, of course) noticed that virtually every one 
of my
hard drive 'hangs' is because of disk re-reads taking the controller too 
long.  If that
was truly the reason, that would explain why some people have the problem 
and
others do not.  It would be a function of the controller (and its ability to 
remap
flakey sectors) or the hard drive internals (depending on their ability to 
recover) as
to whether or not the controller locks up or not.  It would also be as 
random as it
appears to be.  Also, if your controller remaps bad spots on the fly, it is 
just possible
that the drive may be initializing the replacement with incorrect 
information from
the drive.

Of course, this is all about as authoritative as an X-Men comic, but I would 
like to
tell those of you that are having these problems that I too have had them, 
and have
overcome them through 'bad144' and use of MANY bad spots on the disk being
identified whether they were really (versus correctably) bad.

TSgt Dave Burgess

------------------------------------------------------------------------------