Subject: Re: heavy use of ffs breaks snap-20000304.sparc on sun4m machines.
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Eduardo E. Horvath <eeh@one-o.com>
List: port-sparc
Date: 03/07/2000 09:27:23
Dusting off my old support hat (I wonder if it still fits?):

> 	Hello.  Now that the 20000304 snapshot works with systems with 256MB
> of memory out of the box, I tried my news server that won't run under 1.4.x
> and 1.3.x to see if things improved with 1.4U.  The test takes 30 minutes
> on this machine, and about 3 minutes of innd accepting articles and,
> boom!!! just like the other versions.  Is there anything I can collect for

Boom how?  How boom?  Please describe the boom in excruciating detail.

> people, or suggestions on what I should do with -current to diagnose this
> 2-year-old problem?  Remember, I changed the machine, the memory, the
> cards, the disk drives, everything except the operating system.  I've even
> changed that too,  if you count 1.3.x and 1.4.x versions.  NetBSD 1.3.x ran
> more stable than did 1.4.x in that it could run for a week, 1.4.x can't
> make it 50 minutes.
> 	I've worked with Paul Kranenberg to try and reproduce the problem, but I
> guess I can't seem to replicate the exact conditions for this crash except
> on this production machine.  Would modern core dumps against the snapshot
> GEENEREIC_SCSI3 kernel help?
> Any ideas, solutions, spells, incantations, etc. would be greatfully
> accepted.  I've tried to convince the owner of this machine that NetBSD is
> a viable production system, but this would be the machine that won't run
> for a day.  All the other NetBSD machines I manage run for years at a time,
> with even more load.

I can give you several suggestions, but without knowing the details of
the boom in question, I don't know which one is most appropriate.

Booms of the panic variety can be of two flavors: immediate panics,
and delayed panics.

Immediate panics are caused by code that recieves unexpected data
values, code that does not do proper validation of input, code that
detects but cannot fix certain error conditions, or just plain
incorrect code.  These sorts of things are relatively easy to analyze
and fix.  Just do a quick analysis of the kernel core file and say
"D'oh!  What was I thinking?  Try this patch and call me in the
morning."

Delayed panics are caused by data corruption and are not detected
until well after the corruption occured.  These are much more
difficult to trace down and usually require the addition of
intrumentation code to narrow down the root cause.  A good first step
in this type of situation is to build a kernel with both DEBUG and
DIAGNOSTIC turned on.

I would suggest creating a kernel with both DIAGNOSTIC and DEBUG
turned on, with full `gdb' symbols.  Run the stripped version of the
kernel.  After the system panics, run gdb on the core dump and the
`netbsd.gdb' image, do a `where' to get a stacktrace, and provide that
information.  Then you will be given further instructions. 8^)

Also, since this is a SPARC machine I can recommend the _Panic!_ book
by Chris Drake and Kimberly Brown, available from Sun.  (Maybe we
should put a plug for it on the SPARC port's web page.)  It's a very
good tutorial on how to do core file analysis.  Otherwise you'll need
to ship around multi-megabyte corfiles for people to look at.

=========================================================================
Eduardo Horvath				eeh@netbsd.org
	"I need to find a pithy new quote." -- me