port-mac68k: Re: 1.3K - "panic: pool_get: ffsinopl: page empty"

Subject: Re: 1.3K - "panic: pool_get: ffsinopl: page empty"
To: Bob Nestor <rnestor@metronet.com>
From: Colin Wood <cwood@ichips.intel.com>
List: port-mac68k
Date: 03/28/1999 12:42:33
Bob Nestor wrote:
> Ken Nakata <kenn@synap.ne.jp> wrote:
> 
> >panic: pool_get: ffsinopl: page empty
> >Stopped in fsck at	_Debugger+0x6:  unlk    a6
> >db> t
> >_Debugger(...) + 6
> >_panic(...) + 52
> >_pool_get(...) + 114
> >_ffs_vget(...) + 90
> >_ufs_lookup(...) + a32
> >_lookup(...) + 21c
> >_namei(...) + 2e2
> >_sys___stat13(...) + 2a
> >_syscall(...) + 11c
> >_trap0() + e
> >db> _
> >
> >Has anybody seen this? (admittedly, I haven't been paying much
> >attention to traffic on this list... )-:
> 
> Yes, this looks almost identical to some of the panics I've been seeing, 
> and this one is in fact identical to one I have experienced. Others have 
> reported seeing this when they try to access a second disk or when they 
> try and run fsck on a disk at startup.

that's a me too :-)  not to mention several others of basically similar
flavors.  although i don't think i've ever seen the second one you posted,
ken.  congrats on turning up something new...

> I've also been able to reproduce 
> it by running a certain combination and sequence of userland applications 
> right after bringing the system up.  The problem has been occuring in 
> prior versions of the kernel -- I can reproduce it with a stock 1.3.2 
> kernel, others have seen it in kernels dating back to 0.9.

really?  is it the same thing?  i mean, i know we've always had a problem
with disk accesses, but since we've moved to the pool allocator, it has
certainly seemed to be a lot worse.  i guess it's just more readily
reproducible now.

> Some systems 
> and disk combinations are more sensitive to the problem than others and 
> the problem typically displays different footprints depending on when it 
> pops up.  This leads some of us to suspect that portions of the kernel 
> are getting stepped on and that damage isn't always immediately obvious. 
> It may be that everybody is being affected but most don't see it since 
> it's a non-critical section of kernel memory on their system 
> configuration that gets munged.  Since it seems to occur in a somewhat 
> random fashion, many of us suspect a rouge I/O interrupt that's not being 
> fielded properly.

i think it's the serial ports that're causing the problem.  for whatever
reason, i'm getting continual level4 interrupts on my se/30 without
anything connected to the serial ports.  at least, i'm seeing many calls
to zshard() every second, so i'm assuming that i'm getting interrupts from
the SCC.  i've tried turning interrupts off in the proper way (at least
from what i've read in various files), but it still has no effect.  i've
also written dr. bill about it, so hopefully i'll hear back from him soon.

of course, given how little i know about interrupt handling at the moment
(since a large portion of it dips into some low-level functions of the
hardware that i'm not quite familiar with), i could just be blowing smoke
here... 

> At any rate the problem seems to be infecting more and more user's 
> systems.  Let's hope this is good news as it should aid in finding and 
> fixing the problem.

hopefully so.  it is certainly giving us a lot of easily reproducible
symptoms :-(

later.

colin