current-users: Re: panics and lockups in -current

Subject: Re: panics and lockups in -current
To: Paul Dokas <dokas@cs.umn.edu>
From: Patrick Welche <prlw1@newn.cam.ac.uk>
List: current-users
Date: 02/24/2005 12:10:17
On Fri, Feb 18, 2005 at 03:31:37PM -0600, Paul Dokas wrote:
... 
> I declared victory too soon.  The machine locked up today with the following
> on the screen:
> 
>   ex0: too many segments, retrying
>   ex0: uplistptr was 0
> 
> dropping into DDB, the stack trace was very similar to what's shown above:
> 
>   Stopped in pid 10.1 (pagedaemon) at  netbsd:cpu_Debugger+0x4
>   db> bt
>   .
>   .
>   .
>   Xspllower(7,c0e09700,ffffffff,282,70000) at netbsd:Xspllower+0xe
>   m_freem(c0e09700,c0ef3800,cdb4048c,cd64e854,ccfdd854) at netbsd:m_freem+0x99
>   ex_intr(c0ef0000,0,10,6e860030,30010) at netbsd:ex_intr+0x16a
>   Xintr_legacy10() at netbsd:Xintr_legacy10+0xad
>   --- interrupt ---
>   lockmgr(c04dbe20,ce7fc000,ce800000,202,4215689c) at netbsd:lockmgr
>   uvm_swapout(cddbb088,0,ccfd4f3c,c02bc8c3,97) at netbsd:uvm_swapout+0x8d
>   umn_swapout_threads(0,0,c0437134,54,8e10) at netbsd:uvm_swapout_threads+0xbb
>   uvmd_scan(0,0,55566d7e,8c14,0) at netbsd:uvmd_scan+0x1c9
>   uvm_pageout(cc66b4a4,56c000,574000,0,c0100321) at netbsd:uvm_pageout+0xdb
> 
> Looking at the kernel in gdb, it appears that the call to Xspllower is in
> the MBUFLOCK macro within the MFREE macro at uipc_mbuf.c line 454.  The
> actual call that seems to be deadlocked is the "splx(ms)" at sys/mbuf.h
> line 334.

I posted something just like this:

http://mail-index.netbsd.org/current-users/2005/02/15/0011.html

and today I found the same box with a slightly different backtrace:

ex0: too many segments, retrying
ex0: uplistptr was 0
ex1: uplistptr was 0
kernel: page fault trap, code=0
Stopped at netbsd:arpintr+0xc4: movzbl 0x34(%eax),%eax
db> bt
arpintr(c9e80010,30,f0010,c0480010,c054f000) at netbsd:arpintr+0xc4
DDB lost frame for netbsd:Xsoftnet+0x34, trying 0xc0552df0
Xsoftnet() at netbsd:Xsoftnet+0x34
--- interrupt --
0x246:
db> 

So, apparently the same outcome, but with 3 different traces.. I am hopeful
that this may be the fix:

----------------------------
revision 1.38
date: 2005/02/24 08:04:02;  author: martin;  state: Exp;  lines: +3 -3
Fix the size of psc_regs (0x3c >> 2 is the biggest index used to access
it now, so pick 0x40 >> 2). Fixes "Bug 6", reported by Ted Unangst on
tech-kern.
----------------------------

and hopefully the menace behind

http://mail-index.netbsd.org/netbsd-help/2005/02/01/0009.html
http://mail-index.netbsd.org/current-users/2004/10/13/0010.html

too.. (ISTR someone having problems with an ex0 plugged into a 10MB switch,
but didn't find the reference..)

Now to try to crash the box, add the above patch, and retry for a crash..

In passing: very few of us have the problem, but it is very noticable
to those who do, so it isn't a general problem with ex0... Both
boxes which crash for me are gateways, and both are running ipf - is yours?

Cheers,

Patrick