Subject: Re: kern/25285: i386 MP panic: TLB IPI rendezvous failed (mask 1)
To: None <current-users@netbsd.org>
From: Paul Dokas <dokas@cs.umn.edu>
List: current-users
Date: 04/30/2004 16:47:28
Anyone know why this is happening?  I've got a computer that does this about once
a week.

Of course, I'm probably pushing the bounds of what I ought to be doing.  My machine
is a quad P4 Xeon with 6GB of memory, 4 136GB drives bound in a RAID0 that holds
a filesystem on which I've turned on softdep.   And, I'm using that file system to 
hold a moderately active PostgreSQL database.

When I get these panics, it's always stopped in one of the postgres processes.  And,
looking at the stack trace below, mine are always the result of a call to write(2) also.


Paul



On Thu, 22 Apr 2004 17:18:58 -0400 (EDT) nathanw@mit.edu wrote:
> 
> >Number:         25285
> >Category:       kern
> >Synopsis:       i386 MP panic: TLB IPI rendezvous failed (mask 1)
> >Confidential:   no
> >Severity:       serious
> >Priority:       medium
> >Responsible:    kern-bug-people
> >State:          open
> >Class:          sw-bug
> >Submitter-Id:   net
> >Arrival-Date:   Thu Apr 22 21:41:00 UTC 2004
> >Closed-Date:
> >Last-Modified:
> >Originator:     Nathan J. Williams
> >Release:        NetBSD 2.0C
> >Organization:
> 	Massachvestts Institvte of Technology
> >Environment:
> 	
> 	
> System: NetBSD marvin-the-martian.nathanw.com 2.0C NetBSD 2.0C (MARVIN) #74: Tue Apr 20 14:57:53 EDT 2004 nathanw@marvin-the-martian.nathanw.com:/nbsd/src/sys/arch/i386/compile/MARVIN i386
> Architecture: i386
> Machine: i386
> >Description:
> 
> After upgrading my desktop box (dual athlon MP 2000+) to 2.0C, I
> decided to give MULTIPROCESSOR a spin on it. Under slight load
> (compiling another kernel, no -j option) with a MULTIPROCESSOR
> kernel, I got:
> 
> panic: TLB IPI rendezvous failed (mask 1)
> 
> Stopped in pid 8360.1 (cc1) at  netbsd:cpu_Debugger+0x4: leave
> db{1}> t
> cpu_Debugger()
> panic()
> pmap_tlb_shootnow(3,cc3c1000,61c016c,ce5ebcc0,c07794a0) at pmap_tlb_shootnow+0x108
> pmap_kremove(cc3c0000,2000,21c,ce5ebd18,c0782740) at pmap_kremove+0x56
> ubc_release(cc3c0000,0,0,0,d) at ubc_release+0x1ab
> ffs_write(ce5ebe24,40855555,ce5ebe5c,c0277094,c0381820) at ffs_write+0x41b
> VOP_WRITE(cec0a6f0,ce5ebec4,1,c1e14380,cec0a6f0) at VOP_WRITE+0x34
> vn_write(cde07d24,cde07d4c,cd5ebec4,c1e14380,1) at vn_write+0xbf
> dofilewrite(ce36c018,3,cde07d24,83ca000,2000) at dofilewrite+0x86
> sys_write(cde059d0,cd5ebf64,ce5ebf5c,30,c040a420) at sys_write+0x70
> syscall_plain() at syscall_plain+0x182
> --- syscall (number 4) ---
> 
> db{1}> mach cpu 0
> db{1}> t
> 
> netbsd:cpu_switch+0xda:
> 
> It is quickly repeatable, though not totally deterministic. On a
> second occasion the trace was:
> 
> panic: TLB IPI rendezvous failed (mask 1)
> Stopped in pid 9531.1 (cc) at netbsd:cpu_Debugger_0x4: leave
> db{1}> t
> cpu_Debugger()
> panic()
> pmap_tlb_shootnow(3,ce2a8cdc,0,25bf063,c1bb0c00) at +0x108
> pmap_do_remove(c0453320,cb6e8000,cb728000,0,cb728000) at +0xc1
> pmap_remove(c0453320,cb6e8000,cb728000,cb728000,253b) at +0x15
> uvm_unmap_remove(c1bb0c00,cb6e8000,cb728000,ce2a8d7c,cdcb0c8c) at +0x27c
> uvm_km_free_wakeup(c1bb0c00,cb6e8000,40000,cdcb0c8c,0) at +0xc9
> sys_execve(cdc2294c,ce2a8f64,ce2a8f5c,2c4,c03a842c) at +0x8e6
> syscall_plain() at +0x182
> --- syscall (number 59) ---
> 
> db{1}> mach cpu 0
> db{1}> t
> 
> acquire(c044cf60,cc01ef10,400040,0,600) at +0x5f
> _lockmgr(c044cf60,400042,0,c03dc3a0,315) at +0x4bd
> x86_softintlock(cdc20010,30,c03a0010,c0400010,cc01b000) at +0x21
> 
> >How-To-Repeat:
> 
> "Run on this system". I was unable to reproduce the problem on
> my Dell dual-Pentium 4 system.
> 
> >Fix:
> >Release-Note:
> >Audit-Trail:
> >Unformatted:


-- 
Paul Dokas                                            dokas@cs.umn.edu
======================================================================
Don Juan Matus:  "an enigma wrapped in mystery wrapped in a tortilla."