Re: port-alpha/38335 (kernel freeze on alpha MP system)

To: port-alpha-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,jarle%uninett.no@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
From: "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost>
Date: Thu, 1 Oct 2009 18:25:02 +0000 (UTC)

The following reply was made to PR port-alpha/38335; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: gnats-admin%netbsd.org@localhost, jarle%uninett.no@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 1 Oct 2009 12:20:43 -0600 (MDT)

 On Thu, 1 Oct 2009, Jarle Greipsland wrote:
 
 > Annotated console log:
 > [ I pressed the halt switch here ]
 > halted CPU 0
 > CPU 1 is not halted
 >
 > halt code = 1
 > operator initiated halt
 > PC = fffffc0000736b08
 > P00>>>continue^U
 > P00>>>help
 > [ ... some help text here ... ]
 > P00>>>halt 1
 
    Halting the other cpu is not needed at this point, and might have caused
 the problem when continuing cpu 0.  Also, do you have ddb.onpanic set to 
 0?  If so, you probably want it 1 so that you enter DDB when it panics.
 
 > WARNING: Unable to halt secondary CPUs (0x3)
 
    This would be caused by your halting cpu 1 earlier, although I'm pretty
 sure the same thing would have happened even if you hadn't halted it.
 
 > [ .. end of console log .. ]
 > This seems less helpful than I had imagined; oh well.  I hope you
 > can make some kind of sense out of it.  According to gdb, the PC
 > is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
 > below):
 
    It does show that what I thought had happened did indeed happen.
 
 > 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 > 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq    t2,t1,t4
 > 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq     t0,64(t1)
 > 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and     s2,t0,t0
 > 0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne     
 > t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
 > 0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq     t0,88(t4)
 > 0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq     t1,8(t5)
 > 0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq   t0,t1,t0
 > 0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne     
 > t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
 > 0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq     t3,0(t3)
 > 0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq     
 > t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
 > 0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop
 > 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda     a0,3
 > 0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq     t0,32(t3)
 > 0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and     t0,0x10,t0
 > 0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq     
 > t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
 > 0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq     a1,16(t3)
 > 0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal        0x33
 > 0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq     t3,0(t3)
 > 0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne     
 > t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
 
    This is the loop where pmap_do_tlb_shootdown is processing the job queue 
 for this cpu (cpu 0).  T3 is the current queue entry and almost certainly 
 is linked to itself - meaning it keeps invalidating the same tlb entry 
 over and over and over......  At this point, cpu 0 will have the job queue 
 locked, and I'm certain that cpu 1 wants to invalidate another tlb entry, 
 and is attempting to acquire the lock that cpu 0 holds.
 
    This is the problem I'm still in the process of trying to figure out 
 what the problem is and how to fix it.  The patch I posted previously is a 
 workaround to detect this particular problem, and will display a message 
 if it occurs.  The other alternative I posted (changing 
 PMAP_TLB_SHOOTDOWN_MAXJOBS to 0) make it never allocate queue entries so 
 it should never loop.
 
    What I think may be happening is that the IPI interrupt may be occuring 
 at level 5, which is the same level as the clock interrupt.  The 
 mutex for the tlb shootdown queue uses IPL_VM, which I think is at level 
 4.  If I understand the mutex locking mechanism, I think this would allow 
 an IPI request from another cpu interrupt the code that manipulates the 
 shootdown queue entries, and could possibly corrupt the job queue.
 
 --
 Michael L. Hitch                       mhitch%montana.edu@localhost
 Computer Consultant
 Information Technology Center
 Montana State University       Bozeman, MT     USA

Prev by Date: Re: kern/41937 (ffs+log rmdir diagnostic kernel assertion)
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Next by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Indexes:

Home | Main Index | Thread Index | Old Index