NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-alpha/38335 (kernel freeze on alpha MP system)
The following reply was made to PR port-alpha/38335; it has been noted by GNATS.
From: "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: gnats-admin%netbsd.org@localhost, jarle%uninett.no@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 1 Oct 2009 12:20:43 -0600 (MDT)
On Thu, 1 Oct 2009, Jarle Greipsland wrote:
> Annotated console log:
> [ I pressed the halt switch here ]
> halted CPU 0
> CPU 1 is not halted
>
> halt code = 1
> operator initiated halt
> PC = fffffc0000736b08
> P00>>>continue^U
> P00>>>help
> [ ... some help text here ... ]
> P00>>>halt 1
Halting the other cpu is not needed at this point, and might have caused
the problem when continuing cpu 0. Also, do you have ddb.onpanic set to
0? If so, you probably want it 1 so that you enter DDB when it panics.
> WARNING: Unable to halt secondary CPUs (0x3)
This would be caused by your halting cpu 1 earlier, although I'm pretty
sure the same thing would have happened even if you hadn't halted it.
> [ .. end of console log .. ]
> This seems less helpful than I had imagined; oh well. I hope you
> can make some kind of sense out of it. According to gdb, the PC
> is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
> below):
It does show that what I thought had happened did indeed happen.
> 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
> 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq t2,t1,t4
> 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq t0,64(t1)
> 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and s2,t0,t0
> 0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne
> t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
> 0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq t0,88(t4)
> 0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq t1,8(t5)
> 0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq t0,t1,t0
> 0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne
> t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
> 0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq t3,0(t3)
> 0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq
> t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
> 0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop
> 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda a0,3
> 0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq t0,32(t3)
> 0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and t0,0x10,t0
> 0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq
> t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
> 0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq a1,16(t3)
> 0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal 0x33
> 0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq t3,0(t3)
> 0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne
> t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
This is the loop where pmap_do_tlb_shootdown is processing the job queue
for this cpu (cpu 0). T3 is the current queue entry and almost certainly
is linked to itself - meaning it keeps invalidating the same tlb entry
over and over and over...... At this point, cpu 0 will have the job queue
locked, and I'm certain that cpu 1 wants to invalidate another tlb entry,
and is attempting to acquire the lock that cpu 0 holds.
This is the problem I'm still in the process of trying to figure out
what the problem is and how to fix it. The patch I posted previously is a
workaround to detect this particular problem, and will display a message
if it occurs. The other alternative I posted (changing
PMAP_TLB_SHOOTDOWN_MAXJOBS to 0) make it never allocate queue entries so
it should never loop.
What I think may be happening is that the IPI interrupt may be occuring
at level 5, which is the same level as the clock interrupt. The
mutex for the tlb shootdown queue uses IPL_VM, which I think is at level
4. If I understand the mutex locking mechanism, I think this would allow
an IPI request from another cpu interrupt the code that manipulates the
shootdown queue entries, and could possibly corrupt the job queue.
--
Michael L. Hitch mhitch%montana.edu@localhost
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
Home |
Main Index |
Thread Index |
Old Index