NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-alpha/38335 (kernel freeze on alpha MP system)
The following reply was made to PR port-alpha/38335; it has been noted by GNATS.
From: "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost>
To: Jarle Greipsland <jarle%uninett.no@localhost>
Cc: gnats-bugs%NetBSD.org@localhost, dholland%NetBSD.org@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 26 Sep 2009 11:04:37 -0600 (MDT)
On Tue, 22 Sep 2009, Jarle Greipsland wrote:
> The problem is still there. I installed and booted GENERIC.MP,
> and built a full release. During extraction of the newly built
> sets, the system wedged again. This time I could not break into
> DDB from the console.
This was a CS20, according to the original PR? If so, you can get
into DDB using the halt switch [it's hidden in a small hole in the front
panel to the right]. After the machine is halted, you can enter
'continue', which will enter into a console trap. The halt should
show the PC of cpu 0 at the time of the halt, and I think that address
should be related to the backtrace of cpu 0. One problem will be that
it's very likely cpu 1 is not halted, and is likely spinning with
interrupts block so the IPI send by cpu 0 to halt will not occur and
there won't be any register available [I'm not ever sure that works
now even when the IPI can be delivered - I need to look into that].
> Also, right after root file system detection, the kernel complains:
>
> WARNING: negative runtime; monotonic clock has gone backwards
>
> I don't know if this might be related to the wedging or not.
Not very likely, I think. I see this every MP boot, and it is related
to using the the PCC timecounter. I haven't figured out exactly how that
is supposed to work on an MP system yet.
> If you want me to dig out more information, please ask, and I'll
> see what I can do.
The problem is likely some kind of locking deadlock, which I've gotten
a few times. I've been trying to debug a problem with the TLB shootdown
code where it gets a corrupted pool_cache entry and ends up with a
list that links to itself. You can try this patch I'm using to attempt
to detect this and work around it:
@@ -3700,6 +3700,12 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
* don't really have to do anything else.
*/
mutex_spin_enter(&pq->pq_lock);
+/**/ if (pj && pj == pq->pq_head.tqh_first) {
+/**/ printf("Whoa! pool_cache_get returned an in-use
entry! ci_index %d pj %p\n",
+ self->ci_index, pj);
+/**/ /*panic("Oops");*/
+/**/ pj = NULL; /* XXX */
+/**/ }
pq->pq_pte |= pte;
if (pq->pq_tbia) {
mutex_spin_exit(&pq->pq_lock);
An alternative workaround is to set PMAP_TLB_SHOOTDOWN_MAXJOBS to 0,
which will invalidate all tlbs instead of trying to invalidate
single entries.
--
Michael L. Hitch mhitch%montana.edu@localhost
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
Home |
Main Index |
Thread Index |
Old Index