NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-alpha/38335 (kernel freeze on alpha MP system)



The following reply was made to PR port-alpha/38335; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost>
To: Jarle Greipsland <jarle%uninett.no@localhost>
Cc: gnats-bugs%NetBSD.org@localhost, dholland%NetBSD.org@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 26 Sep 2009 11:04:37 -0600 (MDT)

 On Tue, 22 Sep 2009, Jarle Greipsland wrote:
 
 > The problem is still there.  I installed and booted GENERIC.MP,
 > and built a full release.  During extraction of the newly built
 > sets, the system wedged again.  This time I could not break into
 > DDB from the console.
 
    This was a CS20, according to the original PR?  If so, you can get
 into DDB using the halt switch [it's hidden in a small hole in the front
 panel to the right].  After the machine is halted, you can enter
 'continue', which will enter into a console trap.  The halt should
 show the PC of cpu 0 at the time of the halt, and I think that address
 should be related to the backtrace of cpu 0.  One problem will be that
 it's very likely cpu 1 is not halted, and is likely spinning with
 interrupts block so the IPI send by cpu 0 to halt will not occur and
 there won't be any register available [I'm not ever sure that works
 now even when the IPI can be delivered - I need to look into that].
 
 > Also, right after root file system detection, the kernel complains:
 >
 > WARNING: negative runtime; monotonic clock has gone backwards
 >
 > I don't know if this might be related to the wedging or not.
 
    Not very likely, I think.  I see this every MP boot, and it is related
 to using the the PCC timecounter.  I haven't figured out exactly how that
 is supposed to work on an MP system yet.
 
 > If you want me to dig out more information, please ask, and I'll
 > see what I can do.
 
    The problem is likely some kind of locking deadlock, which I've gotten
 a few times.  I've been trying to debug a problem with the TLB shootdown
 code where it gets a corrupted pool_cache entry and ends up with a
 list that links to itself.  You can try this patch I'm using to attempt
 to detect this and work around it:
 
 @@ -3700,6 +3700,12 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                   * don't really have to do anything else.
                   */
                  mutex_spin_enter(&pq->pq_lock);
 +/**/           if (pj && pj == pq->pq_head.tqh_first) {
 +/**/                   printf("Whoa!  pool_cache_get returned an in-use 
 entry! ci_index %d pj %p\n",
 +                           self->ci_index,  pj);
 +/**/                   /*panic("Oops");*/
 +/**/                   pj = NULL;      /* XXX */
 +/**/           }
                  pq->pq_pte |= pte;
                  if (pq->pq_tbia) {
                          mutex_spin_exit(&pq->pq_lock);
 
    An alternative workaround is to set PMAP_TLB_SHOOTDOWN_MAXJOBS to 0, 
 which will invalidate all tlbs instead of trying to invalidate
 single entries.
 
 --
 Michael L. Hitch                       mhitch%montana.edu@localhost
 Computer Consultant
 Information Technology Center
 Montana State University       Bozeman, MT     USA
 


Home | Main Index | Thread Index | Old Index