Re: Some alpha problems in current

To: port-alpha%NetBSD.org@localhost
Subject: Re: Some alpha problems in current
From: "Michael L. Hitch" <mhitch%NetBSD.org@localhost>
Date: Fri, 8 Feb 2008 12:14:35 -0700 (MST)

On Mon, 4 Feb 2008, Michael L. Hitch wrote:

I'm now trying a kernel where I've added a SPINLOCK_SPIN_HOOK in theplaces where sys/kern/mutex.c spins, and my MP kernel has been running forover 2 hours. I'm going to try a LOCKDEBUG kernel again after a while
to see if that's changed by the addition of the SPINLOCK_SPIN_HOOKs.

LOCKDEBUG kernels still hang on boot and can't be halted, so I can't useLOCKDEBUG yet.

 The machine survived 16 hours (until 04:55), at which point it paniced
with "fpsave ipi didn't".  CPU 1 was hung and I couldn't get a backtrace
from it.  This would also explain the panic - CPU 0 is waiting for the
other CPU to process the fpusave ipi, but it's probably spinning somewhere

and unable to process the ipi. This type of panic used to be fairly frequentbefore, but I was able to track down the deadlock at the time and

that particular problem was fixed.  It looks like it may be back again.

I've been getting the "fpsave ipi didn't" panics, and hangs in thepmap_do_tlb_shootdown() fairly often (which helps in trying to get moreinformation from them).

Because LOCKDEBUG doesn't work, I partially enabled some of theLOCKDEBUG features - specifically the SPINLOCK_SPINOUT() stuff. This hasallowed me to get more information from the pmap_do_tlb_shootdown() hang.What I'm seeing is that CPU 1 is spinning trying to acquire the pq_lockmutex to add another entry to the job queue. CPU 0 currently has themutex, but it should't be holding it very long, but appears to be looping.

I've finally managed to get information from DDB on both CPUs and get a

good dumpfile. The only place in pmap_do_tlb_shootdown() I could see anypossibility of looping is the "TAILQ_FOREACH(pj, &jobs, pj_list)" loop,

but that's not supposed to be happening.

But that is exactly what is happening. I have a dump file that I thinkI was able to locate the jobs variable on the stack, and it points to apmap_tlb_shootdown_job that links to itself:


*** stack from for pmap_do_tlbshootdown:

(gdb) x/10gx 0xfffffe000e501e58
0xfffffe000e501e58:     0xfffffc0000840468      0xfffffe0000084c70
                        ^ RA from call to pmap_do_tlb_shootdown
0xfffffe000e501e68:     0xfffffc0000b31ea8      0x0000000000000003
0xfffffe000e501e78:     0xfffffc0000b3b2f8      0xfffffc0000b6e3c8
0xfffffe000e501e88:     0xfffffc006f92e2c0      0xfffffc006f92e2c0
                        ^  jobs TAILQ_HEAD
0xfffffe000e501e98:     0xfffffc000083fd84      0xfffffc0000b31ea8

*** pmap_tlb_shootdown_job pointed to by jobs:

(gdb) x/x 0xfffffc006f92e2c0
0xfffffc006f92e2c0:     0xfffffc006f92e2c0
(gdb) print (struct pmap_tlb_shootdown_job)* 0xfffffc006f92e2c0
$4 = {pj_list = {tqe_next = 0xfffffc006f92e2c0, tqe_prev =  0xfffffe000e501e88},
                            ^^^^^^^^^^^^^^^^^^
                            EEEK!!!!!
  pj_va = 18446741874823061504, pj_pmap = 0xfffffc0000ba68a8, pj_pte = 16}

Another oddity - the pmap_tlb_shootdown_q entry for CPU 0 shows adifferent count:


(gdb) print pmap_tlb_shootdown_q[0]
$5 = {pq_head = {tqh_first = 0x0, tqh_last = 0xfffffc0000b77480}, pq_lock  = {
    mtx_pad1 = 1025, mtx_pad2 = 1}, pq_pte = 16, pq_count = 2, pq_tbia =  0,
  pq_pad = '\0' <repeats 23 times>}

  The pq_count indicates there should be 2 entries in the job queue.

Somewhere something is corrupting the job queue, but I haven't been ableto spot it. All the accesses look like they should be properly protectedvia the pq_lock mutex. I guess the next step will be to put checks in toverify the proper queue entries and see if I can find where it's gettingcorrupted.


--
Michael L. Hitch                        mhitch%NetBSD.org@localhost

Follow-Ups:
- Re: Some alpha problems in current
  - From: Michael L. Hitch

References:
- Re: Some alpha problems in current
  - From: Anders Hjalmarsson
- Re: Some alpha problems in current
  - From: Anders Hjalmarsson
- Re: Some alpha problems in current
  - From: Andrew Doran
- Re: Some alpha problems in current
  - From: Michael L. Hitch
- Re: Some alpha problems in current
  - From: Michael L. Hitch
- Re: Some alpha problems in current
  - From: Michael L. Hitch

Prev by Date: Re: Some alpha problems in current
Next by Date: Re: Self baked kernel panics
Previous by Thread: Re: Some alpha problems in current
Next by Thread: Re: Some alpha problems in current
Indexes:

Home | Main Index | Thread Index | Old Index