Port-alpha archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Some alpha problems in current



On Mon, 4 Feb 2008, Michael L. Hitch wrote:

I'm now trying a kernel where I've added a SPINLOCK_SPIN_HOOK in the places where sys/kern/mutex.c spins, and my MP kernel has been running for over 2 hours. I'm going to try a LOCKDEBUG kernel again after a while
to see if that's changed by the addition of the SPINLOCK_SPIN_HOOKs.

LOCKDEBUG kernels still hang on boot and can't be halted, so I can't use LOCKDEBUG yet.

 The machine survived 16 hours (until 04:55), at which point it paniced
with "fpsave ipi didn't".  CPU 1 was hung and I couldn't get a backtrace
from it.  This would also explain the panic - CPU 0 is waiting for the
other CPU to process the fpusave ipi, but it's probably spinning somewhere
and unable to process the ipi. This type of panic used to be fairly frequent before, but I was able to track down the deadlock at the time and
that particular problem was fixed.  It looks like it may be back again.

I've been getting the "fpsave ipi didn't" panics, and hangs in the pmap_do_tlb_shootdown() fairly often (which helps in trying to get more information from them).

Because LOCKDEBUG doesn't work, I partially enabled some of the LOCKDEBUG features - specifically the SPINLOCK_SPINOUT() stuff. This has allowed me to get more information from the pmap_do_tlb_shootdown() hang. What I'm seeing is that CPU 1 is spinning trying to acquire the pq_lock mutex to add another entry to the job queue. CPU 0 currently has the mutex, but it should't be holding it very long, but appears to be looping.
I've finally managed to get information from DDB on both CPUs and get a
good dumpfile. The only place in pmap_do_tlb_shootdown() I could see any possibility of looping is the "TAILQ_FOREACH(pj, &jobs, pj_list)" loop,
but that's not supposed to be happening.

But that is exactly what is happening. I have a dump file that I think I was able to locate the jobs variable on the stack, and it points to a pmap_tlb_shootdown_job that links to itself:

*** stack from for pmap_do_tlbshootdown:

(gdb) x/10gx 0xfffffe000e501e58
0xfffffe000e501e58:     0xfffffc0000840468      0xfffffe0000084c70
                        ^ RA from call to pmap_do_tlb_shootdown
0xfffffe000e501e68:     0xfffffc0000b31ea8      0x0000000000000003
0xfffffe000e501e78:     0xfffffc0000b3b2f8      0xfffffc0000b6e3c8
0xfffffe000e501e88:     0xfffffc006f92e2c0      0xfffffc006f92e2c0
                        ^  jobs TAILQ_HEAD
0xfffffe000e501e98:     0xfffffc000083fd84      0xfffffc0000b31ea8

*** pmap_tlb_shootdown_job pointed to by jobs:

(gdb) x/x 0xfffffc006f92e2c0
0xfffffc006f92e2c0:     0xfffffc006f92e2c0
(gdb) print (struct pmap_tlb_shootdown_job)* 0xfffffc006f92e2c0
$4 = {pj_list = {tqe_next = 0xfffffc006f92e2c0, tqe_prev =  0xfffffe000e501e88},
                            ^^^^^^^^^^^^^^^^^^
                            EEEK!!!!!
  pj_va = 18446741874823061504, pj_pmap = 0xfffffc0000ba68a8, pj_pte = 16}

Another oddity - the pmap_tlb_shootdown_q entry for CPU 0 shows a different count:

(gdb) print pmap_tlb_shootdown_q[0]
$5 = {pq_head = {tqh_first = 0x0, tqh_last = 0xfffffc0000b77480}, pq_lock  = {
    mtx_pad1 = 1025, mtx_pad2 = 1}, pq_pte = 16, pq_count = 2, pq_tbia =  0,
  pq_pad = '\0' <repeats 23 times>}

  The pq_count indicates there should be 2 entries in the job queue.

Somewhere something is corrupting the job queue, but I haven't been able to spot it. All the accesses look like they should be properly protected via the pq_lock mutex. I guess the next step will be to put checks in to verify the proper queue entries and see if I can find where it's getting corrupted.

--
Michael L. Hitch                        mhitch%NetBSD.org@localhost



Home | Main Index | Thread Index | Old Index