Port-alpha archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Some alpha problems in current



On Fri, 8 Feb 2008, Michael L. Hitch wrote:

But that is exactly what is happening. I have a dump file that I think I was able to locate the jobs variable on the stack, and it points to a pmap_tlb_shootdown_job that links to itself:

*** stack from for pmap_do_tlbshootdown:

(gdb) x/10gx 0xfffffe000e501e58
0xfffffe000e501e58:     0xfffffc0000840468      0xfffffe0000084c70
                       ^ RA from call to pmap_do_tlb_shootdown
0xfffffe000e501e68:     0xfffffc0000b31ea8      0x0000000000000003
0xfffffe000e501e78:     0xfffffc0000b3b2f8      0xfffffc0000b6e3c8
0xfffffe000e501e88:     0xfffffc006f92e2c0      0xfffffc006f92e2c0
                       ^  jobs TAILQ_HEAD
0xfffffe000e501e98:     0xfffffc000083fd84      0xfffffc0000b31ea8

*** pmap_tlb_shootdown_job pointed to by jobs:

(gdb) x/x 0xfffffc006f92e2c0
0xfffffc006f92e2c0:     0xfffffc006f92e2c0
(gdb) print (struct pmap_tlb_shootdown_job)* 0xfffffc006f92e2c0
$4 = {pj_list = {tqe_next = 0xfffffc006f92e2c0, tqe_prev = 0xfffffe000e501e88},
                           ^^^^^^^^^^^^^^^^^^
                           EEEK!!!!!
 pj_va = 18446741874823061504, pj_pmap = 0xfffffc0000ba68a8, pj_pte = 16}

Another oddity - the pmap_tlb_shootdown_q entry for CPU 0 shows a different count:

(gdb) print pmap_tlb_shootdown_q[0]
$5 = {pq_head = {tqh_first = 0x0, tqh_last = 0xfffffc0000b77480}, pq_lock = {
   mtx_pad1 = 1025, mtx_pad2 = 1}, pq_pte = 16, pq_count = 2, pq_tbia =  0,
 pq_pad = '\0' <repeats 23 times>}

 The pq_count indicates there should be 2 entries in the job queue.

Somewhere something is corrupting the job queue, but I haven't been able to spot it. All the accesses look like they should be properly protected via the pq_lock mutex. I guess the next step will be to put checks in to verify the proper queue entries and see if I can find where it's getting corrupted.

  What I have found so far:

The pool_cache_get() in pmap_tlb_shootdown() is returning a pool entry which is already on the job queue, which results in that entry getting linked to itself. Adding checks in pool_cache_get_paddr() and pool_cache_put_paddr() caught pool_cache_get_paddr() with two consecutive pool entries in the cache with the same address. The corresponding check in pool_cache_put_paddr() didn't see the duplicate entry being put back in the cache, so I don't know where it came from.

I have now started looking at a different approach to this. Since the length of the job queue is now limited to 6 entries (after which the shootdown just invalidates all the tlb entries), I thought I'd try just allocating the job queue as a static array in each pmap_tlb_shootdown_q entry and not even try using the pool_cache. Initially, I ran into problems with the kernel_lock (big lock) spinning out and crashing while trying to rebuild the parity on my raidframe disk. After letting the parity rewrite complete while in single-user mode, I when multi-user and my system has been running for 7 hours now. I've done the operations that would usually induce the duplicate job queue entry fairly quickly several times, and have not experienced any problems so far (although that's no saying much).

--
Michael L. Hitch                        mhitch%NetBSD.org@localhost



Home | Main Index | Thread Index | Old Index