Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: something's wrong



On Thu, 14 Nov 2013 23:41:33 +0900
Takahiro HAYASHI <t-hash%abox3.so-net.ne.jp@localhost> wrote:
> I happened unfortunately to meet this problem, but fortunately
> entered ddb.
> I was doing ./build release for amd64 on amd64 HEAD around Nov 9.
>
> Does this give any help?

Yes, thanks - this helps to narrow down the problem. I don't see the
real reason yet, but perhaps someone more familiar with synchronization
matters can make more sense of it. Just extracting some interesting
data.

> db{0}> ps
> PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
> 2150     1 3   5         0   fffffe81c3a0f6e0                cc1
> xchicv

I see only this one LWP with pcu activity. It waits for another CPU
to dump its FPU state to memory, so that it can be loaded and used
on the current CPU:

> db{0}> bt/a fffffe81c3a0f6e0
> trace: pid 2150 lid 1 at 0xfffffe810ed35bc0
> sleepq_block() at netbsd:sleepq_block+0xa0
> cv_wait() at netbsd:cv_wait+0x9f
> xc_wait() at netbsd:xc_wait+0x4a
> pcu_load() at netbsd:pcu_load+0x79

We don't know which CPU the remote one is. The xcall is executed
in an softclk context. 3 of the handlers are blocked:

> 0       56 3   7       200   fffffe810e170ac0          softclk/7
> tstile
> 0       38 3   4       200   fffffe810e105a00          softclk/4
> tstile
> 0       26 3   2       200   fffffe810e0a1980          softclk/2
> tstile

For the first one, we got a stacktrace:

> db{0}> t/a fffffe810e170ac0
> trace: pid 0 lid 56 at 0xfffffe810e169be0
> sleepq_block() at netbsd:sleepq_block+0xa0
> turnstile_block() at netbsd:turnstile_block+0x2cc
> mutex_vector_enter() at netbsd:mutex_vector_enter+0x13d
> arptimer() at netbsd:arptimer+0x15
> callout_softclock() at netbsd:callout_softclock+0x174
> softint_dispatch() at netbsd:softint_dispatch+0x7b

Apparently waiting for softnet_lock.
There is another softint handler also waiting for softnet_lock:

> 0        3 3   0       200   fffffe821dd69440          softnet/0
> tstile
> [...]
> db{0}> bt/a fffffe821dd69440
> trace: pid 0 lid 3 at 0xfffffe810e055c30
> sleepq_block() at netbsd:sleepq_block+0xa0
> turnstile_block() at netbsd:turnstile_block+0x2cc
> mutex_vector_enter() at netbsd:mutex_vector_enter+0x13d
> arpintr() at netbsd:arpintr+0x13
> softint_dispatch() at netbsd:softint_dispatch+0x7b

We don't know what the softclk handlers on cpu2 and cpu4 are
waiting for.
It doesn't look like two pcu actions directly locking against
each other. Other xcalls also don't seem to be involved.
(The only other user of high-priority xcalls is the "pserialize"
framework which I don't see any traces of.)
So it looks like a more complex lock order issue.

best regards
Matthias


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------



Home | Main Index | Thread Index | Old Index