Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD DomU MP freeze under Linux Dom0



Manuel Bouyer wrote:
> On Thu, Sep 06, 2012 at 12:57:19PM +0200, Roger Pau Monne wrote:
>> Hello,
>>
>> Recently I've been doing some benchmarks on NetBSD, to compare the
>> performances of both NetBSD and Linux as Dom0/DomUs (this was presented
>> on XenSummit last week with Cherry G. Mathew, slides will probably be
>> uploaded soon).
>>
>> One of the benchmarks consisted in running build.sh inside a DomU, and
>> during this test I've realised that this lead to a freeze when running a
>> Linux Dom0 and a NetBSD DomU with 4vcpus. So far I haven't been able to
>> reproduce the problem without MP or in a NetBSD Dom0, which is kind of
>> strange, because I would say it is not related to blkfront, I've added
>> some debugging prints there, and blkfront seems to not be the owner of
>> the lock when the freeze happens. The build of NetBSD inside the DomU
>> was using 8 simultaneous jobs, and it freezes to a point where I can not
>> even access ddb. I've been able to get a trace using gdbsx:
>>
>> Thread 4:
>>
>> #0  0xffffffff80101248 in hypercall_page ()
>> #1  0x000000000000e033 in ?? ()
>> #2  0x0000000000000000 in ?? ()
>>
>> Thread 3:
>>
>> #0  0xffffffff80130f32 in x86_pause ()
>> #1  0xffffffff801f67b1 in _kernel_lock ()
>> #2  0xffffffff8030b054 in bdev_strategy ()
>> #3  0xffffffff803037d8 in spec_strategy ()
>> #4  0xffffffff803a719a in VOP_STRATEGY ()
>> #5  0xffffffff8035ff7a in ufs_strategy ()
>> #6  0xffffffff803a719a in VOP_STRATEGY ()
>> #7  0xffffffff8038d3fa in bwrite ()
>> #8  0xffffffff803a6320 in VOP_BWRITE ()
>> #9  0xffffffff80357125 in ufs_dirremove ()
>> #10 0xffffffff8035dc47 in ufs_remove ()
>> #11 0xffffffff803a6b53 in VOP_REMOVE ()
>> #12 0xffffffff8039ac4f in do_sys_unlink ()
>> #13 0xffffffff8032b044 in syscall ()
>> #14 0xffffffff8010221d in Xsyscall ()
>>
>> Thread 2:
>>
>> #0  0xffffffff801f67b1 in _kernel_lock ()
>> #1  0xffffffff8030b054 in bdev_strategy ()
>> #2  0xffffffff803037d8 in spec_strategy ()
>> #3  0xffffffff803a719a in VOP_STRATEGY ()
>> #4  0xffffffff8035ff7a in ufs_strategy ()
>> #5  0xffffffff803a719a in VOP_STRATEGY ()
>> #6  0xffffffff8038d3fa in bwrite ()
>> #7  0xffffffff803a6320 in VOP_BWRITE ()
>> #8  0xffffffff80357125 in ufs_dirremove ()
>> #9  0xffffffff8035dc47 in ufs_remove ()
>> #10 0xffffffff803a6b53 in VOP_REMOVE ()
>> #11 0xffffffff8039ac4f in do_sys_unlink ()
>> #12 0xffffffff8032b044 in syscall ()
>> #13 0xffffffff8010221d in Xsyscall ()
>>
>> Thread 1:
>>
>> #0  0xffffffff801f67b1 in _kernel_lock ()
>> #1  0xffffffff8030b054 in bdev_strategy ()
>> #2  0xffffffff803037d8 in spec_strategy ()
>> #3  0xffffffff803a719a in VOP_STRATEGY ()
>> #4  0xffffffff8035ff7a in ufs_strategy ()
>> #5  0xffffffff803a719a in VOP_STRATEGY ()
>> #6  0xffffffff8038d3fa in bwrite ()
>> #7  0xffffffff803a6320 in VOP_BWRITE ()
>> #8  0xffffffff80357125 in ufs_dirremove ()
>> #9  0xffffffff8035dc47 in ufs_remove ()
>> #10 0xffffffff803a6b53 in VOP_REMOVE ()
>> #11 0xffffffff8039ac4f in do_sys_unlink ()
>> #12 0xffffffff8032b044 in syscall ()
>> #13 0xffffffff8010221d in Xsyscall ()
>>
>> My guess is that Thread 4 is holding the lock, and it's blocked for some
>> reason that's beyond my current knowledge of NetBSD internals, and the
>> stack trace is not helping on that.
> 
> Do you have a way to know what hypercall thread 4 is doing ?
> it looks like it's doing an hypercall with the kernel_lock held,
> and this hypercall blocks.

I'm not so sure this is related to Xen, I've been trying to debug this,
in the case above the hypercall was a do_console_io, but I've been
having a lot more of this crashes, and they all seem to be related to
the filesystem (probably related to the bug that I've emailed to
tech-kern "Panic when deleting large number of files inside DomU").

Here is another crash, this time the hypercall is a do_sched_op_compat:

Thread 4:

#0  0xffffffff801010ca in hypercall_page ()
#1  0xffffffff807db030 in ?? ()
#2  0x0000000000000001 in ?? ()
#3  0xffffffff803b03ee in xenconscn_getc ()
#4  0xffffffff8013be10 in db_readline ()
#5  0xffffffff8013c934 in db_read_line ()
#6  0xffffffff80139eb5 in db_command_loop ()
#7  0xffffffff8013f43d in db_trap ()
#8  0xffffffff8013c7da in kdb_trap ()
#9  0xffffffff8034a525 in trap ()
#10 0xffffffff8010340f in calltrap ()
#11 0xffffffff80130bf5 in breakpoint ()
#12 0xffffffff803172f1 in vpanic ()
#13 0xffffffff80317410 in panic ()
#14 0xffffffff803a2ae6 in wapbl_register_deallocation ()
#15 0xffffffff8015ef1b in ffs_indirtrunc ()
#16 0xffffffff8015eec2 in ffs_indirtrunc ()
#17 0xffffffff8015eec2 in ffs_indirtrunc ()
#18 0xffffffff8016007f in ffs_truncate ()
#19 0xffffffff803575ef in ufs_inactive ()
#20 0xffffffff803a817d in VOP_INACTIVE ()
#21 0xffffffff8039f28c in vrelel ()
#22 0xffffffff8039c31c in do_sys_stat ()
#23 0xffffffff8039c3c9 in sys___lstat50 ()
#24 0xffffffff8032c2e4 in syscall ()
#25 0xffffffff8010221d in Xsyscall ()

Thread 3:

#0  0xffffffff8013c58f in ddb_suspend ()
#1  0xffffffff8013c898 in ddb_ipi ()
#2  0xffffffff803abae6 in xen_ipi_ddb ()
#3  0xffffffff803aba91 in xen_ipi_handler ()
#4  0xffffffff8014bc9b in evtchn_do_event ()
#5  0xffffffff801027ed in call_evtchn_do_event ()
#6  0xffffffff8017b76d in do_hypervisor_callback ()
#7  0xffffffff80105bae in hypervisor_callback ()
#8  0x00000000deadbeef in ?? ()
#9  0x00000000deadbeef in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 2:

#0  0xffffffff8013c58f in ddb_suspend ()
#1  0xffffffff8013c898 in ddb_ipi ()
#2  0xffffffff803abae6 in xen_ipi_ddb ()
#3  0xffffffff803aba91 in xen_ipi_handler ()
#4  0xffffffff8014bc9b in evtchn_do_event ()
#5  0xffffffff801027ed in call_evtchn_do_event ()
#6  0xffffffff8017b76d in do_hypervisor_callback ()
#7  0xffffffff80105bae in hypervisor_callback ()
#8  0x00000000deadbeef in ?? ()
#9  0x00000000deadbeef in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 1:

#0  0xffffffff8013c58f in ddb_suspend ()
#1  0xffffffff8013c898 in ddb_ipi ()
#2  0xffffffff803abae6 in xen_ipi_ddb ()
#3  0xffffffff803aba91 in xen_ipi_handler ()
#4  0xffffffff8014bc9b in evtchn_do_event ()
#5  0xffffffff801027ed in call_evtchn_do_event ()
#6  0xffffffff8017b76d in do_hypervisor_callback ()
#7  0xffffffff80105bae in hypervisor_callback ()
#8  0x00000000deadbeef in ?? ()
#9  0x00000000deadbeef in ?? ()
#10 0x0000000000000000 in ?? ()

This time I was able to get a ddb session also, here is the output:

panic: wapbl_register_deallocation: out of resources
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff80130bf5 cs e030 rflags 246 cr2
7f7ff7b1f000 cpl 0 rsp ffffa0005b03b490
Stopped in pid 1425.1 (find) at netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x1f2
printf_nolog() at netbsd:printf_nolog
wapbl_register_inode() at netbsd:wapbl_register_inode
ffs_indirtrunc() at netbsd:ffs_indirtrunc+0x35b
ffs_indirtrunc() at netbsd:ffs_indirtrunc+0x302
ffs_indirtrunc() at netbsd:ffs_indirtrunc+0x302
ffs_truncate() at netbsd:ffs_truncate+0x9a4
ufs_inactive() at netbsd:ufs_inactive+0x2df
VOP_INACTIVE() at netbsd:VOP_INACTIVE+0x33
vrelel() at netbsd:vrelel+0x1bb
do_sys_stat() at netbsd:do_sys_stat+0x78
sys___lstat50() at netbsd:sys___lstat50+0x26
syscall() at netbsd:syscall+0xc4
ds          4000
es          b4d0
fs          100
gs          7500
rdi         0
rsi         d
rbp         ffffa0005b03b490
rbx         104
rdx         0
rcx         8
rax         1
r8          0
r9          1
r10         0
r11         1180
r12         ffffffff80427780    copyright+0x22f20
r13         ffffa0005b03b4d0
r14         4000
r15         772
rip         ffffffff80130bf5    breakpoint+0x5
cs          e030
rflags      246
rsp         ffffa0005b03b490
ss          e02b

The filesystem was clean, since I've just created it with newfs -O 2.


Home | Main Index | Thread Index | Old Index