A new-ish remote Xen server (that had never been rebooted, sigh) crashed
and failed to reboot yesterday. It seemed to get its network configured
but unfortunately it didn't seem to be starting sshd. Worse when an
on-site person watched it boot it was clearing the screen quickly after
showing the boot messages, but never presenting a login prompt, so no
chance of really seeing how far it got (at least not without watching a
video of the screen and pausing just before it blanked).
Unfortunately the swap partition had not yet been enabled, so no crash
dump from the original crash either.
It's running NetBSD-7.1 (amd64) release with Xen-4.8:
# uname -a
NetBSD xentral 7.1 NetBSD 7.1 (XEN3_DOM0.201703111743Z) amd64
# pkg_info -c xentools\* | sed 1q
Information for xentools48-4.8.0:
# pkg_info -c xenkern\* | sed 1q
Information for xenkernel48-4.8.0nb1:
It would still boot a GENERIC kernel and nothing seemed wrong from
there, so after some staring at logs I decided to comment out the
xendomains list and try booting the xen dom0 kernel again.
This time it booted fine, as expected, so I tried starting the first domU:
# xl create /usr/pkg/etc/xen/b2.conf
Parsing config from /usr/pkg/etc/xen/b2.conf
load: 0.06 cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
load: 0.06 cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
load: 0.04 cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/0 because of timeout
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/4 because of timeout
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/5 because of timeout
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/6 because of timeout
libxl: error: libxl_create.c:1255:domcreate_launch_dm: unable to add disk devices
xenstore-read: couldn't read path /local/domain/0/backend/vbd/1/6/vnd
load: 0.02 cmd: xenstore-read 1646 [tstile] 0.00u 0.00s 0% 1200k
Oh, oh, there's my problem! Starting a domU deadlocks in the kernel!
First things first: add "sshd" to the "REQUIRE" line in
/etc/rc.conf/xendomains! This was the main difficulty that required an
on-site person to help reboot -- there wasn't even a login prompt on the
console when it deadlocked. Both local and remote access MUST be
enabled before trying to start any domU!!!
Now, back to debugging. Here's the config for the sticky domU:
# cat /usr/pkg/etc/xen/b2.conf | grep -v '^#'
name = "b2"
uuid = "96e60a02-9fc7-11e7-9f0d-d4ae528b9511"
kernel = "/netbsd-7.99.34-XEN3_DOMU"
extra = "root=xbd0a"
memory = 10000
maxmem = 16000
vcpus = 8
vif = [ 'bridge=bridge0' ]
disk = [
'format=raw, vdev=0x0, access=rw, target=/dev/mapper/vg0-lv10',
'format=raw, vdev=0x1, access=rw, target=/dev/mapper/vg1-lv11',
'format=raw, vdev=0x2, access=rw, target=/dev/mapper/vg0-lv12',
'format=raw, vdev=0x3, access=rw, target=/dev/mapper/vg0-lv13',
'format=raw, vdev=0x4, access=rw, target=/dev/mapper/vg0-lv14',
'format=raw, vdev=0x5, access=rw, target=/dev/mapper/vg1-lv15',
'format=raw, vdev=0x6, access=ro, devtype=cdrom, target=/images/NetBSD-7.99.34-amd64.iso'
]
OK, the error message suggested something wrong with vbd#6, so why the
sudden problem with /images?
# ls -l /images
total 8639360
-r--r--r-- 1 woods users 715941888 Sep 19 20:26 NetBSD-5.2_STABLE-amd64.iso
-rw-r--r-- 1 woods users 391266304 Mar 11 2017 NetBSD-7.1-amd64.iso
-r--r--r-- 1 woods users 3313895424 Sep 20 00:12 NetBSD-7.99.34-amd64.iso
hmmm... I forget, is /images a filesystem?
# df
[[ twiddle thumbs for a moment ... then hit <Ctrl-T> (how to linuxoids
do without it!?!?!?!) ]]
load: 0.01 cmd: df 2245 [tstile] 0.00u 0.00s 0% 1204k
Oh oh, that IS a nasty deadlock for sure!
google and read some old email messages and bug reports....
Run "crash":
crash> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
386 > 1 7 0 0 ffffa00005ce04c0 crash
4397 1 3 0 80 ffffa00005c9f8c0 ksh pause
3213 1 3 0 80 ffffa00005c63060 ksh pause
1828 1 2 0 0 ffffa00005c63480 sshd
2795 1 3 0 80 ffffa00005bcb020 sshd select
2865 1 3 0 0 ffffa00005c638a0 ls tstile
2471 1 3 0 80 ffffa00005b59780 ksh pause
1464 1 3 0 80 ffffa00005c16880 sshd select
2598 1 3 0 80 ffffa00005c16460 sshd select
2245 1 3 0 1000000 ffffa000050f4ac0 df tstile
416 1 3 0 80 ffffa00005bb9840 ksh pause
1827 1 3 0 80 ffffa00005b59360 sshd select
1576 1 3 0 80 ffffa00004a00600 sshd select
1292 1 3 0 0 ffffa00005bb9420 xl tstile
1785 1 3 0 1000000 ffffa00005b26340 xl tstile
1264 1 3 0 80 ffffa00005b59ba0 ksh pause
905 1 3 0 0 ffffa00005bb9000 xenstore-read tstile
907 1 3 0 0 ffffa00005bcb860 xenstore-read tstile
1646 1 3 0 0 ffffa00005bcb440 xenstore-read tstile
906 1 3 0 80 ffffa00005c16040 xenstore-read xbrd
849 2 3 0 80 ffffa000059e1320 xl netio
849 1 3 0 80 ffffa00005b26b80 xl select
1373 1 3 0 80 ffffa000059e1b60 ksh pause
1219 1 3 0 80 ffffa00005b26760 ksh pause
1159 1 3 0 80 ffffa00005953b40 sshd select
322 1 3 0 80 ffffa000049eb1c0 sshd select
703 1 3 0 80 ffffa000049eba00 ksh pause
762 1 3 0 80 ffffa00004a1f620 sshd select
732 1 3 0 80 ffffa00004a001e0 sshd select
693 1 3 0 80 ffffa000059e1740 getty ttyraw
699 1 3 0 80 ffffa000059442e0 getty ttyraw
596 1 3 0 80 ffffa00004a00a20 getty ttyraw
531 1 3 0 80 ffffa000049eb5e0 getty ttyraw
630 1 3 0 80 ffffa00004870120 getty ttyraw
658 1 3 0 80 ffffa00005944700 cron nanoslp
644 1 3 0 80 ffffa000050c7aa0 inetd kqueue
289 1 3 0 80 ffffa00005953720 qmgr kqueue
290 1 3 0 80 ffffa00004e4fa80 pickup kqueue
702 1 3 0 80 ffffa00005953300 master kqueue
508 1 3 0 80 ffffa00005944b20 unbound kqueue
498 2 3 0 80 ffffa000050f46a0 xenconsoled netio
498 1 3 0 80 ffffa00004a1fa40 xenconsoled select
474 1 3 0 80 ffffa000053aa2c0 xenstored select
478 1 3 0 80 ffffa000053aab00 tset ttyraw
462 1 3 0 80 ffffa0000530b2a0 ksh pipe_rd
454 1 3 0 80 ffffa0000530b6c0 conserver select
445 1 3 0 80 ffffa0000530bae0 conserver select
307 2 3 0 80 ffffa000050f4280 sshguard nanoslp
307 1 3 0 80 ffffa00004a5a220 sshguard pipe_rd
421 1 3 0 80 ffffa000050c7680 sshd select
415 1 3 0 80 ffffa000050c7260 rwhod select
405 1 3 0 80 ffffa00004e4f660 powerd kqueue
411 1 3 0 80 ffffa00004e4f240 ntpd pause
372 1 3 0 80 ffffa00004a1f200 xenwatchdogd nanoslp
342 1 3 0 80 ffffa00004a5aa60 ipmon select
152 1 3 0 80 ffffa00004a5a640 syslogd kqueue
1 1 3 0 80 ffffa00004881160 init wait
0 53 3 0 200 ffffa00003ed04c0 ipmi0 ipmi0
0 52 3 0 200 ffffa000053aa6e0 xen_balloon xen_balloon
0 51 3 0 200 ffffa00004899180 physiod physiod
0 50 3 0 200 ffffa000049cd1a0 aiodoned aiodoned
0 49 3 0 200 ffffa000049cd5c0 ioflush syncer
0 48 3 0 200 ffffa000049cd9e0 pgdaemon pgdaemon
0 45 3 0 200 ffffa00004870540 usb7 usbevt
0 44 3 0 200 ffffa00004870960 usb6 usbevt
0 43 3 0 200 ffffa0000486f100 usb5 usbevt
0 42 3 0 200 ffffa0000486f520 usb4 usbevt
0 41 3 0 200 ffffa0000486f940 usb3 usbevt
0 40 3 0 200 ffffa000048200e0 usb2 usbevt
0 39 3 0 200 ffffa00004820500 usb1 usbevt
0 38 3 0 200 ffffa000048995a0 usb0 usbevt
0 37 3 0 200 ffffa000048999c0 cryptoret crypto_w
0 36 3 0 200 ffffa00004881580 unpgc unpgc
0 35 3 0 200 ffffa000048819a0 vmem_rehash vmem_rehash
0 34 3 0 200 ffffa00004871140 xenbus rdst
0 33 3 0 200 ffffa00004871560 xenwatch evtsq
0 32 3 0 200 ffffa00004871980 mfi0 mfi0
0 23 3 0 200 ffffa00004820920 usbtask-dr usbtsk
0 22 3 0 200 ffffa000044d50c0 usbtask-hc usbtsk
0 21 3 0 200 ffffa000044d54e0 scsibus0 sccomp
0 20 3 0 200 ffffa000044d5900 bnx1 bnx1
0 19 3 0 200 ffffa00003ed00a0 bnx0 bnx0
0 17 3 0 200 ffffa00003ed08e0 ipmi ipmipoll
0 16 3 0 200 ffffa00003eb0080 sysmon smtaskq
0 15 3 0 200 ffffa00003eb04a0 pmfsuspend pmfsuspend
0 14 3 0 200 ffffa00003eb08c0 pmfevent pmfevent
0 13 3 0 200 ffffa00003a13060 sopendfree sopendfr
0 12 3 0 200 ffffa00003a13480 nfssilly nfssilly
0 11 3 0 200 ffffa00003a138a0 cachegc cachegc
0 10 3 0 200 ffffa00003a11040 vrele vrele
0 9 3 0 200 ffffa00003a11460 vdrain vdrain
0 8 3 0 200 ffffa00003a11880 modunload mod_unld
0 7 3 0 200 ffffa00003a08020 xcall/0 xcall
0 6 1 0 200 ffffa00003a08440 softser/0
0 5 1 0 200 ffffa00003a08860 softclk/0
0 4 1 0 200 ffffa00003a06000 softbio/0
0 3 1 0 200 ffffa00003a06420 softnet/0
0 2 1 0 201 ffffa00003a06840 idle/0
0 1 3 0 200 ffffffff80c69860 swapper uvm
crash> ps | fgrep tstile
2865 1 3 0 0 ffffa00005c638a0 ls tstile
2245 1 3 0 1000000 ffffa000050f4ac0 df tstile
1292 1 3 0 0 ffffa00005bb9420 xl tstile
1785 1 3 0 1000000 ffffa00005b26340 xl tstile
905 1 3 0 0 ffffa00005bb9000 xenstore-read tstile
907 1 3 0 0 ffffa00005bcb860 xenstore-read tstile
1646 1 3 0 0 ffffa00005bcb440 xenstore-read tstile
crash>
crash>
crash>
crash> t/a ffffa00005bcb440
trace: pid 1646 lid 1 at 0xffffa00065156c90
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
vn_write() at vn_write+0x9b
dofilewrite() at dofilewrite+0x97
sys_write() at sys_write+0x5f
syscall() at syscall+0x9a
--- syscall (number 4) ---
7f7ff703c1fa:
crash>
crash>
crash>
crash> t/a ffffa00005bcb860
trace: pid 907 lid 1 at 0xffffa0006514f988
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff703c43a:
crash>
crash>
crash>
crash> t/a ffffa00005bb9000
trace: pid 905 lid 1 at 0xffffa0006514c988
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff703c43a:
crash>
crash>
crash>
crash> t/a ffffa00005bb9420
trace: pid 1292 lid 1 at 0xffffa00065149960
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
kernfs_root() at kernfs_root+0x41
lookup_once() at lookup_once+0x26e
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff603c43a:
crash>
crash>
crash>
crash> t/a ffffa00005c16040
trace: pid 906 lid 1 at 0xffffa00065172d08
sleepq_block() at sleepq_block+0x8e
xenbus_dev_read() at xenbus_dev_read+0x47
kernfs_try_fileop() at kernfs_try_fileop+0x56
VOP_READ() at VOP_READ+0x3c
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff703c3ba:
crash>
crash>
crash>
It looks like one of these is holding a lock while starting another.
I'm guessing xenbus_dev_read() should not hold any locks when it goes
to sleep waiting for that read to complete.
I thought the problem might have something to do with the last domU disk
being on the root filesystem, as those other commands that were hanging
seemed to be hanging on the root filesystem.
However after a reboot and re-configuring the .iso files to be on a
separate filesystem it still hangs, so I think instead it has to do with
something more central in the Xen and VFS code.
# crash
Crash version 7.1, image version 7.1.
Output from a running system is unreliable.
crash> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
1457 > 1 7 0 0 ffffa00005b6b360 crash
1033 1 3 0 80 ffffa00005b6b780 ksh pause
1313 1 3 0 0 ffffa00005afa340 xl tstile
1671 1 3 0 0 ffffa00005bd4860 xenstore-write tstile
1129 1 3 0 80 ffffa00005bea460 xenstore-read xbrd
1122 1 3 0 0 ffffa00005bc9840 xenstore-read tstile
1003 2 3 0 80 ffffa00005955300 xl netio
1003 1 3 0 80 ffffa00005afa760 xl select
642 1 3 0 80 ffffa00005afab80 ksh pause
331 1 3 0 80 ffffa000049eb1c0 ksh pause
1134 1 3 0 80 ffffa000049eba00 sshd select
328 1 3 0 80 ffffa0000597f740 sshd select
767 1 3 0 80 ffffa00004a1f620 ksh pause
590 1 2 0 0 ffffa00004a001e0 sshd
440 1 3 0 80 ffffa00004a00600 sshd select
693 1 3 0 80 ffffa0000597f320 getty ttyraw
475 1 3 0 80 ffffa000050bf260 getty ttyraw
756 1 3 0 80 ffffa00004a00a20 getty ttyraw
691 1 3 0 80 ffffa000049eb5e0 getty ttyraw
662 1 3 0 80 ffffa00004870120 getty ttyraw
594 1 3 0 80 ffffa00005955720 cron nanoslp
804 1 3 0 80 ffffa00005955b40 inetd kqueue
698 1 3 0 80 ffffa00004a1f200 qmgr kqueue
701 1 3 0 80 ffffa000057492e0 pickup kqueue
638 1 3 0 80 ffffa0000597fb60 master kqueue
516 1 3 0 80 ffffa00005749700 unbound kqueue
434 2 3 0 80 ffffa000054bd6c0 xenconsoled netio
434 1 3 0 80 ffffa00004a1fa40 xenconsoled select
479 1 3 0 80 ffffa000054ed6e0 tset ttyraw
380 1 3 0 80 ffffa000054ed2c0 xenstored select
463 1 3 0 80 ffffa000054bd2a0 ksh pipe_rd
474 1 3 0 80 ffffa000050e1280 conserver select
441 1 3 0 80 ffffa000054bdae0 conserver select
438 2 3 0 80 ffffa000054edb00 sshguard nanoslp
438 1 3 0 80 ffffa00004efb660 sshguard pipe_rd
451 1 3 0 80 ffffa000050e1ac0 sshd select
414 1 3 0 80 ffffa000050bf680 rwhod select
403 1 3 0 80 ffffa00004efb240 powerd kqueue
405 1 3 0 80 ffffa000050bfaa0 ntpd pause
365 1 3 0 80 ffffa00004a5a220 xenwatchdogd nanoslp
358 1 3 0 80 ffffa00004a5aa60 ipmon select
152 1 3 0 80 ffffa00004a5a640 syslogd kqueue
1 1 3 0 80 ffffa00004881160 init wait
0 54 5 0 200 ffffa00003ed04c0 (zombie)
0 53 3 0 200 ffffa00005749b20 xen_balloon xen_balloon
0 52 3 0 200 ffffa00004efba80 ipmi0 ipmi0
0 51 3 0 200 ffffa00004899180 physiod physiod
0 50 3 0 200 ffffa000049cd1a0 aiodoned aiodoned
0 49 3 0 200 ffffa000049cd5c0 ioflush syncer
0 48 3 0 200 ffffa000049cd9e0 pgdaemon pgdaemon
0 45 3 0 200 ffffa00004870540 usb7 usbevt
0 44 3 0 200 ffffa00004870960 usb6 usbevt
0 43 3 0 200 ffffa0000486f100 usb5 usbevt
0 42 3 0 200 ffffa0000486f520 usb4 usbevt
0 41 3 0 200 ffffa0000486f940 usb3 usbevt
0 40 3 0 200 ffffa000048200e0 usb2 usbevt
0 39 3 0 200 ffffa00004820500 usb1 usbevt
0 38 3 0 200 ffffa000048995a0 usb0 usbevt
0 37 3 0 200 ffffa000048999c0 cryptoret crypto_w
0 36 3 0 200 ffffa00004881580 unpgc unpgc
0 35 3 0 200 ffffa000048819a0 vmem_rehash vmem_rehash
0 34 3 0 200 ffffa00004871140 xenbus rdst
0 33 3 0 200 ffffa00004871560 xenwatch evtsq
0 32 3 0 200 ffffa00004871980 mfi0 mfi0
0 23 3 0 200 ffffa00004820920 usbtask-dr usbtsk
0 22 3 0 200 ffffa000044d50c0 usbtask-hc usbtsk
0 21 3 0 200 ffffa000044d54e0 scsibus0 sccomp
0 20 3 0 200 ffffa000044d5900 bnx1 bnx1
0 19 3 0 200 ffffa00003ed00a0 bnx0 bnx0
0 17 3 0 200 ffffa00003ed08e0 ipmi ipmipoll
0 16 3 0 200 ffffa00003eb0080 sysmon smtaskq
0 15 3 0 200 ffffa00003eb04a0 pmfsuspend pmfsuspend
0 14 3 0 200 ffffa00003eb08c0 pmfevent pmfevent
0 13 3 0 200 ffffa00003a13060 sopendfree sopendfr
0 12 3 0 200 ffffa00003a13480 nfssilly nfssilly
0 11 3 0 200 ffffa00003a138a0 cachegc cachegc
0 10 3 0 200 ffffa00003a11040 vrele vrele
0 9 3 0 200 ffffa00003a11460 vdrain vdrain
0 8 3 0 200 ffffa00003a11880 modunload mod_unld
0 7 3 0 200 ffffa00003a08020 xcall/0 xcall
0 6 1 0 200 ffffa00003a08440 softser/0
0 5 1 0 200 ffffa00003a08860 softclk/0
0 4 1 0 200 ffffa00003a06000 softbio/0
0 3 1 0 200 ffffa00003a06420 softnet/0
0 2 1 0 201 ffffa00003a06840 idle/0
0 1 3 0 200 ffffffff80c69860 swapper uvm
crash> t/a ffffa00005bea460
trace: pid 1129 lid 1 at 0xffffa000651c9d08
sleepq_block() at sleepq_block+0x8e
xenbus_dev_read() at xenbus_dev_read+0x47
kernfs_try_fileop() at kernfs_try_fileop+0x56
VOP_READ() at VOP_READ+0x3c
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff703c3ba:
crash> t/a ffffa00005bd4860
trace: pid 1671 lid 1 at 0xffffa000651b5a38
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
do_sys_statat() at do_sys_statat+0x72
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9a
--- syscall (number 439) ---
7f7ff710870a:
crash> t/a ffffa00005bc9840
trace: pid 1122 lid 1 at 0xffffa000651a0c90
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
vn_write() at vn_write+0x9b
dofilewrite() at dofilewrite+0x97
sys_write() at sys_write+0x5f
syscall() at syscall+0x9a
--- syscall (number 4) ---
7f7ff703c1fa:
crash>
I'm guessing the real problem is this xenbus_dev_read() going to sleep
with a lock held. I don't see where the lock is created though.
I'm not sure why others haven't encountered this deadlock yet. I'm also
not sure why it didn't happen when these VMs were first configured.
Timing must be everything I guess, though it's been 110% reproducible
on my setup, and across many reboots.
Entirely removing the .iso file from the device config still caused it
to encounter some kind of problem, but this time it didn't deadlock and
require a reboot:
# xl create /usr/pkg/etc/xen/b2.conf
Parsing config from /usr/pkg/etc/xen/b2.conf
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/0 because of timeout
libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/5 because of timeout
libxl: error: libxl_create.c:1255:domcreate_launch_dm: unable to add disk devices
libxl: error: libxl.c:1575:libxl__destroy_domid: non-existant domain 1
libxl: error: libxl.c:1534:domain_destroy_callback: unable to destroy guest with domid 1
libxl: error: libxl.c:1463:domain_destroy_cb: destruction of domain 1 failed
What's odd is that the two other VMs on this machine now start normally.
This problem one happened to be first in the xendomains list, so was
really causing a major headache, but it's the least important of them,
so now I can debug it in relative calm.
What's even more odd is that now the other VMs are running, this problem
one has started A-OK and without a peep.
HOWEVER!
I see in src-current that there have been "recent" changes to
xenbus_dev.c, and they get rid of the errant tsleep() call.
I'd say these changes and any related to them are critical fixes that
should be considered for pull up to netbsd-7.
Thanks!
--
Greg A. Woods <gwoods%acm.org@localhost>
+1 250 762-7675 RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpGLRvVUuo72.pgp
Description: PGP signature