Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

a story of debugging a NetBSD/Xen server crash and hang



A new-ish remote Xen server (that had never been rebooted, sigh) crashed
and failed to reboot yesterday.  It seemed to get its network configured
but unfortunately it didn't seem to be starting sshd.  Worse when an
on-site person watched it boot it was clearing the screen quickly after
showing the boot messages, but never presenting a login prompt, so no
chance of really seeing how far it got (at least not without watching a
video of the screen and pausing just before it blanked).

Unfortunately the swap partition had not yet been enabled, so no crash
dump from the original crash either.

It's running NetBSD-7.1 (amd64) release with Xen-4.8:

    # uname -a
    NetBSD xentral 7.1 NetBSD 7.1 (XEN3_DOM0.201703111743Z) amd64
    # pkg_info -c xentools\* | sed 1q
    Information for xentools48-4.8.0:
    # pkg_info -c xenkern\* | sed 1q
    Information for xenkernel48-4.8.0nb1:

It would still boot a GENERIC kernel and nothing seemed wrong from
there, so after some staring at logs I decided to comment out the
xendomains list and try booting the xen dom0 kernel again.

This time it booted fine, as expected, so I tried starting the first domU:

    # xl create /usr/pkg/etc/xen/b2.conf
    Parsing config from /usr/pkg/etc/xen/b2.conf
    load: 0.06  cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
    load: 0.06  cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
    load: 0.04  cmd: sh 1767 [pipe_rd] 0.00u 0.00s 0% 1432k
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/0 because of timeout
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/4 because of timeout
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/5 because of timeout
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/6 because of timeout
    libxl: error: libxl_create.c:1255:domcreate_launch_dm: unable to add disk devices
    xenstore-read: couldn't read path /local/domain/0/backend/vbd/1/6/vnd
    
    load: 0.02  cmd: xenstore-read 1646 [tstile] 0.00u 0.00s 0% 1200k


Oh, oh, there's my problem!  Starting a domU deadlocks in the kernel!

First things first:  add "sshd" to the "REQUIRE" line in
/etc/rc.conf/xendomains!  This was the main difficulty that required an
on-site person to help reboot -- there wasn't even a login prompt on the
console when it deadlocked.  Both local and remote access MUST be
enabled before trying to start any domU!!!

Now, back to debugging.  Here's the config for the sticky domU:

    # cat /usr/pkg/etc/xen/b2.conf | grep -v '^#'
    name = "b2"
    uuid = "96e60a02-9fc7-11e7-9f0d-d4ae528b9511"
    kernel = "/netbsd-7.99.34-XEN3_DOMU"
    extra = "root=xbd0a"
    memory = 10000
    maxmem = 16000
    vcpus = 8
    vif = [ 'bridge=bridge0' ]
    disk = [
        'format=raw, vdev=0x0, access=rw, target=/dev/mapper/vg0-lv10',
        'format=raw, vdev=0x1, access=rw, target=/dev/mapper/vg1-lv11',
        'format=raw, vdev=0x2, access=rw, target=/dev/mapper/vg0-lv12',
        'format=raw, vdev=0x3, access=rw, target=/dev/mapper/vg0-lv13',
        'format=raw, vdev=0x4, access=rw, target=/dev/mapper/vg0-lv14',
        'format=raw, vdev=0x5, access=rw, target=/dev/mapper/vg1-lv15',
        'format=raw, vdev=0x6, access=ro, devtype=cdrom, target=/images/NetBSD-7.99.34-amd64.iso'
    ]


OK, the error message suggested something wrong with vbd#6, so why the
sudden problem with /images?

    # ls -l /images
    total 8639360
    -r--r--r--  1 woods  users   715941888 Sep 19 20:26 NetBSD-5.2_STABLE-amd64.iso
    -rw-r--r--  1 woods  users   391266304 Mar 11  2017 NetBSD-7.1-amd64.iso
    -r--r--r--  1 woods  users  3313895424 Sep 20 00:12 NetBSD-7.99.34-amd64.iso

hmmm... I forget, is /images a filesystem?

    # df
[[ twiddle thumbs for a moment ... then hit <Ctrl-T> (how to linuxoids
do without it!?!?!?!) ]]
    load: 0.01  cmd: df 2245 [tstile] 0.00u 0.00s 0% 1204k

Oh oh, that IS a nasty deadlock for sure!

google and read some old email messages and bug reports....

Run "crash":

    crash> ps
    PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
    386  >   1 7   0         0   ffffa00005ce04c0              crash
    4397     1 3   0        80   ffffa00005c9f8c0                ksh pause
    3213     1 3   0        80   ffffa00005c63060                ksh pause
    1828     1 2   0         0   ffffa00005c63480               sshd
    2795     1 3   0        80   ffffa00005bcb020               sshd select
    2865     1 3   0         0   ffffa00005c638a0                 ls tstile
    2471     1 3   0        80   ffffa00005b59780                ksh pause
    1464     1 3   0        80   ffffa00005c16880               sshd select
    2598     1 3   0        80   ffffa00005c16460               sshd select
    2245     1 3   0   1000000   ffffa000050f4ac0                 df tstile
    416      1 3   0        80   ffffa00005bb9840                ksh pause
    1827     1 3   0        80   ffffa00005b59360               sshd select
    1576     1 3   0        80   ffffa00004a00600               sshd select
    1292     1 3   0         0   ffffa00005bb9420                 xl tstile
    1785     1 3   0   1000000   ffffa00005b26340                 xl tstile
    1264     1 3   0        80   ffffa00005b59ba0                ksh pause
    905      1 3   0         0   ffffa00005bb9000      xenstore-read tstile
    907      1 3   0         0   ffffa00005bcb860      xenstore-read tstile
    1646     1 3   0         0   ffffa00005bcb440      xenstore-read tstile
    906      1 3   0        80   ffffa00005c16040      xenstore-read xbrd
    849      2 3   0        80   ffffa000059e1320                 xl netio
    849      1 3   0        80   ffffa00005b26b80                 xl select
    1373     1 3   0        80   ffffa000059e1b60                ksh pause
    1219     1 3   0        80   ffffa00005b26760                ksh pause
    1159     1 3   0        80   ffffa00005953b40               sshd select
    322      1 3   0        80   ffffa000049eb1c0               sshd select
    703      1 3   0        80   ffffa000049eba00                ksh pause
    762      1 3   0        80   ffffa00004a1f620               sshd select
    732      1 3   0        80   ffffa00004a001e0               sshd select
    693      1 3   0        80   ffffa000059e1740              getty ttyraw
    699      1 3   0        80   ffffa000059442e0              getty ttyraw
    596      1 3   0        80   ffffa00004a00a20              getty ttyraw
    531      1 3   0        80   ffffa000049eb5e0              getty ttyraw
    630      1 3   0        80   ffffa00004870120              getty ttyraw
    658      1 3   0        80   ffffa00005944700               cron nanoslp
    644      1 3   0        80   ffffa000050c7aa0              inetd kqueue
    289      1 3   0        80   ffffa00005953720               qmgr kqueue
    290      1 3   0        80   ffffa00004e4fa80             pickup kqueue
    702      1 3   0        80   ffffa00005953300             master kqueue
    508      1 3   0        80   ffffa00005944b20            unbound kqueue
    498      2 3   0        80   ffffa000050f46a0        xenconsoled netio
    498      1 3   0        80   ffffa00004a1fa40        xenconsoled select
    474      1 3   0        80   ffffa000053aa2c0          xenstored select
    478      1 3   0        80   ffffa000053aab00               tset ttyraw
    462      1 3   0        80   ffffa0000530b2a0                ksh pipe_rd
    454      1 3   0        80   ffffa0000530b6c0          conserver select
    445      1 3   0        80   ffffa0000530bae0          conserver select
    307      2 3   0        80   ffffa000050f4280           sshguard nanoslp
    307      1 3   0        80   ffffa00004a5a220           sshguard pipe_rd
    421      1 3   0        80   ffffa000050c7680               sshd select
    415      1 3   0        80   ffffa000050c7260              rwhod select
    405      1 3   0        80   ffffa00004e4f660             powerd kqueue
    411      1 3   0        80   ffffa00004e4f240               ntpd pause
    372      1 3   0        80   ffffa00004a1f200       xenwatchdogd nanoslp
    342      1 3   0        80   ffffa00004a5aa60              ipmon select
    152      1 3   0        80   ffffa00004a5a640            syslogd kqueue
    1        1 3   0        80   ffffa00004881160               init wait
    0       53 3   0       200   ffffa00003ed04c0              ipmi0 ipmi0
    0       52 3   0       200   ffffa000053aa6e0        xen_balloon xen_balloon
    0       51 3   0       200   ffffa00004899180            physiod physiod
    0       50 3   0       200   ffffa000049cd1a0           aiodoned aiodoned
    0       49 3   0       200   ffffa000049cd5c0            ioflush syncer
    0       48 3   0       200   ffffa000049cd9e0           pgdaemon pgdaemon
    0       45 3   0       200   ffffa00004870540               usb7 usbevt
    0       44 3   0       200   ffffa00004870960               usb6 usbevt
    0       43 3   0       200   ffffa0000486f100               usb5 usbevt
    0       42 3   0       200   ffffa0000486f520               usb4 usbevt
    0       41 3   0       200   ffffa0000486f940               usb3 usbevt
    0       40 3   0       200   ffffa000048200e0               usb2 usbevt
    0       39 3   0       200   ffffa00004820500               usb1 usbevt
    0       38 3   0       200   ffffa000048995a0               usb0 usbevt
    0       37 3   0       200   ffffa000048999c0          cryptoret crypto_w
    0       36 3   0       200   ffffa00004881580              unpgc unpgc
    0       35 3   0       200   ffffa000048819a0        vmem_rehash vmem_rehash
    0       34 3   0       200   ffffa00004871140             xenbus rdst
    0       33 3   0       200   ffffa00004871560           xenwatch evtsq
    0       32 3   0       200   ffffa00004871980               mfi0 mfi0
    0       23 3   0       200   ffffa00004820920         usbtask-dr usbtsk
    0       22 3   0       200   ffffa000044d50c0         usbtask-hc usbtsk
    0       21 3   0       200   ffffa000044d54e0           scsibus0 sccomp
    0       20 3   0       200   ffffa000044d5900               bnx1 bnx1
    0       19 3   0       200   ffffa00003ed00a0               bnx0 bnx0
    0       17 3   0       200   ffffa00003ed08e0               ipmi ipmipoll
    0       16 3   0       200   ffffa00003eb0080             sysmon smtaskq
    0       15 3   0       200   ffffa00003eb04a0         pmfsuspend pmfsuspend
    0       14 3   0       200   ffffa00003eb08c0           pmfevent pmfevent
    0       13 3   0       200   ffffa00003a13060         sopendfree sopendfr
    0       12 3   0       200   ffffa00003a13480           nfssilly nfssilly
    0       11 3   0       200   ffffa00003a138a0            cachegc cachegc
    0       10 3   0       200   ffffa00003a11040              vrele vrele
    0        9 3   0       200   ffffa00003a11460             vdrain vdrain
    0        8 3   0       200   ffffa00003a11880          modunload mod_unld
    0        7 3   0       200   ffffa00003a08020            xcall/0 xcall
    0        6 1   0       200   ffffa00003a08440          softser/0
    0        5 1   0       200   ffffa00003a08860          softclk/0
    0        4 1   0       200   ffffa00003a06000          softbio/0
    0        3 1   0       200   ffffa00003a06420          softnet/0
    0        2 1   0       201   ffffa00003a06840             idle/0
    0        1 3   0       200   ffffffff80c69860            swapper uvm
    crash> ps | fgrep tstile
    2865     1 3   0         0   ffffa00005c638a0                 ls tstile
    2245     1 3   0   1000000   ffffa000050f4ac0                 df tstile
    1292     1 3   0         0   ffffa00005bb9420                 xl tstile
    1785     1 3   0   1000000   ffffa00005b26340                 xl tstile
    905      1 3   0         0   ffffa00005bb9000      xenstore-read tstile
    907      1 3   0         0   ffffa00005bcb860      xenstore-read tstile
    1646     1 3   0         0   ffffa00005bcb440      xenstore-read tstile
    crash> 
    crash> 
    crash> 
    crash> t/a ffffa00005bcb440
    trace: pid 1646 lid 1 at 0xffffa00065156c90
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    vn_write() at vn_write+0x9b
    dofilewrite() at dofilewrite+0x97
    sys_write() at sys_write+0x5f
    syscall() at syscall+0x9a
    --- syscall (number 4) ---
    7f7ff703c1fa:
    crash> 
    crash> 
    crash> 
    crash> t/a ffffa00005bcb860
    trace: pid 907 lid 1 at 0xffffa0006514f988
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    lookup_once() at lookup_once+0x1f4
    namei_tryemulroot() at namei_tryemulroot+0x612
    namei() at namei+0x43
    vn_open() at vn_open+0x94
    do_open() at do_open+0x111
    do_sys_openat() at do_sys_openat+0x68
    sys_open() at sys_open+0x24
    syscall() at syscall+0x9a
    --- syscall (number 5) ---
    7f7ff703c43a:
    crash> 
    crash> 
    crash> 
    crash> t/a ffffa00005bb9000
    trace: pid 905 lid 1 at 0xffffa0006514c988
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    lookup_once() at lookup_once+0x1f4
    namei_tryemulroot() at namei_tryemulroot+0x612
    namei() at namei+0x43
    vn_open() at vn_open+0x94
    do_open() at do_open+0x111
    do_sys_openat() at do_sys_openat+0x68
    sys_open() at sys_open+0x24
    syscall() at syscall+0x9a
    --- syscall (number 5) ---
    7f7ff703c43a:
    crash> 
    crash> 
    crash> 
    crash> t/a ffffa00005bb9420
    trace: pid 1292 lid 1 at 0xffffa00065149960
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    kernfs_root() at kernfs_root+0x41
    lookup_once() at lookup_once+0x26e
    namei_tryemulroot() at namei_tryemulroot+0x612
    namei() at namei+0x43
    vn_open() at vn_open+0x94
    do_open() at do_open+0x111
    do_sys_openat() at do_sys_openat+0x68
    sys_open() at sys_open+0x24
    syscall() at syscall+0x9a
    --- syscall (number 5) ---
    7f7ff603c43a:
    crash> 
    crash> 
    crash> 
    crash> t/a ffffa00005c16040
    trace: pid 906 lid 1 at 0xffffa00065172d08
    sleepq_block() at sleepq_block+0x8e
    xenbus_dev_read() at xenbus_dev_read+0x47
    kernfs_try_fileop() at kernfs_try_fileop+0x56
    VOP_READ() at VOP_READ+0x3c
    vn_read() at vn_read+0x94
    dofileread() at dofileread+0x90
    sys_read() at sys_read+0x5f
    syscall() at syscall+0x9a
    --- syscall (number 3) ---
    7f7ff703c3ba:
    crash> 
    crash> 
    crash> 

It looks like one of these is holding a lock while starting another.
I'm guessing xenbus_dev_read() should not hold any locks when it goes
to sleep waiting for that read to complete.

I thought the problem might have something to do with the last domU disk
being on the root filesystem, as those other commands that were hanging
seemed to be hanging on the root filesystem.

However after a reboot and re-configuring the .iso files to be on a
separate filesystem it still hangs, so I think instead it has to do with
something more central in the Xen and VFS code.


    # crash
    Crash version 7.1, image version 7.1.
    Output from a running system is unreliable.
    crash> ps
    PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
    1457 >   1 7   0         0   ffffa00005b6b360              crash
    1033     1 3   0        80   ffffa00005b6b780                ksh pause
    1313     1 3   0         0   ffffa00005afa340                 xl tstile
    1671     1 3   0         0   ffffa00005bd4860     xenstore-write tstile
    1129     1 3   0        80   ffffa00005bea460      xenstore-read xbrd
    1122     1 3   0         0   ffffa00005bc9840      xenstore-read tstile
    1003     2 3   0        80   ffffa00005955300                 xl netio
    1003     1 3   0        80   ffffa00005afa760                 xl select
    642      1 3   0        80   ffffa00005afab80                ksh pause
    331      1 3   0        80   ffffa000049eb1c0                ksh pause
    1134     1 3   0        80   ffffa000049eba00               sshd select
    328      1 3   0        80   ffffa0000597f740               sshd select
    767      1 3   0        80   ffffa00004a1f620                ksh pause
    590      1 2   0         0   ffffa00004a001e0               sshd
    440      1 3   0        80   ffffa00004a00600               sshd select
    693      1 3   0        80   ffffa0000597f320              getty ttyraw
    475      1 3   0        80   ffffa000050bf260              getty ttyraw
    756      1 3   0        80   ffffa00004a00a20              getty ttyraw
    691      1 3   0        80   ffffa000049eb5e0              getty ttyraw
    662      1 3   0        80   ffffa00004870120              getty ttyraw
    594      1 3   0        80   ffffa00005955720               cron nanoslp
    804      1 3   0        80   ffffa00005955b40              inetd kqueue
    698      1 3   0        80   ffffa00004a1f200               qmgr kqueue
    701      1 3   0        80   ffffa000057492e0             pickup kqueue
    638      1 3   0        80   ffffa0000597fb60             master kqueue
    516      1 3   0        80   ffffa00005749700            unbound kqueue
    434      2 3   0        80   ffffa000054bd6c0        xenconsoled netio
    434      1 3   0        80   ffffa00004a1fa40        xenconsoled select
    479      1 3   0        80   ffffa000054ed6e0               tset ttyraw
    380      1 3   0        80   ffffa000054ed2c0          xenstored select
    463      1 3   0        80   ffffa000054bd2a0                ksh pipe_rd
    474      1 3   0        80   ffffa000050e1280          conserver select
    441      1 3   0        80   ffffa000054bdae0          conserver select
    438      2 3   0        80   ffffa000054edb00           sshguard nanoslp
    438      1 3   0        80   ffffa00004efb660           sshguard pipe_rd
    451      1 3   0        80   ffffa000050e1ac0               sshd select
    414      1 3   0        80   ffffa000050bf680              rwhod select
    403      1 3   0        80   ffffa00004efb240             powerd kqueue
    405      1 3   0        80   ffffa000050bfaa0               ntpd pause
    365      1 3   0        80   ffffa00004a5a220       xenwatchdogd nanoslp
    358      1 3   0        80   ffffa00004a5aa60              ipmon select
    152      1 3   0        80   ffffa00004a5a640            syslogd kqueue
    1        1 3   0        80   ffffa00004881160               init wait
    0       54 5   0       200   ffffa00003ed04c0           (zombie)
    0       53 3   0       200   ffffa00005749b20        xen_balloon xen_balloon
    0       52 3   0       200   ffffa00004efba80              ipmi0 ipmi0
    0       51 3   0       200   ffffa00004899180            physiod physiod
    0       50 3   0       200   ffffa000049cd1a0           aiodoned aiodoned
    0       49 3   0       200   ffffa000049cd5c0            ioflush syncer
    0       48 3   0       200   ffffa000049cd9e0           pgdaemon pgdaemon
    0       45 3   0       200   ffffa00004870540               usb7 usbevt
    0       44 3   0       200   ffffa00004870960               usb6 usbevt
    0       43 3   0       200   ffffa0000486f100               usb5 usbevt
    0       42 3   0       200   ffffa0000486f520               usb4 usbevt
    0       41 3   0       200   ffffa0000486f940               usb3 usbevt
    0       40 3   0       200   ffffa000048200e0               usb2 usbevt
    0       39 3   0       200   ffffa00004820500               usb1 usbevt
    0       38 3   0       200   ffffa000048995a0               usb0 usbevt
    0       37 3   0       200   ffffa000048999c0          cryptoret crypto_w
    0       36 3   0       200   ffffa00004881580              unpgc unpgc
    0       35 3   0       200   ffffa000048819a0        vmem_rehash vmem_rehash
    0       34 3   0       200   ffffa00004871140             xenbus rdst
    0       33 3   0       200   ffffa00004871560           xenwatch evtsq
    0       32 3   0       200   ffffa00004871980               mfi0 mfi0
    0       23 3   0       200   ffffa00004820920         usbtask-dr usbtsk
    0       22 3   0       200   ffffa000044d50c0         usbtask-hc usbtsk
    0       21 3   0       200   ffffa000044d54e0           scsibus0 sccomp
    0       20 3   0       200   ffffa000044d5900               bnx1 bnx1
    0       19 3   0       200   ffffa00003ed00a0               bnx0 bnx0
    0       17 3   0       200   ffffa00003ed08e0               ipmi ipmipoll
    0       16 3   0       200   ffffa00003eb0080             sysmon smtaskq
    0       15 3   0       200   ffffa00003eb04a0         pmfsuspend pmfsuspend
    0       14 3   0       200   ffffa00003eb08c0           pmfevent pmfevent
    0       13 3   0       200   ffffa00003a13060         sopendfree sopendfr
    0       12 3   0       200   ffffa00003a13480           nfssilly nfssilly
    0       11 3   0       200   ffffa00003a138a0            cachegc cachegc
    0       10 3   0       200   ffffa00003a11040              vrele vrele
    0        9 3   0       200   ffffa00003a11460             vdrain vdrain
    0        8 3   0       200   ffffa00003a11880          modunload mod_unld
    0        7 3   0       200   ffffa00003a08020            xcall/0 xcall
    0        6 1   0       200   ffffa00003a08440          softser/0
    0        5 1   0       200   ffffa00003a08860          softclk/0
    0        4 1   0       200   ffffa00003a06000          softbio/0
    0        3 1   0       200   ffffa00003a06420          softnet/0
    0        2 1   0       201   ffffa00003a06840             idle/0
    0        1 3   0       200   ffffffff80c69860            swapper uvm
    crash> t/a ffffa00005bea460
    trace: pid 1129 lid 1 at 0xffffa000651c9d08
    sleepq_block() at sleepq_block+0x8e
    xenbus_dev_read() at xenbus_dev_read+0x47
    kernfs_try_fileop() at kernfs_try_fileop+0x56
    VOP_READ() at VOP_READ+0x3c
    vn_read() at vn_read+0x94
    dofileread() at dofileread+0x90
    sys_read() at sys_read+0x5f
    syscall() at syscall+0x9a
    --- syscall (number 3) ---
    7f7ff703c3ba:
    crash> t/a ffffa00005bd4860
    trace: pid 1671 lid 1 at 0xffffa000651b5a38
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    lookup_once() at lookup_once+0x1f4
    namei_tryemulroot() at namei_tryemulroot+0x612
    namei() at namei+0x43
    fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
    do_sys_statat() at do_sys_statat+0x72
    sys___stat50() at sys___stat50+0x28
    syscall() at syscall+0x9a
    --- syscall (number 439) ---
    7f7ff710870a:
    crash> t/a ffffa00005bc9840
    trace: pid 1122 lid 1 at 0xffffa000651a0c90
    sleepq_block() at sleepq_block+0x8e
    turnstile_block() at turnstile_block+0x2c3
    rw_vector_enter() at rw_vector_enter+0x11f
    genfs_lock() at genfs_lock+0x7b
    VOP_LOCK() at VOP_LOCK+0x37
    vn_lock() at vn_lock+0x11
    vn_write() at vn_write+0x9b
    dofilewrite() at dofilewrite+0x97
    sys_write() at sys_write+0x5f
    syscall() at syscall+0x9a
    --- syscall (number 4) ---
    7f7ff703c1fa:
    crash> 

I'm guessing the real problem is this xenbus_dev_read() going to sleep
with a lock held.  I don't see where the lock is created though.

I'm not sure why others haven't encountered this deadlock yet.  I'm also
not sure why it didn't happen when these VMs were first configured.
Timing must be everything I guess, though it's been 110% reproducible
on my setup, and across many reboots.

Entirely removing the .iso file from the device config still caused it
to encounter some kind of problem, but this time it didn't deadlock and
require a reboot:

    # xl create /usr/pkg/etc/xen/b2.conf
    Parsing config from /usr/pkg/etc/xen/b2.conf
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/0 because of timeout
    libxl: error: libxl_aoutils.c:539:async_exec_timeout: killing execution of /usr/pkg/etc/xen/scripts/block /local/domain/0/backend/vbd/1/5 because of timeout
    libxl: error: libxl_create.c:1255:domcreate_launch_dm: unable to add disk devices
    libxl: error: libxl.c:1575:libxl__destroy_domid: non-existant domain 1
    libxl: error: libxl.c:1534:domain_destroy_callback: unable to destroy guest with domid 1
    libxl: error: libxl.c:1463:domain_destroy_cb: destruction of domain 1 failed

What's odd is that the two other VMs on this machine now start normally.
This problem one happened to be first in the xendomains list, so was
really causing a major headache, but it's the least important of them,
so now I can debug it in relative calm.

What's even more odd is that now the other VMs are running, this problem
one has started A-OK and without a peep.

HOWEVER!

I see in src-current that there have been "recent" changes to
xenbus_dev.c, and they get rid of the errant tsleep() call.

I'd say these changes and any related to them are critical fixes that
should be considered for pull up to netbsd-7.

Thanks!

-- 
					Greg A. Woods <gwoods%acm.org@localhost>

+1 250 762-7675                           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Attachment: pgpbjaiPiXTAu.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index