Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: current dom0 panic on domu launch



On 21/10/2009 6:50 AM, Manuel Bouyer wrote:
On Tue, Oct 20, 2009 at 12:49:00PM +1100, Sarton O'Brien wrote:
If there is further information I can provide please let me know. In the
meantime I'm updating xentools, python and expat to see if that helps.
This hasn't helped. The logs report as follows:
Can you see in xenbackendd.log if xenbackendd detects the new device and
call the scripts ? And also if xenbackendd calls the script when a
domU is shutdown, to release the devices ?
At moments everything appears to be working flawlessly, the logs don't 
seem to be saying much more or less than usual.
I can't reproduce this issue on my system. There may be a race somewhere
which cause a device destroy event to be missed, or a condition in
a backend driver which cause it to fail to detach.
Sorry I haven't had much time to test thoroughly. I thought maybe there 
was something wrong with the update build so I tested a complete/clean 
build from another server.
I tested by starting a domu, stopping a domu ... and all seemed fine up 
until I had around 4 domu running. Then when stopping, the same problem 
presented itself, domu gone but xvifs/vnds still there.
So there seems to be a magic number to when this occurs. By default I 
start 4 at boot. Actually, I just tested with only one domU then and got 
the following:
## dmesg - Start one domU ##
xvif1.0: Ethernet address 00:16:3e:3c:07:01
sysctl_createv: sysctl_create(xvif1.0) returned 22
xvif1.0: could not attach sysctl nodes
xbd backend: attach device vnd0d (size 83886080) for domain 1
xbd backend: attach device sd0a (size 1953525168) for domain 1
xbd backend 0x0 for domain 1 using event channel 17, protocol x86_64-abi
xbd backend 0x1 for domain 1 using event channel 18, protocol x86_64-abi

## dmesg - Stop one domU ##
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 120 rsp_prod 119 rsp_prod_pvt 119 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 121 rsp_prod 120 rsp_prod_pvt 120 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 122 rsp_prod 121 rsp_prod_pvt 121 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 123 rsp_prod 122 rsp_prod_pvt 122 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 124 rsp_prod 123 rsp_prod_pvt 123 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 125 rsp_prod 124 rsp_prod_pvt 124 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 126 rsp_prod 125 rsp_prod_pvt 125 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 127 rsp_prod 126 rsp_prod_pvt 126 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 128 rsp_prod 127 rsp_prod_pvt 127 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 129 rsp_prod 128 rsp_prod_pvt 128 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 130 rsp_prod 129 rsp_prod_pvt 129 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 131 rsp_prod 130 rsp_prod_pvt 130 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 132 rsp_prod 131 rsp_prod_pvt 131 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 133 rsp_prod 132 rsp_prod_pvt 132 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 134 rsp_prod 133 rsp_prod_pvt 133 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 135 rsp_prod 134 rsp_prod_pvt 134 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 136 rsp_prod 135 rsp_prod_pvt 135 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 137 rsp_prod 136 rsp_prod_pvt 136 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 138 rsp_prod 137 rsp_prod_pvt 137 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 139 rsp_prod 138 rsp_prod_pvt 138 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 140 rsp_prod 139 rsp_prod_pvt 139 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 141 rsp_prod 140 rsp_prod_pvt 140 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 142 rsp_prod 141 rsp_prod_pvt 141 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 143 rsp_prod 142 rsp_prod_pvt 142 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 144 rsp_prod 143 rsp_prod_pvt 143 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 145 rsp_prod 144 rsp_prod_pvt 144 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 146 rsp_prod 145 rsp_prod_pvt 145 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 147 rsp_prod 146 rsp_prod_pvt 146 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 148 rsp_prod 147 rsp_prod_pvt 147 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 149 rsp_prod 148 rsp_prod_pvt 148 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 150 rsp_prod 149 rsp_prod_pvt 149 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 151 rsp_prod 150 rsp_prod_pvt 150 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 152 rsp_prod 151 rsp_prod_pvt 151 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 153 rsp_prod 152 rsp_prod_pvt 152 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 154 rsp_prod 153 rsp_prod_pvt 153 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 155 rsp_prod 154 rsp_prod_pvt 154 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 156 rsp_prod 155 rsp_prod_pvt 155 i 1
xvif1.0 GNTTABOP_transfer[0] -1
xvif1.0: req_prod 256 req_cons 157 rsp_prod 156 rsp_prod_pvt 156 i 1

This keeps going over time.

## ifconfig -a ##
bge0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500

capabilities=3f80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx>
        enabled=0
        address: 00:13:72:18:02:ad
media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause)
        status: active
        inet 192.168.210.10 netmask 0xffffff00 broadcast 192.168.210.255
        inet6 fe80::213:72ff:fe18:2ad%bge0 prefixlen 64 scopeid 0x1
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 33648
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
bridge0: flags=41<UP,RUNNING> mtu 1500
xvif1.0: flags=8963<UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
        capabilities=2800<TCP4CSUM_Tx,UDP4CSUM_Tx>
        enabled=0
        address: 00:16:3e:3c:07:01
        inet6 fe80::216:3eff:fe3c:701%xvif1.0 prefixlen 64 scopeid 0x4

## vnconfig -l ##
vnd0: /usr (/dev/raid0e) inode 11043390
vnd1: not in use
vnd2: not in use
vnd3: not in use

Trying to start the domU again yields:

# xm create spike -c
Using config file "/usr/pkg/etc/xen/spike".

And nothing else. Still not much in the logs but I haven't had time to monitor them properly during testing, time was limited :).
So sometimes it works, sometimes it doesn't. I'm not sure what is 
actually triggering it :(
The domU that just triggered it is quite network intensive (nfs, yp/nis, 
samba, radius, ldap) and has a USB drive passed to it from dom0 but 
other than that they are all pretty much identical (all current). TBH, I 
think it is triggered more with this domU as I was avoiding it when 
testing earlier ... and it may have been the last one I booted.
To be thorough I just tested booting and shutting down a lightweight 
domU, all good. I had to CTRL-C passed ypbind and mounting NFS due to it 
not being available. dmesg is clean.
Booting the the network intensive one mentioned above and shutting it 
down relatively quickly worked ... though it wasn't clean:
xvif3.0 GNTTABOP_transfer[0] -1
xvif3.0: req_prod 383 req_cons 137 rsp_prod 136 rsp_prod_pvt 136 i 1
xbd backend: detach device vnd0d for domain 3
xbd backend: detach device sd0a for domain 3
xvif3.0: disconnecting

I'm willing to bet that if I had let it run for a while, the same problem from before would have occurred.
Seems to occur with time and network usage. Possibly more domU running 
helps trigger it.
The logs don't seem to contain anything other than what they usually do, 
they don't seem to be very verbose. Should I be booting a debug xen kernel?
Hopefully some of this info helps. :)

Sarton


Home | Main Index | Thread Index | Old Index