On 09/11/16 21:16, Alaric Snell-Pym wrote:
> On 17/10/16 18:43, Manuel Bouyer wrote:
>> this means that they're all stuck on a lock.
>> The first thing would be to use ddb (or crash(8), if this works on Xen)
>> to find where in the kernel these processes are stuck.
>>
>> The next step would be to which which process is actaully holding the lock
>> (this one is probably not in tstile state)
>>
>
>
> Thanks, Manuel... looks like crash(8) works, I'll check that out next
> time it happens! Does the lock show as the "wait channel" as shown by
> ps/w? How do I find what process currently holds it? And then I guess I
> do a "trace/t" on the pids involved to see what's what? I'm a little
> green with ddb.
>
It's happened again, during an xl create! I'm not sure how to interpret
these results, but here's what I was doing:
Pid 4406 is my hanging "xl create".
-+= 00001 root init
|--= 00526 root xl create /domU/domUs/vm2/xl.cfg
|--= 00892 root /usr/sbin/syslogd -s
|-+= 01180 root xl create /domU/domUs/vm4/xl.cfg
| \--- 02832 root xl create /domU/domUs/vm4/xl.cfg
|--- 01380 root /usr/pkg/sbin/xenstored --pid-file /var/run/xenstored.pid
|--- 01397 root /usr/pkg/sbin/xenconsoled
|--= 01492 root /usr/sbin/powerd
|--= 01557 root xl create /domU/domUs/vm3/xl.cfg
|--= 01637 root /usr/sbin/ntpd -p /var/run/ntpd.pid
|-+= 01673 root /usr/sbin/sshd
| |-+= 00379 root sshd: alaric [priv]
| | \-+- 00301 alaric sshd: alaric@pts/1 (sshd)
| | \-+= 00250 alaric -sh
| | \-+= 02415 root su -
| | \-+= 02493 root -sh
| | \--= 05496 root /usr/pkg/libexec/xen/bin/xenconsole --num 0
--type pv 3
| |-+= 02815 root sshd: alaric [priv]
| | \-+- 00423 alaric sshd: alaric@pts/0 (sshd)
| | \-+= 02096 alaric -sh
| | \-+= 00938 root su -
| | \-+= 00210 root -sh
| | \--= 04406 root xl create -c /domU/domUs/vm4/xl.cfg
| \-+= 08541 root sshd: alaric [priv]
| \-+- 05855 alaric sshd: alaric@pts/10 (sshd)
| \-+= 06627 alaric -sh
| \-+= 05133 root su -
| \-+= 07688 root -sh
| \-+= 16294 root /usr/pkg/bin/proctree
| \-+- 09586 root sh -c ps -axwwo user,pid,ppid,pgid,command
| \--- 17305 root ps -axwwo user,pid,ppid,pgid,command
|--= 01931 root /usr/sbin/cron
|-+= 02060 root /usr/libexec/postfix/master -w
| |--- 01779 postfix qmgr -l -t unix -u
| \--- 07732 postfix pickup -l -t unix -u
|--= 02089 root xl create /domU/domUs/vm5/xl.cfg
|--= 02206 root /usr/sbin/inetd -l
|--= 02475 root xl create /domU/domUs/vm8/xl.cfg
|--= 02511 root xl create /domU/domUs/vm1/xl.cfg
|--= 02630 root xl create /domU/domUs/vm6/xl.cfg
|--- 03217 root xenstore-read /local/domain/0/backend/vbd/4/3/params
|--- 04535 root xenstore-read /local/domain/0/backend/vbd/7/1/params
|--- 05442 root xenstore-rm /local/domain/0/backend/vif/4/0
|--= 05978 root xl create /domU/domUs/vm7/xl.cfg
|--- 08280 root xenstore-rm /local/domain/0/backend/vif/7/1
|--= 00366 root /usr/libexec/getty Pc console
|--= 02673 root /usr/libexec/getty Pc ttyE1
|--= 02617 root /usr/libexec/getty Pc ttyE2
\--= 00259 root /usr/libexec/getty Pc ttyE3
Here's what crash has to say:
crash> ps/a
PID COMMAND STRUCT PROC * UAREA * VMSPACE/VM_MAP
16638 crash ffffa00001feea70 ffffa0000e508000 ffffa00001ef88b8
7732 pickup ffffa00001fee030 ffffa0000e516000 ffffa00001ef8748
7688 sh ffffa00001560a58 ffffa0000e429000 ffffa00001c9d460
5133 su ffffa000020bea78 ffffa0000e377000 ffffa00001ef8188
6627 sh ffffa000020bed08 ffffa0000de60000 ffffa00002ae3d10
5855 sshd ffffa00001e5acf8 ffffa0000e519000 ffffa00001ef82f8
8541 sshd ffffa00001feed00 ffffa0000e4ad000 ffffa00001c9d5d0
4406 xl ffffa00001fee550 ffffa0000e505000 ffffa00001c9d8b0
2832 xl ffffa00001cff2b0 ffffa0000e41b000 ffffa00001ef85d8
5442 xenstore-r ffffa00001fee2c0 ffffa0000dd44000 ffffa00001ef8468
3217 xenstore-r ffffa000020be7e8 ffffa0000fff0000 ffffa00002ae35e0
5496 xenconsole ffffa00001560018 ffffa0000e314000 ffffa00001c9dd00
5978 xl ffffa00001cff7d0 ffffa0000e4be000 ffffa00001ef8018
8280 xenstore-r ffffa000020be558 ffffa0000e4eb000 ffffa00000fd4cf8
4535 xenstore-r ffffa000020be2c8 ffffa0000dfea000 ffffa00002ae3750
2475 xl ffffa00001fee7e0 ffffa0000e4e0000 ffffa00001c9d180
2630 xl ffffa00001e5a028 ffffa0000e487000 ffffa00001ef8a28
2089 xl ffffa00001e5aa68 ffffa0000e45a000 ffffa00001ef8b98
1180 xl ffffa00001e5a2b8 ffffa0000e422000 ffffa00001ef8d08
1557 xl ffffa00001e5a7d8 ffffa0000e3ca000 ffffa00001ef8e78
526 xl ffffa00001e5a548 ffffa0000e399000 ffffa00001c9d010
2511 xl ffffa00001cff020 ffffa0000e317000 ffffa00001c9d2f0
2493 sh ffffa00000f7e7b8 ffffa0000ddb0000 ffffa00000e455c0
2415 su ffffa00000e40520 ffffa0000dd1d000 ffffa00000e45a10
250 sh ffffa00000e407b0 ffffa0000dcd1000 ffffa00000e45b80
301 sshd ffffa00000f7ea48 ffffa0000dd9d000 ffffa00000e45450
379 sshd ffffa00001cff540 ffffa0000e30d000 ffffa00000e458a0
210 sh ffffa00000e40290 ffffa0000dd94000 ffffa00001c9d740
938 su ffffa000015602a8 ffffa0000e283000 ffffa00001c9db90
2096 sh ffffa00001cffcf0 ffffa0000e300000 ffffa00000fd4e68
423 sshd ffffa00001560538 ffffa0000e280000 ffffa00000fd4008
2815 sshd ffffa00000f7e528 ffffa0000ddb7000 ffffa00000e45170
366 getty ffffa00000e40a40 ffffa0000dcce000 ffffa00000e45cf0
259 getty ffffa0000130cce0 ffffa0000ddd6000 ffffa00000fd45c8
2617 getty ffffa00000f7ecd8 ffffa0000dd9a000 ffffa00000e452e0
2673 getty ffffa00000e40000 ffffa0000dd97000 ffffa00000e45730
1931 cron ffffa00001cffa60 ffffa0000e303000 ffffa00001c9da20
2206 inetd ffffa00000f7e008 ffffa0000ddd9000 ffffa00000fd4178
1779 qmgr ffffa0000130c7c0 ffffa0000dfcc000 ffffa00000fd42e8
2060 master ffffa000015607c8 ffffa0000dfed000 ffffa00001c9de70
1673 sshd ffffa00001560ce8 ffffa0000dfe7000 ffffa00000fd4a18
1492 powerd ffffa0000130c010 ffffa0000dfe4000 ffffa00000fd4458
1637 ntpd ffffa0000130c530 ffffa0000dfc9000 ffffa00000fd48a8
1397 xenconsole ffffa0000130c2a0 ffffa0000dd1a000 ffffa00000fd4738
1380 xenstored ffffa00000f7e298 ffffa0000ddba000 ffffa00000e45000
892 syslogd ffffa0000130ca50 ffffa0000ddd3000 ffffa00000fd4b88
1 init ffffa00000e40cd0 ffffa0000dcdd000 ffffa00000e45e60
0 system ffffffff80c60ea0 ffffa0000de6d000 ffffffff80cacf60
I took a look at the stacks of the ones that looked xen-related:
4406 1 xl netbsd 43 tstile ffffa00001a59c30
trace: pid 4406 lid 1 at 0xffffa0000e507960
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
kernfs_root() at kernfs_root+0x41
lookup_once() at lookup_once+0x26e
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff603c40a:
2832 1 xl netbsd 43 tstile ffffa00001a59c30
trace: pid 2832 lid 1 at 0xffffa0000e41d960
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
kernfs_root() at kernfs_root+0x41
lookup_once() at lookup_once+0x26e
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff603c40a:
5442 1 xenstore-rm netbsd 43 tstile ffffa00001e263e8
trace: pid 5442 lid 1 at 0xffffa0000dd46a38
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
do_sys_statat() at do_sys_statat+0x72
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9a
--- syscall (number 439) ---
7f7ff710809a:
3217 1 xenstore-read netbsd 43 tstile ffffa00001868dd8
trace: pid 3217 lid 1 at 0xffffa0000fff2a38
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
do_sys_statat() at do_sys_statat+0x72
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9a
--- syscall (number 439) ---
7f7ff710809a:
5496 2 xenconsole netbsd 43 netio ffffa00001e77f40
5496 1 xenconsole netbsd 43 select ffffa000006ed050
trace: pid 5496 lid 2 at 0xffffa0000e316cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff6c3c38a:
5978 2 xl netbsd 43 netio ffffa000020a1628
5978 1 xl netbsd 43 select ffffa000006ed050
trace: pid 5978 lid 2 at 0xffffa0000e4c0cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
8280 1 xenstore-rm netbsd 43 xbrd ffffa0000079e000
trace: pid 8280 lid 1 at 0xffffa0000e4edd08
sleepq_block() at sleepq_block+0x8e
xenbus_dev_read() at xenbus_dev_read+0x47
kernfs_try_fileop() at kernfs_try_fileop+0x56
VOP_READ() at VOP_READ+0x3c
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff703c38a:
4535 1 xenstore-read netbsd 43 tstile ffffa00001868dd8
trace: pid 4535 lid 1 at 0xffffa0000dfecc90
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
vn_write() at vn_write+0x9b
dofilewrite() at dofilewrite+0x97
sys_write() at sys_write+0x5f
syscall() at syscall+0x9a
--- syscall (number 4) ---
7f7ff703c1ca:
2475 2 xl netbsd 43 netio ffffa0000203b870
2475 1 xl netbsd 43 select ffffa000006ed050
trace: pid 2475 lid 2 at 0xffffa0000e4e2cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
2630 2 xl netbsd 43 netio ffffa00001fe2628
2630 1 xl netbsd 43 select ffffa000006ed050
trace: pid 2630 lid 2 at 0xffffa0000e489cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
2089 2 xl netbsd 43 netio ffffa00001fe2f48
2089 1 xl netbsd 43 select ffffa000006ed050
trace: pid 2089 lid 2 at 0xffffa0000e45ccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
1180 2 xl netbsd 43 netio ffffa00001f0acf8
1180 1 xl netbsd 43 select ffffa000006ed050
trace: pid 1180 lid 2 at 0xffffa0000e424cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
1557 2 xl netbsd 43 netio ffffa00001ed5198
1557 1 xl netbsd 43 select ffffa000006ed050
trace: pid 1557 lid 2 at 0xffffa0000e3cccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
526 2 xl netbsd 43 netio ffffa00001e77190
526 1 xl netbsd 43 select ffffa000006ed050
trace: pid 526 lid 2 at 0xffffa0000e39bcd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
2511 2 xl netbsd 43 netio ffffa00001cd9190
2511 1 xl netbsd 43 select ffffa000006ed050
trace: pid 2511 lid 2 at 0xffffa0000e319cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:
1397 2 xenconsoled netbsd 43 netio ffffa000013fc3d8
1397 1 xenconsoled netbsd 43 select ffffa000006ed050
trace: pid 1397 lid 2 at 0xffffa0000dd1ccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff643c38a:
1380 1 xenstored netbsd 43 select ffffa000006ed050
trace: pid 1380 lid 1 at 0xffffa0000ddbcc58
sleepq_block() at sleepq_block+0x8e
sel_do_scan() at sel_do_scan+0x4c4
pollcommon() at pollcommon+0x12c
sys_poll() at sys_poll+0x82
syscall() at syscall+0x9a
--- syscall (number 209) ---
7f7ff703c3ea:
Does any of that point to what the problem might be? How can I tell what
process holds the lock we're queuing up on? To my untrained eye, it
looks like some kind of filesystem lock, but I'm not familiar with
NetBSD's locks...
I'm afraid I'm going to need to reboot this machine ("vm4" is my
database server, so my blog's down until it's back), so I won't be able
to try anything else until it happens again :-)
Thanks,
--
Alaric Snell-Pym
http://www.snell-pym.org.uk/alaric/
Attachment:
signature.asc
Description: OpenPGP digital signature