tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Odd pigz behavior on build cluster - libpthread bug?



On Sat, Jun 14, 2014 at 09:20:27AM -0400, Thor Lancelot Simon wrote:
> 
> Now we seem to be stuck like this on the master:
> 
> 26269 builds     0    0    25M 1372K CPU/18   130:47 99.02% 99.02% nbpigz
> 21425 builds     0    0    25M 1344K CPU/13   130:31 99.02% 99.02% nbpigz
> 
> This looks very odd to me.  Why has pigz been running so long, and even
> if it's doing something, why's it on so few CPUs -- shouldn't each process
> have a thread per CPU?  It doesn't seem to, the load average is only 4.

OK, here's the deal: it's doing something, but, to me it also looks like
it may be exhibiting pathological libpthread behavior of some kind.

26269      1 nbpigz   EMUL  "netbsd"
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
[repeat 685,651 times]
26269      2 nbpigz   CALL  _lwp_unpark(1,0x7f7ff7e12ac0)
26269      2 nbpigz   RET   _lwp_unpark 0
26269      2 nbpigz   CALL  ___lwp_park50(0,0,0x7f7ff7b030f8,0x7f7ff7b030f8)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 3 No such process
26269      1 nbpigz   CALL  __sigprocmask14(3,0x7f7fffffc120,0)
26269      1 nbpigz   RET   __sigprocmask14 0
26269      1 nbpigz   CALL  getcontext(0x7f7ff6400a88)
26269      1 nbpigz   RET   getcontext 0
26269      1 nbpigz   CALL  _lwp_create(0x7f7ff6400a88,0x40,0x7f7ff6400180)
26269      1 nbpigz   RET   _lwp_create 0
26269      1 nbpigz   CALL  ___lwp_park50(0,0,0x7f7ff6800030,0x7f7ff6800030)
26269      1 nbpigz   RET   ___lwp_park50 -1 errno 37 Operation already in 
progress
26269      1 nbpigz   CALL  ___lwp_park50(0,0,0x7f7ff6800030,0x7f7ff6800030)
26269      3 nbpigz   CALL  _lwp_ctl(1,0x7f7ff6400208)
26269      3 nbpigz   RET   _lwp_ctl 0
26269      3 nbpigz   CALL  mmap(0,0x1000,3,0x1002,0xffffffff,0,0)
[mmap more stuff]
26269      3 nbpigz   CALL  _lwp_unpark(2,0x7f7ff7b030f8)
 26269      3 nbpigz   RET   _lwp_unpark 0
 26269      2 nbpigz   CSW   stop kernel
 26269      2 nbpigz   CSW   resume kernel
 26269      2 nbpigz   RET   ___lwp_park50 0
 26269      2 nbpigz   CALL  write(1,0x7f7ff7747000,0x1127)
 26269      3 nbpigz   CALL  ___lwp_park50(0,0,0x7f7ff7b03098,0x7f7ff7b03098)
 26269      2 nbpigz   GIO   fd 1 wrote 4088 bytes
[after the write, there is a short burst of repeated calls to lwp_self
 and sigprocmask, then the cycle seems to resume]

WTF?  Ideas most welcome.

Thor


Home | Main Index | Thread Index | Old Index