Subject: deadlock in getnewb? (Re: running bonnie on -current/VAX)
To: current-users <current-users@netbsd.org>
From: Kazushi (Jam) Marukawa <jam@pobox.com>
List: current-users
Date: 03/07/2001 04:13:51
Hi,

This time, I experienced some deadlock like situation while
I was writing large block data to the SCSI disk.  So, I'm
reporting this.  I'm using serial console and it was off
when I got this problem, so I don't have console messages.

My machine is made of Celeron 366, Asus P3V4X, and 512MB
memory.  The version of kernel is:

NetBSD 1.5S (sou) #7: Sat Feb 24 05:21:30 CST 2001

First, I noticed that all programs I tried to execute hangs.
I was using screen, so I could switch among screens, but no
more screens nor new programs.  For example, if I typed "ls"
nothing happened and ^C, ^\, and ^Z didn't work.  Then,
screend also hangs since it called getnewb.  However, still
I can ping and send packets through nat on this machine.

I waited about 30 minutes with a hope it would recover.
However, it didn't.  So, I sent break through serial console
and got trace, ps, and uvmexp.  I had more information, but
I don't know what those information means.  So, this time, I
just put some of them at the bottom of this message.  If you
have an interest on this and want to see all log or the
result of some specific commands, please ask me so.  Or
should I send-pr about this?  Thanks.

Oh, and one more thing.  What I was doing is "dd
if=/dev/zero of=/dev/sd0 count=XXXX bs=8192" on a SCSI HDD
bigger than 40G.  I though I can zero-fill a hard drive in
this way before I sell this...

db> trace
cpu_Debugger(c0f2aa80,e5053560,e5053560,e5092e5c,0) at cpu_Debugger+0x4
comintr(c0f42c00) at comintr+0xcd
Xintr4() at Xintr4+0x70
--- interrupt ---
idle(e5053560) at idle+0x1b
bpendtsleep(e505495c,128,c0486fec,0,0) at bpendtsleep
sigsuspend1(e5053560,e5092f50,e5092f80,0,0) at sigsuspend1+0xf0
sys___sigsuspend14(e5053560,e5092f88,e5092f80) at sys___sigsuspend14+0x36
syscall_plain(1f,bfbf001f,bfbfdc69,0,bfbfdc40) at syscall_plain+0x98
db> c
Stopped at      cpu_Debugger+0x4:       leave
db> trace
cpu_Debugger(c0f2aa80,e50a4abc,e50a4abc,e50c8da8,0) at cpu_Debugger+0x4
comintr(c0f42c00) at comintr+0xcd
Xintr4() at Xintr4+0x70
--- interrupt ---
idle(e50a4abc) at idle+0x1b
bpendtsleep(c05e710c,118,c048954d,65,0) at bpendtsleep
sys_select(e50a4abc,e50c8f88,e50c8f80) at sys_select+0x309
syscall_plain(1f,1f,0,8,bfbfdc60) at syscall_plain+0x98
db> ps
 PID             PPID       PGRP        UID S   FLAGS          COMMAND    WAIT
 23132            193        193          0 3     0x4             sshd getnewb
 23131           9668      23130          0 3  0x4006             less getnewb
 23130           9668      23130          0 3  0x4006               ps getnewb
 23129           9560       9560       1000 3  0x4006        fetchmail getnewb
 23128          21992      21992       1000 3  0x4006             perl getnewb
 23105            277      23104       1000 3  0x4006             perl getnewb
 23038            214        214      32767 3   0x184            httpd  netcon
 23036            214        214      32767 3   0x184            httpd  netcon
 23035            214        214      32767 3   0x184            httpd  netcon
 22994            214        214      32767 3   0x184            httpd  netcon
 22971            214        214      32767 3   0x104            httpd getnewb
 22962            214        214      32767 3   0x184            htd  netcon
 22930            214        214      32767 3   0x184            httpd  netcon
 22859           9668      22859          0 4  0x5002               vi
 21992            325      21992       1000 3  0x4086               sh    wait
 21106            266      21106          0 3  0x4006               dd getnewb
 9668             270       9668          0 3  0x5086              zsh   pause
 9560             264       9560       1000 3  0x5086             mush    wait
 1163            1162       1161          0 4  0x4002             less
 1162            1161       1161          0 4  0x4082               sh
 1161             266       1161          0 4  0x5082              man
 325              263        325       1000 3  0x4082              zsh   pause
 270              263        270       1000 3  0x4082              zsh   pause
 266              265        266          0 3  0x4082              zsh   pause
 265              263        265       1000 3  0x4082              zsh   pause
 263                1        263       1000 3   0x104     screen-3.9.8 getnewb
 238                1          1          0 3  0x4004            getty  vnlock
 237                1        237          0 3  0x4006            getty getnewb
 236              206        236      32767 3  0x4184           pinger  select
 231                1        231          0 3     0x4            inetd  vnlock
 224                1        224          0 3     0x4         sendmail  vnlock
 214                1        214          0 3    0x84            httpd  select
 211              206        211      32767 3  0x4080          unlinkd   netio
 206              202          7          0 3  0x4106            squid getnewb
 202                1          7          0 3  0x4082               sh    wait
 193                1        193          0 3     0x4             sshd  vnlock
 12                 0          0          0 3 0x20204             raid rfwcond
 6                  0          0          0 3 0x20204         aiodoned aiodone
 5                  0          0          0 3 0x20204          ioflush getnewb
 4                  0          0          0 3 0x20204           reaper  reaper
 3                  0          0          0 3 0x20204       pagedaemon pgdaemo
 2                  0          0          0 3 0x20204             usb0  usbevt
 1                  0          1          0 3  0x4080             init    wait
 0                 -1          0          0 3 0x20204          swapper schedul
db> kill CE    <- squid in newcb
db> kill 48B   <- less in normal status, but I cannot kill either
db> show uvmexp
Current UVM status:
  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
  127787 VM pages: 78530 active, 36887 inactive, 973 wired, 140 free
    14843 anon, 101138 vnode, 0 vtext
  freemin=64, free-target=85, inactive-target=38518, wired-max=42595
  faults=6992210, traps=85475941, intrs=83588156, ctxswitch=59582042
  softint=78595794, syscalls=292274228, swapins=1434, swapouts=1467
  fault counts:
    noram=3, noanon=0, pgwait=0, pgrele=0
    ok relocks(total)=25163(25163), anget(retrys)=613856(3), amapcopy=241944
    neighbor anon/obj pg=602787/3113099, gets(lock/unlock)=941513/25164
    cases: anon=416621, anoncow=197235, obj=819333, prcopy=122176, przero=219795
9
  daemon and swap counts:
    woke=797, revs=797, scans=5326460, obscans=4571110, anscans=3436
    busy=0, freed=4571111, reactivate=69923, deactivate=5554947
    pageouts=222, pending=222, nswget=3
    nswapdev=1, nanon=271923, nanonneeded=271923freeanon=257080
    swpages=152451, swpginuse=879, swpgonly=0 paging=0
  kernel pointers:
    objs(kern/kmem/mb)=0xc05cdd40/0xc05cde50/0xc05cde68
db> sync
syncing disks... 
(waited more than 5 minutes, no disk I/O, send break, send
 break...  no response.  reset)

Regards,
-- Kazushi

   On Jan 29,  7:01, Ken Wellsch wrote:
   > Subject: Re: running bonnie on -current/VAX
   > Tom Ivar Helbekkmo wrote:
   > > Tom Ivar Helbekkmo <tih@kpnQwest.no> writes:
   > > > Ken Wellsch <kwellsch@tampabay.rr.com> writes:
   > > > > Writing with putc()...^C^C^C^C^C^C^C^Z^Z^Z^Z^Z^Z^Z^\^\^\  (no affect)
   > > > > load: 6.38  cmd: bonnie 314 [uvn_fp1] 15.94u 7.80s 0% 296k
   > > > > load: 6.43  cmd: bonnie 314 [uvn_fp1] 15.93u 7.81s 0% 296k
   > > > > [...]
   > > > > with every new ^T I type, the load climbs...
   > > > 
   > > > This might be related to a problem I see [...] just like the above,
   > > > except that the state reported by ^T is "[getnewbuf]".
   > > 
   > > It's still there in -current (at least on the Sparc), and it only
   > > occurs during large scale writing to partitions that are mounted with
   > > the softdep option.
   > 
   > Good point - I've started enabling softdep on /usr to aid in disk performance
   > on my slower systems and had it enabled on the VAX that produced the
   > issue mentioned above.