Subject: Re: argh, spoke too soon
To: None <port-sparc@NetBSD.ORG>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: port-sparc
Date: 02/28/1996 13:40:32
>> Is there anything useful I could do with the coredump to help anyone
>> figure out what's wrong?
> [pk explains about "-O paddr" to ps, and the proc command in kgdb]

Well, it hung again, overnight.  And at suspiciously close to the same
time (about 23:48 - no cron jobs visible then, and nobody can log in to
the machine except me, so it wasn't a user).  I used pk's recipe on
both the kernel coredump from last time and the live kernel from this
time (I left the machine up, with the make build hung, to poke around).

Here are the stack backtraces.  In the live system, make is what hung;
in the kernel coredump, it's cc1 that hung.  It does strike me as
interesting and probably significant that the bottom eight frames on
the stacks are identical.

    live kernel                    coredump

#0  0xf8029bd8 in mi_switch ()     #0  0xf8029bd8 in mi_switch ()
#1  0xf802944c in bpendtsleep ()   #1  0xf802944c in bpendtsleep ()
#2  0xf8044c3c in vflushbuf ()     #2  0xf8044c3c in vflushbuf ()
#3  0xf80b4740 in ffs_fsync ()     #3  0xf80b4740 in ffs_fsync ()
#4  0xf80448b8 in vinvalbuf ()     #4  0xf80448b8 in vinvalbuf ()
#5  0xf804559c in vclean ()        #5  0xf804559c in vclean ()
#6  0xf8045828 in vgone ()         #6  0xf8045828 in vgone ()
#7  0xf8044654 in getnewvnode ()   #7  0xf8044654 in getnewvnode ()
#8  0xf80b391c in ffs_vget ()      #8  0xf80b391c in ffs_vget ()
#9  0xf80bdd04 in ufs_lookup ()    #9  0xf80adc54 in ffs_valloc ()
#10 0xf8043a14 in lookup ()        #10 0xf80c3680 in ufs_makeinode ()
#11 0xf8043468 in namei ()         #11 0xf80c0644 in ufs_create ()
#12 0xf804855c in sys_stat ()      #12 0xf8049ff0 in vn_open ()
#13 0xf8107520 in syscall ()       #13 0xf80476ec in sys_open ()
#14 0xf8006714 in trapbase ()      #14 0xf8107520 in syscall ()
#15 0x72d8 in ?? ()                #15 0xf8006714 in trapbase ()
#16 0xda20 in ?? ()
#17 0x44b4 in ?? ()
[user-land frames 18..28 omitted]

"ps ax -O paddr,flags,wchan" on the live system shows the hung process
as

  PID  PADDR       F WCHAN  TT  STAT      TIME COMMAND
 1086 62ea00    4000 vflush ??  DW     0:00.64 make _THISDIR_ all 

F=4000 is P_TIMEOUT and nothing else.  "vflush" is presumably a
truncation of the "vflushbuf" specified in the tsleep call in
vflushbuf().

In passing, is there any way to tell ps to widen the columns, so it
doesn't truncate things like PADDR and WCHAN?  Specifying a long string
for the header widens the field but doesn't stop the truncation:

# ps ax -O paddr=PROCADDESS,flags,wchan=WAITCHANNEL
  PID PROCADDESS       F WAITCHANNEL TT  STAT      TIME COMMAND
[...]
 1086     62ea00    4000 vflushb&    ??  DW     0:00.64 make _THISDIR_ all 

(indeed, the & appears to be a garbage character; in another run it was
S instead, and on each run the same character appeared for all wait
channels that got truncated that run).  I suppose I should send-pr
this.

I just now told the machine to reboot, to clear the stuck process
preparatory to building a kernel with -g, and it's hung at the
"syncing disks... " point.  I guess I'll have to nerve-pinch it.  Once
I have a debugging kernel, I'll start poking around and see how much I
can find out about what vflushbuf is doing, and why it's getting stuck.

As Calvin once said, "Further bulletins as events warrant". :-)

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu