tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: ioflush kernel thread chewing CPU time

Andrew Doran wrote:

> On Thu, Mar 19, 2009 at 10:21:49PM +1100, Simon Burge wrote:
> > I've recently upgraded my dual CPU i386 laptop from a September 2008
> > -current kernel to a netbsd-5 kernel from March 12.  I am using WAPBL on
> > the four ffs filesystems.
> What's the file system configuration? Do you have noatime, etc?
> UFS1 or UFS1? Any especially active applications?

All filesystems are UFS2, no extra mount options line noatime.  Other
than the ioflush thread, top says the system is completely idle (except
for top itself occassionaly).  There is a firefox running but it is
idle.  A handful of xterms as well.  The first time this happened, I'd
killed pretty much every process I could while still being logged in and
was still seeing the problem.

> > After a couple of days, I noticed the system
> > was really sluggish and found the ioflush kernel thread was using a
> > lot of CPU time.  I've rebooted, and then another fews days later same
> > issue.
> What happens if you type `sync'?

No effect..  There's no disk IO when I try this ssh'd in to the laptop,
and top continues to show ioflush using about the same amount of CPU time.

> Can you try a gprof kernel?  If you have
> concurrent activity running, lockstat can sometimes give you a picture of
> what is happening.

No real activity at all.  lockstat isn't showing anything of interest:

   euclid 25> lockstat sleep 30
   Elapsed time: 30.03 seconds.

   -- Adaptive mutex spin

   Total%  Count   Time/ms          Lock                       Caller
   ------ ------- --------- ---------------------- ------------------------
    67.84       3      0.09 00000000cc231f1c       uvm_fault_internal+e97
    19.26      25      0.03 00000000d0352d00       <all>
    15.95      21      0.02 00000000d0352d00       sopoll+1f
     2.93       3      0.00 00000000d0352d00       soreceive+a0c
     0.39       1      0.00 00000000d0352d00       soreceive+ab
     7.85       1      0.01 00000000e569a5c4       uvm_fault_internal+3e9
     4.03       3      0.01 xc_lock                cv_wait+137
     1.01       3      0.00 uvm_pageqlock          <all>
     0.65       2      0.00 uvm_pageqlock          uvm_fault_internal+1170
     0.36       1      0.00 uvm_pageqlock          uvm_fault_internal+1eac

   -- Spin mutex spin

   Total%  Count   Time/ms          Lock                       Caller
   ------ ------- --------- ---------------------- ------------------------
   100.00       2      0.00 00000000cad44d80       mi_switch+c6

   -- RW lock spin

   Total%  Count   Time/ms          Lock                       Caller
   ------ ------- --------- ---------------------- ------------------------
   100.00       1      0.00 00000000c1a5ae04       vm_map_lock+69

> I suggest putting in some counters to see what the syncer
> is doing. For example:
> - number VDIR vnodes flushed
> - number VREG vnodes flushed
> - number VT_VFS vnodes flushed (sync vnodes)
> If you put an integer switch in the kernel you can turn the counters on at
> runtime using gdb, when the problem starts to occur.

I'll try this before trying a gprof kernel.  Actually, maybe both - I'm 
not worried about the performance hit of profiling on this box.

> > Before I start digging, anyone else seen anything like this before? 
> Nope. But, processing a sync vnode involves a trawl through all vnodes
> associated with every file system. It sounds like that could be happening
> too often, or for some reason perhaps vnodes on the worklist aren't getting
> flushed.

That seems like a pretty reasonable assumption - maxvnodes is set to
128k here, and dropping it to 8k sees ioflush go pretty much idle!
ps shows that thread now using 1.05 cpu seconds of CPU time over 60
seconds.  Bumping maxvnodes back to 128k still shows ioflush idle, but
based on past experience I guess it's not going to show a problem for 48
or more hours.


Home | Main Index | Thread Index | Old Index