Subject: 3.0.1: softdep + ffsv2 + 'heavy' load = pauses
To: None <netbsd-users@netbsd.org>
From: Mark Cullen <mark.r.cullen@gmail.com>
List: netbsd-users
Date: 07/21/2006 15:16:57
I've just basically narrowed down my annoying pausing issues to softdep.

On a 2GHz (Athlon XP 2400+) test machine with 512MB of DDR RAM and a 
rather old 10GB ATA66 disk (I know.. I know, but bear with me), one big 
/ partition, and attempting to copy the netbsd source tree over NFS I 
was seeing terrible pauses trying to do anything at all on another shell 
(both logins via SSH). I'd say the frequency of the pauses was perhaps 
every 30 seconds. The pauses themselves would last anywhere between 10 
and 30 seconds, even trying to run a simple thing like `uptime`. Even 
though the hard disk is quite old in this case, I am experiencing very 
similar, but less pronounced pausing (about 5 seconds), on another 
machine with much newer ATA100 disks, though this kernel had 
BUFQ_NEW_STATEGY in the kernel.

Another thing which seemed to make the problem appear was running 
'dbench' with ~100 clients on the same machine, like `dbench -t 60 -D 
/tmp 100`. Once it got to the "executing" stage for it's 60 seconds, 
trying to login via SSH again just totally hung until it had cleaned up 
and finished.

Setting vm.bufcache much lower (I set it to 1) seemed to help a bit with 
the copying over NFS, but with dbench it was still hanging. I also tried 
BUFQ_PRIOREAD and BUFQ_PRIOCSCAN, but it was *still* hanging when trying 
to login again when running dbench.

The solution, so far, appears to be to just turn off softdep. I'm seeing 
no pauses now with either of the tests, SSH on the dbench test logs in 
nice and quick.. as if the machine were still idle! Though I am sure 
turning off softdep has a horrible performance impact, so I would really 
like to use it.

I seem to remember this problem being mentioned before and the actual 
cause of this problem was known, but I can't remember where I saw it, or 
even if it was going to be looked in to at some point in time? I think 
it may have been something to do with bad interactions between the 
softdep code and the new buffer queue code, or something?



On a side note, I tried running `dbench` on the home server with just 20 
clients and it totally hung the machine when it got to the "executing" 
stage of the test. I'm not terribly sure why either. It'd been up for 20 
days, was using BUFQ_NEW_STATEGY (=BUFQ_PRIOREAD?) and had softupdates 
enabled. I couldn't get this to happen again on the test machine with 
the same kernel. I can't even get it to happen again on the same 
machine, with the same kernel, after rebooting it!

-- 
Mark Cullen <mark.r.cullen@gmail.com>