Re: Finding out where biowait is stuck

To: tech-kern%netbsd.org@localhost
Subject: Re: Finding out where biowait is stuck
From: Stuart Brooks <stuartb%cat.co.za@localhost>
Date: Tue, 24 Feb 2009 08:40:53 +0200

On Mon, Feb 23, 2009 at 09:42:31PM +0000, Andrew Doran wrote:

On Mon, Feb 23, 2009 at 09:47:42AM -0500, Allen Briggs wrote:

The process is in src/sys/vfs_bio.c:biowait(),
but the question is why isn't it getting woken up--or if it's
getting woken up, why aren't B_DONE or B_DELWRI set?

I closed a PR with a similar symptom last year. We had sloppy manipluation
of buf::b_flag. Updates were made without any protection against disk
interrupts.

It was exacerbated by the advent of RISC (load, modify, store) and softdep.
A large chunk of the softdep code runs in interrupt context, which broke
some of the undocumented assumptions that the pre-5.0 buffer code made about
updates to b_flag.

The good news is that it is fixed in 5.0. The bad news is that there is not
yet any remediation for this issue in earlier releases (that I know of).


Although I should note, it could be a very different problem in this
instance. The one I note is a good candidate, though.

I'm guessing this is PR/38761 - didn't pick this up this in my search.Having had a quick look it appears this code (or at least the filenames)has changed quite a bit between NetBSD3/4 and 5 so it appears it's notjust a cut and paste to backport the fix if this is the problem. Howmuch work do you think it would be to put it into NetBSD 4, do the sameprinciples hold in the biowait code?

We run quite a few NetBSD systems and the only ones we have seen this onare running easyRaid storage devices through a scsi card - the one is anAdaptec, the other LSI :


mpt0 at pci3 dev 10 function 0: LSI Logic 53c1030 Ultra320 SCSI
mpt0: interrupting at irq 15

Most of the other systems have SATA drives. So the external RAID seems to be 
the difference. And as I mentioned, the last few times have all coincided with 
the running of the daily script which would obviously work over the 
disks/filesystems.

Another thing which might be relevant is that these systems are receiving 
fairly heavy network traffic of the order of 30-40Mbps.

Thanks for the assistance,
Stuart

References:
- Finding out where biowait is stuck
  - From: Stuart Brooks
- Re: Finding out where biowait is stuck
  - From: Allen Briggs
- Re: Finding out where biowait is stuck
  - From: Andrew Doran
- Re: Finding out where biowait is stuck
  - From: Andrew Doran

Prev by Date: Re: Finding out where biowait is stuck
Next by Date: Re: Call for testing "Attansic/Atheros L1 gigabit ethernet"
Previous by Thread: Re: Finding out where biowait is stuck
Next by Thread: Re: Finding out where biowait is stuck
Indexes:

Home | Main Index | Thread Index | Old Index