Re: kern/34461 (multiple problems; lfs-related)

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,kiersb%xs4all.net@localhost
Subject: Re: kern/34461 (multiple problems; lfs-related)
From: David Holland <dholland-bugs%netbsd.org@localhost>
Date: Sun, 13 Jul 2014 18:45:01 +0000 (UTC)

The following reply was made to PR kern/34461; it has been noted by GNATS.

From: David Holland <dholland-bugs%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 18:42:21 +0000

 On Sun, Jul 13, 2014 at 02:35:00PM +0000, Bert Kiers wrote:
  >  On 7/12/14 7:20 PM, dholland%NetBSD.org@localhost wrote:
  >> If there's anyone still on the other end of this... is there any
  >> reason to believe this wasn't a driver or hardware issue? The fact
  >> that the problem starts with a driver-level error makes me think
  >> it probably isn't LFS.
  >> 
  >> Reformatting with FFS and no longer seeing the issue doesn't
  >> really prove anything as FFS has much different I/O patterns, and
  >> in particular when rsyncing LFS will be ramming a lot more data
  >> down the disk system's throat.  (Or at least, a lot more at once.)
  >> If there was a load- or timing-dependent problem at the disk level
  >> it's quite possible that FFS wouldn't trigger it, especially if
  >> you weren't using softupdates.
  >  
  >  System is scrapped. If somebody is really interested I could retry with
  >  newer computer, NetBSD-current and same box with disks.

 Looking at it some more, I have the following conjecture:

  - Back in 2009 when WAPBL was new, it sometimes under load exhibited
 a dysfunctional operating state where it would be madly writing the
 same blocks over and over again and making very little real progress.

  - This turned out to be not WAPBL-specific but also possible (just
 much harder to get into) with regular FFS.

  - It was caused by bad dynamic behavior logic in the syncer that was
 triggered by the disks getting behind on the pending I/O.

  - It got fixed; some of the fix was FS-independent, but it isn't
 clear to me (without digging a lot deeper) how much.

 I think it's possible that you were seeing an LFS version of this same
 behavior. With the size of the RAID you had/have, it's quite plausible
 that flooding it with writes, as this problem resulted in, would
 render the system as slow as described.

 If so, it might now be fixed... or it might not. It would probably be
 interesting to find out, but see below.

 The allocation failure message that started it is scsipi-level; it
 means that the allocation pool for xs structures ran out. In the
 current code (this doesn't seem to have changed) this causes a
 half-second delay. It is quite likely that a sudden half-second delay
 while running at peak throughput would be enough to trigger the
 dysfunctional state described above, if it existed in LFS at the time.
 (Given that the message appeared only once, the half-second delay
 itself can't be the performance problem.)

 The problem could be something else entirely, though; e.g. something
 in the way LFS prepares segments, or something silly in lfs_putpages.
 Or this could be the same as PR 35187. Or a raidframe issue. Also, the
 allocation failure might conceivably be a red herring and not actually
 related at all; or the trigger (or even the problem) might be some
 other allocation failure that doesn't print anything.

 Trying to replicate this on a modern machine (with much more RAM and
 faster disks) might be much harder... or much easier. Even if my
 conjecture's correct, it's hard to guess. It will, probably, be harder
 to get the triggering allocation failure; you might have to insert
 fault injection code for that. If my conjecture's correct, without the
 half-second delay the problem might well not appear. There's some
 chance (especially if my conjecture's wrong) that it'll turn out to be
 easy to reproduce, but this doesn't seem too likely.

 So I would say: unless you're interested in working on LFS, trying to
 replicate it probably isn't worthwhile; it will take a fair amount of
 effort and isn't that likely to produce conclusive results.

 If you *are* interested in working on LFS, by all means go ahead
 though :-)

 (There's another unrelated bug: it seems that hitting that half-second
 delay causes mishandling of the iostat counters. However, that
 shouldn't matter much.)

 -- 
 David A. Holland
 dholland%netbsd.org@localhost

Prev by Date: Re: standards/14702 (xargs man page lies)
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: kern/34461 (multiple problems; lfs-related)
Next by Thread: Re: kern/34461 (multiple problems; lfs-related)
Indexes:

Home | Main Index | Thread Index | Old Index