NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/34461 (multiple problems; lfs-related)
The following reply was made to PR kern/34461; it has been noted by GNATS.
From: David Holland <dholland-bugs%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 18:42:21 +0000
On Sun, Jul 13, 2014 at 02:35:00PM +0000, Bert Kiers wrote:
> On 7/12/14 7:20 PM, dholland%NetBSD.org@localhost wrote:
>> If there's anyone still on the other end of this... is there any
>> reason to believe this wasn't a driver or hardware issue? The fact
>> that the problem starts with a driver-level error makes me think
>> it probably isn't LFS.
>>
>> Reformatting with FFS and no longer seeing the issue doesn't
>> really prove anything as FFS has much different I/O patterns, and
>> in particular when rsyncing LFS will be ramming a lot more data
>> down the disk system's throat. (Or at least, a lot more at once.)
>> If there was a load- or timing-dependent problem at the disk level
>> it's quite possible that FFS wouldn't trigger it, especially if
>> you weren't using softupdates.
>
> System is scrapped. If somebody is really interested I could retry with
> newer computer, NetBSD-current and same box with disks.
Looking at it some more, I have the following conjecture:
- Back in 2009 when WAPBL was new, it sometimes under load exhibited
a dysfunctional operating state where it would be madly writing the
same blocks over and over again and making very little real progress.
- This turned out to be not WAPBL-specific but also possible (just
much harder to get into) with regular FFS.
- It was caused by bad dynamic behavior logic in the syncer that was
triggered by the disks getting behind on the pending I/O.
- It got fixed; some of the fix was FS-independent, but it isn't
clear to me (without digging a lot deeper) how much.
I think it's possible that you were seeing an LFS version of this same
behavior. With the size of the RAID you had/have, it's quite plausible
that flooding it with writes, as this problem resulted in, would
render the system as slow as described.
If so, it might now be fixed... or it might not. It would probably be
interesting to find out, but see below.
The allocation failure message that started it is scsipi-level; it
means that the allocation pool for xs structures ran out. In the
current code (this doesn't seem to have changed) this causes a
half-second delay. It is quite likely that a sudden half-second delay
while running at peak throughput would be enough to trigger the
dysfunctional state described above, if it existed in LFS at the time.
(Given that the message appeared only once, the half-second delay
itself can't be the performance problem.)
The problem could be something else entirely, though; e.g. something
in the way LFS prepares segments, or something silly in lfs_putpages.
Or this could be the same as PR 35187. Or a raidframe issue. Also, the
allocation failure might conceivably be a red herring and not actually
related at all; or the trigger (or even the problem) might be some
other allocation failure that doesn't print anything.
Trying to replicate this on a modern machine (with much more RAM and
faster disks) might be much harder... or much easier. Even if my
conjecture's correct, it's hard to guess. It will, probably, be harder
to get the triggering allocation failure; you might have to insert
fault injection code for that. If my conjecture's correct, without the
half-second delay the problem might well not appear. There's some
chance (especially if my conjecture's wrong) that it'll turn out to be
easy to reproduce, but this doesn't seem too likely.
So I would say: unless you're interested in working on LFS, trying to
replicate it probably isn't worthwhile; it will take a fair amount of
effort and isn't that likely to produce conclusive results.
If you *are* interested in working on LFS, by all means go ahead
though :-)
(There's another unrelated bug: it seems that hitting that half-second
delay causes mishandling of the iostat counters. However, that
shouldn't matter much.)
--
David A. Holland
dholland%netbsd.org@localhost
Home |
Main Index |
Thread Index |
Old Index