Subject: Re: 1.4.2 Observations
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: port-i386
Date: 03/28/2000 12:16:41
On Tue, Mar 28, 2000 at 06:02:17PM +0200, Manuel Bouyer wrote:
> On Mon, Mar 27, 2000 at 04:44:01PM -0500, Thor Lancelot Simon wrote:
> > On Mon, Mar 27, 2000 at 10:53:28PM +0200, Manuel Bouyer wrote:
> > Nonetheless, it's been my experience that since shortly before the 1.4
> > release, our IDE subsystem has been prone to misbehave in the face of high
> > levels of I/O in ways which do make the whole system feel rather slow.
> 
> I have the same behavior on a system with IDE disks, and SCSI (aha2940UW).
> 
> > 
> > I don't understand quite what's going on, but doing things like rsync or
> > find or ls -lR or dump that hit the IDE disks with huge numbers of requests
> > do, in fact, from the statistics, get huge numbers of xfers/sec and very
> > high bytes/sec throughput, and CPU utilization does not, in fact, seem to
> > be particularly high.  On the other hand, in the midst of this type of
> > activity even keyboard input can seem sluggish, and if I do something that
> > generates *new* I/O requests to the IDE disk (e.g. 'ls' while an rsync is
> > running in the background) those requests take a *long* time to complete.
> > 
> > The first behaviour suggests that too much time is being spent at high SPL,
> > but from examination of the IDE code that doesn't seem correct.
> 
> This seems to be related to higth IRQ load, involving disk I/O or not
> (I've also seen this on machine with higth network load but no disk I/O).

Interestingly, I have a system here with a parallel printer attached that
has 64MB of buffer memory.  I have seen over *40,000* IRQs/sec from the
lpt device on this system, while the system feels completely usable.  It
can't just be the number of IRQs.

> > Interestingly, using LFS, which makes almost all disk I/O asynchronous,
> > pretty much makes both problems go away.
> 
> I think this is also because I/O are of larger size, so the IRQ load is
> less.

This might explain why some SCSI controllers avoid this problem: when you
get an interrupt, it's quite likely you may find out that multiple commands
have completed.

> > With SCSI disks, they don't
> > seem to appear in the first place.  I'd suspect some kind of odd barrier
> > condition with !B_ASYNC buffers, but since we don't do disconnection or
> > multiple command queueing on IDE that doesn't seem likely, either.
> 
> What SCSI controller do you use ?

A variety of them: ahc, bha, and adw.  I haven't seen the problem we're
discussing with any of them.  I run 'ahc' with tagged queueing turned on,
BTW.

I don't think it can be just the *number* of IRQs.  I think we have to be
spending too much time with too many interrupts blocked in some devices' 
interrupt service routines.  Otherwise, my system with the fast printer
generating 40,000 IRQs/sec would be useless, and it's fine.

I can't find a change in the period in which I recall this phenomenon
appearing (a month or so pre 1.4) which looks likely to have caused this,
however.