port-macppc: Re: Oops: pciide0:0:0: lost interrupt

Subject: Re: Oops: pciide0:0:0: lost interrupt
To: None <port-macppc@NetBSD.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 03/12/2004 09:38:09
>> wd0 is hooked up to the "internal" ide, and disks wd1 and wd2
>> are connected to a PCI card
>> (pciide0 at pci0 dev 15 function 0: Promise Ultra133/ATA Bus Master IDE  Accelerator (rev. 0x02))
>> I get lots of these:
>> Mar 11 23:27:40 grace /netbsd:  type: ata tc_bcount: 8192 tc_skip: 0
>> Mar 11 23:27:49 grace /netbsd: pciide0:0:0: lost interrupt
>
>I've been complaining about this for ages with people telling me it's my
>hardware. I'm also using a Promise-type controller (Sonnet Tempo Trio),
>and in my case as well, heavy load seems to precipitate it. 1.6.2_RC3 didn't
>make it much better, although like you stated, the errors appear benign.

The word benign is too strong.  Non-fatal would be more accurate.  I'm not
sure I can live with this on a production machine.  (web/backup server)

I've been running experiments this morning.  As long as I only use one of
the channels at a time, everything seems fine.  If I try to use the
second channel while the other one is busy, everything stops and I
start getting "lost interrupt" in the log.

It does not seem to be a problem to have activity on the on-board and one
of the pciide channels simultaneously.  That works fine.  It starts
complaining when both Sonnett card channels gets used simultaneously.

For instance, I run:

    ( cd /usr ; tar -cf - . ) | (cd /cuda/mnt && tar -xf - )

as a continuous load, and watch the disks with systat.  (/usr is on
internal ata, /cuda is one of the PCI channels.)  All I have to do
is a "du -s /other/*" and everything screeches to a halt.

It is pretty clear that the "workaround" in the wdc driver is capable of
recovering from this ugliness - none of the I/Os that are "lost"
actually result in errors.  Has anyone gone in and "tuned" the
timeout code so that it recovers faster?

What appears to happen in the wdc code is that it allows the I/Os to time out,
and when they do, the interrupt handler is called "manually".  The
trip through the interrupt handler is sufficient to do all the
"cleanup".  What would happen if I simply kept a counter of outstanding
requests, and called the interrupt handler every 20th of a second any time
there is at least one outstanding I/O?

Gross?  Disgusting?  Incorrect?  Sure!  Would it work?

>> Can anyone tell me if it is at least a little improved in NetBSD 1.6.2?
>> Any advice for getting around this?  So far it looks pretty annoying,
>> but not fatal.  It'll be a pretty severe performance problem.
>> Now that I've been playing with it a bit, I'm finding that no
>> command that touches the disks on the Promise card  (like "df -k")
>> seems to complete....
>
>Ooog. Mine *does* work okay in that respect.

To be more precise, it *seems* to not complete.  It's actually
jut really slow.

-dgl-