Subject: Re: generic HBA error messages on 1.6beta2
To: Matthew Jacob <mjacob@feral.com>
From: Matthias Buelow <mkb@informatik.uni-wuerzburg.de>
List: port-alpha
Date: 07/10/2002 02:21:42
Matthew Jacob writes:

>Fetch ftp://ftp.feral.com/pub/outgoing/patches.gz  and apply and try again.

Will do that tomorrow... besides, let's have an analytical look at
the problem (maybe with someone on the list a bell might ring):

1) the problem only appears to occur with machines with >= 1GB RAM
   installed (as Mel Kravitz claims, who has seen the same problem),
2) the problem only occurs here when the machine has been running for
   at least 2-3 days, this might hint at some problem with higher
   address spaces or physical memory or mappings, and the kernel
   migrates some mappings or buffers slowly upwards over time,
   making the problem appear after a couple of days,
3) the problem appears to be with the dma mapping of the host adapter,
   or more generally; considering that Jason has made new SGMAP DMA
   improvements a while ago (according to the /alpha webpage) this
   might be a hint that something might be broken there (with the
   direct-mapped DMA window, although it only mentions mbufs and
   things being made "a bit more efficient" on the webpage),
4) it does not seem to result from hardware bus collision or similar,
   because the system is completely unloaded and what triggers it
   seems rather to be related to passed uptime than to i/o traffic,
5) there hasn't been observed any real data loss to the disks so far,
   at least not here, maybe it's just a bogus error (although I somewhat
   doubt that, and there hasn't been enough disk i/o to substantiate that),
6) it cannot be triggered from userland by consuming all available
   virtual memory (what's available physical, not swap) and doing disk
   i/o.

I haven't checked yet if the problem also occurs on the adaptec
controller (or at least, never have seen it for that one so far)
which is also installed in the system, which may or may not hint
at specific problems with the isp (qlogic) driver.  I somehow doubt
that, though, but I of course can't tell.

--mkb