Subject: Re: generic HBA error messages on 1.6beta2
To: Matthias Buelow <mkb@informatik.uni-wuerzburg.de>
From: Jason R Thorpe <thorpej@wasabisystems.com>
List: port-alpha
Date: 07/10/2002 15:52:06
On Wed, Jul 10, 2002 at 02:21:42AM +0200, Matthias Buelow wrote:
 > 1) the problem only appears to occur with machines with >= 1GB RAM
 >    installed (as Mel Kravitz claims, who has seen the same problem),
Yes, >= 1G causes the SGMAP code to be used.
 > 2) the problem only occurs here when the machine has been running for
 >    at least 2-3 days, this might hint at some problem with higher
 >    address spaces or physical memory or mappings, and the kernel
 >    migrates some mappings or buffers slowly upwards over time,
 >    making the problem appear after a couple of days,
Hm.  Well, pages are actually cycled through pretty quickly.  My guess
would instead be some kind of slow resource leak.
 > 3) the problem appears to be with the dma mapping of the host adapter,
 >    or more generally; considering that Jason has made new SGMAP DMA
 >    improvements a while ago (according to the /alpha webpage) this
 >    might be a hint that something might be broken there (with the
 >    direct-mapped DMA window, although it only mentions mbufs and
 >    things being made "a bit more efficient" on the webpage),
The improvements in question fixed some bugs, and also reduced resource
usage on disk->memory transfers.  Matt Thomas and I also recently fixed
a serious SGMAP resource-leaking bug.
Are you, per chance, running kernels built with "options DIAGNOSTIC"?
 > I haven't checked yet if the problem also occurs on the adaptec
 > controller (or at least, never have seen it for that one so far)
 > which is also installed in the system, which may or may not hint
 > at specific problems with the isp (qlogic) driver.  I somehow doubt
 > that, though, but I of course can't tell.
Here is what I would suggest:
In isp_pci_dmasetup(), in the error case for bus_dmamap_load(), print
out the errno.  EAGAIN and ENOMEM will be common ... those can occur
as transient errors due to temporary resource shortage ... the scsipi
layer backs off in that case, and retries the command.
-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>