Subject: Re: generic HBA error messages on 1.6beta2
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Matthew Jacob <mjacob@feral.com>
List: port-alpha
Date: 07/10/2002 15:58:58
Jason- read the thread. The patches I offered for him to try already have
this.


On Wed, 10 Jul 2002, Jason R Thorpe wrote:

> On Wed, Jul 10, 2002 at 02:21:42AM +0200, Matthias Buelow wrote:
> 
>  > 1) the problem only appears to occur with machines with >= 1GB RAM
>  >    installed (as Mel Kravitz claims, who has seen the same problem),
> 
> Yes, >= 1G causes the SGMAP code to be used.
> 
>  > 2) the problem only occurs here when the machine has been running for
>  >    at least 2-3 days, this might hint at some problem with higher
>  >    address spaces or physical memory or mappings, and the kernel
>  >    migrates some mappings or buffers slowly upwards over time,
>  >    making the problem appear after a couple of days,
> 
> Hm.  Well, pages are actually cycled through pretty quickly.  My guess
> would instead be some kind of slow resource leak.
> 
>  > 3) the problem appears to be with the dma mapping of the host adapter,
>  >    or more generally; considering that Jason has made new SGMAP DMA
>  >    improvements a while ago (according to the /alpha webpage) this
>  >    might be a hint that something might be broken there (with the
>  >    direct-mapped DMA window, although it only mentions mbufs and
>  >    things being made "a bit more efficient" on the webpage),
> 
> The improvements in question fixed some bugs, and also reduced resource
> usage on disk->memory transfers.  Matt Thomas and I also recently fixed
> a serious SGMAP resource-leaking bug.
> 
> Are you, per chance, running kernels built with "options DIAGNOSTIC"?
> 
>  > I haven't checked yet if the problem also occurs on the adaptec
>  > controller (or at least, never have seen it for that one so far)
>  > which is also installed in the system, which may or may not hint
>  > at specific problems with the isp (qlogic) driver.  I somehow doubt
>  > that, though, but I of course can't tell.
> 
> Here is what I would suggest:
> 
> In isp_pci_dmasetup(), in the error case for bus_dmamap_load(), print
> out the errno.  EAGAIN and ENOMEM will be common ... those can occur
> as transient errors due to temporary resource shortage ... the scsipi
> layer backs off in that case, and retries the command.
> 
> -- 
>         -- Jason R. Thorpe <thorpej@wasabisystems.com>
>