Subject: Re: generic HBA error messages on 1.6beta2
To: Matthias Buelow <mkb@informatik.uni-wuerzburg.de>
From: Matthew Jacob <mjacob@feral.com>
List: port-alpha
Date: 07/09/2002 17:30:13
On Wed, 10 Jul 2002, Matthias Buelow wrote:
> Matthew Jacob writes:
> 
> >Fetch ftp://ftp.feral.com/pub/outgoing/patches.gz  and apply and try again.
> 
> Will do that tomorrow... besides, let's have an analytical look at
> the problem (maybe with someone on the list a bell might ring):
> 
> 1) the problem only appears to occur with machines with >= 1GB RAM
>    installed (as Mel Kravitz claims, who has seen the same problem),
> 2) the problem only occurs here when the machine has been running for
>    at least 2-3 days, this might hint at some problem with higher
>    address spaces or physical memory or mappings, and the kernel
>    migrates some mappings or buffers slowly upwards over time,
>    making the problem appear after a couple of days,
> 3) the problem appears to be with the dma mapping of the host adapter,
>    or more generally; considering that Jason has made new SGMAP DMA
>    improvements a while ago (according to the /alpha webpage) this
>    might be a hint that something might be broken there (with the
>    direct-mapped DMA window, although it only mentions mbufs and
>    things being made "a bit more efficient" on the webpage),
> 4) it does not seem to result from hardware bus collision or similar,
>    because the system is completely unloaded and what triggers it
>    seems rather to be related to passed uptime than to i/o traffic,
> 5) there hasn't been observed any real data loss to the disks so far,
>    at least not here, maybe it's just a bogus error (although I somewhat
>    doubt that, and there hasn't been enough disk i/o to substantiate that),
> 6) it cannot be triggered from userland by consuming all available
>    virtual memory (what's available physical, not swap) and doing disk
>    i/o.
> 
> I haven't checked yet if the problem also occurs on the adaptec
> controller (or at least, never have seen it for that one so far)
> which is also installed in the system, which may or may not hint
> at specific problems with the isp (qlogic) driver.  I somehow doubt
> that, though, but I of course can't tell.
> 
> --mkb
> 
> 
I'm also certain that it's a busdma map issue with sgmap. But I cannot, from
code inspection, figure out where I'm going wrong in isp_pci.c. I tried to
simulate this with my pc164 by faking out the code to use nothing but sgmap
instead of the direct mapped window- and I ran fine. Tra La.
Therefore some slightly more informative error messages seem in odrder.
-matt