Subject: Re: generic HBA error messages on 1.6beta2
To: Matthias Buelow <mkb@informatik.uni-wuerzburg.de>
From: Matthew Jacob <mjacob@feral.com>
List: port-alpha
Date: 07/09/2002 17:30:13
On Wed, 10 Jul 2002, Matthias Buelow wrote:
> Matthew Jacob writes:
>
> >Fetch ftp://ftp.feral.com/pub/outgoing/patches.gz and apply and try again.
>
> Will do that tomorrow... besides, let's have an analytical look at
> the problem (maybe with someone on the list a bell might ring):
>
> 1) the problem only appears to occur with machines with >= 1GB RAM
> installed (as Mel Kravitz claims, who has seen the same problem),
> 2) the problem only occurs here when the machine has been running for
> at least 2-3 days, this might hint at some problem with higher
> address spaces or physical memory or mappings, and the kernel
> migrates some mappings or buffers slowly upwards over time,
> making the problem appear after a couple of days,
> 3) the problem appears to be with the dma mapping of the host adapter,
> or more generally; considering that Jason has made new SGMAP DMA
> improvements a while ago (according to the /alpha webpage) this
> might be a hint that something might be broken there (with the
> direct-mapped DMA window, although it only mentions mbufs and
> things being made "a bit more efficient" on the webpage),
> 4) it does not seem to result from hardware bus collision or similar,
> because the system is completely unloaded and what triggers it
> seems rather to be related to passed uptime than to i/o traffic,
> 5) there hasn't been observed any real data loss to the disks so far,
> at least not here, maybe it's just a bogus error (although I somewhat
> doubt that, and there hasn't been enough disk i/o to substantiate that),
> 6) it cannot be triggered from userland by consuming all available
> virtual memory (what's available physical, not swap) and doing disk
> i/o.
>
> I haven't checked yet if the problem also occurs on the adaptec
> controller (or at least, never have seen it for that one so far)
> which is also installed in the system, which may or may not hint
> at specific problems with the isp (qlogic) driver. I somehow doubt
> that, though, but I of course can't tell.
>
> --mkb
>
>
I'm also certain that it's a busdma map issue with sgmap. But I cannot, from
code inspection, figure out where I'm going wrong in isp_pci.c. I tried to
simulate this with my pc164 by faking out the code to use nothing but sgmap
instead of the direct mapped window- and I ran fine. Tra La.
Therefore some slightly more informative error messages seem in odrder.
-matt