Subject: Re: NMI on Compaq 1850R
To: None <current-users@NetBSD.org, port-i386@NetBSD.org>
From: Chris Ross <cross+netbsd@distal.com>
List: port-i386
Date: 01/20/2007 12:12:13
On Jan 18, 2007, at 18:19, Chris Ross wrote:
>   In June of 2004, I posted to current-users about a problem I was  
> having getting NetBSD 2 (point something) installed on a Compaq  
> 1850R.  I have a recollection of discussing this with someone else  
> from the list, off-list, and finding a really tiny kernel bug that  
> only affected some small class of memory systems.  PIIX3, perhaps?
>
>   In any case, I was fairly certain that change, which allowed me  
> to run with more than 512MB of memory without getting an NMI fairly  
> easily during heavy disk activity, was committed to the trunk, and  
> pulled up into 3.  [...]

   Hello again, all.  And, new lists for the more specific questions  
now to be asked.  As it turns out, I still had a 2.99.14 kernel tree  
sitting on the machine, and was able to find a [crude] patch to sys/ 
arch/i386/pci/pchb.c that appears, tested now against 3.1-RELEASE, to  
solve my problem.  This was clearly never contributed back to the  
core, and that may well be my fault.  The relevant code in pchb.c  
hasn't changed in any significant way in a very long time.

   The aforementioned "patch" I am now running with simply removes  
the PCI_PRODUCT_INTEL_82443BX_AGP & PCI_PRODUCT_INTEL_82443BX_AGP  
case starting near line 193 of pchbattach().  This is noted to be a  
"BIOS BUG WORKAROUND".  But, at least for my machine (pchb0: Intel  
82443BX Host Bridge/Controller (AGP disabled) (rev. 0x03)), this  
"workaround" causes the machine to get an NMI fairly easily.

   I have confirmed that with 4 DIMMS making 768MB of memory, the  
above code will cause a crash within a few minutes when doing a cvs  
checkout of the NetBSD src tree.  Without the 20 lines of code (and  
comment) in that 'case', it runs just fine for multiple full  
checkouts/updates.  If I have only 2 DIMMS (either 256MB or 512MB) in  
the machine, though, it will work just fine with or without the above  
code.  As I mentioned in my first piece of email, which went only to  
current-users, I discussed this off-list with someone in the summer  
of 2004.  Sadly, I don't have that email.  But, I do remember now,  
vaguely, him noting something about this being incorrect code, at  
least with respect to some revisions of the 82443BX.  I wish I could  
remember which revisions he said did or didn't this code belong to,  
but clearly for rev 0x03, it causes a problem.

   Perhaps the person who "owns" that code in i386/pci/pchb.c, or if  
the person I worked with a couple years ago is on any of these lists,  
could discuss this with me we could find the "correct" solution, and  
get it into the tree.  I can certainly run a patched kernel, but  
there must be other people with a Proliant 1850R, or some other  
machine with affected rev's of the 82443BX, that this would also  
help.  :-)

   Thanks much.  I hope to hear from you soon!

                                                     - Chris