Subject: Re: AlphaServer 4100 spew on reboot bug
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Chris G. Demetriou <cgd@netbsd.org>
List: port-alpha
Date: 04/19/1999 16:36:15
Jason Thorpe <thorpej@nas.nasa.gov> writes:
> Firmware shouldn't really be making any assumptions about the state of
> devices once a client program (i.e. the kernel) has run.

No, but unless it's fixed by a newer version of SRM, you can't count
on them ever getting it right.


> This may just be a general firmware bug.

If it's not been fixed in a later release of the firmware, it's
something that you have to work around anyway if you want to be
producing a reliable system.


Reality of 'new SRM' is that while it has many more features than 'old
SRM' (old being defined as the versions in the TC boxes, and maybe the
Jensen, i forget, and new being defined as the happy drug-induced
version found in the various PCI-using boxes, as far as I can
tell...), 'new SRM' is less stable and behaves much less like firmware
should than does 'old SRM.'

For instance, do you remember the 'unaligned access' messages printed
while the boot loader was running?  In a nutshell, for the rest of the
readers: I'd done a 'reboot' with a new kernel in place, but the new
kernel binary was such that the boot loader couldn't load it.  (ELF
headers were screwy and the loader was getting unaligned accesses
trying to read some data before loading the kernel.)  The output that
got printed was the normal NetBSD/alpha kernel unaligned access
printfs, followed by a "panic:" message, followed by a halt.  The
firmware had NEVER REINITIALIZED THE UNALIGNED ACCESS HANDLER after
halt, so the previously-run kernel's unaligned access handler was
being invoked by these new unaligned accesses in the boot block while
loading a new kernel...

Or, for instance, consider the reason that the APECS code leaves the
direct-mapped DMA window alone (1G at 1G, in window 2), when one might
easily use window 1 for that purpose instead.  In that case, i'd tried
using window 1 for the direct-mapped region and nuking window 2, and
lo and behold on reboot the system failed to boot (and machine checked
until it reinitted itself).  Turns out that the firmware was reusing
window 1 for its SGMAP DMA region or just turning it off (i forget
which), but was assuming that window 2 was going to be left alone as
1G direct mapped at 1G...

New SRM is fragile, and i've yet to see any indication that the
firmware folks have really attempted to make it less fragile.


Obviously, this is all my personal opinion, and shouldn't be construed
as the position of any current or previous employer.


cgd
-- 
Chris Demetriou - cgd@netbsd.org - http://www.netbsd.org/People/Pages/cgd.html
Disclaimer: Not speaking for NetBSD, just expressing my own opinion.