Subject: Re: SPARCengine Ultra AXi bogus power failure
To: Francis Devereux <francis@devrx.org>
From: Eduardo Horvath <eeh@NetBSD.ORG>
List: port-sparc64
Date: 02/28/2003 12:51:03
On Thu, Feb 27, 2003 at 10:14:00PM +0000, Francis Devereux wrote:
> On Thu, Feb 27, 2003 at 10:37:27AM -0800, Eduardo Horvath wrote:
> > On Thu, Feb 27, 2003 at 03:13:21PM +0000, Francis Devereux wrote:
> > > I have a Sunray workstation based on the SPARCengine Ultra AXi board.  When I
> > > boot NetBSD (1.6) I get the message "Power Failure Detected: Shutting down
> > > NOW." and the machine powers off.  Here are the messages I get:
> > > 
> > > console is /pci@1f,0/pci@1,1/ebus@1/se@14,400000:a
> > > Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002
> > >     The NetBSD Foundation, Inc.  All rights reserved.
> > > Copyright (c) 1982, 1986, 1989, 1991, 1993
> > >     The Regents of the University of California.  All rights reserved.
> > > 
> > > NetBSD 1.6 (KRAKEN) #0: Thu Feb 27 00:35:16 GMT 2003
> > >     francis@cepre.repton.int:/usr/src/sys/arch/sparc64/compile/KRAKEN
> > > total memory = 512 MB
> > > avail memory = 466 MB
> > > using 3289 buffers containing 26312 KB of memory
> > > bootpath: /pci@1f,0/pci@1,0/scsi@1,0/disk@1,0
> > > mainbus0 (root): SUNW,UltraSPARC-IIi-Engine
> > > cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 440.048 MHz, version 0 FPU
> > > cpu0: physical 32K instruction (32 b/l), 16K data (32 b/l), 2048K external (64 b
> > > /l)
> > > psycho0 at mainbus0 addr 0xfffc0000
> > > SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 128; PCI bus 0
> > > Power Failure Detected: Shutting down NOW.
> > 
> > I'd speculate that the firmware does interesting things to the bus
> > controller that generates or fails to clear the power fail interrupt 
> > during the reset sequece.  One thing to try is to clear the interrupt
> > (store 0LL in the power fail interrupt clear register) just before
> > installing the interrupt handler.  An interesting question that 
> > should be answered is whether doing this will prevent the interrupt
> > from being delivered later if there is a real power failure.  You
> > could try that fix, then boot into single user mode and power off
> > the machine and see if it manages to print anything.
> > 
> > Eduardo
> 
> I downloaded the -current source and made the following changes to psycho.c:
> --- psycho.c.orig	Wed Dec 11 11:05:00 2002
> +++ psycho.c	Thu Feb 27 21:54:37 2003
> @@ -447,6 +447,8 @@
>  		psycho_set_intr(sc, 15, psycho_bus_b,
>  			&sc->sc_regs->pciberr_int_map, 
>  			&sc->sc_regs->pciberr_clr_int);
> +        /* clear the powerfail interrupt */
> +        sc->sc_regs->power_clr_int = 0LL;
>  		psycho_set_intr(sc, 15, psycho_powerfail,
>  			&sc->sc_regs->power_int_map, 
>  			&sc->sc_regs->power_clr_int);
> @@ -749,8 +751,8 @@
>  	/*
>  	 * We lost power.  Try to shut down NOW.
>  	 */
> -	printf("Power Failure Detected: Shutting down NOW.\n");
> -	cpu_reboot(RB_POWERDOWN|RB_HALT, NULL);
> +	printf("Power Failure Detected: would shut down NOW without Francis' hack.\n");
> +	/* cpu_reboot(RB_POWERDOWN|RB_HALT, NULL); */
>  	return (1);
>  }
>  static 
> 
> Is that what you mean?

That looks like what I had in mind.

> 
> I also commented out the line in the interrupt handler that actually powers
> down.  I now get an endless stream of "Power Failure Detected" messages, it
> seems like the interrupt is being triggered repeatedly.

It's beginning to look like that board you have is not wired together
correctly.  The process of dispatching the interrupt also clears it so
the interrupt controller detects it again and sends a new interrupt 
packet.  Are you sure it's a Sun OEM board that's been repackaged
and not clone board?  

I'll have to think about this....  Since it seems that the OFW system
ID strings are not used very consistently, determining whether to install
the handler based on that is probably not the best idea.

Eduardo