Subject: port-alpha/23588: alpha SMP kernel dies horribly after completing autoconfig
To: None <gnats-bugs@gnats.netbsd.org>
From: None <he@netbsd.org>
List: netbsd-bugs
Date: 11/28/2003 11:52:45
>Number:         23588
>Category:       port-alpha
>Synopsis:       alpha SMP kernel dies horribly after completing autoconfig
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-alpha-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Nov 28 10:53:00 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator:     Havard Eidnes
>Release:        NetBSD 1.6ZF Nov 28 2003
>Organization:
	Unorganized, Inc.
>Environment:
System: NetBSD kveite.urc.uninett.no 1.6ZF NetBSD 1.6ZF (CS20.MP) #8: Fri Nov 14 14:14:17 CET 2003  he@kveite.urc.uninett.no:/usr/obj/sys/arch/alpha/compile/CS20.MP alpha
Architecture: alpha
Machine: alpha
>Description:
	I just updated my sources, did an update build of the tools,
	and built a new kernel for my CS20 from scratch.

	The kernel completes the autoconfig phase, but either jumps
	into nowhere-land (does not respond to BREAK on the console)
	or dies horribly after init is started.  I have so far
	observed three different failure modes:

	1) it drops back to SRM:

root on sd0a dumps on sd0b
root file system type: ffs

halted CPU 0
CPU 1 is not halted

halt code = 4
invalid PTBR
PC = fffffc00004f2a30
P00>>>

	2) it appears to get stuck (DDB does not respond to BREAK):

root on sd0a dumps on sd0b
root file system type: ffs
Fri Nov 28 10:15:30 GMT 2003
[BREAK][BREAK]

	3) it gets a fatal kernel trap followed by what seemed like an
	endless loop of fatal kernel panics:

root on sd0a dumps on sd0b
root file system type: ffs
Fri Nov 28 10:17:55 GMT 2003

CPU 1: fatal kernel trap:

CPU 1    trap entry = 0x4 (unaligned access fault)
CPU 1    a0         = 0xfffffc0000420adc
CPU 1    a1         = 0x29
CPU 1    a2         = 0x2
CPU 1    pc         = 0xfffffc00004fe7fc
CPU 1    ra         = 0xfffffc00003004b8
CPU 1    pv         = 0xfffffc00004fdf00
CPU 1    curlwp    = 0xfffffc0000420a04

CPU 1: fatal kernel trap:

CPU 1    trap entry = 0x4 (unaligned access fault)
CPU 1    a0         = 0xfffffc0000420a34
CPU 1    a1         = 0x29
CPU 1    a2         = 0x12
CPU 1    pc         = 0xfffffc00004fde08
CPU 1    ra         = 0xfffffc00004fdde0
CPU 1    pv         = 0xfffffc0000443250
CPU 1    curlwp    = 0xfffffc0000420a04
panic: alpha_send_ipi: bogus cpu_id
Begin traceback...

CPU 1: fatal kernel trap:

CPU 1    trap entry = 0x4 (unaligned access fault)
CPU 1    a0         = 0xfffffc0000420a34
CPU 1    a1         = 0x29
CPU 1    a2         = 0x12
CPU 1    pc         = 0xfffffc00004fde08
CPU 1    ra         = 0xfffffc00004fdde0
CPU 1    pv         = 0x0
CPU 1    curlwp    = 0xfffffc0000420a04
alpha trace requires known PC =eject=
End traceback...

	etc. etc. etc.

	The difference between 3) and 2) is that I in 3) pressed ENTER
	on the console (and got it echoed before the ream of trap/
	panic messages).

	Dmesg output for "last good" and "this" kernel will be
	appended to this PR after initial submission.

>How-To-Repeat:
	Update to today's -current on an SMP alpha system.  Watch it
	behave as one of the above.

>Fix:
	Don't know, but something changed between Nov 14 and Nov 28
	that has caused this bug.
>Release-Note:
>Audit-Trail:
>Unformatted: