Subject: suddenly my sparcs are crashing left, right, and centre!
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 04/25/2000 14:07:23
I must have hexed my systems by posting uptimes the other day.

Since early last night I've not had an uptime of more than two hours on
either the diskless SS-1+ or its SS-2 server.   (Oddly enough the other
headless SS-1 has been running fine since rebooting last night!)

The crashes were very annoying too as I couldn't see their cause.  There
was no core dump and my 25-line console terminal had always scrolled the
information off the top before I could run downstairs to see what it
said.  At first I tried setting the nvram "auto-boot?" flag to false but
that only prevented boot on startup, not the reboot.  Finally I moved
the console over to my terminal server (even though the terminal server
would have to boot from the same host should all the power fail long
enough for the UPS' to die).  What was revealed was a series of
"alignment fault" panics.  Here's the latest:

trap type 0x7: pc=0xf0108678 npc=0xf010867c psr=118000c0<S,PS>
panic: alignment fault
syncing disks... 4 4 1 done
Frame pointer is at 0xf1a28bb0
Call traceback:
  pc = 0xf01068b8  args = (0x0, 0x11000fe5, 0xf0133000, 0xf1a28cd0, 0xf1a28c60, 0x0, 0xf1a28c18) fp = 0xf1a28c18
  pc = 0xf002e954  args = (0x100, 0x0, 0x1, 0xf1a28d40, 0xf1a28cc8, 0x0, 0xf1a28c80) fp = 0xf1a28c80
  pc = 0xf010c21c  args = (0xf010bfb8, 0x100, 0x1, 0xf010867c, 0xf1a28d48, 0xc000204b, 0xf1a28ce8) fp = 0xf1a28ce8
  pc = 0xf000640c  args = (0x7, 0x118000c0, 0xf0108678, 0xf1a28df0, 0xef, 0xf0136400, 0xf1a28d90) fp = 0xf1a28d90
  pc = 0xf00d9e08  args = (0x8, 0x900ea007, 0x0, 0x1, 0xc0000000, 0x80000904, 0xf1a28e40) fp = 0xf1a28e40
  pc = 0xf010cbc0  args = (0xf08d9100, 0x63000, 0x2, 0x62ffc, 0x0, 0xf1a28e94, 0xf1a28ea8) fp = 0xf1a28ea8
  pc = 0xf00062f4  args = (0xf0840900, 0x8080, 0x63000, 0x11ff8, 0x11400083, 0xf1a28fb0, 0xf1a28f50) fp = 0xf1a28f50
  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084, 0xf1a28fb0, 0xeffff4d8) fp = 0xeffff4d8
rebooting


Note that this is NetBSD-1.3.2.  Remember that this machine had an
uptime of over 100 days previously.  

As I typed this message another reboot:

trap type 0x7: pc=0xf0108678 npc=0xf010867c psr=118000c0<S,PS>
panic: alignment fault
syncing disks... 2 2 done
Frame pointer is at 0xf1a16bb0
Call traceback:
  pc = 0xf01068b8  args = (0x0, 0x11000fe5, 0xf0133000, 0xf1a16cd0, 0xf1a16c60, 0x0, 0xf1a16c18) fp = 0xf1a16c18
  pc = 0xf002e954  args = (0x100, 0x0, 0x1, 0xf1a16d40, 0xf1a16cc8, 0x0, 0xf1a16c80) fp = 0xf1a16c80
  pc = 0xf010c21c  args = (0xf010bfb8, 0x100, 0x1, 0xf010867c, 0xf1a16d48, 0xc00038e8, 0xf1a16ce8) fp = 0xf1a16ce8
  pc = 0xf000640c  args = (0x7, 0x118000c0, 0xf0108678, 0xf1a16df0, 0xef, 0xf0136400, 0xf1a16d90) fp = 0xf1a16d90
  pc = 0xf00d9e08  args = (0x8, 0x900ea007, 0x0, 0x1, 0xc0000000, 0x800009d2, 0xf1a16e40) fp = 0xf1a16e40
  pc = 0xf010cbc0  args = (0xf087e600, 0x63000, 0x2, 0x62ffc, 0x0, 0xf1a16e94, 0xf1a16ea8) fp = 0xf1a16ea8
  pc = 0xf00062f4  args = (0xf088f900, 0x8080, 0x63000, 0x11ff8, 0x11400083, 0xf1a16fb0, 0xf1a16f50) fp = 0xf1a16f50
  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084, 0xf1a16fb0, 0xeffff4d8) fp = 0xeffff4d8
rebooting


The only explanation I can think of (other than a hardware failure of
some sort, which this definitely does not seem to be) is that there's
something happening on the network that causes this.  Either an errant
(malicious?) packet, or a bug being triggered by an application.  The
only application I've changed in the last few weeks is smail -- I
installed a new version just before the first crash in fact....  There
weren't any changes related to system calls as far as I'm aware, but I
was fixing a bug related to pointers so I might have corrupted some
memory that gets passed to the kernel somehow.....

I've taken the entire network off the Internet too avoid both errant
packets and of course SMTP connections, and I'm rebuilding a kernel with
DDB now.  So far the machine's been up for over an hour (almost 30
minutes waiting for mountd to timeout on something though), which is
making me think my hypothesis is correct.

OK, the DDB kernel was installed, the network brought back online, and
no sooner said than done it rebooted again.  Note that it rebooted
without dropping into DDB!  I'll have to double-check my kernel config
(I think I only added "options DDB" though I didn't give DDB_ONPANIC
because the manual page says it should default to one and I didn't think
that option was supported on 1.3.2/sparc anyway, at least there's no
"ddb.onpanic" showing up in sysctl)...

trap type 0x7: pc=0xf010e450 npc=0xf010e454 psr=118000c0<S,PS>
panic: alignment fault
syncing disks... 6 6 3 done
Frame pointer is at 0xf1a55bb0
Call traceback:
  pc = 0xf010c480  args = (0x0, 0x11000fe5, 0xf013e000, 0xf1a55cd0, 0xf1a55c60, 0x0, 0xf1a55c18) fp = 0xf1a55c18
  pc = 0xf0034354  args = (0x100, 0x0, 0x1, 0xf1a55d40, 0xf1a55cc8, 0x0, 0xf1a55c80) fp = 0xf1a55c80
  pc = 0xf0112dd4  args = (0xf0112b70, 0x100, 0x1, 0xf010e454, 0xf1a55d48, 0xf014a000, 0xf1a55ce8) fp = 0xf1a55ce8
  pc = 0xf00064ec  args = (0x7, 0x118000c0, 0xf010e450, 0xf1a55df0, 0xef, 0xf0142800, 0xf1a55d90) fp = 0xf1a55d90
  pc = 0xf00df808  args = (0x8, 0x900ea007, 0x0, 0x0, 0xf01c49f0, 0x80003f1b, 0xf1a55e40) fp = 0xf1a55e40
  pc = 0xf0113778  args = (0xf08da180, 0x63000, 0x0, 0x62ffc, 0x0, 0xf1a55e94, 0xf1a55ea8) fp = 0xf1a55ea8
  pc = 0xf000636c  args = (0xf08e2900, 0x8080, 0x63000, 0x11ff8, 0x11400083, 0xf1a55fb0, 0xf1a55f50) fp = 0xf1a55f50
  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084, 0xf1a55fb0, 0xeffff4d8) fp = 0xeffff4d8
rebooting

# sysctl ddb.onpanic
ddb.onpanic: value is not available

I've now backed out yesterday's new smail binary and I'll see how long
it runs on the old one....

I also started a tcpdump on my gateway to watch for any packets coming
from the Internet to my diskless workstation (there should be none) and
hopefully if it's a malicious packet I'll catch it..... 

So far so good.... 30 minutes online again, and a few e-mails too.

So, does anyone know of a bug that allows a user-level program to
trigger a "panic: alignment fault" in NetBSD/sparc-1.3.2?

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>