Subject: Re: suddenly my sparcs are crashing left, right, and centre!
To: NetBSD/sparc Discussion List <port-sparc@netbsd.org>
From: Steven Grunza <steven_grunza@ieee.org>
List: port-sparc
Date: 04/25/2000 14:29:07
I used to crash a SunOS 4.1.3 system by sending non 32-bit aligned memory
to the TCP/IP routines.  Could NetBSD have the same type of problem?


At 02:07 PM 4/25/00 -0400, you wrote:
>I must have hexed my systems by posting uptimes the other day.
>
>Since early last night I've not had an uptime of more than two hours on
>either the diskless SS-1+ or its SS-2 server.   (Oddly enough the other
>headless SS-1 has been running fine since rebooting last night!)
>
>The crashes were very annoying too as I couldn't see their cause.  There
>was no core dump and my 25-line console terminal had always scrolled the
>information off the top before I could run downstairs to see what it
>said.  At first I tried setting the nvram "auto-boot?" flag to false but
>that only prevented boot on startup, not the reboot.  Finally I moved
>the console over to my terminal server (even though the terminal server
>would have to boot from the same host should all the power fail long
>enough for the UPS' to die).  What was revealed was a series of
>"alignment fault" panics.  Here's the latest:
>
>trap type 0x7: pc=0xf0108678 npc=0xf010867c psr=118000c0<S,PS>
>panic: alignment fault
>syncing disks... 4 4 1 done
>Frame pointer is at 0xf1a28bb0
>Call traceback:
>  pc = 0xf01068b8  args = (0x0, 0x11000fe5, 0xf0133000, 0xf1a28cd0,
0xf1a28c60, 0x0, 0xf1a28c18) fp = 0xf1a28c18
>  pc = 0xf002e954  args = (0x100, 0x0, 0x1, 0xf1a28d40, 0xf1a28cc8, 0x0,
0xf1a28c80) fp = 0xf1a28c80
>  pc = 0xf010c21c  args = (0xf010bfb8, 0x100, 0x1, 0xf010867c, 0xf1a28d48,
0xc000204b, 0xf1a28ce8) fp = 0xf1a28ce8
>  pc = 0xf000640c  args = (0x7, 0x118000c0, 0xf0108678, 0xf1a28df0, 0xef,
0xf0136400, 0xf1a28d90) fp = 0xf1a28d90
>  pc = 0xf00d9e08  args = (0x8, 0x900ea007, 0x0, 0x1, 0xc0000000,
0x80000904, 0xf1a28e40) fp = 0xf1a28e40
>  pc = 0xf010cbc0  args = (0xf08d9100, 0x63000, 0x2, 0x62ffc, 0x0,
0xf1a28e94, 0xf1a28ea8) fp = 0xf1a28ea8
>  pc = 0xf00062f4  args = (0xf0840900, 0x8080, 0x63000, 0x11ff8,
0x11400083, 0xf1a28fb0, 0xf1a28f50) fp = 0xf1a28f50
>  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084,
0xf1a28fb0, 0xeffff4d8) fp = 0xeffff4d8
>rebooting
>
>
>Note that this is NetBSD-1.3.2.  Remember that this machine had an
>uptime of over 100 days previously.  
>
>As I typed this message another reboot:
>
>trap type 0x7: pc=0xf0108678 npc=0xf010867c psr=118000c0<S,PS>
>panic: alignment fault
>syncing disks... 2 2 done
>Frame pointer is at 0xf1a16bb0
>Call traceback:
>  pc = 0xf01068b8  args = (0x0, 0x11000fe5, 0xf0133000, 0xf1a16cd0,
0xf1a16c60, 0x0, 0xf1a16c18) fp = 0xf1a16c18
>  pc = 0xf002e954  args = (0x100, 0x0, 0x1, 0xf1a16d40, 0xf1a16cc8, 0x0,
0xf1a16c80) fp = 0xf1a16c80
>  pc = 0xf010c21c  args = (0xf010bfb8, 0x100, 0x1, 0xf010867c, 0xf1a16d48,
0xc00038e8, 0xf1a16ce8) fp = 0xf1a16ce8
>  pc = 0xf000640c  args = (0x7, 0x118000c0, 0xf0108678, 0xf1a16df0, 0xef,
0xf0136400, 0xf1a16d90) fp = 0xf1a16d90
>  pc = 0xf00d9e08  args = (0x8, 0x900ea007, 0x0, 0x1, 0xc0000000,
0x800009d2, 0xf1a16e40) fp = 0xf1a16e40
>  pc = 0xf010cbc0  args = (0xf087e600, 0x63000, 0x2, 0x62ffc, 0x0,
0xf1a16e94, 0xf1a16ea8) fp = 0xf1a16ea8
>  pc = 0xf00062f4  args = (0xf088f900, 0x8080, 0x63000, 0x11ff8,
0x11400083, 0xf1a16fb0, 0xf1a16f50) fp = 0xf1a16f50
>  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084,
0xf1a16fb0, 0xeffff4d8) fp = 0xeffff4d8
>rebooting
>
>
>The only explanation I can think of (other than a hardware failure of
>some sort, which this definitely does not seem to be) is that there's
>something happening on the network that causes this.  Either an errant
>(malicious?) packet, or a bug being triggered by an application.  The
>only application I've changed in the last few weeks is smail -- I
>installed a new version just before the first crash in fact....  There
>weren't any changes related to system calls as far as I'm aware, but I
>was fixing a bug related to pointers so I might have corrupted some
>memory that gets passed to the kernel somehow.....
>
>I've taken the entire network off the Internet too avoid both errant
>packets and of course SMTP connections, and I'm rebuilding a kernel with
>DDB now.  So far the machine's been up for over an hour (almost 30
>minutes waiting for mountd to timeout on something though), which is
>making me think my hypothesis is correct.
>
>OK, the DDB kernel was installed, the network brought back online, and
>no sooner said than done it rebooted again.  Note that it rebooted
>without dropping into DDB!  I'll have to double-check my kernel config
>(I think I only added "options DDB" though I didn't give DDB_ONPANIC
>because the manual page says it should default to one and I didn't think
>that option was supported on 1.3.2/sparc anyway, at least there's no
>"ddb.onpanic" showing up in sysctl)...
>
>trap type 0x7: pc=0xf010e450 npc=0xf010e454 psr=118000c0<S,PS>
>panic: alignment fault
>syncing disks... 6 6 3 done
>Frame pointer is at 0xf1a55bb0
>Call traceback:
>  pc = 0xf010c480  args = (0x0, 0x11000fe5, 0xf013e000, 0xf1a55cd0,
0xf1a55c60, 0x0, 0xf1a55c18) fp = 0xf1a55c18
>  pc = 0xf0034354  args = (0x100, 0x0, 0x1, 0xf1a55d40, 0xf1a55cc8, 0x0,
0xf1a55c80) fp = 0xf1a55c80
>  pc = 0xf0112dd4  args = (0xf0112b70, 0x100, 0x1, 0xf010e454, 0xf1a55d48,
0xf014a000, 0xf1a55ce8) fp = 0xf1a55ce8
>  pc = 0xf00064ec  args = (0x7, 0x118000c0, 0xf010e450, 0xf1a55df0, 0xef,
0xf0142800, 0xf1a55d90) fp = 0xf1a55d90
>  pc = 0xf00df808  args = (0x8, 0x900ea007, 0x0, 0x0, 0xf01c49f0,
0x80003f1b, 0xf1a55e40) fp = 0xf1a55e40
>  pc = 0xf0113778  args = (0xf08da180, 0x63000, 0x0, 0x62ffc, 0x0,
0xf1a55e94, 0xf1a55ea8) fp = 0xf1a55ea8
>  pc = 0xf000636c  args = (0xf08e2900, 0x8080, 0x63000, 0x11ff8,
0x11400083, 0xf1a55fb0, 0xf1a55f50) fp = 0xf1a55f50
>  pc = 0x11fd0  args = (0x46, 0x62010, 0x11f64, 0x11f64, 0x11000084,
0xf1a55fb0, 0xeffff4d8) fp = 0xeffff4d8
>rebooting
>
># sysctl ddb.onpanic
>ddb.onpanic: value is not available
>
>I've now backed out yesterday's new smail binary and I'll see how long
>it runs on the old one....
>
>I also started a tcpdump on my gateway to watch for any packets coming
>from the Internet to my diskless workstation (there should be none) and
>hopefully if it's a malicious packet I'll catch it..... 
>
>So far so good.... 30 minutes online again, and a few e-mails too.
>
>So, does anyone know of a bug that allows a user-level program to
>trigger a "panic: alignment fault" in NetBSD/sparc-1.3.2?
>
>-- 
>							Greg A. Woods
>
>+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
>Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
>
>