Subject: Re: unusual panics on NetBSD/alpha 3.0_* and 4.0_BETA
To: Eric Schnoebelen <eric@cirr.com>
From: Simon Burge <simonb@NetBSD.org>
List: tech-kern
Date: 10/07/2006 16:38:01
Eric Schnoebelen wrote:

> 	I'm running NetBSD/alpha on an assortment of alpha
> hardware, but  mostly DS10L's.  One of them, running 3.0_STABLE
> (circa 26 July 2006) is seeing the following panics on a
> semi-regular basis: (dmesg in the first attachment)
> 
> 		[-- eric@localhost attached -- Tue Sep 26 19:09:14 2006]
> 		db> bt
> 		cpu_Debugger() at netbsd:cpu_Debugger+0x4
> 		panic() at netbsd:panic+0x1f8
> 		trap() at netbsd:trap+0x120
> 		XentUna() at netbsd:XentUna+0x20
> 		--- unaligned access fault (from ipl 1) ---
> 		tcp_sack_option() at netbsd:tcp_sack_option+0x13c
> 		tcp_dooptions() at netbsd:tcp_dooptions+0x278
> 		tcp_input() at netbsd:tcp_input+0xa20
> 		ip_input() at netbsd:ip_input+0xb4c
> 		ipintr() at netbsd:ipintr+0xa0
> 		netintr() at netbsd:netintr+0x158
> 		softintr_dispatch() at netbsd:softintr_dispatch+0x160
> 		exception_return() at netbsd:exception_return+0x7c
> 		--- root of call graph ---

This looks like it happened in netinet/tcp_sack.c at:

        for (i = 0; i < num_sack_blks; i++, lp += 2) {
                memcpy(&left, lp, sizeof(*lp));
                memcpy(&right, lp + 1, sizeof(*lp));
--->            left = ntohl(left);
                right = ntohl(right);

Disassembly of tcp_sack.o shows:

../../../../netinet/tcp_sack.c:225
 168:   a2 09 e4 43     cmplt   zero,t3,t1
../../../../netinet/tcp_sack.c:224
 16c:   8f 0c 61 44     cmovle  t2,t0,fp
../../../../netinet/tcp_sack.c:225
 170:   0e 04 ff 47     clr     s5
 174:   20 00 40 e4     beq     t1,1f8 <tcp_sack_option+0x1b8>
../../../../netinet/tcp_sack.c:228
 178:   00 00 0c a2     ldl     a0,0(s3)
../../../../netinet/tcp_sack.c:227
 17c:   04 00 2c a1     ldl     s0,4(s3)

I think that it looks like gcc is optimising the memcpy out and doing an
unaligned load directly.  We probably need some sort of qualifier on a
variable somewhere?

> unexpected machine check:
> 
>     mces    = 0x1
>     vector  = 0x670
>     param   = 0xfffffc0000006000
>     pc      = 0xfffffc0000589174
>     ra      = 0xfffffc0000589128
>     code    = 0x100000000
>     curlwp = 0xfffffc000fcfb800
>         pid = 7.1, comm = ioflush

Machine checks are totally different.  Google finds:

   > > A 0x670 vector machine check indicates a hardware failure specific to the
   > > CPU such as a cache failure.
   > 
   > On some machine modells it could also be accessing non exiting memory,
   > which can point to a broken memory mapped videocard driver.
   > If the video card is on a secondary bus it can also be a sparse
   > initialisation problem related to PCI-PCI bridges.
   > Moving slots may help in that case.

Simon.