Subject: Re: unusual panics on NetBSD/alpha 3.0_* and 4.0_BETA
To: Eric Schnoebelen <eric@cirr.com>
From: Simon Burge <simonb@NetBSD.org>
List: tech-kern
Date: 10/07/2006 16:38:01
Eric Schnoebelen wrote:
> I'm running NetBSD/alpha on an assortment of alpha
> hardware, but mostly DS10L's. One of them, running 3.0_STABLE
> (circa 26 July 2006) is seeing the following panics on a
> semi-regular basis: (dmesg in the first attachment)
>
> [-- eric@localhost attached -- Tue Sep 26 19:09:14 2006]
> db> bt
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x1f8
> trap() at netbsd:trap+0x120
> XentUna() at netbsd:XentUna+0x20
> --- unaligned access fault (from ipl 1) ---
> tcp_sack_option() at netbsd:tcp_sack_option+0x13c
> tcp_dooptions() at netbsd:tcp_dooptions+0x278
> tcp_input() at netbsd:tcp_input+0xa20
> ip_input() at netbsd:ip_input+0xb4c
> ipintr() at netbsd:ipintr+0xa0
> netintr() at netbsd:netintr+0x158
> softintr_dispatch() at netbsd:softintr_dispatch+0x160
> exception_return() at netbsd:exception_return+0x7c
> --- root of call graph ---
This looks like it happened in netinet/tcp_sack.c at:
for (i = 0; i < num_sack_blks; i++, lp += 2) {
memcpy(&left, lp, sizeof(*lp));
memcpy(&right, lp + 1, sizeof(*lp));
---> left = ntohl(left);
right = ntohl(right);
Disassembly of tcp_sack.o shows:
../../../../netinet/tcp_sack.c:225
168: a2 09 e4 43 cmplt zero,t3,t1
../../../../netinet/tcp_sack.c:224
16c: 8f 0c 61 44 cmovle t2,t0,fp
../../../../netinet/tcp_sack.c:225
170: 0e 04 ff 47 clr s5
174: 20 00 40 e4 beq t1,1f8 <tcp_sack_option+0x1b8>
../../../../netinet/tcp_sack.c:228
178: 00 00 0c a2 ldl a0,0(s3)
../../../../netinet/tcp_sack.c:227
17c: 04 00 2c a1 ldl s0,4(s3)
I think that it looks like gcc is optimising the memcpy out and doing an
unaligned load directly. We probably need some sort of qualifier on a
variable somewhere?
> unexpected machine check:
>
> mces = 0x1
> vector = 0x670
> param = 0xfffffc0000006000
> pc = 0xfffffc0000589174
> ra = 0xfffffc0000589128
> code = 0x100000000
> curlwp = 0xfffffc000fcfb800
> pid = 7.1, comm = ioflush
Machine checks are totally different. Google finds:
> > A 0x670 vector machine check indicates a hardware failure specific to the
> > CPU such as a cache failure.
>
> On some machine modells it could also be accessing non exiting memory,
> which can point to a broken memory mapped videocard driver.
> If the video card is on a secondary bus it can also be a sparse
> initialisation problem related to PCI-PCI bridges.
> Moving slots may help in that case.
Simon.