NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
The following reply was made to PR port-sparc/56788; it has been noted by GNATS.
From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: port-hppa-maintainer%netbsd.org@localhost
Subject: Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
Date: Sun, 12 Jun 2022 14:19:35 -0400
Andreas Gustafsson writes:
> On the TNF sparc testbed, various networking tests randomly fail with
> a rump kernel panic:
> [ 1.3700050] panic: kernel diagnostic assertion "old =3D=3D LOCK_LOCK=
ED" failed: file "/tmp/build/2022.04.05.05.04.04-sparc/src/sys/rump/net/li=
b/libshmif/if_shmem.c", line 157
> This is a long-standing issue, going back at least to 2011. I have
> not seen it on architectures other than sparc.
I see this on HPPA as well, and I concur that it seems to be the
underlying cause of intermittent failures in various ATF net/ tests.
For example, if I run
$ while /usr/tests/net/icmp/t_ping floodping2
do
:
done
for awhile, many iterations produce such a report -- though t_ping
usually claims it succeeded anyway. Maybe it's atf-run's responsibility
to notice rump kernel failure?
The failing assertion is in shmif_unlockbus():
old =3D atomic_swap_32(&busmem->shm_lock, LOCK_UNLOCKED);
KASSERT(old =3D=3D LOCK_LOCKED);
which for me immediately raises the suspicion that something's busted
in either atomic_swap_32 or atomic_cas_32 (the function used earlier to
acquire the shm_lock). On HPPA, atomic_swap_32 is implemented atop
atomic_cas_32, so that reduces to just one target of suspicion. In
kernel space, atomic_cas_32 is supposed to be implemented by _lock_cas
in lock_stubs.S, and (on my uniprocessor machine) that is a non-atomic
instruction sequence that is supposed to be made to act atomic by the
RAS restart mechanism. I can think of a few ways this might be going
wrong:
1. Maybe the rump kernel is linked to the userspace version of
atomic_cas_32, so that the address range being defended by the RAS
mechanism is the wrong one.
2. Maybe the rump kernel fails to implement the RAS checks in the
traps it takes.
3. Maybe this is related to PR 56837, and there is some case where
we can trap at _lock_cas_ras_start with tf_iioq_tail different from
_lock_cas_ras_start + 4.
It looks like uniprocessor SPARC also relies on RAS to make =
atomic_cas_32 atomic, so Occam's razor suggests that the failure
mechanism is the same on both archs, which would favor explanation
1 or 2 over 3. However, then we'd expect to see these on every
arch that uses RAS for atomic_cas_32, which I think is more than
SPARC and HPPA.
I just managed to get a core dump out of t_ping with librumpnet_shmif.so
attached, and it kind of looks like theory #1 might be the winner:
(gdb) x/16i atomic_cas_32
0xad53e0c0 <_atomic_cas_32>: addil L%2800,r19,r1
0xad53e0c4 <_atomic_cas_32+4>: ldw 3a4(r1),ret0
0xad53e0c8 <_atomic_cas_32+8>: stw rp,-14(sp)
0xad53e0cc <_atomic_cas_32+12>: stw,ma r4,40(sp)
0xad53e0d0 <_atomic_cas_32+16>: stw r19,-20(sp)
0xad53e0d4 <_atomic_cas_32+20>: ldw 0(ret0),r22
0xad53e0d8 <_atomic_cas_32+24>: b,l 0xad5375e8,r31
0xad53e0dc <_atomic_cas_32+28>: copy r31,rp
0xad53e0e0 <_atomic_cas_32+32>: ldw -54(sp),rp
0xad53e0e4 <_atomic_cas_32+36>: bv r0(rp)
0xad53e0e8 <_atomic_cas_32+40>: ldw,mb -40(sp),r4
0xad53e0ec <_atomic_cas_16>: addil L%2800,r19,r1
0xad53e0f0 <_atomic_cas_16+4>: ldw 3a8(r1),ret0
0xad53e0f4 <_atomic_cas_16+8>: stw rp,-14(sp)
0xad53e0f8 <_atomic_cas_16+12>: extrw,u r24,31,16,r24
0xad53e0fc <_atomic_cas_16+16>: stw,ma r4,40(sp)
This is not the code from lock_stubs.S.
Maybe there is an arch-specific linking problem here? Or maybe I don't
understand how this stuff is supposed to work in a rump kernel.
But in any case my money is on RAS not doing what it needs to.
regards, tom lane
Home |
Main Index |
Thread Index |
Old Index