Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed

To: port-sparc-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,gson%gson.org@localhost (Andreas Gustafsson)
Subject: Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
Date: Sun, 12 Jun 2022 18:20:01 +0000 (UTC)

The following reply was made to PR port-sparc/56788; it has been noted by GNATS.

From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: port-hppa-maintainer%netbsd.org@localhost
Subject: Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
Date: Sun, 12 Jun 2022 14:19:35 -0400

 Andreas Gustafsson writes:
 > On the TNF sparc testbed, various networking tests randomly fail with
 > a rump kernel panic:
 >  [   1.3700050] panic: kernel diagnostic assertion "old =3D=3D LOCK_LOCK=
 ED" failed: file "/tmp/build/2022.04.05.05.04.04-sparc/src/sys/rump/net/li=
 b/libshmif/if_shmem.c", line 157
 > This is a long-standing issue, going back at least to 2011.  I have
 > not seen it on architectures other than sparc.

 I see this on HPPA as well, and I concur that it seems to be the
 underlying cause of intermittent failures in various ATF net/ tests.
 For example, if I run

 $ while /usr/tests/net/icmp/t_ping floodping2
 do
 :
 done

 for awhile, many iterations produce such a report -- though t_ping
 usually claims it succeeded anyway.  Maybe it's atf-run's responsibility
 to notice rump kernel failure?

 The failing assertion is in shmif_unlockbus():

         old =3D atomic_swap_32(&busmem->shm_lock, LOCK_UNLOCKED);
         KASSERT(old =3D=3D LOCK_LOCKED);

 which for me immediately raises the suspicion that something's busted
 in either atomic_swap_32 or atomic_cas_32 (the function used earlier to
 acquire the shm_lock).  On HPPA, atomic_swap_32 is implemented atop
 atomic_cas_32, so that reduces to just one target of suspicion.  In
 kernel space, atomic_cas_32 is supposed to be implemented by _lock_cas
 in lock_stubs.S, and (on my uniprocessor machine) that is a non-atomic
 instruction sequence that is supposed to be made to act atomic by the
 RAS restart mechanism.  I can think of a few ways this might be going
 wrong:

 1. Maybe the rump kernel is linked to the userspace version of
 atomic_cas_32, so that the address range being defended by the RAS
 mechanism is the wrong one.

 2. Maybe the rump kernel fails to implement the RAS checks in the
 traps it takes.

 3. Maybe this is related to PR 56837, and there is some case where
 we can trap at _lock_cas_ras_start with tf_iioq_tail different from
 _lock_cas_ras_start + 4.

 It looks like uniprocessor SPARC also relies on RAS to make =

 atomic_cas_32 atomic, so Occam's razor suggests that the failure
 mechanism is the same on both archs, which would favor explanation
 1 or 2 over 3.  However, then we'd expect to see these on every
 arch that uses RAS for atomic_cas_32, which I think is more than
 SPARC and HPPA.

 I just managed to get a core dump out of t_ping with librumpnet_shmif.so
 attached, and it kind of looks like theory #1 might be the winner:

 (gdb) x/16i atomic_cas_32
    0xad53e0c0 <_atomic_cas_32>: addil L%2800,r19,r1
    0xad53e0c4 <_atomic_cas_32+4>:       ldw 3a4(r1),ret0
    0xad53e0c8 <_atomic_cas_32+8>:       stw rp,-14(sp)
    0xad53e0cc <_atomic_cas_32+12>:      stw,ma r4,40(sp)
    0xad53e0d0 <_atomic_cas_32+16>:      stw r19,-20(sp)
    0xad53e0d4 <_atomic_cas_32+20>:      ldw 0(ret0),r22
    0xad53e0d8 <_atomic_cas_32+24>:      b,l 0xad5375e8,r31
    0xad53e0dc <_atomic_cas_32+28>:      copy r31,rp
    0xad53e0e0 <_atomic_cas_32+32>:      ldw -54(sp),rp
    0xad53e0e4 <_atomic_cas_32+36>:      bv r0(rp)
    0xad53e0e8 <_atomic_cas_32+40>:      ldw,mb -40(sp),r4
    0xad53e0ec <_atomic_cas_16>: addil L%2800,r19,r1
    0xad53e0f0 <_atomic_cas_16+4>:       ldw 3a8(r1),ret0
    0xad53e0f4 <_atomic_cas_16+8>:       stw rp,-14(sp)
    0xad53e0f8 <_atomic_cas_16+12>:      extrw,u r24,31,16,r24
    0xad53e0fc <_atomic_cas_16+16>:      stw,ma r4,40(sp)

 This is not the code from lock_stubs.S.

 Maybe there is an arch-specific linking problem here?  Or maybe I don't
 understand how this stuff is supposed to work in a rump kernel.
 But in any case my money is on RAS not doing what it needs to.

 			regards, tom lane

Follow-Ups:
- re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
  - From: matthew green

Prev by Date: Re: port-hppa/56878 (hppa: kernel module lua fails to load)
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: port-hppa/56878 (hppa: kernel module lua fails to load)
Next by Thread: re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
Indexes:

Home | Main Index | Thread Index | Old Index