NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed



The following reply was made to PR port-sparc/56788; it has been noted by GNATS.

From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: port-hppa-maintainer%netbsd.org@localhost
Subject: Re: port-sparc/56788 Rump kernel panic: kernel diagnostic assertion "old == LOCK_LOCKED" failed
Date: Fri, 17 Jun 2022 17:46:19 -0400

 I've spent a great deal of time digging into the t_ping test case,
 and I think I see what is happening on HPPA ... but I'm not sure
 if it explains anything for SPARC.
 
 In the first place, my theory about bogus RAS setup is all wet.
 What's actually being used in librump for HPPA is
 sys/rump/librump/rumpkern/atomic_cas_generic.c
 which does not use the RAS mechanism.  Instead it relies on
 __cpu_simple_lock() which is a simple LDCW spinlock, which looks
 entirely bulletproof --- but by experiment, it fails intermittently
 in various t_ping test cases.  I eventually realized that each
 of the problem tests is forking a child process and then running
 rump kernels in both the parent and child processes.  The two
 kernels communicate via a memory-mapped host file (cf
 rump_pub_shmif_create) and we're trying to synchronize via
 atomic_cas_32() applied to a word in that shared memory.  That's
 fine, but what makes atomic_cas_32 actually atomic?  AFAICS, the
 LDCW spinlock storage is a static array in atomic_cas_generic.c,
 which means that *each rump kernel has its own copy* after the
 fork.  Therefore, __cpu_simple_lock() successfully interlocks
 between threads in each rump kernel, but not between threads in
 the two kernels, making atomic_cas_32() not at all atomic for
 if_shmem.c's "bus" lock.
 
 According to Makefile.rumpkern, atomic_cas_generic.c is used on
 
 .if (${MACHINE_CPU} == "arm" && "${FEAT_LDREX}" != "yes") \
     || ${MACHINE_ARCH} == "coldfire" || ${MACHINE_CPU} == "hppa" \
     || ${MACHINE_CPU} == "mips" || ${MACHINE_CPU} == "sh3" \
     || ${MACHINE_ARCH} == "vax" || ${MACHINE_ARCH} == "m68000"
 
 so I'd kind of expect these tests to fail on all of those arches.
 (Their spinlock mechanisms vary of course, but the problem of
 the spinlock data being process-local will be the same for all.)
 It's striking though that SPARC is not in this list.  Perhaps
 it has some related failure mechanism?  I see that
 src/common/lib/libc/arch/sparc/atomic/atomic_cas.S
 makes use of a "locktab array" that looks like it'd have the
 same problem of being process-local, but I am not sure if that
 code is used in a rump kernel, or whether it applies to the
 SPARC variant being complained of here.
 
 Anyway, I'm not sure that I see a practical solution to make
 these test cases work on these arches.  Potentially we could
 add a spinlock field to struct shmif_mem, but getting
 atomic_cas_32 to use it would entail some ugly API changes.
 Maybe it's better just to skip the problematic test cases
 on these arches.
 
 			regards, tom lane
 


Home | Main Index | Thread Index | Old Index