Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Patch to cache TLS in %fp7 to avoid _lwp_getprivate syscalls



I recently started hacking on NetBSD/amiga and have been working on a
patch that I'd like to contribute to -current. I still need to wrap
the code with an appropriate #ifdef and do some more testing, but I'd
like to get some feedback and find out if there's an open PR for this
issue, or if I should create a new PR for the patch. I've attached the
current version with comments below. It's only beneficial for m68k
systems with a hardware FPU.

The problem is that -current has new TLS support which is currently
implemented on m68k with a system call to _lwp_getprivate(). This
causes a storm of system calls when pthreads are linked, as every call
to pthread_self() results in a system call, including stdio calls,
which are protected by POSIX mutexes. Another symptom is that ATF test
cases mutex2 and mutex3 (in lib/libpthread/t_mutex) time out after 300
seconds.

I patched lib/libc/arch/m68k/sys/__m68k_read_tp.S to cache the value
returned from _lwp_getprivate in %fp7, with the high 32 bits set to
indicate a signaling NaN with a specific bit pattern (the default FP
register value is 0x7fffffff...). On entry to the function, %fp7 is
pushed to the stack, and if the high long word matches the special bit
pattern, the lower 32 bits are returned as the TLS pointer. If the
most significant 16 bits of %fp7 are not 0x7fff, then %fp7 is presumed
to be in use, and the value is not overwritten.

It was also necessary to patch _lwp_makecontext() in
lib/libc/arch/m68k/gen/_lwp.c to update %fp7 with the new TLS base
when a new context is created, or else it would inherit the cached TLS
base of the parent LWP. The FP registers are stored as extended
precision in the uc_mcontext, so the magic bit pattern and TLS base
are shifted 11 bits. The register is only modified if _UC_FPU is set
in uc_flags and the current value of %fp7 is a signaling NaN.

No kernel changes or other diffs are required. Because %fp7 is the
lowest priority register for GCC, I could find very little usage in
the NetBSD userland (at least from running "objdump -d" and grepping
for "fp7"). I'm investigating whether GCC generated code is any
different if I change %fp7 to reserved in FIXED_REGISTERS in
gcc/config/m68k/m68k.h. At any rate, all existing code will continue
to work due to the check for the special NaN bit pattern, unless there
is some code out there that is specifically (ab)using NaN values in FP
registers in the same way to hold integer values.

My system is stable so far, and test cases mutex2 and mutex3 now pass
(in 148 seconds, on a 40MHz 68040). The attached patch includes two
other small and somewhat related changes: I commented out the SMT
spinloop code in libpthread/pthread_lock.c (with #ifdef SMP), and
changed "tas" to "bset #7" in sys/arch/m68k/include/lock.h.

I'd like to see if all of the spinloop code can be commented out for
non-SMP archs like m68k, and changed to sched_yield() or similar. The
FreeBSD implementation of pthreads reads the "hw.ncpu" sysctl and only
spins if ncpu > 1. So I think it'd be nice if NetBSD only included the
spinloop code on archs that support SMP, and then only activated it
when the CPU count is > 1. The pthread_lock.c modification seems the
safest, as it already called sched_yield(), but there are other places
in pthreads that only spinwait, so I need to do more testing to see
how the behavior changes when sched_yield() is called in all places
that would otherwise spin. Any suggestions for the best way to
approach this set of changes would be appreciated.

As for the "tas" instruction, the Amiga hardware reference manual
warns that the special read-modify-write cycle it generates is unsafe
on the Amiga platform. I haven't seen any indication of flakiness on
my system, but it has the same effect as "bset #7" on non-SMP systems,
and because NetBSD doesn't support SMP on m68k (AFAIK), it should be
safest to change this for all m68k machines. There's a comment to this
effect in external/gpl3/gcc/dist/libstdc++-v3/config/cpu/m68k/atomicity.h,
where "tas" is only used when __mcpu32__ or __mcfisab__ or __mcfisac__
is defined, and "bset #7" is used otherwise.

For the %fp7 changes, what would be an appropriate #ifdef guard for
the new code? I'm thinking #ifndef SOFTFLOAT, but perhaps there's
something better? I'm pretty sure that the software FP emulator will
run it correctly, but I suspect it'd be slower than the current code
that always calls _lwp_getprivate. At least for Amiga, Atari, and
Mac68k, I'd assume that the vast majority of NetBSD systems
(especially running -current) have an FPU, so I think it should be
enabled by default for those platforms (and possibly all except for
sun2)?

Regards,
Jake Hamby

Attachment: use-fp7-for-tls-patch.diff
Description: Binary data



Home | Main Index | Thread Index | Old Index