Port-mips archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: __atomic_test_and_set() and mips o32 - help wanted
On Wed, 19 Nov 2025, Jason Thorpe wrote:
> > The traps themselves are cheap, but that does not always equate to
> > “trap handling is cheap”.
> >
> > I agree with you vis a vis "making appropriate trade-offs is good,
> > actually”, but I’d also like to point out that work described in the
> > original paper on restartable atomic sequences was done on an R3000
> > (DECstation 5000/200).
Interesting, thanks, I didn't know. I guess that came as a late attempt
to make up for an obviously missing architecture feature (even the low-end
8086 CPU had MP atomicity implemented several years before the MIPS ISA
came out, to say nothing of more developed CPU architectures of the time
such as the VAX).
> It’s also worth remembering that emulating ll/sc requires **two** traps
> per “atomic test-and-set”. Unless, of course, you’re going to increase
> the complexity of the trap handler to work forward from the ll and
> interpret the instructions up to the sc.
Yes, it's true that two traps are required for LL/SC emulation, however
the trap handler can be optimised for this case if needed as the Reserved
Instruction exception is not a common execution path to be taken, unlike
for example the Syscall exception. One possibility could be avoiding a
full switch to the kernel stack and limiting the use of registers so that
not all temporaries have to be saved/restored in the prologue/epilogue as
they normally do in an exception handler (conversely the Syscall handler
can just follow the psABI as if an ordinary function call and save no
registers at all in its prologue/epilogue).
Decades ago I investigated a fast-path emulation of the RDHWR instruction
(for TLS pointer retrieval with MIPS architecture revisions that do not
have the CP0 UserLocal register) and came to less than 20 instructions
total executed in the kernel mode. That can surely be on a par with the
overhead of a function call. OTOH RDHWR is obviously trivial to emulate,
but then the function call for a restartable atomic sequence will have
other code beyond just making the call itself.
Though for actual instruction emulation the handler has to have access to
an array of saved user registers to index into anyway, so there might be
not much room for manoeuvre here after all.
I think to emulate the whole instruction sequence between LL and SC would
be asking for trouble and likely not any more efficient. I was actually
given a suggestion to do that with the kernel side of the non-BWX Alpha
issue I mentioned earlier on, but I couldn't have been convinced it would
be a more robust approach.
Maciej
Home |
Main Index |
Thread Index |
Old Index