Port-mips archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: __atomic_test_and_set() and mips o32 - help wanted



On Wed, 19 Nov 2025, Jason Thorpe wrote:

> > The traps themselves are cheap, but that does not always equate to 
> > “trap handling is cheap”.
> > 
> > I agree with you vis a vis "making appropriate trade-offs is good, 
> > actually”, but I’d also like to point out that work described in the 
> > original paper on restartable atomic sequences was done on an R3000 
> > (DECstation 5000/200).

 Interesting, thanks, I didn't know.  I guess that came as a late attempt 
to make up for an obviously missing architecture feature (even the low-end 
8086 CPU had MP atomicity implemented several years before the MIPS ISA 
came out, to say nothing of more developed CPU architectures of the time 
such as the VAX).

> It’s also worth remembering that emulating ll/sc requires **two** traps 
> per “atomic test-and-set”.  Unless, of course, you’re going to increase 
> the complexity of the trap handler to work forward from the ll and 
> interpret the instructions up to the sc.

 Yes, it's true that two traps are required for LL/SC emulation, however 
the trap handler can be optimised for this case if needed as the Reserved 
Instruction exception is not a common execution path to be taken, unlike 
for example the Syscall exception.  One possibility could be avoiding a 
full switch to the kernel stack and limiting the use of registers so that 
not all temporaries have to be saved/restored in the prologue/epilogue as 
they normally do in an exception handler (conversely the Syscall handler 
can just follow the psABI as if an ordinary function call and save no 
registers at all in its prologue/epilogue).

 Decades ago I investigated a fast-path emulation of the RDHWR instruction 
(for TLS pointer retrieval with MIPS architecture revisions that do not 
have the CP0 UserLocal register) and came to less than 20 instructions 
total executed in the kernel mode.  That can surely be on a par with the 
overhead of a function call.  OTOH RDHWR is obviously trivial to emulate, 
but then the function call for a restartable atomic sequence will have 
other code beyond just making the call itself.

 Though for actual instruction emulation the handler has to have access to 
an array of saved user registers to index into anyway, so there might be 
not much room for manoeuvre here after all.

 I think to emulate the whole instruction sequence between LL and SC would
be asking for trouble and likely not any more efficient.  I was actually 
given a suggestion to do that with the kernel side of the non-BWX Alpha 
issue I mentioned earlier on, but I couldn't have been convinced it would 
be a more robust approach.

  Maciej


Home | Main Index | Thread Index | Old Index