Port-sparc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SPARCStation VSIMM



On Sat, 30 Oct 2021 13:20:13 +0200
Romain Dolbeau <romain%dolbeau.org@localhost> wrote:

> Le jeu. 28 oct. 2021 à 20:58, Michael <macallan%netbsd.org@localhost> a écrit :
> > That said, the suncg14 driver in -current supports a good part of the
> > most common xrender operations ( basically, what's needed to render
> > anti-aliased text and images with transparency ), it's disabled by
> > default since there aren't many people testing it. Options "xrender"
> > "true" will enable it.  
> 
> Very interesting ; so basically if you have some sort of sufficiently
> programmable core to deal with acceleration, you can do Xrender?

If the core can do per-channel multiplication, or SIMD multiplication,
you can. Some chips support this kind of operations directly in their
drawing engines ( like permedia2 or crime ), on others you had to abuse
the texture mapping engine. Modern chips are far more flexible, in a
way SX is the great-granddaddy of all those
gazillions-of-shader-cores-on-a-chip monstrosities.
Then there are oddballs like ffb which have ALUs in their RAM chips so
you can draw antialiased characters by poking a few parameters and then
simply writing you colour and alpha info into the RAMs which then do
the shading, more or less at the speed your CPU can shove the data
through the UPA port, which is 64bit at 120MHz on an Ultra 60.
 
> I'm guessing from 'transparency' you also presumably need 24+ bits
> TrueColor, not some 8-bits indexed mode?

These operate on pixels as RGB values, doing this on 8bit indexed
colour would be rather difficult. Wsdisplay cheats by using an R3G3B2
truecolour palette to render anti-aliased characters in 8bit.

> Do you need FP, or is integer enough?

FP would be overkill in most of our cases, we got 8 bit per channel, so
doing out calculations in 16 or 32bit integers is sufficient.

> My current cg6 re-implementation [1] is basically a small RV32 core
> [2] and some microcode to implement the functionalities used by the
> PROM console and NetBSD (console and X11).
> It's not particularly fast, but as a proof-of-concept it seems to be
> working. It could be used for a 24/32-bits framebuffer PoC easily.
> (it has an optional FPU in SP or SP/DP but that adds a lot of area to
> the core).

Qemu went with emulating a tgx / S24 mostly because it's got a very
simple, stateless acceleration scheme and the S24 variant supports
24bit.

> An alternative I've considered would be to leverage Betrusted's
> Curve25519 crypto engine [3], it's basically a 50 Mhz instruction
> sequencer that can do 100 Mhz execution for multi-cycle instructions,
> with a wide (32x256 bits by default) register file (running at 200 MHz
> to get data fast enough);
> I've already added a simple Wishbone Load/Store unit to implement
> AES256 and GHASH (for the full GCM mode) in it directly out of a DVMA
> buffer.
> It's quite versatile, so using narrower registers and SIMD-like
> instructions would be very doable.
> The primary issue would be unaligned memory accesses; it seems to
> always be the issue...

... which is why (almost) nobody does 'real' 24bit colour.
SX has instructions to read a 32bit word into four registers and store
them that way too, even letting you pick which byte to grab from each
register. Very nice since it eliminates the need to shift things around
after multiplying.
All instructions have a count field, a single load/store instruction
can repeat up to 32 times on subsequent registers, math and logic
instructions up to 16. You can do things like take 16 registers, add
them to another 16 registers and store the results in yet another 16
registers, in one instruction. Not in one cycle though  ;)
Most instructions are generic but there are a few that scream graphics,
like ROP, which applies a standard X11 ROP to a range of registers, and
then there's one that takes two registers and a bitmask, picking which
one of the inputs to use based on the bitmask, on a range of memory
locations. In other words colour exansion for monochrome font rendering.
The annoying part is the way it synchronizes with the CPU - there's a
short, 8 entry pipeline you write instructions to, and if it's full SX
will stall the MBus until there is room, which can take a while, which
leads to nasty side effects like missed timer interrupts, IPIs timing
out and so on. There is no way to check the pipeline level, just a bit
that lights up when it's empty. No highwater interrupt or anything like
that. 
It's especially annoying on SMP machines where it stalls all CPUs on
the same MBus. My drivers try to avoid that by letting the pipeline
drain every couple of instructions and never feeding it enough to fill
the entire pipeline in one go.

have fun
Michael


Home | Main Index | Thread Index | Old Index