Subject: Re: [RFC] Interface to hardware-assisted data movers
To: None <thorpej@wasabisystems.com>
From: None <cgd@broadcom.com>
List: tech-kern
Date: 06/21/2002 17:49:45
so, some comments.
some of these are "things that need to be addressed now because what's
there is broken." Some of these are "things that should be addressed
now so you can do things more dynamically / smarter in the future, if
the need arises."
There is no real ordering to these points, and there may not be much
sense. 8-)
* suggest possibly splitting dmover front-end interface and back-end
interface manual pages and maybe changing function names. e.g.:
dm_*
dmbackend_* or something.
Users of the interface really don't care about half of this
document. 8-)
* I'm a bit concerned about the separation of this from the
'transform' API.
A single piece of HW may be able to provide both types of interfaces
([dm] generic data movement, zero, prefetch-into-l2, copy; [xform]
generic up-to-32bit CRC generation, IP checksum).
It's therefore important to be able to have the two seamlessly
integrated into a single back-end driver, and I dunno how well that
will be possible with this API.
Also, note that one of the impacts of that is that, say, if the
device has a certain sustainable throughput/bandwidth, both types of
uses will contribute to reaching that limit.
* the load balancing algorithm, etc., seems a bit ad-hoc.
additionally, static assignments of sessions to back-ends for all
times also seems limiting. why restrict by describing it that way?
random thought that popped into my head: if you have some kind of HW
assist module which gets removed from the system (!!), in current
scheme all dmover clients who happened to have their sessions
assigned to that module will need to squish and create sessions
anew.
requirement that hw be used first is kinda lame... what if your xor
engine is maxed out but you've got a dual-processor system that's
idle waiting on xors to finish?
* dmover_request struct:
* dreq_imm{8,16,64} don't seem to be used by any existing
method.
* for the 'fill' methods, is there any reason you don't just
use & require a simple linear buffer to get the data? (i.e.,
supply a ptr to the value rather than just the value in the
struct?) That would simplify the interface a bit, with
approximately no cost. dreq_imm32 would then go away.
(annoying thing about this is, then you need persistent
storage for asynchronous use of the data...)
* hmm, maybe make ptr to inbuf? would be nice if
dmover_request structs were fixed sized (could be poolified),
plus variable-sized structs are always fun to allocate.
* the above should be unioned together, if anything more than
a ptr to a dreq_inbuf is left.
* should include a count of dreq inbuf array elements, to
allow back-end to sanity check.
* use of hard-coded strings is nice for extensibility, but error prone
and a bit wasteful. Suggest providing #defines for "globally
recognized" methods. OK, they can expand into strings or whatever,
but then at least you get compile-time checking of correctness and
maybe some space savings. e.g.:
/* declare & init to "zero-block" in some file... */
extern const char dmover_function_zero_name[];
#define DMOVER_FUNCTION_ZERO dmover_function_zero_name
People don't have to use the #defines, but if they do then they know
they probably got the function they wanted if it compiles. 8-)
* the 'hwassist' specifications are ... lame. Make the back-ends
estimate their throughput in terms of bytes/sec (or, if you want to
be modern, MB/sec).
Provide a way for sessions to estimate throughput requirements.
Try to do a bit more sane assignment based on throughput/bandwidth
than it sounds like you're planning or can do w/ current interface.
Then, provide mechanisms to collect actual statistics as a
debugging/diagnostic tool. 8-)
consider further, say you have two movers w/ equal capabilities both
of which can copy data but only one of which is XORiffic. you first
get 6 requests for sessions for high-bandwidth copying, and one for
really-high-bandwidth 5-way xor. With current fixed allocation
scheme, the first 6 will get split 3 & 3, then the xor will get
stuck on an already loaded mover. what you _should_ do is shuffle
the 6 moves off to the lame mover, but do the xor on the xoriffic
one. (i'm thinking 5-way -> 1 dest, so 6 * single-stream bytes/s,
or possibly seven depending on the DM implementation. 8-)
* dmover_bufptr_t: a uio includes the amount of stuff that needs to be
xferred (resid). a void * does not. you seem to have a 'len'
outside these anyway in dreq_outbut and dreq_inbuf[]. seems that
the void * should be paired with a len, and the uio should be left
to go on its merry way unassisted.
* can you have mismatches in total input and output length in this
interface? 8-) are the extra lengths just here for sanity?
* flags and processing:
* maybe a flag to say "owned by dm system"? (i.e.,
dmover_process() called on this request)
* similarly, a flag to say "on hardware"? (i.e., assigned to
a specific back-end, on the hardware ring buffer, etc.)
* mechanism to cancel/abort requests?
* dmover_session_create description: "specied"? (typo 8-)
* dmover_process: if op has to be done in SW, or for other reasons,
can callback fn be invoked before dmover_process returns? (think
so!) (beware uses which continually increase stack depth!)
* for sessions, number of inputs is determined by session function
type. for back-ends, there seems to be 'algorithm' and separate
'number of inputs'. Will this be confusing? (I assume the goal
here is to allow the interface to compose large-#-inputs-XORs from
small-#-input-XOR data movers? yet another reason why you don't
want to nail session down to a specific back-end! 4-input xor w/ 2
3-input xor engines, or 3 2-input ones! pipelining! 8-)
I think the biggest problem I have w/ the current design is the method
of assignment of requests to movers, and the static assignment of
requests to movers. I think this will probably require a bit of study
and evolution in the interface, but the current interface spec is _so_
deficient in this regard that it pretty much has to get better. 8-)
There are a few schemes which are probably worth considering.
* assignment of sessions to back-ends, with measurement of
utilization/performance & adjustment as needed.
* make back-ends bid for requests, offering 'approx time to
completion' and selecting something close to the least, while trying
to keep HW queues / ring buffers 'full enough'.
Assuming you're not promising anything about guaranteed FIFO ordering
or guaranteed bandwidth, I suspect the latter may work best. (e.g.,
consider pipelining of requests, consider the case where you've got
two DMs that you're trying to use for a single session that needs both
of their throughput...)
cgd
--
Chris Demetriou Broadcom Corporation
Senior Staff Design Engineer Broadband Processor Business Unit
Any opinions expressed in this message are mine, not necessarily Broadcom's.