Subject: Re: [RFC] Interface to hardware-assisted data movers
To: None <thorpej@wasabisystems.com>
From: None <cgd@broadcom.com>
List: tech-kern
Date: 06/21/2002 17:49:45
so, some comments.

some of these are "things that need to be addressed now because what's
there is broken."  Some of these are "things that should be addressed
now so you can do things more dynamically / smarter in the future, if
the need arises."

There is no real ordering to these points, and there may not be much
sense.  8-)


* suggest possibly splitting dmover front-end interface and back-end
  interface manual pages and maybe changing function names.  e.g.:

	dm_*
	dmbackend_* or something.

  Users of the interface really don't care about half of this
  document.  8-)

* I'm a bit concerned about the separation of this from the
  'transform' API.

  A single piece of HW may be able to provide both types of interfaces
  ([dm] generic data movement, zero, prefetch-into-l2, copy; [xform]
  generic up-to-32bit CRC generation, IP checksum).

  It's therefore important to be able to have the two seamlessly
  integrated into a single back-end driver, and I dunno how well that
  will be possible with this API.

  Also, note that one of the impacts of that is that, say, if the
  device has a certain sustainable throughput/bandwidth, both types of
  uses will contribute to reaching that limit.

* the load balancing algorithm, etc., seems a bit ad-hoc.
  additionally, static assignments of sessions to back-ends for all
  times also seems limiting.  why restrict by describing it that way?

  random thought that popped into my head: if you have some kind of HW
  assist module which gets removed from the system (!!), in current
  scheme all dmover clients who happened to have their sessions
  assigned to that module will need to squish and create sessions
  anew.

  requirement that hw be used first is kinda lame...  what if your xor
  engine is maxed out but you've got a dual-processor system that's
  idle waiting on xors to finish?

* dmover_request struct:

	* dreq_imm{8,16,64} don't seem to be used by any existing
	method.

	* for the 'fill' methods, is there any reason you don't just
	use & require a simple linear buffer to get the data?  (i.e.,
	supply a ptr to the value rather than just the value in the
	struct?)  That would simplify the interface a bit, with
	approximately no cost.  dreq_imm32 would then go away.
	(annoying thing about this is, then you need persistent
	storage for asynchronous use of the data...)

	* hmm, maybe make ptr to inbuf?  would be nice if
        dmover_request structs were fixed sized (could be poolified),
        plus variable-sized structs are always fun to allocate.

	* the above should be unioned together, if anything more than
	a ptr to a dreq_inbuf is left.

	* should include a count of dreq inbuf array elements, to
        allow back-end to sanity check.

* use of hard-coded strings is nice for extensibility, but error prone
  and a bit wasteful.  Suggest providing #defines for "globally
  recognized" methods.  OK, they can expand into strings or whatever,
  but then at least you get compile-time checking of correctness and
  maybe some space savings.  e.g.:

	/* declare & init to "zero-block" in some file... */
	extern const char dmover_function_zero_name[];
	#define DMOVER_FUNCTION_ZERO	dmover_function_zero_name

  People don't have to use the #defines, but if they do then they know
  they probably got the function they wanted if it compiles.  8-)

* the 'hwassist' specifications are ... lame.  Make the back-ends
  estimate their throughput in terms of bytes/sec (or, if you want to
  be modern, MB/sec).

  Provide a way for sessions to estimate throughput requirements.

  Try to do a bit more sane assignment based on throughput/bandwidth
  than it sounds like you're planning or can do w/ current interface.

  Then, provide mechanisms to collect actual statistics as a
  debugging/diagnostic tool.  8-)

  consider further, say you have two movers w/ equal capabilities both
  of which can copy data but only one of which is XORiffic.  you first
  get 6 requests for sessions for high-bandwidth copying, and one for
  really-high-bandwidth 5-way xor.  With current fixed allocation
  scheme, the first 6 will get split 3 & 3, then the xor will get
  stuck on an already loaded mover.  what you _should_ do is shuffle
  the 6 moves off to the lame mover, but do the xor on the xoriffic
  one.  (i'm thinking 5-way -> 1 dest, so 6 * single-stream bytes/s,
  or possibly seven depending on the DM implementation.  8-)

* dmover_bufptr_t: a uio includes the amount of stuff that needs to be
  xferred (resid).  a void * does not.  you seem to have a 'len'
  outside these anyway in dreq_outbut and dreq_inbuf[].  seems that
  the void * should be paired with a len, and the uio should be left
  to go on its merry way unassisted.

* can you have mismatches in total input and output length in this
  interface?  8-)  are the extra lengths just here for sanity?

* flags and processing:

	* maybe a flag to say "owned by dm system"?  (i.e.,
          dmover_process() called on this request)

	* similarly, a flag to say "on hardware"?  (i.e., assigned to
          a specific back-end, on the hardware ring buffer, etc.)

	* mechanism to cancel/abort requests?

* dmover_session_create description: "specied"?  (typo 8-)

* dmover_process: if op has to be done in SW, or for other reasons,
  can callback fn be invoked before dmover_process returns?  (think
  so!)  (beware uses which continually increase stack depth!)

* for sessions, number of inputs is determined by session function
  type.  for back-ends, there seems to be 'algorithm' and separate
  'number of inputs'.  Will this be confusing?  (I assume the goal
  here is to allow the interface to compose large-#-inputs-XORs from
  small-#-input-XOR data movers?  yet another reason why you don't
  want to nail session down to a specific back-end!  4-input xor w/ 2
  3-input xor engines, or 3 2-input ones!  pipelining!  8-)



I think the biggest problem I have w/ the current design is the method
of assignment of requests to movers, and the static assignment of
requests to movers.  I think this will probably require a bit of study
and evolution in the interface, but the current interface spec is _so_
deficient in this regard that it pretty much has to get better.  8-)

There are a few schemes which are probably worth considering.

* assignment of sessions to back-ends, with measurement of
  utilization/performance & adjustment as needed.

* make back-ends bid for requests, offering 'approx time to
  completion' and selecting something close to the least, while trying
  to keep HW queues / ring buffers 'full enough'.

Assuming you're not promising anything about guaranteed FIFO ordering
or guaranteed bandwidth, I suspect the latter may work best.  (e.g.,
consider pipelining of requests, consider the case where you've got
two DMs that you're trying to use for a single session that needs both
of their throughput...)


cgd
-- 
Chris Demetriou                                            Broadcom Corporation
Senior Staff Design Engineer                  Broadband Processor Business Unit
  Any opinions expressed in this message are mine, not necessarily Broadcom's.