Re: FW: ixg(4) performances

To: Terry Moore <tmm%mcci.com@localhost>
Subject: Re: FW: ixg(4) performances
From: David Laight <david%l8s.co.uk@localhost>
Date: Wed, 1 Oct 2014 20:59:37 +0100

On Sun, Aug 31, 2014 at 12:07:38PM -0400, Terry Moore wrote:
> 
> This is not 2.5G Transfers per second. PCIe talks about transactions rather
> than transfers; one transaction requires either 12 bytes (for 32-bit
> systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
> layer, plus 7 bytes at the link layer. 
> 
> The maximum number of transactions per second paradoxically transfers the
> fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
> about 60,000 such transactions are possible per second (moving about
> 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
> for example 95% efficiency is typical for storage controllers.]   

The gain for large transfer requests is probably minimal.
There can be multiple requests outstanding at any one time (the limit
is negotiated, I'm guessing that 8 and 16 are typical values).
A typical PCIe dma controller will generate multiple concurrent transfer
requests, so even if the requests are only 128 bytes you can get a
reasonable overall throughput.

> A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
> transactions are possible per second, but those 9 million transactions can
> only move 36 million bytes.

Except that nothing will generate adequately overlapped short transfers.

The real performance killer is cpu pio cycles.
Every one that the driver does will hit the throughput - the cpu will
be spinning for a long, long time (think ISA bus speeds).

A side effect of this is that PCI-PCIe bridges (either way) are doomed
to be very inefficient.

> Multiple lanes scale things fairly linearly. But there has to be one byte
> per lane; a x8 configuration says that physical transfers are padded so that
> each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
> bytes. Instead of getting 72 million transactions per second, you get 62.5
> million transactions/second, so it doesn't scale as nicely.

I think that individual PCIe transfers requests always use a single lane.
Multiple lanes help if you have multiple concurrent transfers.
So different chunks of an ethernet frame can be transferred in parrallel
over multiple lanes, with the transfer not completing until all the
individual parts complete.
So the ring status transfer can't be scheduled until all the other
data fragment transfers have completed.

I also believe that the PCIe transfers are inherently 64bit.
There are byte-enables indicating which bytes of the first and last
64bit words are actually required.

The real thing to remember about PCIe is that it is a comms protocol,
not a bus protocol.
It is high throughput, high latency.

I've had 'fun' getting even moderate PCIe throughput into an fpga.

	David

-- 
David Laight: david%l8s.co.uk@localhost

References:
- RE: ixg(4) performances
  - From: Terry Moore
- Re: ixg(4) performances
  - From: Emmanuel Dreyfus
- FW: ixg(4) performances
  - From: Terry Moore
- Re: FW: ixg(4) performances
  - From: Hisashi T Fujinaka
- RE: FW: ixg(4) performances
  - From: Terry Moore

Prev by Date: Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
Next by Date: Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
Previous by Thread: RE: FW: ixg(4) performances
Next by Thread: Re: FW: ixg(4) performances
Indexes:

Home | Main Index | Thread Index | Old Index