Subject: Re: fixing send(2) semantics (kern/29750)
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 03/27/2005 13:42:42
>> You're saying that one syscall takes enough time for an entire full
>> interface queue to drain?  How can you ever fill it up with sends
>> then?
> One additionnal system call other than the sendto().  I tried adding
> this and I can't keep up with the network interface when I do it:
> struct timval tv = { 0, 1};
> (void)select(0, NULL, NULL, NULL, &tv}

A select with a nonzero timeout always waits at least until the next
clock tick, I think (well, assuming no fds become ready, as in your
case); while I don't know what your kernel uses for its clock tick, I
find indications saying i386 (you said "celeron", below, in text I cut)
uses 100Hz, so that select is likely to be very time-expensive compared
to what you want.  Simply retrying the send() when you get ENOBUFS is
more likely to do what you want.

> I send packets of 1024 bytes, plus 28 bytes of headers, that makes
> 8416 bits. At 100 Mb/s, it would takes 0.084 ms to send a packet.
> That number is probably horribly wrong,

It's right for the data content, but it takes no account of the channel
seizure time or any such - I know real Ethernet has something
relatively large like 64 bit times of seizure; I don't know what
100baseTX (which is what I assume you're using) does along those lines.

> Until I get a better idea, I'll assume that if I add system calls
> that last longer then 0.084 ms, the queue is consumed faster than I
> can feed it, and it never fills.

If you average more than that per packet, yes.

> I don't know how much time a sendto() consume.  How can I evaluate
> that?

"route add -host 127.0.0.2 127.0.0.1 -discard" and loop sending to
127.0.0.2, to see how long it takes to send (say) 1000000 packets?
(The route add is to cut down the overhead as much as feasible, by
throwing away the output packets.)

> ktrace tells me that it takes up to 0.18 ms for a 1024 bytes packet.
> But when I use ktrace on the test program in kern/29750, the ENOBUFS
> disapear, so I suspect ktrace makes the system call really longer
> than they really are.

Yes, I've seen sensitive timing issues disappear when ktrace is
involved too.  Not surprising; ktrace will necessarily increase
overhead.  If you have enough RAM to do so, you might try ktracing to a
ramdisk, to reduce the ktrace overhead as much as possible.  (If you're
using mfs, do it several times and use the last values, since
never-written MFS pages have to take a trip through the page fault
handler, which will slow things down.)

> A 19 ms wait clearly explains that the ENOBUFS disapear: calling
> select cause me to feed the queue 200 times slower than it is
> consumed.  Another quick test shows that ktrace reports 0.005 ms for
> getuid().  So all the system calls are not that slow, it's just
> select().

And quite possibly not just select(), but select() with a nonzero
timeout.  Try using a zero timeout, or poll(0,0,0), maybe.

> So I'd change my statement: waiting using select() cause the app to
> be so slow it cannot feed the queue fast enough to use full
> bandwidth.

If your machine has enough RAM to support it, I'd suggest hacking on
the driver so that the send queue is a lot larger - say, thousands or
even tens of thousands of packets.  You may have to grow your mbuf pool
too.  But it will greatly reduce the problem; when userland gets
ENOBUFS it then has a lot longer that it can delay before the queue
completely drains.

You might even add a cdev interface to the network driver so you can,
say, provide an ioctl that sleeps until there's space on the queue.
Since it appears that sleep/wakeup delay is likely not the problem
(your select is probably waiting for a clock tick), you may be able to
do ok with sleeping in an ioctl and being awakened to write more by the
driver TX interrupt routine.

> Question: how can I wait with a finer granularity than select()?

If you really want something time-based, increase HZ.  But since you
just want to run the link wide open, I'd first try the cdev interface
to the network interface, so it can wake you up, without depending on
clock ticks, when it has space.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B