Subject: fixing send(2) semantics (kern/29750)
To: None <tech-kern@netbsd.org>
From: Emmanuel Dreyfus <manu@netbsd.org>
List: tech-kern
Date: 03/26/2005 10:56:43
Hello

Some thoughts about the way we can fix send(2) semantics, which is
broken as explained in kern/29750:

Summary: Single Unix Specification and our man page say that unless non
blocking I/O is enabled, send(2) should block when it cannot send the
data because buffers are full.

The code for dealing when the socket send buffer is depleted works,
there is no problem here. The problem is lower in the network stack: The
socket layer calls lower layers and the send request hits the link
layer. For instance we do sosend() -> udp_output() -> ip_output() ->
ether_output().

At the link layer, IFQ_ENQUEUE() is used to put the packet onto the
interface queue. If the interface is full, ENOBUFS is returned and
sosend() fails instead of blocking.

In order to be standard compliant, we need to sleep when blocking I/O is
in use. This can be done at the socket layer or at the link layer. 


1) sleeping at the socket layer, in sosend():

There is no way of telling that the interface queue is full. sosend()
does not know what layers are below, there might even be no underlying
interface. 

So we'd need the lowers layers up to the link layer to report that
information though a function. For instance udp_qavail() -> ip_qavail()
-> ether_qavail(). The change require modifying all the transport,
network and link layers, it's very intrusive. That is probably not a
good path.

Alternatively we can loop, trying to send until we do not get ENOBUFS.
That does not seems very reasonable on the performance front, because on
each attempt we will walk through the whole network stack, redoing
things such as routing, IPsec processing and packet fragmentation.


2) sleeping at the link layer, for instance in ether_output():

We have two problems: 
- All link layer output functions must be modified. That's intrusive,
but at least it's less intrusive than the *_qavail() idea.

- There is no way of telling if the socket has blocking I/O or not. In
order to address this, we'd need a reference to the socket. The network
layer output function has the reference, so we just need to change the
link layer output function interface so that the socket reference gets
propagated there. 

IFQ_ENQUEUE() is used at splnet(). I am not sure if we can afford
sleeping at this level: will the driver be able to send the data? I
assume we have to splx() before sleeping. All the work can be done in
IFQ_ENQUEUE() itself. I would add a sleep argument to the macro, set to
1 if the socket has blocking I/O, ans set to 0 otherwise:

#define IFQ_ENQUEUE(ifq, m, pattr, err, sleep)                  \
do {                                                            \
        if (ALTQ_IS_ENABLED((ifq)))                             \
                ALTQ_ENQUEUE((ifq), (m), (pattr), (err));       \
        else {                                                  \
                while (IF_QFULL((ifq))) {                       \
                        if ((sleep)) {                          \
                                splx();                         \
                                tsleep((ifq), PCATCH|PRIBIO,    \
                                    "ifq_enqueue", NULL);       \
                                splnet();                       \
                        } else {                                \
                                m_freem((m));                   \
                                (err) = ENOBUFS;                \
                                break;                          \
                        }                                       \
                        (err) = 0;                              \
                }                                               \
                                                                \
                if (!(err))                                     \
                        IF_ENQUEUE((ifq), (m));                 \
                        (err) = 0;                              \
                }                                               \
        }                                                       \
        if ((err))                                              \
                (ifq)->ifq_drops++;                             \
} while (/*CONSTCOND*/ 0) 


IFQ_DEQUEUE would call wakeup():

#define IFQ_DEQUEUE(ifq, m)                                     \
do {                                                            \
        if (TBR_IS_ENABLED((ifq)))                              \
                (m) = tbr_dequeue((ifq), ALTDQ_REMOVE);         \
        else if (ALTQ_IS_ENABLED((ifq)))                        \
                ALTQ_DEQUEUE((ifq), (m));                       \ 
        else                                                    \  
                IF_DEQUEUE((ifq), (m));                         \
        wakeup((ifq));                                          \
} while (/*CONSTCOND*/ 0)   


Opinions?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org