tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Proposal for new MP-safe network interface output queue API



NetBSD currently has a set of macros, IFQ_*(), that implement the default
network interface output queue used by the overwhelming majority of network
interface drivers in our kernel.  These macros are wrappers around the
historic BSD IF_*() ifqueue macros, and provide some basic synchronization
and integration with the ALTQ system.

Alas, as with many legacy APIs, there are some problems with these macros
in multiprocessor environments, and other deficiencies vis a vis how the
output queue interacts with the IFF_OACTIVE and IFF_RUNNING flags.
Furthermore, these macros are a source of ABI fragility that can be
easily avoided.

First, let me outline some problematic usage patterns that can be
seen in various network drivers in our kernel.

== The IF_PREPEND() problem ==

The current "best practice" for drivers to follow when processing their
output queue is like so:

        if ((ifp->if_flags & IFF_RUNNING) == 0) {
                return;
        }

        for (;;) {
                IFQ_POLL(&ifp->if_snd, m);
                if (m == NULL) {
                        break;
                }

                /* Do stuff to encapsulate the packet for hardware. */
                error = foo_encap(sc, m);
                if (error) {
                        if (error == FATAL_ERROR) {
                                IFQ_DEQUEUE(&ifp->if_snd, m);
                                if_statinc(ifp, if_oerrors);
                                m_freem(m);
                                continue;
                        }
                        /* Temporary resource shortage. */
                        break;
                }

                /* NOW COMMITTED TO TRANSMITTING THE PACKET. */
                IFQ_DEQUEUE(&ifp->if_snd, m);

                /* mbuf will be freed when transmission is complete. */
        }

Unfortunately, not all drivers are structured like this.  There are
some cases where hardware requirements might necessitate allocating
a new mbuf and copying the data.  Such drivers often use the following
pattern instead:

        if ((ifp->if_flags & IFF_RUNNING) == 0) {
                return;
        }

        for (;;) {
                IFQ_DEQUEUE(&ifp->if_snd, m);
                if (m == NULL) {
                        break;
                }

                /*
                 * Do stuff to encapsulate the packet for hardware,
                 * maybe allocating a new mbuf in the process.
                 */
                error = foo_encap(sc, &m);
                if (error) {
                        if (error == FATAL_ERROR) {
                                if_statinc(ifp, if_oerrors);
                                m_freem(m);
                                continue;
                        }
                        /* Temporary resource shortage. */
                        IF_PREPEND(&ifp->if_snd, m);
                        break;
                }

                /* mbuf will be freed when transmission is complete. */
        }

There are couple of problems with this.

1. Conceptually, it's a bit of an abstraction violation to put a
   packet at the head of the queue like this.

2. It doesn't work at all with ALTQ, which is why such drivers are forced
   to use the IF_PREPEND() macro (note: it's not IFQ_PREPEND()), which
   in additiona to not working with ALTQ, does not actually lock the
   queue and thus isn't MP-safe at all.

It would be nice to provide an API that provides some flexibility for
drivers that need this currently-problematic pattern while also providing
crisp semantic guarantees.

== The ifp->if_flags problem ==

Currently, the Tx process for network interfaces typically consults 2
network interface flags: IFF_OACTIVE and IFF_RUNNING.

Regarding IFF_RUNNING, this is generally consulted at the top of
(*if_start)() before entering the main loop, and if not set, then
just return.

Regarding IFF_OACTIVE, this is a bit whose meaning is not really reflected
in its name, especially when viewed in the context of modern hardware.
Historically, this bit meant "output is active", and the practical effect
is that the network stack will not call (*if_start)() if the bit is set.
Drivers set the bit when there are no more transmit slots left in the
hardware.  It's generally considered unnecessary these days, because modern
hardware has more complex transmit processing resources, and there has been
a gradual push to eliminate it.  However, even *I*, the most fervent
opponent of IFF_OACTIVE, acknowledge that there is a use case for something
to indicate that output is temporarily stalled and allows other parts of
the stack to make decisions based on that information.

The main problem with using these flags is synchronizing access to them.
(*if_start)() is frequently called in hard or soft interrupt context,
that those flags are only stable if the IFNET lock is held (which cannot
be guaranteed when (*if_start)() is called), and manipulating them from
a hard interrupt context currently happens in every driver that uses
IFF_OACTIVE, which is definitely no bueno.

Furthermore, there is the issue of synchronizing against the transition
of IFF_RUNNING -> 0, due to the independent and asynchronous nature of
how (*if_start)() can be called (from hard or soft interrupt context, as
noted previously).

== The proposal ==

To address the above problems, I'm proposing a new interface output
queue API that tackles the above points while also being simple
to adopt (with a reasonably straight-forward mapping from the legacy
API to the new API).  Converting all drivers to this new API will
be pretty easy and will help move the needle forward vis a vis the
transition to a NET_MPSAFE world.

One aspect of the new design is that it is stateful and the standard
API flow lets the ifqueue itself track the various state transitions,
with no real need for the caller to explicitly manage state except for
start/stop.

==> IFQ_STATE_INVALID
This state exists as a transitional step in order to allow drivers to
be migrated over time, rather than all at once.  A queue is initialized
with this state, and queues in this state bypass all of the automatic
state transitions, thus allowing the ifqueue to behave like the old type,
and letting IFQ_*() / IF_*() be used.  A driver opts-in to the new
behavior by calling ifq_start().  Eventually, this state will be removed
once all drivers are converted and the IFQ_*() / IF_*() macros garbage-
collected.

==> IFQ_STATE_STOPPED
This state indicates that the queue is stopped.  This is the rough analog
of IFF_RUNNING not being set in ifp->if_flags.

==> IFQ_STATE_READY
This state indicates that the queue is ready.  This is the rough analog
of IFF_RUNNING being set in ifp->if_flags.  if_start_lock() must find
the queue in this state in order to call (*if_start)().

==> IFQ_STATE_BUSY
This state indicates that the queue is being processed.  if_start_lock()
transitions from READY to BUSY immediately before calling (*if_start)().
If if_start_lock() encounters a queue that is not in READY state, it will
not call (*if_start)().  This is the rough analog of IFF_OACTIVE being set
in ifp->if_flags except the state does not need to be managed directly by
the driver.  Note that this state is not intended as a mutex around a
driver's private transmit-related data; drivers must still provide their
own serialization between (*if_start)() and the interrupt handler.  That is
because...

The transition out of IFQ_STATE_BUSY is handled automatically by ifq_get()
and ifq_stage().  If either of those functions returns a NULL, indicating
that the entire queue has been processed, then they will perform the state
transition automatically for the driver (back to READY, or to STOPPED if
someone is waiting for the queue to be stopped).  It works this way because
drivers will break out of their transmit loops upon the "no more packets"
indication, and would thus miss any additional packets that might be enqueued
between breaking out of that loop and returning from (*if_start)().  This
underscores the importance of a driver having its own serialization mechanism
around its transmit logic; there is a window in which two CPUs may be in
(*if_start)() at the same time (one having exited the transmit loop and
another attempting to enter it).  Because of this automatic state transition
behavior, this has implications for how the transmit loop should be structured,
which is covered in the examples below.

Note that if a driver breaks out of its transmit loop before draining all
packets from the queue (say, for example, because it ran of transmit slots
in the hardware), then the queue state will remain BUSY, which is analogous
to IFF_OACTIVE remaining set, and thus preventing (*if_start)() from being
called again until more slots are available.

==> IFQ_WAIT_STOP
Someone has requested the Tx processing be stopped, but someone else was
already running the queue (i.e. the queue was BUSY).  When that processing
is finished, the state will transition to IFQ_STATE_STOPPED and the waiters
unblocked.


==> void ifq_init(struct ifqueue *);
Initialize an interface queue.  For ifp->if_snd, this is done for
you in if_initialize().

==> void ifq_fini(struct ifqueue *);
Finalize and tear down an interface queue.  For ifp->if_snd, this is done
for you in if_detach().

==> void ifq_start(struct ifqueue *);
Start the Tx process on an interface queue.  This should be called in
(*if_init)() before setting IFF_RUNNING.

==> void ifq_stop(struct ifqueue *);
Stop the Tx process on an interface queue.  This may sleep waiting for
any outstanding Tx processing to complete.  When it returns, it is
guaranteed that (*if_start)() will not be called again until a subsequent
call to ifq_start() re-starts the Tx process.  This should be called in
(*if_stop)() before clearing IFF_RUNNING.

==> bool ifq_continue(struct ifqueue *);
This function allows transmit processing to continue after freeing up
transmit slots in the interrupt handler, and this thus the rough analog
of clearing IFF_OACTIVE.  It returns true if the resulting state is READY
and there are packets in the queue waiting to be processed, so that the
caller can arrange for (*if_start)() to be invoked.  THE DRIVER SHOULD
NOT CALL ITS START ROUTINE DIRECTLY; the preferred way is to use the
deferred-start mechanism, but a suitable wrapper (i.e. one not named
if_start_lock(), which is a horrible name, but with equivalent functionality)
will be provided for drivers that don't want to use deferred-start.

==> int ifq_put(struct ifqueue *, struct mbuf *);
Put a packet into the interface queue.  This is the equivalent of the
old IFQ_ENQUEUE().  Returns 0 on success or an error code indicating
the mode of failure (ENOBUFS is the queue is full).  This function always
consumes the packet (either places it in the queue or frees it if an
error occurs).

==> struct mbuf *ifq_stage(struct ifqueue *);
Stage a packet for output.  This is the rough equivalent of IFQ_POLL(),
but it makes some additional guarantees.  Namely, ifq_stage() is guaranteed
to return the same mbuf each time it is called, no matter what queueing
discipline the ifqueue uses, until one of 3 things happens: the packet
is committed, aborted, or re-staged (see below).  ifq_get() (see below)
will also return the currently-staged packet before dipping into the
queue, if one exists.  As noted above, if the ifqueue state is
IFQ_STATE_BUSY and there are no more packets (staged or otherwise)
in the queue, ifq_stage() will atomically set the state to IFQ_STATE_READY
and return NULL.

==> void ifq_restage(struct ifqueue *, struct mbuf *);
"Re-stage" the currently-staged packet with a new one.  This is the
replacement for prior uses of IF_PREPEND().  A packet must already
be staged in the ifqueue.  Once the new packet has taken the place
of the old one, the old packet will be freed.  It is OK to call
ifq_restage() with the pointer to the currently-staged packet; this
case is detected and treated as a no-op.

==> struct mbuf *ifq_get(struct ifqueue *);
Get a packet from the queue.  This is the equivalent of the old IFQ_DEQUEUE(),
with the caveat that if a packet has been staged, it will be returned before
ifq_get() dips into the queue.  ifq_get(), like ifq_stage(), will perform the
state transition from IFQ_STATE_BUSY to IFQ_STATE_READY, under the same
conditions.

==> void ifq_commit(struct ifqueue *);
Commits the currently-staged packet, freeing up the staging area for another
packet.  The ifq_stage() / ifq_commit() combination is the rough equivalent
of the existing IFQ_POLL() / IFQ_DEQUEUE() pattern.  The caller is responsible
for freeing the packet once the transmission has completed or the mbuf
is otherwise no longer needed.  Typically, for a network interface that is
doing DMA directly from the mbuf, the packet will be freed in the interrupt
handler.

==> void ifq_abort(struct ifqueue *);
This function is intended to be used when transmission of the packet has
encountered a fatal error.  The packet is removed from the staging area
and freed.

==> void ifq_purge(struct ifqueue *);
Purges all packets from the interface queue and frees them.


== Examples ==

AN IMPORTANT TAKE-AWAY: The terminating condition of the transmit loop
should be "checking the packet queue returned NULL", and importantly NOT
"no available resources on the device".  The state transitions rely on this!
Check for packets and THEN check interface resources!

Here is an example of how a typical uses-DMA network interface driver
would use the new ifq API:

void
foo_start(struct ifnet *ifp)
{
        .
        .
        .

        mutex_spin_enter(&sc->sc_txlock);

        while ((m = ifq_stage(&ifp->if_snd)) != NULL) {

                /* Do stuff to encapsulate the packet for hardware. */
                error = foo_encap(sc, m);
                if (error) {
                        if (error == FATAL_ERROR) {
                                ifq_abort(&ifp->if_snd);
                                if_statinc(ifp, if_oerrors);
                                continue;
                        }
                        /* Temporary resource shortage. */
                        break;
                }

                /* NOW COMMITTED TO TRANSMITTING THE PACKET. */
                ifq_commit(&ifp->if_snd);

                /* mbuf will be freed when transmission is complete. */
        }

        /* Poke hardware to wake it up if packets were enqueued. */
        .
        .
        .

        mutex_spin_exit(&sc->sc_txlock);
}

void
foo_intr(struct foo_softc *sc)
{

        .
        .
        .

        mutex_spin_enter(&sc->sc_txlock);

        /* Do stuff to process the completed packet transmissions. */

        .
        .
        .

        mutex_spin_exit(&sc->sc_txlock);

        /*
         * More transmit slots are now available; get more packets going.
         */
        if (ifq_continue(&ifp->if_snd)) {
                if_schedule_deferred_start(ifp);
        }
}


What about a driver that previously used the IF_PREPEND() pattern?

void
foo_start(struct ifnet *ifp)
{
        .
        .
        .

        mutex_spin_enter(&sc->sc_txlock);

        while ((m = ifq_stage(&ifp->if_snd)) != NULL) {

                /*
                 * Do stuff to encapsulate the packet for hardware.
                 * N.B. might allocate a new mbuf.
                 */
                orig_m = m;
                error = foo_encap(sc, &m);
                if (orig_m != m) {
                        /*
                         * This block could actually be in foo_encap()
                         * in the case where it actually allocated a
                         * new one, but it's here for the example just
                         * for illustrative purposes.
                         */
                        ifq_restage(&ifp->if_snd, m);
                        m_freem(orig_m);
                }
                if (error) {
                        if (error == FATAL_ERROR) {
                                ifq_abort(&ifp->if_snd);
                                if_statinc(ifp, if_oerrors);
                                continue;
                        }
                        /* Temporary resource shortage. */
                        break;
                }

                /* NOW COMMITTED TO TRANSMITTING THE PACKET. */
                ifq_commit(&ifp->if_snd);

                /* mbuf will be freed when transmission is complete. */
        }

        /* Poke hardware to wake it up if packets were enqueued. */
        .
        .
        .

        mutex_spin_exit(&sc->sc_txlock);
}

(The interrupt routine is the same as the previous example.)


What about a driver for hardware that doesn't do DMA and can only process
one packet at a time (I'm looking at you, sun2 "ec" driver!)?

void
foo_start(struct ifnet *ifp)
{
        .
        .
        .

        mutex_spin_enter(&sc->sc_txlock);       /* could be splnet() on sun2 */

        m = ifq_get(&ifp->if_snd);
        if (m == NULL) {
                return;
        }

        /* Copy the packet to the hardware. */
        foo_writepkt(sc, m);

        /* All done with this mbuf. */
        m_freem(m);
        .
        .
        .

        mutex_spin_exit(&sc->sc_txlock);        /* could be splx(s) on sun2 */
}

(The interrupt routine is the same as the first example, except it obviously
would not need to free the mbuf because that's already been done.)


-- thorpej



Home | Main Index | Thread Index | Old Index