PCI MSI musings

To: tech-kern%netbsd.org@localhost
Subject: PCI MSI musings
From: David Young <dyoung%pobox.com@localhost>
Date: Sun, 15 Jul 2012 13:18:02 -0500

I'm writing to share some thoughts I've had about PCI Message-Signaled
Interrupts (MSI), MSI-X, and their application:

Establishment of an MSI/MSI-X handler routine customarily happens
in stages like this:

        a) Query MSI/MSI-X device capabilities: find out any
           limitations to MSI/MSI-X message data & message address.
           E.g., 16-bit message width with 0..n^2-1 lower bits
           reserved (MSI), either 32-bit or 64-bit address width.

        b) Establish a mapping,
           (message address, message data) -> (CPU(s),
           handler routine), in the interrupt controller (e.g.,
           IOAPIC) and in the CPU (interrupt vector table).

        c) Program the MSI/MSI-X registers with the message data &
           address.

        d) Enable MSI.

MSI/MSI-X are really useful when we use them in a customary mode, but
I think that there are useful ways that we can modify stage (b), above: 

1) Device chaining: one PCI bus-master processes a memory buffer
   and, when it has finished processing, triggers processing by a
   second device.  For example, a cryptographic coprocessor and a
   network interface (NIC) share a network buffer.  The cryptographic
   coprocessor encrypts the buffer and signals completion by sending
   a message.  The target of the message is a memory location
   corresponding to the NIC register that either triggers DMA
   descriptor-ring polling or advances the descriptor-ring tail pointer.

2) Device polling 1: low-cost polling for coprocessor completions:
   say that you have a userland driver for a PCI 3D graphics
   coprocessor whose pattern of operation is to write a list of
   triangles to render into memory that is shared with the device,
   to issue a Draw Polygons command, to do other work until the
   command completes, and to repeat.  The driver tests for completion
   of commands by polling a memory-mapped device register.  Usually
   polling a register is a costly operation at best.  At worst,
   polling may introduce variable latency: the host CPU may have
   to retry its transaction once or more while PCI a bus bridge
   forwards pending PCI transactions upstream.

   In a much more efficient arrangement, the userland driver polls
   a memory word that is the target for the coprocessor's
   message-signaled completion interrupts.  At least on DMA-coherent
   systems like x86, the memory word can be cached, so polling it
   is quite cheap.

3) Device polling 2: like above, but let us say that you have drivers
   polling a bunch of NICs.  Instead of polling with register reads, let
   them check a shared word for changes.

4) Timer invalidation: sometimes reading hardware time sources involves
   register reads that are costly.  If I have an application that uses
   the current time often but that doesn't need the time with equal
   accuracy as the time source provides, then the app may spend an
   inordinate amount of time reading and re-reading the registers of the
   time source.

   If the time source can be programmed to interrupt at intervals
   corresponding to the accuracy of time that your application
   wants, and if the source supports MSI, then we can direct its
   interrupt messages to a memory word that the app can treat as
   a "cache invalidated" flag:  when the app needs the current
   time, it refers to the flag.  If the flag is 0, then it reads
   the current time from the time-source registers and caches it.
   If the flag is 1, then it reads the current time from its cache.
   Let the interrupt's message data be 0, so that signalling the
   interrupt invalidates the app's cache.

5) I have been turning over and over in my head the idea that if there
   are no processes eligible to run on a CPU except for a userland
   device driver, if we want that device driver to wake and process an
   interrupt with very low latency, if we are allergic for some reason
   to spinning while waiting for the interrupt, and if MSI is available,
   then maybe on x86 we can MONITOR/MWAIT the cacheline containing an
   MSI target in the last few instructions of a return to userland.  The
   CPU will just hang there until either there is some other interrupt
   (the hardclock ticks, say) or the message signalling the interrupt
   lands.

   Granted, I may have described such a rare alignment of conditions
   that this is never worth it.  The latency of "waking" a CPU from
   its MWAIT may be very long, too: I think that typically MWAIT
   is used to put the CPU into a power-saving state.  I think that
   the amount of power-saving is adjustable, though.

   I think on most x86 CPUs, MONITOR/MWAIT are only available in
   the privileged context, so another problem is that you may have
   to MWAIT right on the brink of a kernel->user return.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981

Prev by Date: Re: PUFFS lookup/reclaim race
Next by Date: Re: src/sys/vfs
Previous by Thread: avoiding bus_dmamap_sync() costs
Next by Thread: Syscall kill(2) called for a zombie process should return 0
Indexes:

Home | Main Index | Thread Index | Old Index