tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
PCI MSI musings
I'm writing to share some thoughts I've had about PCI Message-Signaled
Interrupts (MSI), MSI-X, and their application:
Establishment of an MSI/MSI-X handler routine customarily happens
in stages like this:
a) Query MSI/MSI-X device capabilities: find out any
limitations to MSI/MSI-X message data & message address.
E.g., 16-bit message width with 0..n^2-1 lower bits
reserved (MSI), either 32-bit or 64-bit address width.
b) Establish a mapping,
(message address, message data) -> (CPU(s),
handler routine), in the interrupt controller (e.g.,
IOAPIC) and in the CPU (interrupt vector table).
c) Program the MSI/MSI-X registers with the message data &
address.
d) Enable MSI.
MSI/MSI-X are really useful when we use them in a customary mode, but
I think that there are useful ways that we can modify stage (b), above:
1) Device chaining: one PCI bus-master processes a memory buffer
and, when it has finished processing, triggers processing by a
second device. For example, a cryptographic coprocessor and a
network interface (NIC) share a network buffer. The cryptographic
coprocessor encrypts the buffer and signals completion by sending
a message. The target of the message is a memory location
corresponding to the NIC register that either triggers DMA
descriptor-ring polling or advances the descriptor-ring tail pointer.
2) Device polling 1: low-cost polling for coprocessor completions:
say that you have a userland driver for a PCI 3D graphics
coprocessor whose pattern of operation is to write a list of
triangles to render into memory that is shared with the device,
to issue a Draw Polygons command, to do other work until the
command completes, and to repeat. The driver tests for completion
of commands by polling a memory-mapped device register. Usually
polling a register is a costly operation at best. At worst,
polling may introduce variable latency: the host CPU may have
to retry its transaction once or more while PCI a bus bridge
forwards pending PCI transactions upstream.
In a much more efficient arrangement, the userland driver polls
a memory word that is the target for the coprocessor's
message-signaled completion interrupts. At least on DMA-coherent
systems like x86, the memory word can be cached, so polling it
is quite cheap.
3) Device polling 2: like above, but let us say that you have drivers
polling a bunch of NICs. Instead of polling with register reads, let
them check a shared word for changes.
4) Timer invalidation: sometimes reading hardware time sources involves
register reads that are costly. If I have an application that uses
the current time often but that doesn't need the time with equal
accuracy as the time source provides, then the app may spend an
inordinate amount of time reading and re-reading the registers of the
time source.
If the time source can be programmed to interrupt at intervals
corresponding to the accuracy of time that your application
wants, and if the source supports MSI, then we can direct its
interrupt messages to a memory word that the app can treat as
a "cache invalidated" flag: when the app needs the current
time, it refers to the flag. If the flag is 0, then it reads
the current time from the time-source registers and caches it.
If the flag is 1, then it reads the current time from its cache.
Let the interrupt's message data be 0, so that signalling the
interrupt invalidates the app's cache.
5) I have been turning over and over in my head the idea that if there
are no processes eligible to run on a CPU except for a userland
device driver, if we want that device driver to wake and process an
interrupt with very low latency, if we are allergic for some reason
to spinning while waiting for the interrupt, and if MSI is available,
then maybe on x86 we can MONITOR/MWAIT the cacheline containing an
MSI target in the last few instructions of a return to userland. The
CPU will just hang there until either there is some other interrupt
(the hardclock ticks, say) or the message signalling the interrupt
lands.
Granted, I may have described such a rare alignment of conditions
that this is never worth it. The latency of "waking" a CPU from
its MWAIT may be very long, too: I think that typically MWAIT
is used to put the CPU into a power-saving state. I think that
the amount of power-saving is adjustable, though.
I think on most x86 CPUs, MONITOR/MWAIT are only available in
the privileged context, so another problem is that you may have
to MWAIT right on the brink of a kernel->user return.
Dave
--
David Young
dyoung%pobox.com@localhost Urbana, IL (217) 721-9981
Home |
Main Index |
Thread Index |
Old Index