tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: physical address space management



David Young wrote:
I have in mind an MI API for managing physical address space in NetBSD.
I invite your feedback.

Here are some uses that I envision for a physaddr space API:

        * abstract cache-control, execute- and write-protection features
          of x86 MTRR, AMD Elan SC520, et cetera

        * provide for dynamic allocation of bus space for ExpressCard,
          CardBus, et cetera.

        * express memory topology---e.g., NUMA, cluster

on x86 there are SLIT and SRAT tables. It is best to decode the SRAT table and create your own SLIT table since only a few BIOSes provide it.

        * bad RAM management

Before you reinvent the wheel, please read this first:
http://www.opensparc.net/pubs/papers/MPR_DSN06.pdf


        * replace multiple ad hoc, MD, broken and/or undocumented
          mechanisms for managing physical address space, including rbus.

The major entities in the API are an address space *arena*,
*interconnects* and *mappings* between arenas, address space *regions*
within each arena, and address region *type*, *use*, *protection*,
and *properties*.

The API lets us carve up address space into "arenas."  An arena is a
set of consecutive physical addresses with identical type and access
characteristics.  For example, arenas on a typical uniprocessor system may
be the system RAM, the system ROM, and a PCI bus.  In a NUMA system, there
may be more than one RAM arena.

Not just may - there are! On x86 within one NUMA-Node there are several areas, some below 1MB, some below 4GB and some above - also distinguish between DMA-safe and DMA-unsafe memory regions.
The big NUMA-machines also have cpu-less NUMA-Nodes - they just consist
of a memory controller and some DIMMs. The access to this memory is slow
for CPUs but not necessarily for DMA controllers and other pheripheral devices.

Arenas are unified by "interconnects,"
objects that model the bus bridges, IOMMUs, HyperTransport links, et
cetera.  Each interconnect connects precisely two arenas.  Properties of
an interconnect include a mapping from the first arena to the second
(expresses IOMMU configuration), and a metric (expresses cost/delay/speed
of traversing the interconnect).

So you consider the binding of IO-Chips to the NUMA-Nodes in the topology? If yes, that's cool! When a process doing heavy IO runs on the Node with a direct connection to the IO-Chip shows up to 25% more performance than when it runs on an other Node. => Scheduler must be clever here.

It is not interrupt load, because interrupts are acknowleged immediately but handling is a little deferred.


Within an arena are regions.  A region is a span of consecutive physical
addresses with identical ownership and access permissions, use, and
operational characteristics.  For example, a RAM arena may contain a
text region, a read-only data region, and a read/write data region.
A PCI bus arena may have both a prefetchable, write-combining region,
and several non-prefetchable regions.

Is a region physical contigous ? If yes, then it should be possible to use large memory pages on malloc() or mmap(). On x86 these are 2MB/4MB and 1GB and on Alpha these are various sizes up to 512MB.

Methods on arenas let the kernel load an arena with regions during
bootstrap, reserve regions, and map regions from one arena into
another.  Methods on regions let the kernel release a region, get/set
properties such as use (text, data, DMA buffer), access permissions
(read/write/execute), and operational characteristics (uncached,
prefetchable, write-back).

When device drivers for bus bridges, IOMMUs, CPUs, x86 MTRR, et cetera
attach, they register with the physaddr space manager.  When a region's
properties or mappings change, the manager will notify registered drivers
so that they can re-program MTRRs or IOMMUs, or adjust address windows
on bus bridges.

Here is the start of an API:

enums are bad in an API. Not the type, but its size. It is defined to be
the smallest possible integer to hold the largest value.
All enums below are sizeof(char). Whenever a value > 0xff is added, you
change the ABI.


enum pmem_props {                       /* hardware implementation */
          PMEM_P_WTHRU          = 0x01  /* MTRR */
        , PMEM_P_WBACK          = 0x02  /* MTRR */
        , PMEM_P_WCOMB          = 0x04  /* MTRR */
        , PMEM_P_UNCACHED       = 0x08  /* MTRR, AMD Elan SC520 PAR */
        , PMEM_P_PREFETCH       = 0x10  /* PCI bus bridge */
};

enum pmem_prot {                        /* hardware implementation */
          PMEM_PROT_READ        = 0x01  /* PCI bus bridge, IOMMU */
        , PMEM_PROT_WRITE       = 0x02  /* PCI bus bridge, IOMMU, MTRR,
                                         * AMD Elan SC520 PAR
                                         */
        , PMEM_PROT_EXEC        = 0x04  /* AMD Elan SC520 PAR */
};

enum pmem_type {
        , PMEM_T_RAM            = 0x01
        , PMEM_T_ROM            = 0x02
        , PMEM_T_PCI            = 0x04
};

enum pmem_use {
          PMEM_T_TEXT           = 0x01
        , PMEM_T_DMABUF         = 0x02
        , PMEM_T_DATA           = 0x04
        , PMEM_T_DEVREGS        = 0x08
        , PMEM_T_FRAMEBUF       = 0x10
        , PMEM_T_BROKEN         = 0x20  /* bad RAM */
}

typedef enum pmem_props pmem_props_t;
typedef enum pmem_prot pmem_prot_t;
typedef enum pmem_type pmem_type_t;
typedef enum pmem_use pmem_use_t;

typedef struct pmem_arena *pmem_arena_t;

pmem_arena_t
pmem_arena_create(pmem_type_t);

static const pmem_mapping_t pmem_mapping_identity = NULL;

/* Connect two arenas. */
int
pmem_arena_connect(pmem_arena_t left, pmem_arena_t right,
    pmem_mapping_t m, pmem_metric_t metric);

/* Load arena `a' with physical addresses [start, end) having the given
 * default properties.
 */
int
pmem_arena_prime(pmem_arena_t a, paddr_t start, paddr_t end,
    pmem_use_t use, pmem_prot_t prot, pmem_props_t props);

/* Reserve a region in arena `a' that meets the given criteria.
 * The region is returned with a reference count of at least 1.
 */
pmem_region_t
pmem_alloc(pmem_arena_t a, paddr_t start, paddr_t end,
    pmem_prot_t prot, pmem_props_t props, pmem_use_t use,
    size_t align, size_t len, pmem_metric_t maxmetric);

/* Get/set properties on the region `r'. */
int
pmem_get(pmem_region_t r, pmem_prot_t *prot, pmem_props_t *props,
    pmem_use_t *use);

int
pmem_set(pmem_region_t r, pmem_prot_t prot, pmem_props_t props,
    pmem_use_t use);

/* Count another reference to region `r'. */
void
pmem_incref(pmem_region_t r);

/* Reduce the reference count on `r' by one.  pmem_decref may reclaim the
 * resources held by `r'.
 */
void
pmem_decref(pmem_region_t r);

/* Map region `r' into arena `a'.
 *
 * Returns NULL on failure.  `paddr' is undefined on failure.
 *
 * On success, return `r' if region `r' belongs to arena `a', or else
 * return an alias for region `r' in `a'.  The returned region's reference
 * count is increased by one.  Set `paddr' to the physical address of
 * the start of the region `r' in arena `a'.
 */
pmem_region_t
pmem_map(pmap_arena_t a, pmem_region_t r, paddr_t *paddr);

/* Remove a mapping of `r' from its arena.  Decrease the reference count
 * by one.
 */
void
pmem_unmap(pmem_region_t r);

Dave




Home | Main Index | Thread Index | Old Index