Subject: power management
To: None <tech-kern@netbsd.org>
From: Jachym Holecek <freza@dspfpga.com>
List: tech-kern
Date: 06/22/2006 19:32:36
--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hello,

there's recently been fair amount of work going on towards proper
ACPI support. This seems like a good opportunity to have a look
at power management support in NetBSD.

Currently, drivers can register powerhooks that other part of the kernel
will run on global PM state transitions (ie. ACPI/APM sleep). I'd like
to propose a replacement for this.

First, let's see why powerhooks aren't fully appropriate:

  * Powerhooks need to be run in the right order (sleep child device
    before putting parent bus to sleep, resume parent bus before
    resuming child device). They indeed are now -- by accident. It's
    a "side effect" of the way autoconfiguration works and of the
    assumption that drivers will call powerhook_establish() while
    they're attaching and before they run config_{found, search}.

  * Powerhooks are not explicitely tied to devices, it's impossible
    to selectively power down unused devices for instance. This also
    means it's impossible to expose per-device PM to userland, we can
    only run all powerhooks, or none.

  * On some platforms, model-specific ASICs can be used to power-{up,
    down} otherwise MI devices. This is the case for prep and sparcbook
    (thanks garbled@ for the examples). Powerhooks don't provide a clean
    way to handle this (ie run MD handler and driver's MI powerhook).

Talking about PM, it seems reasonable to distinguish the following:

  1. System-wide management, affecting the whole machine. This is what
     APM and ACPI sleep states do.

  2. Per-device power management. It should be possible to allow power
     saving operation (there can be more modes) for devices that support
     it (802.11 wireless etc). Furthermore, the user should be allowed to
     poweroff unused components (no point in running ethernet interfaces
     when you're sitting on a plane and batterry is running out). See
     next point.

  3. The system should be able to monitor device activity and take
     appropriate steps automatically, at least for the most obvious
     scenarios (acpiacad(4) disconnected --> put devices to power
     saving operation). There also needs to be a way for userland
     to monitor this so that more sophisticated PM policies are
     doable.

I'd like to handle (2) fairly soon -- it should provide reasonable basis
for further work, in particular it's prerequisite for (1). It's also good
to do this before more powerhooks get written (I could convert the
existing ones, there's only a handful).

A couple of thoughts on (3) are pasted below [@] to get the discussion
started. Also check the attachments for annotated comments from people
I've discussed PM with (thanks for the input!), there's a number of ideas
for future work -- mostly related to (3), some also touch (2) and (1).
Messages not included here are either (hopefully) covered by the proposal,
or mentioned in one of the attachments (I picked the longest replies ;).

Now to the point -- for (2) I'd propose:

  * Get rid of powerhooks as we know them

  * Distinguish the following "power levels", it makes sense to define
    this in terms of performance and functionality:

    ON 		- Device is fully powered up. This is the initial state
		  after autoconf. This is "high" power level.

    LOWPOWER1 	- Moderate power saving. May impact perfomance, but the
		  device needs to stay operational. Imagine atactl's
		  "idle"/"standby" states.

    LOWPOWER2 	- Aggresive power saving. Sacrifice all performance and
		  feel free to make the device not-operational in some
		  non-vital way.

    OFF 	- Device is stone dead. This is "low" power level.

    I'm not particularly fixed on the set of levels/number of levels
    (the above is taken from AIX, IIRC), so feel free to suggest better
    one, as long as it's strictly ordered and has clear "operational"
    and "performance" semantics.

  * Use ca_activate as per-device PM entry point. The calling convention
    may need to change slightly, it seem good to pass a request structure
    pointer intead of an enum (the same request struct could be reused
    for (1)). In any case, the desired power level is the primary
    argument.

  * Have a kernel process handle all device-PM operations so that the
    devices can sleep on power level transition (may need to wait for
    DMA or just have a *long* OFF->ON period).

  * Handle devices with hierarchy in mind -- high->low transitions
    should hit children (recursively) before the parent, low->high
    transitions should hit the parent before any of it's (recursive)
    children.

  * Provide a hook for MD code (see prep and sparcbook case above).
    When the hook is present, it will be used to wrap ca_activate
    calls for _any_ device in the system. This way MD code can even
    disable some PM operations for known-broken devices, or handle
    the need to access dedicated ASIC at appropriate point.

  * Userland interface will go via character-device (/dev/power would
    be good, except it's optional component) as (probably) a
    dictionary-passing ioctl.

  * A powerctl tool should exist, "powerctl <dev> <level> <...>"
    would push <dev> to <level> the way described above. When
    <dev> is the root device, all devices are affected (obviously).

  * Busses need a rescan after (some) low->high transitions.

  * Don't care about "not-configured" devices *for now*. They
    definitely should be handled (by parent bus), but it would
    be too intrusive/messy with current state of autoconf.

I hope I didn't forget something. The actual diff would probably
be shorter then this mail...

	-- Jachym

[@] Activity monitoring:

  * The device themselves are mostly _not_ competent to decide on their
    own activity. Instead, upper layers should indicate this. The network
    stack can best tell if (or "how much"?) given interface is active.
    Wscons best knows when a display is active -- it can watch mouse
    and keyboard, and indicate inactivity after they don't send in any
    events for a while.

  * Higher levels need to be able to query "power state" of devices,
    no point in sending data to network card if it's off. There also
    needs to be a way to force a device into operational mode if
    it's powered off. This could happen just by marking the device
    active and waiting for the event to be propagated and handled
    eventually. Not sure about this.

  * It seems reasonable to send activity events to userland so that it
    has enough information for PM policy. If no deamon (?) is listening,
    the kernel itself should come up with reasonable default action.

  * Transitions between active/inactive should be filtreded by
    configurable timeout. This would avoid spurious actions and event
    floods. "Just let me know when the disk is inactive for N seconds".

--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=garrett-damore

# Garrett D'Amore
> But there are several variations on power management:
> 
>     1) global system power (think suspend to disk, etc.)
>     2) power reduction in idle components (e.g. if a disk isn't be used,
> spin it down)
>     3) power off components that are not in use at all (i.e. not configured)
>     4) power reduction by reducing power consumption when e.g. battery
> is low (speed step, etc.)

Yep, and I think the framework should keep some of the above as separete
concepts, too.

> I'd start by looking at the power framework in Solaris.  From
> experience, I think they have it "mostly" correct.

Thanks for the pointer, I took some inspiration.

--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=gavan-fantom

# Gavan Fantom
> I would like to see each driver able to operate in one of three modes:
> 
> * Not operating (and powered down)
> * Operating (and powered up)
> * Somewhere in between - operating, but drawing power only when active.
> 
> Not all hardware will support all of these modes, but some devices,
> especially portable ones, will benefit from being able to power down, or
> almost power down, during periods of inactivity.
> 
> An obvious example would be a radio transmitter module. I think most of
> these tend to include this kind of PM in hardware.
> 
> A less obvious example could perhaps be an XOR acceleration module. The
> driver would be permanently hooked into dmover, but not necessarily
> always processing data. When not processing data, it could shut down the
> XOR module, and turn off the power to it.

Agreed. The "periods of inactivity" part is TBD though.

> Of course you'd have to be a bit intelligent about this, because the
> time taken to power down and up is likely to be significant. But the
> benefits to mobile devices is not insignificant.

Maybe we can have devices for which the cost is nonzero announce
the time needed to the pm framework somehow?

> Also, it would be good to have a generic framework for controlling CPU
> speed and power. x86 processors are by no means the only ones which can
> run at variable clock speed and/or voltage. A framework for other ports
> to attach CPU speed and power control to would be very useful.

Yep. The activity tracking needs to be figured out well enough to allow
scaling to arbitrary number of device's private power levels (in the
CPU case -- frequencies)...

--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=jesse-off

# Jesse Off
> I work for a company manufacturing low power single-board computers.  We
> have a very low power board (TS-7260) that can save power by using PLL's
> to clock down various internal to the CPU buses and SDRAM.  Also, RS232
> transceivers, USB power, ethernet and PC104 bus power can be switched off
> via software.
>
> I always thought a good framework would allow for continuous automatic
> tweaking of clocks based on system activity-- i.e. each time quanta the
> scheduler runs < 100% CPU utilization, the CPU clock is lowered a notch
> and each time quanta at 100% utilization the CPU clock is increased. 

Yes, that would be good. With fixed set of supported power levels
(which I'd prefer), doing this might get a bit creative...

> Reconfiguring the clock takes at most milliseconds and can even be
> instantaneous for some powers-of-2 increases/decreases.  Many applications
> are more power senstive than speed sensitive so even if this type of thing
> "had some quirks" it would be an obvious net-win to embedded designs
> caring primarily about power.

One could always map power levels to frequencies in arbitrary ways,
but having this done dynamically across full frequency range for given
device would be great (see above).

> We are currently designing a daughter board with battery-backup and remote
> wakeup.  This allows the board to be 100% shut off for some period of time
> (from a few seconds to hours) and automatically re-power up.  It would be
> nice if the system could use this type of functionality as a pseudo- low
> power mode and automatically hibernate and spring back to life at required
> times.  Granted, the only way this could be possible is if the system
> didn't have any sockets open for LISTEN, all dirty blocks written, all com
> ports were closed, etc-- but it sure would be nice if applications written
> to sleep(3600) would power down, wait an hour, power up and return from
> the sleep() call if at all possible-- if there happened to be another
> thread doing something with the ethernet ports or the COM port, the system
> would remain running.

Yeah, I've heard such scenario more then once for different platforms.
We can pass an optional "timeout" parameter as part of the power request,
so that the device could be programmed to wakeup at given time.

> We get a lot of design wins and losses based on board and OS bootup times.
>  A SBC with only 32MB of SDRAM and few peripherals should be able to come
> out of a hibernation type state really quick though it may bootup a full
> OS like NetBSD quite slowly.

Definitely.

--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=steven-bellovin

# Steven M. Bellovin
> [...]
> Anyway -- first, of course, is support for all standard user
> requests, such as standby, suspend, and hibernate (to disk) mode.  The
> latter is remarkably useful on machines with lots of RAM; you can go
> through a lot of battery keeping 2G powered!  

Yep -- hibernation would be part of (3). Whatever it will look like,
it will have a way to power-off all devices in the system in a generic
way (2) before doing its thing.

> It needs to be possible to power down peripherals that aren't being used,
> such as USB ports, Cardbus/PCMCIA slots, etc.  Ideally, this happens
> automatically -- on Windows, there's a checkbox for "allow Windows to
> power this down automatically".

Yes, this has actually been pointed out by more people. The proposal
addresses this, except that the decision isn't automatic (yet) -- the
user needs to run powerctl (from powerd script, presumably).

> There needs to be an indicator (used by various pkgsrc programs which can
> get tricky) that will let applications adjust their own behavior when on
> battery.  For example, I might want my mailer to examine my \aleph_0
> folders for new mail less frequently when I'm on battery, since it's an
> expensive operation.
> 
> In a related vein, there needs to be a "disconnected" indicator that
> applications can use for similar purposes -- there's no point in polling
> for new email if I have no IP connectivity.  One unified indicator
> structure?

Things to be considered when discussing (3).

> We need ACPI support for removing and installing devices.  This isn't just
> for frills like being able to insert my CD drive after boot; it's
> necessary on my laptop to remove and replace an Ultrabay battery.  (I fake
> it now by suspending the machine first, so that the BIOS doesn't notice I
> popped out a live "device".)

> We need better, integrated network support for suspend/resume.
> My /etc/apm/resume script is 46 lines, because I do things with dhclient,
> rtsol, mixerctl (the volume setting isn't saved/restored across such
> events by the driver), battery state, etc.

Save/restore kind of things should be handled by ca_activate when
entering/leaving low power modes.

> The VM subsystem needs to be aware that it's on battery -- don't flush
> pages gratuitously if the disk is spun down, but if it's ever spun up,
> flush everything in sight.  (Linux does this.)

Good point. I think higher-level PM (point 3, mostly) will necessarily
be per-subsystem code, ie that designing something generic might not
be the best idea. Let's what comes out of the discussion.

--vtzGhvizbBRQ85DL--