Subject: Re: power management
To: Jachym Holecek <freza@dspfpga.com>
From: Garrett D'Amore <garrett_damore@tadpole.com>
List: tech-kern
Date: 06/22/2006 11:40:49
We also need a way for a device driver that wants to access a device to
indicate so to the framework.  Solaris has pm-busy-component and
pm-raise-power, etc.  Take a close look at it.

For example, if I'm about to write data to disk, I mark it
pm-busy-component, and then pm-raise-power.  This tells the power
management framework in kernel that the device needs to be powered up,
and the framework powers up the device if it is not already (and maybe,
for example, busses or controllers to which the device is attached!) 
This can result in a recursive callback into a different power
management entry in the device driver, btw.

Then when the write is done, the driver does pm-idle-component to tell
the framework that the device is no longer in use.

I can provide more info if you need it.

As far as "power states", I'd look closely at the PCI and USB power
management specs to see what they offer.  It would be nice to have
support for fully using the power features supplied by the most common
busses.

    -- Garrett
 

Jachym Holecek wrote:
> Hello,
>
> there's recently been fair amount of work going on towards proper
> ACPI support. This seems like a good opportunity to have a look
> at power management support in NetBSD.
>
> Currently, drivers can register powerhooks that other part of the kernel
> will run on global PM state transitions (ie. ACPI/APM sleep). I'd like
> to propose a replacement for this.
>
> First, let's see why powerhooks aren't fully appropriate:
>
>   * Powerhooks need to be run in the right order (sleep child device
>     before putting parent bus to sleep, resume parent bus before
>     resuming child device). They indeed are now -- by accident. It's
>     a "side effect" of the way autoconfiguration works and of the
>     assumption that drivers will call powerhook_establish() while
>     they're attaching and before they run config_{found, search}.
>
>   * Powerhooks are not explicitely tied to devices, it's impossible
>     to selectively power down unused devices for instance. This also
>     means it's impossible to expose per-device PM to userland, we can
>     only run all powerhooks, or none.
>
>   * On some platforms, model-specific ASICs can be used to power-{up,
>     down} otherwise MI devices. This is the case for prep and sparcbook
>     (thanks garbled@ for the examples). Powerhooks don't provide a clean
>     way to handle this (ie run MD handler and driver's MI powerhook).
>
> Talking about PM, it seems reasonable to distinguish the following:
>
>   1. System-wide management, affecting the whole machine. This is what
>      APM and ACPI sleep states do.
>
>   2. Per-device power management. It should be possible to allow power
>      saving operation (there can be more modes) for devices that support
>      it (802.11 wireless etc). Furthermore, the user should be allowed to
>      poweroff unused components (no point in running ethernet interfaces
>      when you're sitting on a plane and batterry is running out). See
>      next point.
>
>   3. The system should be able to monitor device activity and take
>      appropriate steps automatically, at least for the most obvious
>      scenarios (acpiacad(4) disconnected --> put devices to power
>      saving operation). There also needs to be a way for userland
>      to monitor this so that more sophisticated PM policies are
>      doable.
>
> I'd like to handle (2) fairly soon -- it should provide reasonable basis
> for further work, in particular it's prerequisite for (1). It's also good
> to do this before more powerhooks get written (I could convert the
> existing ones, there's only a handful).
>
> A couple of thoughts on (3) are pasted below [@] to get the discussion
> started. Also check the attachments for annotated comments from people
> I've discussed PM with (thanks for the input!), there's a number of ideas
> for future work -- mostly related to (3), some also touch (2) and (1).
> Messages not included here are either (hopefully) covered by the proposal,
> or mentioned in one of the attachments (I picked the longest replies ;).
>
> Now to the point -- for (2) I'd propose:
>
>   * Get rid of powerhooks as we know them
>
>   * Distinguish the following "power levels", it makes sense to define
>     this in terms of performance and functionality:
>
>     ON 		- Device is fully powered up. This is the initial state
> 		  after autoconf. This is "high" power level.
>
>     LOWPOWER1 	- Moderate power saving. May impact perfomance, but the
> 		  device needs to stay operational. Imagine atactl's
> 		  "idle"/"standby" states.
>
>     LOWPOWER2 	- Aggresive power saving. Sacrifice all performance and
> 		  feel free to make the device not-operational in some
> 		  non-vital way.
>
>     OFF 	- Device is stone dead. This is "low" power level.
>
>     I'm not particularly fixed on the set of levels/number of levels
>     (the above is taken from AIX, IIRC), so feel free to suggest better
>     one, as long as it's strictly ordered and has clear "operational"
>     and "performance" semantics.
>
>   * Use ca_activate as per-device PM entry point. The calling convention
>     may need to change slightly, it seem good to pass a request structure
>     pointer intead of an enum (the same request struct could be reused
>     for (1)). In any case, the desired power level is the primary
>     argument.
>
>   * Have a kernel process handle all device-PM operations so that the
>     devices can sleep on power level transition (may need to wait for
>     DMA or just have a *long* OFF->ON period).
>
>   * Handle devices with hierarchy in mind -- high->low transitions
>     should hit children (recursively) before the parent, low->high
>     transitions should hit the parent before any of it's (recursive)
>     children.
>
>   * Provide a hook for MD code (see prep and sparcbook case above).
>     When the hook is present, it will be used to wrap ca_activate
>     calls for _any_ device in the system. This way MD code can even
>     disable some PM operations for known-broken devices, or handle
>     the need to access dedicated ASIC at appropriate point.
>
>   * Userland interface will go via character-device (/dev/power would
>     be good, except it's optional component) as (probably) a
>     dictionary-passing ioctl.
>
>   * A powerctl tool should exist, "powerctl <dev> <level> <...>"
>     would push <dev> to <level> the way described above. When
>     <dev> is the root device, all devices are affected (obviously).
>
>   * Busses need a rescan after (some) low->high transitions.
>
>   * Don't care about "not-configured" devices *for now*. They
>     definitely should be handled (by parent bus), but it would
>     be too intrusive/messy with current state of autoconf.
>
> I hope I didn't forget something. The actual diff would probably
> be shorter then this mail...
>
> 	-- Jachym
>
> [@] Activity monitoring:
>
>   * The device themselves are mostly _not_ competent to decide on their
>     own activity. Instead, upper layers should indicate this. The network
>     stack can best tell if (or "how much"?) given interface is active.
>     Wscons best knows when a display is active -- it can watch mouse
>     and keyboard, and indicate inactivity after they don't send in any
>     events for a while.
>
>   * Higher levels need to be able to query "power state" of devices,
>     no point in sending data to network card if it's off. There also
>     needs to be a way to force a device into operational mode if
>     it's powered off. This could happen just by marking the device
>     active and waiting for the event to be propagated and handled
>     eventually. Not sure about this.
>
>   * It seems reasonable to send activity events to userland so that it
>     has enough information for PM policy. If no deamon (?) is listening,
>     the kernel itself should come up with reasonable default action.
>
>   * Transitions between active/inactive should be filtreded by
>     configurable timeout. This would avoid spurious actions and event
>     floods. "Just let me know when the disk is inactive for N seconds".
>   
> ------------------------------------------------------------------------
>
> # Garrett D'Amore
>   
>> But there are several variations on power management:
>>
>>     1) global system power (think suspend to disk, etc.)
>>     2) power reduction in idle components (e.g. if a disk isn't be used,
>> spin it down)
>>     3) power off components that are not in use at all (i.e. not configured)
>>     4) power reduction by reducing power consumption when e.g. battery
>> is low (speed step, etc.)
>>     
>
> Yep, and I think the framework should keep some of the above as separete
> concepts, too.
>
>   
>> I'd start by looking at the power framework in Solaris.  From
>> experience, I think they have it "mostly" correct.
>>     
>
> Thanks for the pointer, I took some inspiration.
>   
> ------------------------------------------------------------------------
>
> # Gavan Fantom
>   
>> I would like to see each driver able to operate in one of three modes:
>>
>> * Not operating (and powered down)
>> * Operating (and powered up)
>> * Somewhere in between - operating, but drawing power only when active.
>>
>> Not all hardware will support all of these modes, but some devices,
>> especially portable ones, will benefit from being able to power down, or
>> almost power down, during periods of inactivity.
>>
>> An obvious example would be a radio transmitter module. I think most of
>> these tend to include this kind of PM in hardware.
>>
>> A less obvious example could perhaps be an XOR acceleration module. The
>> driver would be permanently hooked into dmover, but not necessarily
>> always processing data. When not processing data, it could shut down the
>> XOR module, and turn off the power to it.
>>     
>
> Agreed. The "periods of inactivity" part is TBD though.
>
>   
>> Of course you'd have to be a bit intelligent about this, because the
>> time taken to power down and up is likely to be significant. But the
>> benefits to mobile devices is not insignificant.
>>     
>
> Maybe we can have devices for which the cost is nonzero announce
> the time needed to the pm framework somehow?
>
>   
>> Also, it would be good to have a generic framework for controlling CPU
>> speed and power. x86 processors are by no means the only ones which can
>> run at variable clock speed and/or voltage. A framework for other ports
>> to attach CPU speed and power control to would be very useful.
>>     
>
> Yep. The activity tracking needs to be figured out well enough to allow
> scaling to arbitrary number of device's private power levels (in the
> CPU case -- frequencies)...
>   
> ------------------------------------------------------------------------
>
> # Jesse Off
>   
>> I work for a company manufacturing low power single-board computers.  We
>> have a very low power board (TS-7260) that can save power by using PLL's
>> to clock down various internal to the CPU buses and SDRAM.  Also, RS232
>> transceivers, USB power, ethernet and PC104 bus power can be switched off
>> via software.
>>
>> I always thought a good framework would allow for continuous automatic
>> tweaking of clocks based on system activity-- i.e. each time quanta the
>> scheduler runs < 100% CPU utilization, the CPU clock is lowered a notch
>> and each time quanta at 100% utilization the CPU clock is increased. 
>>     
>
> Yes, that would be good. With fixed set of supported power levels
> (which I'd prefer), doing this might get a bit creative...
>
>   
>> Reconfiguring the clock takes at most milliseconds and can even be
>> instantaneous for some powers-of-2 increases/decreases.  Many applications
>> are more power senstive than speed sensitive so even if this type of thing
>> "had some quirks" it would be an obvious net-win to embedded designs
>> caring primarily about power.
>>     
>
> One could always map power levels to frequencies in arbitrary ways,
> but having this done dynamically across full frequency range for given
> device would be great (see above).
>
>   
>> We are currently designing a daughter board with battery-backup and remote
>> wakeup.  This allows the board to be 100% shut off for some period of time
>> (from a few seconds to hours) and automatically re-power up.  It would be
>> nice if the system could use this type of functionality as a pseudo- low
>> power mode and automatically hibernate and spring back to life at required
>> times.  Granted, the only way this could be possible is if the system
>> didn't have any sockets open for LISTEN, all dirty blocks written, all com
>> ports were closed, etc-- but it sure would be nice if applications written
>> to sleep(3600) would power down, wait an hour, power up and return from
>> the sleep() call if at all possible-- if there happened to be another
>> thread doing something with the ethernet ports or the COM port, the system
>> would remain running.
>>     
>
> Yeah, I've heard such scenario more then once for different platforms.
> We can pass an optional "timeout" parameter as part of the power request,
> so that the device could be programmed to wakeup at given time.
>
>   
>> We get a lot of design wins and losses based on board and OS bootup times.
>>  A SBC with only 32MB of SDRAM and few peripherals should be able to come
>> out of a hibernation type state really quick though it may bootup a full
>> OS like NetBSD quite slowly.
>>     
>
> Definitely.
>   
> ------------------------------------------------------------------------
>
> # Steven M. Bellovin
>   
>> [...]
>> Anyway -- first, of course, is support for all standard user
>> requests, such as standby, suspend, and hibernate (to disk) mode.  The
>> latter is remarkably useful on machines with lots of RAM; you can go
>> through a lot of battery keeping 2G powered!  
>>     
>
> Yep -- hibernation would be part of (3). Whatever it will look like,
> it will have a way to power-off all devices in the system in a generic
> way (2) before doing its thing.
>
>   
>> It needs to be possible to power down peripherals that aren't being used,
>> such as USB ports, Cardbus/PCMCIA slots, etc.  Ideally, this happens
>> automatically -- on Windows, there's a checkbox for "allow Windows to
>> power this down automatically".
>>     
>
> Yes, this has actually been pointed out by more people. The proposal
> addresses this, except that the decision isn't automatic (yet) -- the
> user needs to run powerctl (from powerd script, presumably).
>
>   
>> There needs to be an indicator (used by various pkgsrc programs which can
>> get tricky) that will let applications adjust their own behavior when on
>> battery.  For example, I might want my mailer to examine my \aleph_0
>> folders for new mail less frequently when I'm on battery, since it's an
>> expensive operation.
>>
>> In a related vein, there needs to be a "disconnected" indicator that
>> applications can use for similar purposes -- there's no point in polling
>> for new email if I have no IP connectivity.  One unified indicator
>> structure?
>>     
>
> Things to be considered when discussing (3).
>
>   
>> We need ACPI support for removing and installing devices.  This isn't just
>> for frills like being able to insert my CD drive after boot; it's
>> necessary on my laptop to remove and replace an Ultrabay battery.  (I fake
>> it now by suspending the machine first, so that the BIOS doesn't notice I
>> popped out a live "device".)
>>     
>
>   
>> We need better, integrated network support for suspend/resume.
>> My /etc/apm/resume script is 46 lines, because I do things with dhclient,
>> rtsol, mixerctl (the volume setting isn't saved/restored across such
>> events by the driver), battery state, etc.
>>     
>
> Save/restore kind of things should be handled by ca_activate when
> entering/leaving low power modes.
>
>   
>> The VM subsystem needs to be aware that it's on battery -- don't flush
>> pages gratuitously if the disk is spun down, but if it's ever spun up,
>> flush everything in sight.  (Linux does this.)
>>     
>
> Good point. I think higher-level PM (point 3, mostly) will necessarily
> be per-subsystem code, ie that designing something generic might not
> be the best idea. Let's what comes out of the discussion.
>   


-- 
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecomputer.com/
Phone: 951 325-2134  Fax: 951 325-2191