tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: dbcool, envsys, powerd shutting down my machine



On 30 January 2016 at 11:01, Edgar Fuß <ef%math.uni-bonn.de@localhost> wrote:
> I don't know whether this is a userland or kernel issue or a layer 8 problem.
>
> After running a customized kernel, I found a server powered down.
> The culprit turned out to be dbcool->envsys->powerd fabulating some temperature
> rose above limits.
>
> envstat -d dbcool0 says:
>            Current  CritMax  WarnMax  WarnMin  CritMin  Unit
> [...]
> r2_temp:    53.250   54.000                     45.000 degC
> [...]
>
> sysctl hw.dbcool0 says:
> [...]
> hw.dbcool0.r2_temp.Tmin = 44
> hw.dbcool0.r2_temp.Ttherm = 57
> hw.dbcool0.r2_temp.Thyst = 2
>
> If I read that correctly, it means that at 54 degC, it's time for emergency
> shut-down, while only at 57 degC, fans have to run at full speed.
> (Also, it seems to be threatning the hardware if that temp falls below 45.)
>
> I have no clue where that magic value of 54 degC comes from. It's not in any
> config file I can find, I don't find such a value in sys/dev/i2c/dbcool.c.
> Is it the BIOS writing that value into the IC? Is it a chip manufacturer
> default?
>
> The board is a Tyan S2882-D, in case that matters.
> (Btw., does anyone know what r2_temp on that board is?)
>
> I turned off powerd for now.
>
> Thanks for any hints.

In general, I personally don't think it ever makes sense to shutdown
by default when the temperature is exceeded, since most of these
sensors aren't really all that reliable (especially if you're getting
them over i2c, with potential bus locking issues and race conditions
with BIOS / IPMI; getting a bit sidelined, at the very least, the
sensor values should be dampened, which is what's done in OpenBSD's
sensorsd, not sure if anything similar is done here).

However, it does appear that
http://bxr.su/n/etc/powerd/scripts/sensor_temperature is the powerd
script responsible for such automated shutdown.

For what it's worth, the envsys temperature limits you mention are
most likely read directly from the chip in question, see
http://BXR.SU/NetBSD/sys/dev/i2c/dbcool.c#dbcool_get_temp_limits .
Did you at all try to modify these limits from the userland?  (Which
may or may not work, especially if something else decides to modify it
behind your back.)  Perhaps a potential solution may be to change them
from being CRIT to be of WARN type, and/or remove the immediate
shutdown from the powerd script?

Cheers,
Constantine.SU.


Home | Main Index | Thread Index | Old Index