Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: cpu temperature readings



    Date:        Sun, 2 Jul 2023 08:11:59 -0000 (UTC)
    From:        mlelstv%serpens.de@localhost (Michael van Elst)
    Message-ID:  <u7rbg7$4od$1%serpens.de@localhost>

  | In the end that means the chip either won't reach it's maximum turbo
  | speed, or only for a shorter time, or only when cooled better. The
  | value that corresponds to this is called cTDP (and usually used
  | to raise the value for extreme overclocking, but it can also be
  | reduced).

Next time I am in the BIOS I will look more carefully for something
related to that.   There are vast numbers of settings, most of which
I have never seen... (never had a reason to look).

  | The values probably come from ACPI. I first thought there was a limit
  | of 16 states, but we (arbitrarily) have a limit of 256. So either
  | ACPI doesn't show all states that you can see in the BIOS interface
  | or we have a bug.

This is an extraordinarily unimportant issue, but if I find some free
time I will see if I can add some debugging to observe what is happening
with that.

  | coretemp doesn't have thresholds, so it cannot trigger powerd to shut down.

That's weird then, as while most times I have no idea what is causing
the halt (poweroff) I have certainly observed a couple of occasions when
it was certainly powerd shutting things down based upon over temps (I
haven't configured any limits).   Once I was ssh'd in from my phone
when it happened, and as that was a login I got powerd's broadcast
messages.   Those shutdowns feel different than the others though,
things seem to sequence through the normal shutdown steps.

Could the powerd behaviour perhaps be related to these couple of lines
in arch/x86/x86/coretemp.c:coretemp_refresh_xcall()

        if ((msr & MSR_THERM_STATUS_CRIT_STA) != 0)
                edata->state = ENVSYS_SCRITICAL;

that is, rather than reaching some configured limit, simply being told
by the cpu that the status is critical ?

In my recent changes I added a debug printf (with relevant data included)
if that ever happens, which I haven't ever seen - but if the system shuts
down quickly enough, there's no guarantee I would.

But there is no question that I saw powerd shut the system down based
upon over temp, twice I know of (a small fraction of all the shutdowns),
and coretemp provides the only temperature related info available to
the system (well, I guess there is drive temp, from SMART, but that
wouldn't shut the system down, and none of the drives has ever reached
its upper limit - though some have come close once or twice), and there
may be temp sensors in the DIMMs, but nothing I'm aware of is accessing
those.

  | Immediate power off also doesn't suggest that this is a shutdown.

Yes, agreed, or not a normal one anyway.  At least most of the time.

  | I would
  | guess it's either the CPU reaching its limit (unlikely to your description,

I'm less sure how unlikely it is, as:

  | but the temperature can change very very quickly)

it certainly does that (if I am doing a -j16 build, it will, using the old
code (Tjmax==100) ramp up to 90 (which probably was really 105) quite
quickly - but then cool down again just as quickly when a brief gap in
the cpu intensive part of the build happens.   I have done several builds
during this period of instability, and the system has never halted during
any of those.

But when it is just idling (mostly) in turbo mode, the temps seem to
stay fairly stable for a while (hours perhaps) then go up a degree or
two, then more quickly, rise again, and then again even more quickly,
and possibly by a bit more.   Since the at rest idle temp in turbo mode
(new code, with Tjmax==115) seems around 50, that I had observed temps
about 60 (as the apparent baseline, not the occasional blip up and down
again, which happens all the time, when something runs for a second or two) 
suggests that this ramp up effect was happening.   The higher the temp
rises, the faster it rises more, so it easily could have been that.

  | or something completely
  | different (motherboard power regulators or even the PSU?).

Certainly.   Anything is possible.   I suspect something changed
(broke, or wore out) about a month ago - clearly it is marginal, and
only seems to affect things in turbo mode (higher power draw), as
this is a new phenomenom in the past month or so.   Initially I thought
(kind of absurdly perhaps) it may have been somehow related to a very
large copy (set of copies) to an external USB drive I was doing, but that
finished weeks ago now (there are also I/O performance related issues I will
mention later related to that, but as a preview, I once observed a "sync"
command take 3 hours to complete).

I have been centred around the temperature issue, as that one is at
least observable, and has been behaving erratically (though some of
that is now explained).
  
  | The Z690 Taichi BIOS seems to have an event log, not sure what it actually
  | logs.

Thanks - more for me to look for next time I am in the BIOS.   I guess I
need to spend some time and look through everything that is there.

kre



Home | Main Index | Thread Index | Old Index