Re: cpu temperature readings

To: Masanobu SAITOH <msaitoh%execsw.org@localhost>
Subject: Re: cpu temperature readings
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Thu, 29 Jun 2023 00:52:29 +0700
I am going to reply to several messages in one reply...
But first, thanks for looking at this at all, x86 processors
have always been black magic to me.

    Date:        Wed, 28 Jun 2023 05:06:11 -0000 (UTC)
    From:        mlelstv%serpens.de@localhost (Michael van Elst)
    Message-ID:  <u7gf43$1or$1%serpens.de@localhost>

  | coretemp temperatures in that range are unlikely to be true.

Yes, of course, that was kind of the point of my message - something
is obviously reporting nonsense.   The question is why.

  | But you didn't tell what sensors you were reporting. Is that coretemp?

Sorry, but yes, I should have actually included some values.

                      Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[coretemp0]
  cpu0 temperature:    21.000                                      degC
[coretemp1]
  cpu1 temperature:    15.000                                      degC
[coretemp2]
  cpu2 temperature:    16.000                                      degC
[coretemp3]
  cpu3 temperature:    13.000                                      degC
[coretemp4]
  cpu4 temperature:    14.000                                      degC
[coretemp5]
  cpu5 temperature:    15.000                                      degC
[coretemp6]
  cpu6 temperature:    12.000                                      degC
[coretemp7]
  cpu7 temperature:    13.000                                      degC

That's the entire envstat output on my system (as configured currently).
Those readings are from when I was generating this mail, room temp is
about 22C, cpu target frequency is 3400 (not 3401).   If I set it to
3401, and wait a minute or so, what appears is:

                      Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[coretemp0]
  cpu0 temperature:    42.000                                      degC
[coretemp1]
  cpu1 temperature:    40.000                                      degC
[coretemp2]
  cpu2 temperature:    36.000                                      degC
[coretemp3]
  cpu3 temperature:    35.000                                      degC
[coretemp4]
  cpu4 temperature:    45.000                                      degC
[coretemp5]
  cpu5 temperature:    34.000                                      degC
[coretemp6]
  cpu6 temperature:    36.000                                      degC
[coretemp7]
  cpu7 temperature:    33.000                                      degC

Apart from the change of target (and actual) frequency (incidentally,
does machdep.cpu.frequency.target set all the CPUs?   I assume it does,
but is there any mechanism to alter them individually?) nothing else
of significance has changed (the room might be a tenth of a degree
cooler - the A/C has not been turned on for very long).   If (when) I
return from 3401 to 3400 the temps will go back to (give or take a few
degrees) the earlier readings, and if I leave things so the monitors
switch to dpms off mode (or whatever happens to them when they turn off)
and the CPUs have nothing to do, and the room cools a little more, the
temps recorded would drop even lower.

It is this strange behaviour that is interesting.   Certainly I expect that
if the CPUs are running faster, they will be hotter (and when they are really
working they certainly are) but when all they're doing is mostly sitting in
the idle loop, I'd have expected the target max freq to be more or less
irrelevant to the temperature (and certainly not for anything to ever
produce values less than is rationally possible).


  | Some ACPI value? A motherboard sensor (e.g. lm0)?

No ACPI sensors I am aware of, motherboard sensors exist, but either
aren't supported, or my kernel is not configured to support, any of
those.   I certainly have no lm? devices configured, and as best I
can tell from a quick look, nor does the standard GENERIC kernel.   When
I next reboot (which is likely to be seen, as indicated below) I will
try booting GENERIC and see if anything extra shows up ... but my kernel
config was mostly created that way, initially running GENERIC, then making
a config file with all the devices (and options) I am never going to have
or ever want, removed.   Any hardware which GENERIC matched a driver to
would have been retained.   But that GENERIC was 9.99.97 from more than a
year ago now.

  | That's a selected 241W chip that may heat up to > 100 C (Tjmax = 115 C)
  | and usually requires a liquid cooler. Idle temperature between 50C and 60C
  | are normal.

It has a liquid cooler.   What's more it seems to work quite effectively.
When I am going a build the coretemps go way up when things are busy (that's
expected of course) and then drop (more quickly than I would have guessed
would happen) during the occasional idle periods (like when waiting for the
last parts of the main build to finish, before starting on the kernels).
When the build is truly completed, everything reverts to the state before
the build started, reasonably quickly.   So, in general, I believe the
cooling system is working OK.

    Date:        Wed, 28 Jun 2023 05:24:50 -0000 (UTC)
    From:        mlelstv%serpens.de@localhost (Michael van Elst)
    Message-ID:  <u7gg72$1ke$1%serpens.de@localhost>

  | The chip apparently reports a Tjmax of 100 C (as for the non-selected chip)
  | but actually has a real Tjmax of 115 C.

(Yes, 100 seems to be what I am seeing).   As for the real one, see below.

  | The temperature sensor reading is relative to Tjmax.

I had seen that in the code.   Initially I had been thinking that perhaps
this calculation was prone to errors, and had been considering doing the
calculation multiple times, and only accepting the results if the results
were similar over multiple (quick) readings - but I no longer believe that
would be a useful thing to do (more evidence collected over time) so I
never attempted that.    But the strangely different readings based upon
the target frequency suggest another possible issue - if the bits being
read to obtain the temp contained (on this processor) a bit that's being
set when "turbo mode" is turned off, then the reading would produce a
much larger number, resulting in a lower reported temp, when that value
is subtracted from the (currently) constant Tjmax.   Turn the bit off
again, and the number to subtract is smaller, and temps appear higher.

That is most probably not the explanation, but ...

  | So it could be 15C lower than reality (if the default of 100 instead
  | of 115 is used) or even 25C lower if (if the Intel recommenendation
  | is followed).

The absolute temperatures do not concern me, it isn't as if I am able
to get inside the core and use the heat for something - there are just
two issues, first, and the one that I was asking about in my original
message, is why altering the cpu frequency makes such a large difference
(and generates obviously absurd values), even at 15 lower than reality,
as 100 is what is being used for Tjmax (see below) that would (at the
lowest I have seen reported - 8C) mean the actual core temp was just 23,
in a room with ambient temp at the time about 22 perhaps, which is not
really possible I don't think.

The second issue (the one I started investigating) is that (with the
cpu freq at 3401, enabling turbo mode, and I assume, actual frequencies
up to 5500MHz) the temperatures recorded start creeping upwards (when
the system is mostly idle, and nothing is really changing at all) and
what's more, that seems to be on an exponential curve (positive feedback
perhaps).   That is, going from (reported values of) mid 30's to around 40
as the "resting" state, can take many hours, then from 40 to 50 or so, less
time, and then once it gets beyond 50 and is approaching 60, it might just
be minutes until it reaches Tjmax and powerd (or the cpu itself perhaps)
decides to shut things down (when powerd does it, I sometimes see its
broadcast message - but I often don't have a login terminal visible, so
often not) and once or twice, X has actually shut down, and I've seen at
least some of the normal shutdown sequence happening on the console.
Usually however, the power is (or seems to be) simply abruptly cut, and
everything simply stops, instantly, working and doing things (like typing
an e-mail, or whatever) one second, and no power the next.   (And no, it
is not an external power issue, the system has a UPS, and in any case if
it lost external power, it would reboot as soon as that returned, this does
not do that, it behaves just like "poweroff" but seemingly without the
file system unmounting, ... that would normally happen.)

    Date:        Wed, 28 Jun 2023 15:08:17 +0900
    From:        Masanobu SAITOH <msaitoh%execsw.org@localhost>
    Message-ID:  <1b1763d8-f565-612c-9336-9fb71d496da5%execsw.org@localhost>

  | ark.intel.com often shows incorrect values. Looking at this page now,
  | it says Tjmax is 90 degrees.

Yes, I see that on that page as well.

  | Robert, could you show me the output of:
  |
  | 	dmesg -t | grep Tjmax

Certainly:
jacaranda$ dmesg -t | grep Tjmax
coretemp0 at cpu0: thermal sensor, 1 C resolution, Tjmax=100
coretemp1 at cpu1: thermal sensor, 1 C resolution, Tjmax=100
coretemp2 at cpu2: thermal sensor, 1 C resolution, Tjmax=100
coretemp3 at cpu3: thermal sensor, 1 C resolution, Tjmax=100
coretemp4 at cpu4: thermal sensor, 1 C resolution, Tjmax=100
coretemp5 at cpu5: thermal sensor, 1 C resolution, Tjmax=100
coretemp6 at cpu6: thermal sensor, 1 C resolution, Tjmax=100
coretemp7 at cpu7: thermal sensor, 1 C resolution, Tjmax=100

A much older (from a saved dmesg output) listing was (sorry, these
did not include -t)

[     1.031818] coretemp0 at cpu0: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp1 at cpu2: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp2 at cpu4: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp3 at cpu6: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp4 at cpu8: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp5 at cpu10: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp6 at cpu12: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp7 at cpu14: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp8 at cpu16: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp9 at cpu17: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp10 at cpu18: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp11 at cpu19: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp12 at cpu20: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp13 at cpu21: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp14 at cpu22: thermal sensor, 1 C resolution, Tjmax=100
[     1.031818] coretemp15 at cpu23: thermal sensor, 1 C resolution, Tjmax=100

That was from almost a year ago, when the system was still fairly new,
and I was experimenting with things, and saving a bunch of dmesg output
files for comparisons.

If hyperthreading were enabled, those would be cpu0 cpu2 cpu4 ...
and if the "economy" cores had not vanished

  | It seems that the MSR_TEMPERATURE_TARGET's value is not fixed
  | on newer chips. Please test the following diff:
  |
  | 	https://www.netbsd.org/~msaitoh/coretemp-20230628-0.dif

I have fetched it, and will do that.   Thanks.   I will let you know
the results (will take at least hours, if my system decides not to
co-operate, perhaps longer).   My build is starting now.

kre
Follow-Ups:
- Re: cpu temperature readings
  - From: Robert Elz
- Re: cpu temperature readings
  - From: RVP
References:
- Re: cpu temperature readings
  - From: Masanobu SAITOH
- cpu temperature readings
  - From: Robert Elz
- Re: cpu temperature readings
  - From: Michael van Elst
Prev by Date: Re: cpu temperature readings
Next by Date: NetBSD Security Advisory 2023-001: Multiple buffer overflows in USB drivers
Previous by Thread: Re: cpu temperature readings
Next by Thread: Re: cpu temperature readings
Indexes:
Home | Main Index | Thread Index | Old Index