Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: cpu temperature readings



Another reply to multiple messages in one, but starting from the last
one this time, as it is the most important I think.

    Date:        Thu, 29 Jun 2023 13:52:03 +0200
    From:        Michael van Elst <mlelstv%serpens.de@localhost>
    Message-ID:  <ZJ1wY2boyKkeEnFY%serpens.de@localhost>

  | One possibility would be that the 3401 mode didn't enable turbo frequencies
  | but actually throttled the CPU (e.g. due to a faulty BIOS). Then the low
  | temperature readings would have been only a logical consequence.

If that was the problem, then yes, that would have been a possibility, but
that's backwards.  At 3401 the temperature readings look OK (there's still
the other problem I was initially seeking, but this one needs to be solved
first).   At slower cpu frequencies, the temperatures are lower.   That,
when simply stated, with just that much info, looks like it deserves a "That
is how it should be, run slower, generate less heat" response - as indicated
by the data you gave in your previous message:

    Date:        Thu, 29 Jun 2023 13:45:11 +0200
    From:        Michael van Elst <mlelstv%serpens.de@localhost>
    Message-ID:  <ZJ1ux/x3XasTCIbv%serpens.de@localhost>

  | The Haswell CPU here (room temperature about 27C) runs idle at about 40C
  | when clocked at minimum 800, but heats up to 47C when idling at 3300 and
  | there is no difference to 3301.

That kind of thing is what I'd expect to see.   But that isn't what I am
seeing, if I set 3401 as the target frequency (turbo mode - which Intel
still calls it) the temperatures of the cores (when idling) probably range
in the low 30's to low 40's range (as reported, how that relates to real
heat in the chip is anyone's guess - but the BIOS also reports those kinds
of values).

When I set 3400, which is what I have right now, if that dropped to low 30's,
or high 20's, or just stayed the same while idling as your processor does,
then that would all make sense.   But it doesn't.  I am running at 3400 now,
and the coretemp readings are:

                       Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[coretemp0]
   cpu0 temperature:    13.000                                      degC
[coretemp1]
   cpu1 temperature:    13.000                                      degC
[coretemp10]
  cpu10 temperature:    15.000                                      degC
[coretemp11]
  cpu11 temperature:    15.000                                      degC
[coretemp12]
  cpu12 temperature:    14.000                                      degC
[coretemp13]
  cpu13 temperature:    14.000                                      degC
[coretemp14]
  cpu14 temperature:    14.000                                      degC
[coretemp15]
  cpu15 temperature:    14.000                                      degC
[coretemp2]
   cpu2 temperature:    14.000                                      degC
[coretemp3]
   cpu3 temperature:    13.000                                      degC
[coretemp4]
   cpu4 temperature:    13.000                                      degC
[coretemp5]
   cpu5 temperature:    15.000                                      degC
[coretemp6]
   cpu6 temperature:    12.000                                      degC
[coretemp7]
   cpu7 temperature:    13.000                                      degC
[coretemp8]
   cpu8 temperature:    15.000                                      degC
[coretemp9]
   cpu9 temperature:    15.000                                      degC

Room temperature is about 21C at the minute (A/C maintained).   Even remaining
at 3400, if the workload drops even more (I am replying to this mail, which
means keyboard and mouse activity, some disc I/O as well, and the X server
needs to be processing everything - so we can go much more idle than that,
with the screens all off, so the X server has little to do, no keyboard or
mouse activity, ...) then the reported temps will sometimes drop into single
digits (8 or 9 ... I haven't seen less than 8).

Those values are absurdly low.

There doesn't seem to be much (if any) difference between the temps being
reported whether the frequency is 3400, or 800 (highest and lowest available
fixed frequencies).   Maybe just one or two degrees less at 800 than at 3400
(when mostly idling ... not fully idle right now, there's also some network
traffic at the minute - has been throughout this reply).

  | The xx01 frequency sets the maximum base clock and enables turbo mode...
  | on systems that support such a setting.
  |
  | On "modern CPUs" however, it is often sufficient to stay on that setting

That's what I used to do, before I started getting the original problem
(not yet really reported, as I don't yet have much of an idea what is
happening) - but as a first guess, the cores seemed to be overheating, or
at least powerd thought they were (powerd gets the same info as envstat,
which also showed rising temperatures) - this usually happening when the
system was idle (or mostly idle, there are all the usual low cpu usage
background noise processes running - clocks, cron, inetd, nothing that
normally causes even a blip in apparent cpu used).   In fact, if I made
the system really busy (like going a full release build) I never saw a
problem (things get hot, then cool down again).

That actually suggests another possibility for this original problem to
be investigated later - perhaps when the CPU goes into idle mode, something
is happening to the (at least reported) core temperatures, and the more
time it spends idling, the more those appear to increase.   For later,
for now, unless we can trust what the CPU is telling us what the temperature
is, worrying about probable nonsense numbers varying would be a waste of
time.

    Date:        Thu, 29 Jun 2023 13:24:23 +0200
    From:        Michael van Elst <mlelstv%serpens.de@localhost>
    Message-ID:  <ZJ1p5pZJGNOYXJ/g%serpens.de@localhost>

  | Then it gets really strange what the temperature sensor would see.

Yes, that's why I sent the original message.   It is indeed really strange.

  | One possibility would be that the Tjmax value is actually changed
  | dynamically (maybe some SMM code) and that the patch isn't complete
  | to handle this.

The possibility is certainly there.   The patch certainly doesn't handle it,
the code has been rearranged in a way that would make it much simpler to do,
but as it is now, it is really just doing the same as before - calculate the
Tjmax value to be used at sensor attach time, and never touch that again.

That is, I am not surprised that it didn't change anything.   However, if
we can work out when it would be reasonable to look for a new Tjmax (and
on which processor versions) now it will be trivial to make that happen,
where it wouldn't have been before.

However, to explain what is being observed, the Tjmax value would need
to be increasing as the cpu frequency decreases, since at the minute we're
using Tjmax==100, and getting 12 as the reported temp, so the value read
and subtracted from Tjmax must be 88.  To make that value be somewhere
around 30-35, (which is what I am seeing now ... I just went back to 3401
mode temporarily) then Tjmax would need to be in the 118-123 range (and so
perhaps 120).   That seems a bit unlikely to me (unlikely things are that
simple).   Perhaps reading a new Tjmax recalibrates the internal temp
monitoring though?   Clutching at straws...

  | The scheduler did use first cores first, with performance cores
  | using low cpu numbers, they should be utilized first but not
  | necessarily for the important workloads.

Depending upon what that really means, that is, "use first" (use the first
cores *first*) wrt to what?   System boot?   If it is doing that, and just
rotating though the cores, that might (kind of) match what I see.   But if
you mean "when a cpu is needed, the lowest numbered one which is currently
unallocated is used" then I don't see that happening at all.

What's even more peculiar is that we seem to be moving processes from core
to core for no apparent reason, if I am running a single cpu bound process,
I can observe it move from cpu to cpu.   If all cpus were equal, then aside
from the L1 cache losses suffered doing that, it would make no real difference.
If it was moving processes to a more suitable cpu for the workload, that would
also make sense, but it isn't doing that either.   If there were lots of other
processes demanding CPU time, then bumping the busy one (which will have its
run priority dropping (increasing numerically)) to run others, and then
restarting the busy one on the next free CPU would also make sense - but I
doubt it is that either, there just aren't enough processes (even kernel
threads that might have a reason to run) to use all the cores - and the
chances of all the ones that might have something to do all wanting to use
their few required ns of processor time at the same instant are remote
indeed.   If this seemingly random movement happened only rarely, maybe I
could believe that, but it doesn't, it seems to be happening all the time,
almost as if any system call being run results in the possibility that the
resumption might be on a different cpu (the movement isn't frequent enough
to be every system call though - and it happens to processes that make none,
just infinite loop cpu wasters).

  | It now handles big.little configurations independent of cpu numbers,
  | but probably only on arm.

This processor needs more than that, though it would be a start.  It has
been quite a while since I looked at the specs for it in detail, but as
I remember it (and assuming we have no hyperthreading to occupy all the
odd numbered cpu numbers, so cpuN means coreN here) cpu0 (and maybe cpu1)
can run fastest (up to 5500GHz).   Then I think perhaps cpu1 (maybe cpu2)
can run up to 5300GHz, then the rest of the performance cores (...7) run
up to 5200GHz.   The economy cores (8..15) all run at a slower base freq
(2500 rather than 3400 for (all) the performance cores - despite cpuctl
on NetBSD claiming that their base is also 3400 .. I suspect there's just
one kernel "base" frequency, reported for all cpus) and up to 4000 GHz
in turbo mode.

So there are certainly 3, perhaps 4, different processor classes, though
all the fast ones are reasonably close to each other (but when I was running
that openssl test, using turbo mode, I could see in the results the variations
that having a different cpu assigned made - which is why I just kept repeating
it until cpu0 happened to be chosen (for at least most of it).   That one
certainly runs faster than anything else (except maybe cpu1, which might be
the same).

kre



Home | Main Index | Thread Index | Old Index