Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: cpu temperature readings



On Thu, Jun 29, 2023 at 08:59:18PM +0700, Robert Elz wrote:
> 
> When I set 3400, which is what I have right now, if that dropped to low 30's,
> or high 20's, or just stayed the same while idling as your processor does,
> then that would all make sense.   But it doesn't.  I am running at 3400 now,
> and the coretemp readings are:
> 
>                        Current  CritMax  WarnMax  WarnMin  CritMin  Unit
> [coretemp0]
>    cpu0 temperature:    13.000                                      degC

> Room temperature is about 21C at the minute (A/C maintained).   Even remaining
> at 3400, if the workload drops even more (I am replying to this mail, which
> means keyboard and mouse activity, some disc I/O as well, and the X server
> needs to be processing everything - so we can go much more idle than that,
> with the screens all off, so the X server has little to do, no keyboard or
> mouse activity, ...) then the reported temps will sometimes drop into single
> digits (8 or 9 ... I haven't seen less than 8).
> 
> Those values are absurdly low.

Unless there is a BIAS on those numbers and the real values are maybe 15
degrees higher.

I can also easily imagine that temperatures rise with enabled turbo
mode, even when idle, in particular on selected dies like the i9-12900ks.




>   | One possibility would be that the Tjmax value is actually changed
>   | dynamically (maybe some SMM code) and that the patch isn't complete
>   | to handle this.
> 
> The possibility is certainly there.   The patch certainly doesn't handle it,
> the code has been rearranged in a way that would make it much simpler to do,
> but as it is now, it is really just doing the same as before - calculate the
> Tjmax value to be used at sensor attach time, and never touch that again.


The code is supposed to follow the Linux example. If a fixed Tjmax can
be read, then that's it. If no fixed Tjmax can be read, a dynamic value
needs to be read from a new register every time you evaluate the temperature.

N.B. Asrock says, they would configure Tjmax to 105C for that board.



> However, to explain what is being observed, the Tjmax value would need
> to be increasing as the cpu frequency decreases, since at the minute we're
> using Tjmax==100, and getting 12 as the reported temp, so the value read
> and subtracted from Tjmax must be 88.  To make that value be somewhere
> around 30-35, (which is what I am seeing now ... I just went back to 3401
> mode temporarily) then Tjmax would need to be in the 118-123 range (and so
> perhaps 120).   That seems a bit unlikely to me (unlikely things are that
> simple).

120 seems plausible, an "official" number from Intel was 115. So if
Asrock puts 105 instead of 100, then it might also configure 120 instead
of 115.



>   | The scheduler did use first cores first, with performance cores
>   | using low cpu numbers, they should be utilized first but not
>   | necessarily for the important workloads.
> 
> Depending upon what that really means, that is, "use first" (use the first
> cores *first*) wrt to what?

unit numbers. cpu0 before cpu1 before cpu2, etc. This only happens when
a core is searched, it doesn't (immediately) migrate LWPs that were
started on higher units.

There was code to regularly balance LWPs on all cores that was broken,
was fixed by myself and then ripped out by ad@.



> What's even more peculiar is that we seem to be moving processes from core
> to core for no apparent reason, if I am running a single cpu bound process,
> I can observe it move from cpu to cpu.

That happens when that cpu is used by another LWP (maybe a kernel thread
that is bound to that cpu) and the previously running LWP needs to be
migrated.




> If it was moving processes to a more suitable cpu for the workload, that would
> also make sense, but it isn't doing that either.

It has no idea what a "suitable CPU" is.


>   | It now handles big.little configurations independent of cpu numbers,
>   | but probably only on arm.
> 
> This processor needs more than that, though it would be a start.  It has
> been quite a while since I looked at the specs for it in detail, but as
> I remember it (and assuming we have no hyperthreading to occupy all the
> odd numbered cpu numbers,

Actually on AMD I mapped one thread of each core to the first cpus and
the other thread of each core to the later cpus. I.e.

cpu0: Cluster/Package ID 0
cpu0: Core ID 0
cpu0: SMT ID 0
cpu1: Cluster/Package ID 0
cpu1: Core ID 1
cpu1: SMT ID 0
cpu2: Cluster/Package ID 0
cpu2: Core ID 2
cpu2: SMT ID 0
...
cpu16: Cluster/Package ID 0
cpu16: Core ID 0
cpu16: SMT ID 1
cpu17: Cluster/Package ID 0
cpu17: Core ID 1
cpu17: SMT ID 1
...

With our simple scheduler strategy that loads one thread per core and
only puts two threads on each core when you have more runnable threads
(except for the bound system threads).


On Intel however (at least on this i5), the mapping alternates between
both threads:

cpu0: Cluster/Package ID 0
cpu0: Core ID 0
cpu0: SMT ID 0

cpu1: Cluster/Package ID 0
cpu1: Core ID 0
cpu1: SMT ID 1

cpu2: Cluster/Package ID 0
cpu2: Core ID 1
cpu2: SMT ID 0

cpu3: Cluster/Package ID 0
cpu3: Core ID 1
cpu3: SMT ID 1



I expect this to be replaced with something much more bizarre. There
are already data structures that describe the CPU topology.


Greetings,
-- 
                                Michael van Elst
Internet: mlelstv%serpens.de@localhost
                                "A potential Snark may lurk in every tree."


Home | Main Index | Thread Index | Old Index