Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: envstat drivers on SUN Ultra-45?



Hi,

> >   https://ftp.netbsd.org/pub/NetBSD/misc/jdc/sparc64/
> 
> Thanks, grabbed & tested.
>  
> > "u45.diff" is the patch file, "netbsd" and "netbsd.gdb" is the kernel with
> > these and "u45-envsys.conf" is the sensor names for /etc/envsys.conf.
> > "u45.dmesg" is the full dmesg, but relevant lines are:
> 
> Just grabbed kernel image and envsys config, applied and rebooted:

>          pci-fan:         0                                       RPM
>      system-fan3:         0                                       RPM
>      system-fan4:         0                                       RPM
> 
> And yes, the 3x fan group (one FRU, two fans pointed at CPUs and memory
> banks, one at the PCI(e,-X) cards) not running is something I suspected
> (system thermal issues), but it's nice to be able to check for it, now
> I can figure out how to fix it. I think I misrouted at least one cable
> when I replaced the mainboard, thus causing mechanical alignment issues
> for the fan group connector.

It's nice to see the confirmation from the software!  The service manual
(819-1892-12.pdf) has a reasonable overview at the start of section 3.1
showing all the motherboard connections.  I hope that it's easy to solve.
If not, I can take pictures of the fan cable route on mine.

Your CPU temperatures look normal:

>      cpu0-sensor:    64.500                                      degC
>      cpu1-sensor:    60.250                                      degC

I guess that means that I do need to redo the thermal paste on mine (not
looking forward to that on something hard/expensive to replace).

> >   [     1.000000] bq4802rtc0 at ebus0 addr 100000-10000f: real time clock
 
When I was testing today, I saw:

  [  40.6304158] WARNING: clock lost 2 days

and I double-checked the chip documentation.  I see that I inverted the
"stop" bit so that the driver stopped the clock when the power is off.
I've corrected this now and uploaded a new kernel and patches:

  https://ftp.netbsd.org/pub/NetBSD/misc/jdc/sparc64/netbsd
  https://ftp.netbsd.org/pub/NetBSD/misc/jdc/sparc64/u45.diff

This does mean that you'll need to try this kernel (or an autobuild one)
to correctly setup the RTC again - apologies for that.

This kernel also has some additions to the adt7462 driver.  Limits are
read from the chip, voltages are added and I also check the fan fault
register.  This does lead to slightly wierd results though.

                      Current  CritMax  WarnMax  WarnMin  CritMin  Unit
  [adt7462sm0]
    ...
            V1.5 1:     1.490             1.989    0.250              V
            V1.5 2:     1.498             1.989                       V
            V3.3 1:     1.238             4.386                       V
             Vbatt:     1.232             3.978    1.997              V
             V12 3:    11.938            15.938                       V
              V5 3:     5.018             6.630                       V
         fan fault:    100000        0        0        0        0 none

It looks like Sun wired the 3.3V and the battery sensor to 1.2 volt lines,
but didn't remove the default low limit for the battery.  On boot, I see:

  [  33.5647675] adt7462sm0: warning under limit on 'Vbatt'

from the battery voltage low limit, and:

  [  33.6347678] adt7462sm0: fan fault status change: 00000000 -> 00100000
  [  33.7147673] adt7462sm0: critical limit on 'fan fault'

from the fan fault register.  Rather than display the hex value, I thought
that it's nice to display a 1 if the fan has failed and 0 otherwise.  It
currently reads from the right and tells me that fan 6 has failed.  I had
assumed that there were only 5 fans, but it's possible that fan 6 is the
PSU fan and just not shown on Solaris.  You should see a value of 011100
or 111100 for your non-working sytems fans (3, 4 and 5).  If you see the
latter, I'll mask out fan 6 from the results.

Finally, trying to pick a name for the sensors (hard :-) I renamed the
lm75a to nxp75a so that it is less likely to be confused with the lm75a
that lmtemp has.  This needs a matching change to envsys.conf.

> One other reason for the high thermal readouts besides the non-running
> fan group in my machine is that ambient temperature is ~ 30 degC, which
> is probably close to the upper end of the machine design envelope ;-)

That is a good question - 30 C was around the upper limit for running
servers, but I guess that without the main fans, it's definitely getting
toward the limits!

> Thank you very much for your work improving the sensor support for the
> Ultra-45, I very much appreciate it.

A pleasure!  I'm glad that it's useful!

I think that I'm close to getting the changes committed.  I was looking at
the dbcool manual page and I see that has sysctl settings to vary the
automatic fan control settings (Tman and Trange).  I did try manually
adjusting the system fan Trange and I could hear the change in fan sound.
I might commit first and then add that later though.

Regards,

Julian

-- 


Home | Main Index | Thread Index | Old Index