NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

port-sparc64/57848: NetBSD 9.3/sparc64 crash reboots under high I/O load



>Number:         57848
>Category:       port-sparc64
>Synopsis:       NetBSD 9.3/sparc64 crash reboots under high I/O load
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    port-sparc64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 13 22:00:00 +0000 2024
>Originator:     Alexander Schreiber
>Release:        NetBSD 9.3 (release)
>Organization:
>Environment:
NetBSD laurelin.angband.thangorodrim.de 9.3 NetBSD 9.3 (TELPERION) #0: Sun Nov 27 15:17:20 CET 2022  root%telperion.angband.thangorodrim.de@localhost:/usr/obj/sys/arch/sparc64/compile/TELPERION sparc64

>Description:
Let me preface this with: I _suspect_ that something is actually wonky with the machine itself.

machine background:
 - this is a Sun Fire V100 with a SUNW,UltraSPARC-IIe @ 548 MHz CPU and
   maximum memory loadout (2G)
 - it has been frankensteined a bit (which would probably give a SUN field
   support engineer apoplexy if they were still around):
   - system board shelled and moved into MicroATX case
   - powered by 100W NanoPSU (which I suspect might be relevant)
   - wd0 is a 64G PATA SSD
   - wd1, wd2 are 2 TB SATA SSD behind SATA-PATA converter each
   - CPU cooling fan replaced with a quieter one, keeping the cooler in place
     but removing the airguide
   - NetBSd 9.3 was installed on wd0 from the netbooted installer ISO
   - system was rebuilt to enable ZFS
   - wd1 & wd2 were set up as a ZFS mirror-1 for data storage
 - in this configuration, the machine ran for at least 1y just fine
 - a few weeks ago, I accidentally power killed the machine (pulled the plug)
 - on next power up, it refused to boot from wd0, claiming there was nothing
   there ... also, probe-ide-all only showed the PATA SSD
 - netbooted NetBSD 9.3 installer again, reinstalled to wd0
 - system still refused to boot from wd0
 - copied install from wd0 to NFS host, switched to NFS root
 - rebuilt system to enable ZFS and use custom kernel config
 - two crash reboots mid system rebuild
 - imported ZFS pool
 - starting copying (via rsync-over-ssh) a 75G data set to ZFS
 - this triggered crash reboots after sometimes as little as 1h, several times
 - updating a 210G git repo to ZFS sometimes also triggered that

It always seems to crash (and then reboot) at the same place. Crash message
copied from console (with ddb.onpanic=1):

[ 54990.7869753] data error type 32 sfsr=0 sfva=40bdf810 afsr=84000000 afva=1fe02004000 tf=0x1782cd850
[ 54990.9170412] data fault: pc=1083004 addr=40bdf810 sfsr=0x0<ASI=0x0>
[ 54990.9170412] kernel trap 32: data access error
Stopped in pid 0.29 (system) at netbsd:alipm_smb_exec+0x184:    andcc
%g1, 0xe4, %g0
db{0}> bt
iic_exec(101da35a8, 1, 18, 1782cdbfd, 1, 1782cdbff) at netbsd:iic_exec+0x1ac
admtemp_refresh(101dfc148, 10251f5b8, e0047ed0, 0, 103b1a0, 10251f588) at netbsd
:admtemp_refresh+0x48
sysmon_envsys_refresh_sensor(101dfc148, 10251f5b8, 186d800, 1672000, 102553960,
102553a90) at netbsd:sysmon_envsys_refresh_sensor+0x1c
sme_events_worker(101dfc218, 101dfc148, 102553960, 10251f5b8, 101dfc148, 101dc32
48) at netbsd:sme_events_worker+0x130
workqueue_worker(10254b0c0, 10254b120, 10254b130, 10254b108, 101dc3188, 10254b10
0) at netbsd:workqueue_worker+0xf0
lwp_trampoline(f0061134, 116000, 113a30, 1, fffc5c88, 0) at netbsd:lwp_trampolin
e+0x8
db{0}>


contents of /etc/mk.conf:

# ===================
ACCEPTABLE_LICENSES+= vim-license
ACCEPTABLE_LICENSES+= gnu-agpl-v3
PKG_OPTIONS.python27=-x11
PKG_OPTIONS.python37=-x11
PKG_OPTIONS.ghostscript=-x11
ALLOW_VULNERABLE_PACKAGES=yes

UPDATE_TARGET=package-install
WRKOBJDIR = /usr/pkgobj
# WRKOBJDIR = /zfs/pkgobj
# WRKOBJDIR = /backup/1/pkgobj
PACKAGES=${PKGSRCDIR}/packages/${LOWER_OPSYS}-${OS_VERSION}-${MACHINE_ARCH}
USE_FORT=yes
USE_SSP=yes
# PKG_DBDIR=/var/db/pkg

MKZFS=yes
#=======================

Changes between GENERIC and the custom kernel is mostly commenting out
support for hardware the machine doesn't have and network support I
don't need, also:

-#options       BLINK           # blink the system LED
+options        BLINK           # blink the system LED

-#options       NFS_BOOT_BOOTP
+options        NFS_BOOT_BOOTP

-#options       DIAGNOSTIC      # extra kernel sanity checking
+options        DIAGNOSTIC      # extra kernel sanity checking

I can provide the full config if needed, of course. Interestingly, the kernel
compiled on this machine uses the _same_ config I have on another V100, with only the config name changed. The kernel built on this machine is still 120
bytes larger.

I have another, almost identical Sun Fire V100, differences are
 - 3x SATA HDDs behind SATA-PATA adapters
 - full ATX PSU
This one exhibits _none_ of these issues. The machine only started
misbehaving after unexpectedly losing power. I suspect that since that
NanoPSU has no useful amount of capacitor energy storage, all the power
rails dropped at once, which might not be what the hardware expects and
may have therefore poked something in unexpected ways - blind guess, though.
>How-To-Repeat:
Borrow my (presumably slightly wonky) machine?
>Fix:
none



Home | Main Index | Thread Index | Old Index