NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/53017: Kernel panic with "fpusave_lwp: did not" message



>Number:         53017
>Category:       kern
>Synopsis:       Kernel panics every now and then with "fpusave_lwp: did not" message.
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 12 18:30:00 +0000 2018
>Originator:     Tero Kivinen
>Release:        NetBSD 8.0_BETA
>Organization:
IKI ry.
>Environment:
System: NetBSD vielako.iki.fi 8.0_BETA NetBSD 8.0_BETA (GENERIC) #0: Wed Nov 8 23:20:26 EET 2017 kivinen%vielako.iki.fi@localhost:/usr/obj/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64

Supermicro X8SIU (0123456789)
cpu0 at mainbus0 apid 0
cpu0: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu0: package 0, core 0, smt 0
cpu1 at mainbus0 apid 2
cpu1: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu1: package 0, core 1, smt 0
cpu2 at mainbus0 apid 4
cpu2: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu2: package 0, core 2, smt 0
cpu3 at mainbus0 apid 6
cpu3: Intel(R) Xeon(R) CPU           X3430  @ 2.40GHz, id 0x106e5
cpu3: package 0, core 3, smt 0

>Description:

Every now and then the machine crashes with following panic:

"fpusave_lwp: did not"

I have crash dumps available for all of the crashes:

-rw-------  1 root  wheel  1966774296 Dec 23 01:08 /var/crash/netbsd.0.core
-rw-------  1 root  wheel  1927200792 Jan 12 00:08 /var/crash/netbsd.1.core
-rw-------  1 root  wheel  1903505432 Jan 15 11:08 /var/crash/netbsd.2.core
-rw-------  1 root  wheel  1947403800 Jan 24 15:09 /var/crash/netbsd.3.core
-rw-------  1 root  wheel  1917088792 Jan 30 01:24 /var/crash/netbsd.4.core
-rw-------  1 root  wheel  1929837080 Feb 12 13:09 /var/crash/netbsd.5.core

Looking at sys/arch/x86/x86/fpu.c it seems it does loop checking
hardware_ticks and loops until they change, and has spin count to make
sure it does not stay there forever. This panic is triggered when it
has been there more than 100 million times.

This panic usually happens few minutes after hour, because we run our
configuration update script every hour, and it takes few minutes to
run, and during that time it does do some floating point mathematics
when generating graphics etc. During the rest of the time the machine
just runs apache and wiki, so there is no real floating point
calculations done at all.

The bad thing was that as it crashed during the config file update,
some of the config files were not written to the disk when it crashed,
thus some of the config files had lots of nuls in the end. I.e., the
size of the file was correct, but last few hundred kb of it was just
zero. This we fixed by adding sync commands between the generation of
the file, and before doing the rest of the processing...

I have crash dumps available and if some more information is needed
from them I can try to dig things out. 

>How-To-Repeat:

Seems to repeat itself every now and then on our hardware.

This might be related to the kern/53016 as it is running on the same
hardware and clock drift might be related to the this failure too. Or
it might be they are completely unrelated.

>Fix:

No fix known.

>Unformatted:
 
 NetBSD 8.0_BETA GENERIC from 2017-11-08.
 


Home | Main Index | Thread Index | Old Index