Re: specfs/spec_vnops.c diagnostic assertion panic

To: Taylor R Campbell <campbell+netbsd-current-users%mumble.net@localhost>
Subject: Re: specfs/spec_vnops.c diagnostic assertion panic
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Fri, 12 Aug 2022 23:28:13 +0700
    Date:        Thu, 11 Aug 2022 13:01:34 +0000
    From:        Taylor R Campbell <campbell+netbsd-current-users%mumble.net@localhost>
    Message-ID:  <20220811130139.B310860843%jupiter.mumble.net@localhost>

  | If that still doesn't help, you could try the attached patch (which
  | I'm not committing at the moment because it changes the kernel ABI).
  | With the patch, it may hit the following panic:
  |
  | panic: kernel diagnostic assertion "sd->sd_closing == NULL" failed: file "/syzkaller/jobs/netbsd/kernel/sys/miscfs/specfs/spec_vnops.c", line 1675 sd_closing=0xffffbc0013315500

It didn't, it hit the one in the previous line, KASSERT(!sd->sd_opened);
and I actually got that one to happen, 2 boots in a row, then the previous
panic the next boot (all slightly different kernels, but all very recent ones).
The last of those was missing your patch & spec_vnops changes.   All of them
had mlelstv@'s patch for the TSC vs PCIDIMP issue (which unfortunately I
didn't get to properly test, as the systems with PCIDUMP and that patch
both panic'd - but that's almost certainly just a coindidence).

So, no more info from the value of sd_closing ...

What I do now know (as the system stopped in ddb, and I could see what
was really happening) is that we are in the rc.d/fsck script, and the
current process was dkctl (all 3 times) - that's the one that ran the
close() which is generating the panic.

Note that all of the /etc/fstab lines which refer to on-disk filesystems
use the NAME= syntax to refer to a wedge label - I'm assuming that's not
completely unrelated.   The fsck process was in readisklabel() when the
panic happened, its kernel thread traceback goes...
(both fsck and dkctl had '>' in the ddb ps listing, which I believe
means assigned to a cpu, I think everything else marked that was as
an idle thread (idle/1 idle/3 idle/4 ...) though now I look I also
see a '>' beside softser/4 and softclk/4 (as well as idle/4 - all of
those shown as assigned to cpu 4 ... which means I am misinterpreting
things I expect).   [dkctl was on cpu0 and fsck on cpu2]

cv_wait() at netbsd:cv_wait+0x49
biowait() at netbsd:biowait+0x59
brelsel() at netbsd:brelsel+0x285
brelse() at netbsd:brelse+0x25
readdisklabel*( ar netbsd:readdisklabel+0x290
dk_getdisklabel() at netbsd:dk_getdisklabel+x0cb
?() at 80206468
?() at ffffcddc66b87780

The last at ff... is the lwp value for the fsck process (same
value I types to bt/a ffffc...)


At this stage (since fsck hasn't finished) all that should exist is the
root filesys, and whatever devpubd makes happen.

I have some images of the screen from when ddb entered from all 3 crashes

The first one was at line 1741 of (the patched) spec_vnops.c - so is the
second, that was essentially the same kernel, but without options PCIDUMP
enabled (otherwise compiled from the same sources).   The third crash was
from a kernel I built a couple of days earlier (back with PCIDUMP and
Michael's patch - I thought that that way I'd at least get to test that)
This one failed the same way as earlier, it was the sd_closing panic
(but without the mod from the patch to make it print the lwp which set it).

My next boot was the non-PCIDUMP kernel that is otherwise identical to
that last one (has Michael's patch, but without PCIDUMP the TSC has no
issues, so nothing very exciting).

Michael, if you happen to be reading this - while I didn't really get to
test your change in any way I'd be prepared to make any statement about,
my gut feeling is that it didn't fix things.   That's from watching the
dmesg timestamps of the messages from the kernel after the PCIDUMP sequence
ended.   I had decided to actually find out how long that takes (turns out
it is approx 6 minutes) so I had a stopwatch running.   After PCIDUMP
finished, it took about 40 seconds to get to the panic, but the msgbuf
timestamps had only grown to approx 20 seconds, which indicates that time
was still running slowly (I think).   Don't presume too much from this
however, I need to get the system running after having done a PCIDUMP so
I can actually measure what is happening with the clock then.

Back to specfs...

I think I am going to try backing out the patch, and just run with your
(Taylor's) updated specfs_vnops.c in case there is something in the patch
which is altering the behaviour (more than just the nature of the sd_closing
field type).

kre

ps: one other oddity (perhaps) - after all of these boots, the kernel
message buffer has messages from my last 7 kernel boots, including all of
the shutdown messages from the kernel that was running for the past several
days, and all 4 of tonight's boots (so far 4).   But none of the ones that
panicked (the 3 previous to the current one) run as far as the panic occurring,
they all cut off much earlier (but after the PCI dump finished in each of
the cases where that happened).

[     1.031821] acpicpu23 at cpu23: ACPI CPU
[     1.031821] coretemp15 at cpu23: thermal sensor, 1 C resolution, Tjmax=100
[     1.031821] cpu0: CPU base freq 3400000000 Hz
[     1.031821] cpu0: CPU max freq 5500000000 Hz
[     1.031821] cpu0: TSC freq CPUID 3417600000 Hz
[     1.031821] cpu0: TSC freq calibrated 9262831000 Hz
[     1.031821] timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
[     1.000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
[     1.000000]     2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
[     1.000000]     2018, 2019, 2020, 2021, 2022
[     1.000000]     The NetBSD Foundation, Inc.  All rights reserved.
[     1.000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
[     1.000000]     The Regents of the University of California.  All rights reserved.

[     1.000000] NetBSD 9.99.99 (JACARANDA:1.1-20220810) #11: Wed Aug 10 18:50:14 +07 2022
[     1.000000]         kre%jacaranda.noi.kre.to@localhost:/usr/obj/testing/kernels/amd64/JACARANDA

That's the switch from the previous boot to the current one.  The timestamps
on the previous boot should have gone up to about 20 seconds - actually I see
from the photo that it was 19.59... seconds.   Wonder what happened to all
that missing data, which includes essentially all of the normal autoconf
stuff happening.

An earlier one

[    13.700806] Initialized nouveau 1.3.1 20120801 for nouveau0 on minor 0
[    13.720806] nouveaufb0: framebuffer at 0x
[    15.220801] nouveau0: autoconfiguration error: error: DRM: base-1: timeout
[    15.980799] nouveau0: autoconfiguration error:e 640x480x67
[    17.810793] wsdisplay0 at nouveaufb0 kbdmux 1: cfiguration error: error: fifo: fault 01 [WRITE] at 00000000002f0000 engine 04 [B
AR1] client 08 [HUB/HOST_CPU_NB] reason 02 [PTE][   1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2
005,  
[     1.000000]     2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
[     1.000000]     2018, 2019, 2020, 2021, 2022
[     1.000000]     The NetBSD Foundation, Inc.  All rights reserved.
[     1.000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
[     1.000000]     The Regents of the University of California.  All rights reserved.

[     1.000000] NetBSD 9.99.99 (JACARANDA:1.1-20220811) #12: Fri Aug 12 03:45:10 +07 2022
[     1.000000]         kre%jacaranda.noi.kre.to@localhost:/usr/obj/testing/kernels/amd64/JACARANDA

Again, lots has been dropped/overwritten.

On the other hand, the first boot this evening:

[ 400923.546890] unmounted /dev/dk39 on /local/old/src type ffs
[ 400924.156898] unmounted /dev/dk29 on /usr/src type ffs 
[ 400924.336900] unmounted /dev/dk20 on /usr/pkg type ffs
[ 400927.906946] unmounted /dev/dk8 on /var type ffs
[ 400927.936947] unmounted /dev/dk19 on /usr type ffs
[     1.000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
[     1.000000]     2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
[     1.000000]     2018, 2019, 2020, 2021, 2022
[     1.000000]     The NetBSD Foundation, Inc.  All rights reserved.
[     1.000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
[     1.000000]     The Regents of the University of California.  All rights reserved.

[     1.000000] NetBSD 9.99.99 (JACARANDA:1.1-20220811) #12: Fri Aug 12 03:45:49 +07 2022
[     1.000000]         kre%jacaranda.noi.kre.to@localhost:/usr/obj/testing/kernels/amd64/JACARANDA_PCI_DUMP

which all looks perfectly normal - /usr would be the last filesys unmounted
(given that root cannot be - /usr is the 2nd (non-comment) line in fstab).
Follow-Ups:
- Re: specfs/spec_vnops.c diagnostic assertion panic
  - From: Taylor R Campbell
References:
- Re: specfs/spec_vnops.c diagnostic assertion panic
  - From: Taylor R Campbell
Prev by Date: re: current USE_SSP=yes build failure
Next by Date: Re: specfs/spec_vnops.c diagnostic assertion panic
Previous by Thread: Re: specfs/spec_vnops.c diagnostic assertion panic
Next by Thread: Re: specfs/spec_vnops.c diagnostic assertion panic
Indexes:
Home | Main Index | Thread Index | Old Index