Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: specfs/spec_vnops.c diagnostic assertion panic



    Date:        Fri, 12 Aug 2022 17:10:47 +0000
    From:        Taylor R Campbell <campbell+netbsd-current-users%mumble.net@localhost>
    Message-ID:  <20220812171053.3E95260A30%jupiter.mumble.net@localhost>

  | I added a couple more assertions (spec_vnops.c 1.213).

OK, once again, crash on first boot - but this time (as I had only
one new kernel to test) I simply booted it again next time - and for
that, no issues.

I am beginning to suspect something uninitialised here, rather than
some devious locking issue...

Boot once (and the trigger may be that the kernel being booted is not
the same version as the one last booted) and things fail.   Reboot, with
the same kernel, hence same memory layout (I have KASLR disabled), and
the boot which panicked left everything nice and clean, and the boot
succeeds.

The series of boots I did earlier were all (slightly) different kernels
(though the final 2 would have been almost identical).

The most recent crash is a new one, this time in specfs_open() rather
than close, at line 970 in the (patched) spec_vnops.c

That is:
        } else if (error == 0) { 
                /*
                 * Device has not been revoked, so our opencnt can't
                 * have gone away at this point -- transition to
                 * sn_gone=true happens before transition to
                 * sn_opencnt=0 in spec_node_revoke.
                 */
                KASSERT(sd->sd_opencnt);
                KASSERT(sn->sn_opencnt);
                KASSERTMSG(sn->sn_opencnt <= sd->sd_opencnt,
                    "sn_opencnt=%u > sd_opencnt=%u", 
                    sn->sn_opencnt, sd->sd_opencnt);
                KASSERTMSG(sd->sd_closing == NULL, "sd_closing=%p",
                    sd->sd_closing);    
                sd->sd_opened = true;
        } else if (sd->sd_opencnt == 1 && sd->sd_opened) {

the 4th KASSERT in that block, so we finally got a sd_closing=%p output.

The panic is:

panic: kernel diagnostic assertion "sd->sd_closing == NULL" failed:
	file "/.../spec_vnops.c", line 970 sd_closing=0xffffe83c5663fa00

cpu0: Begin traceback...
vpanic() ...
kern_aassert() ...
spec_open() at netbsd:spec_open+0x662
VOP_OPEN() at netbsd:VOP_OPEN+0x3e
dk_open_parent() at netbsd:dk_open_parent+0x57
dkopen() at nebsd:dkopen+0xaa
cdev_open() at netbsd:cdev_open+0x109
spec_open() at netbsd:spec_open+0x238
VOP_OPEN() at netbsd:VOP_OPEN+0x5e
vn_open() at netbsd_vn_open+0x23b
do_open() at netbsd:do_open+0xc3
do_sys_openat() ...
sys_open() at ...
syscall() at ...
--- syscall (number 5) ---
netbsd:syscall+0xa1:
cpu0: End traceback...

I have the details to fill in that which I deleted (shown as ...)
above, those parts just didn't seen likely to be relevant (that
is except for the ... after "End traceback" which is literal).

Once again, the process on cpu0 which hit the assertion
is dkctl (pid 14277)

The process with the lwp* of 0xffffe83c5663fa00 is fsck_ffs

However, that process seems to be running userland code at the
time of the panic - no (meaningful) stack traceback was possible.

There is another fsck_ffs process running however (different cpu)
which was in the kernel (ddb's ps says "physio"), it's traceback was

trace: pid 12439 lid 13439 at 0xffffa590a13fcc60
sleepq_block() at netbsd:sleepq_block+0x13a
cv_wait at netbsd:cv_wait+0x49
physio() at netbsd:physio_0x3ac
spec_read() at netbsd:spec_read+0x34a
VOP_READ() at netbsd:VOP_READ+0x42
vn_read() at netbsd:vn_read+0xa2
dofileread() at netbsd:dofileread+0x79
sys_pread() at netbsd:sys_pread+0x95
syscall() at netbsd:syscall+0xa1
--- syscall (number 173) ---
netbsd:syscall_0xa1

Apart from dkctl() and the two fsck_ffs processes, there was
the parent fsck (just in wait) and (or course) init, and a whole
set of sh processes (many in pipe_rd or similar) - which will be
the infrastructure used for running rc scripts.   There's also devpubd
(in wait) and sleep (in nanoslp).   That's it (and of course, lots and
lots of kernel threads - things with pid==0).

I guess I should have also mentioned that lots (not all) of the
filesystems (all of which are from GPT partioned devices) on this
system are on raidframe (also GPT partitioned) (raid1 and raid5) - which
I assume is what accounts for the specfs accesses.   The previous shutdown
was clean however, so I'm not exactly sure what fsck was doing - unless it
is merely reading the super block to get the "clean" flag.  It had
already printed several "filesystem is clean: not checking" messages,
some of which would have been raidframe filesystems.

That is, assuming the kernel assigned the same /dev/dkNN numbers to
the current boot as it did the previous one (nothing changed which
should have made a difference, but...)

Last - to correct one (unrelated) thing from an earlier message - there
I said that I thought the msgbuf output from the orderly shutdown looked
complete, as it ended with /usr being unmounted.   When I shut down to
do this test, I noticed that it should have contained detach messages for
all kinds of devices, ending with acpisata0, and didn't.   So now I suspect
that simply lots of msgbuf data is not being saved on a reboot (and it is
the more recent data, not the older stuff) - so I am wondering if perhaps
there's a cache flush (or similar) needed before shutdown, to make sure that
the current message buffer pointer is sync'd to RAM correctly ??

Alternatively, perhaps (and this could explain some other messy/corrupted
output) perhaps multiple CPUs have different values for that pointer (or
index, doesn't matter) cached - and it is simply whichever one happens to
be written last which is used on the next boot?   That is, that field isn't
being maintained properly between cpus.

Just speculating of course.

Thanks for the assistance so far.

kre



Home | Main Index | Thread Index | Old Index