Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: specfs/spec_vnops.c diagnostic assertion panic



    Date:        Fri, 12 Aug 2022 21:07:14 +0000
    From:        Taylor R Campbell <campbell+netbsd-current-users%mumble.net@localhost>
    Message-ID:  <20220812210719.CA66B60A30%jupiter.mumble.net@localhost>

  | Here's a hypothesis about what happened.
  |
  | - You have a RAID volume, say raid0, with a GPT partitioning it.
  |
  | - raid0 is configured with an explicit /etc/raid0.conf file, rather
  |   than with an autoconfigured label.
  |
  | - You have devpubd=YES in /etc/rc.conf.

All that is correct (2 RAID volumes actually) using config files rather
than raid autoconf, and devpubd turned on.

  | On boot, the following sequence of events occurs:
  |
  | 1. /etc/rc.d/devpubd launches devpubd, which synchronously enumerates
  |    devices and invokes hooks for all the devices it finds.  This
  |    _excludes_ raid0 because it hasn't been configured yet.

Yes.

  | 2. /etc/rc.d/raidframe configures raid0 from /etc/raid0.conf.

Close enough.

  | 3. /etc/rc.d/fsck starts to run.

Yes.

  | At this point, two things happen concurrently:
  |
  | (a) /etc/rc.d/fsck runs fsck on dkN (some wedge of raid0)
  | (b) devpubd wakes and runs `dkctl raid0 listwedges' in 02-wedgenames

Entirely possible.   I also wondered what was running dkctl, seemed like
an odd thing for fsck to be doing ... forgot about devpubd's hook scripts.

  | fsck and dkctl race to work on raid0 -- the block device.  Sometimes
  | this happens without trouble.  Sometimes one will try to open it while
  | the other is still using it, and the open will fail with EBUSY.  But
  | sometimes one tries to open while the other has started, but not yet
  | finished, closing it -- and that's when the crash happens.  With my
  | last patch, it should just fail with EBUSY in that case too.

No signs of EBUSY from fsck, where it should be noticed I'd expect (I am
now running with that patch).  No idea what devpubd does with EBUSY, but
I'd guess, not much (it doesn't seem to be very chatty).

Why is fsck running on the block device though?   And devpubd too?   Given
the reference to cdev_close() I'd assumed it was a char (raw) device that
was being used, which would be what fsck certainly should be using.  But
I see that 02-wedgenames uses the block /dev/dnN device .. should fix that.)

  | Now there's a higher-level issue here -- once we fix the kernel crash,
  | dkctl as called via devpubd -> 02-wedgenames might lose the race with
  | fsck and then fail to create the wedge, or fsck might lose the race
  | and, well, fail to fsck your file system, both of which might be bad.

The former not so bad (for me anyway) devpubd was more an experiment,
just to see what it would do (primarily for its ability to make more /dev/dkN
nodes on demand .. I kept running out).   The latter, yes, though all it
should result in is rc failing to single user mode, after which it should
restart, and perhaps succeed this time (even probably.)

  | So we need to find a way to deal with this race even after we fix the
  | kernel crash.

Agreed.

Now for the earlier messages about this:

  | Can you try the attached patch and see if (a) it still panics, or if (b)
  | there are any other adverse consequences like fsck failing? 

Of course I can, and have now...   No panic.   No consequences.
Unfortunately this means almost nothing, as it doesn't always fail,
even when it is going to.  I will reboot it some more later today.

I decided to test my (no longer credible anyway, for other reasons) uninit'd
hypothesis, and so booted a generic kernel from about a month ago that I
still had lying around, didn't expect that to fail, and it didn't (it
is of a similar vintage to when the first of these panics that I have
recorded occurred, so it wasn't that - but I booted just to single user
mode, just to use it to scramble RAM, and so no devpubd or fsck happened).

That changed nothing, the kernel as it is now patched booted again without
issues (again, proof of nothing) - though (as I expected it would) started
with a clean message buffer this time, the generic boot destroyed the old one.

  | Might help to know what the process command-lines were for fsck and dkctl if
  | you catch it again (possibly without the patch I just sent in a followup, in
  | case that fixes the kernel crash). 

I will boot the previous kernel and see if I can extract that info, but
I will need to read a bit of ddb command man page first...  So sometime
your tomorrow (much later my today).

  | Can you share dmesg, `drvctl -lt' output, and /etc/rc.conf (or any other
  | /etc/rc* configuration)?

If you think you still need it (your hypothesis seems good to me).

  | Can you describe all the physical and logical storage
  | devices on your system?

Sure (dmesg would reveal most, but would be a pain to dig out, and
not reveal some extra bits.)

There are 2 NVME SSDs (ld0 ld1) each of which has GPT - one of them
is currently wasted by a linux version I installed in the very early
days, the other has misc stuff like /usr/obj (and the first NetBSD I
installed - that one just used to self-build the one I use).  Swap
is also there (though I've never seen any used).

One SATA SSD (wd0) (that's the primary boot drive, has root /usr etc).
That's where NetBSD lives, where linux will live if I can make it
install without thinking it should be upgrading the other copy instead
(very low priority task...) - it also has an unactivated wintrash that
the people who built the system for me used to test that the hardware
all worked.

Then 5 rotating rust drives (wd1..wd5) - 2 are a RAID-1, the other 3 a
RAID-5.  (One of the RAID-1 drives is currently away on vacation - the one
supplied with the system made seek scratch noises, and its SMART
said:
   SMART overall-health self-assessment test result: FAILED!
   Drive failure expected in less than 24 hours. SAVE ALL DATA. 
even though it was (apparently) working OK (NetBSD saw no errors, no
idea if the drive was slower than it should have been).   After about
a month (of the drive not failing in less than 24 hours) I decided to
send it back and have it replaced (I had been waiting upon the promised
failure).   Replacement arrived: on first power up (and again on a
subsequent attempt), made hideous seek scratch noises for 30 secs or so,
then told NetBSD it had 0 sectors (and 0 everything else, except model
& serial numbers).  Totally dead (though the electronics still worked fine).

Turns out that under NetBSD one cannot even attempt to get SMART status of
a drive in that condition - the open fails if there is no associated storage -
I changed that, and then SMART worked to the extent of telling me that the
drive didn't support SMART, and that it wasn't enabled (ie: the capabilities
stuff was all 0's too .. the drive also didn't support DMA or LBA48 addressing
... it is supposed to be 16TB, so using C/H/S would be interesting!), and then
when I asked it to ignore all that and try anyway, the only response was
an error.)   So that drive went back too.   Next replacement is yet to arrive.

The RAID arrays are configured from config file, not autoconf, as one
drive from each array is (or should be, the missing one is one of those)
on an add-in SATA controller that the BIOS (firmware really I suppose)
likes to occasionally pretend doesn't exist.   That's why I enabled PCIDUMP,
so I could see what was apparently there, and the answer really was nothing
(the firmware does a good conjuring job).  If RAID autoconf is turned
on when I boot (even to single user mode) the raids get configured in
degraded mode when that happens (each array appears to have one missing
drive).  When I force the BIOS to make the SATA controller reappear (lots
of abricadabra involved in that) and reboot, raidframe has already marked
the temporarily missing drives as failed, so requires reconstruction to
rebuild everything - which takes almost 24 hours to complete (running both
in parallel, not that that really affects anything).  I got bored with
having that happen over and over again, so turned off raid autoconfig, and
added an rc.d script that runs before raidframe is started, and just checks
that at least one of the drives that should be on the vanishing controller
is visible (in hw.disknames) - if not it aborts the boot, so I can fix it,
before raidframe discovers the problem.   (That strategy is working fine,
though I'd like to find a way to defeat the BIOS permanently - to be able
to undo whatever wall it is building or mirror it is using, as I know the
controller never really leaves the case!)

There's also an optical media drive, which almost never has anything in it.

Apart from that, I have anything up to 6 external USB drives connected (though
5 is the typical max).  Right now while testing this, just 1, and that one
is powered off.   Very occasionally an eSATA drive might get connected.

Everything uses GPT (system uses UEFI booting).

  | (Privately is fine if you prefer.) 

No particular concern, except the dmesg output is not small, it is
probably a bit much for this list.

Further, given your deductions above, I doubt that it is needed any
more.

kre

ps: there don't appear to be any wedges missing from /dev/wedges - I
didn't do a systematic inventory, but I see I am probably not using it
correctly, as it still has entries for the USB drives that are now
disconnected.  The disconnects happened while the system was down (that's
a time I know they're not in use!) so devpubd cannot be blamed for not
removing them - except that it should perhaps ensure that the directory
is empty when it starts (or the rc.d script should - or make a tmpfs or
something for it).   But for this, that helped, as I could simply look and
see what was old, and that was only the unconnected USB drive partitions.
(This is no guarantee that everything else is present, but it is likely).
All the rest look to have been created at the same time, so it is also
unlikely that devpubd missed something in its first attempt, and then
went back and added more later.




Home | Main Index | Thread Index | Old Index