[The original version of this message I sent yesterday apparently failed to
go thru, possibly due to the long length of the console log I included. I'm
retrying this with the console log excised and placed on an external
web site.]
Greg Troxel <gdt%lexort.com@localhost> writes:
Michael Cheponis <michael.cheponis%gmail.com@localhost> writes:
My guess is it's not necessarily something with this particular version =
of
-current, but is something in the disk structure?
My guess is that something is wrong on your filesystem that fsck does
not check for.
No, I'm seeing what I think is the same thing, and I think it's not the
kernel hitting existing corruption on an fs image, but a kernel bug
corrupting the fs and then causing the panic. See below.=20
Any suggestions on how I can fix this? (I'm happy to try newer kernels,
etc).
You can run fsdb and try to look at that inode, and maybe use clri.
Or, you can dump/newfs/restore if you just want a working system.
Well, I've been seeing similar behaviour on recent evbarm64 builds I
test under qemu before letting them loose on real hardware (and, given
the bug's filesystem-eating propensities, I didn't put any of those
recent builds on my real evbarm64 (Pine) hardware). I found the
following:
1) I could pretty reliably get the bug to hit by doing the "installing
world" part of sysupgrade (more precisely, "sysupgrade sets", where
it untars the tar fils for /bin, /usr/bin, /usr/lib and so on)
2) doing an approach along the lines you suggest above, creating a new
disk image, newfsing it, and copying all the files over, then
booting off the new filesystem, doesn't avoid the problem -- it's
not a case of a pre-existing corrupt directory on the filesystem that
one just happened to hit that day.
I've got a console log here ( https://pastebin.com/eHRG8jGy )
exhibiting an example reproduction of the problem.
The first part has qemu booting with two virtual disks
attached, one the "usual" NetBSD-current disk image I've been using with th=
at
VM, and the second a copy of said image which was going to get its /
newfsed and recreated by copying from the master. (I did the copying
using rsync, not dump/restore, but that shouldn't matter.) The second
part is qemu running off of the "copy" disk image, and going thru a typical
attempt to upgrade to an already-built NetBSD-current release (built
elsewhere, stashed on NFS on another machine) -- first sysupgrade
fetches everything, installs the new kernel, then we boot the new
kernel. Next we try to use sysupgrade to install world, but we hit a
panic: ffs_blkfree: bad size: dev =3D 0x5c20, bno =3D 2432893633 bsize =3D=
16384, size =3D 16384, fs =3D /
panic in the middle of extracting the tests tar file, and then fail
automatic fsck on reboot.
After a rather long and tedious set of "hg bisect/build.sh" iterations, it
looks like I've found the culprit, the problem seems to have been
introduced by the following commit:
changeset: 936632:a15d1b7af98f
branch: trunk
user: skrll <skrll%NetBSD.org@localhost>
date: Tue Sep 08 10:30:17 2020 +0000
files: sys/arch/arm/arm32/bus_dma.c
extra: branch=3Dtrunk
description:
A few bus_dmatag_subregion fixes
- return EOPNOTSUPP if min_addr isn't less than max_addr
- fix the subset check to ensure that all the ranges in the parent tag are
within the {min,max}_addr range. If so we can just continue to use the
parent tag.
- when building the new ranges read the parent tag range rather than un-
initialised memory.
- remove the max_addr !=3D 0xffffffff check - the overflow should be handled
by the unsigned arithmetic for arm32.
- add a KASSERT
- add comments
Dunno what in particular in there is the problem, but obviously anything
going astray in the dma code has the potential to wreak havoc on the FS
when there's a lot of disk I/O going on...