Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: continuous ffs_blkfree_common panic



On Sat, May 04, 2024 at 12:12:35AM +0300, Andrius V wrote:
> On Fri, May 3, 2024 at 3:59 PM Andrius V <vezhlys%gmail.com@localhost> wrote:
> >
> > Hi,
> >
> > Today I reinstalled one of my systems (NetBSD 10) and while setting up
> > newly created user's password I received a panic. Since then system
> > always panics in the same way on boot:
> >
> > [     6.252795] panic: ffs_blkfree_common: freeing free frag: dev =
> > 0xa801, block = 186040, fs = /
> > [     6.252795] cpu0: Begin traceback...
> > [     6.252795] vpanic() at netbsd:vpanic+0x183
> > [     6.252795] panic() at netbsd:panic+0x3c
> > [     6.252795] ffs_blkfree_common.isra.0() at
> > netbsd:ffs_blkfree_common.isra.0+0x951
> > [     6.252795] ffs_blkfree_cg() at netbsd:ffs_blkfree_cg+0x106
> > [     6.252795] ffs_realloccg() at netbsd:ffs_realloccg+0x899
> > [     6.262794] ffs_balloc() at netbsd:ffs_balloc+0x1290
> > [     6.262794] ufs_gop_alloc() at netbsd:ufs_gop_alloc+0xa7
> > [     6.262794] ufs_balloc_range() at netbsd:ufs_balloc_range+0x154
> > [     6.262794] ffs_write() at netbsd:ffs_write+0x34e
> > [     6.262794] VOP_WRITE() at netbsd:VOP_WRITE+0xa6
> > [     6.262794] vn_write() at netbsd:vn_write+0x10e
> > [     6.262794] dofilewrite() at netbsd:dofilewrite+0x80
> > [     6.262794] sys_write() at netbsd:sys_write+0x49
> > [     6.272794] syscall() at netbsd:syscall+0x1fc
> > [     6.272794] --- syscall (number 4) ---
> > [     6.272794] netbsd:syscall+0x1fc:
> > [     6.272794] cpu0: End traceback...
> >
> > The disk is Kingston KC400 SSD, there's more than enough space on it.
> > There are three gpt partitions (root/home/swap). root and home uses
> > ffs2ea and wapbl enabled (log). The actual boot happens from USB drive
> > but later altroot is used to switch to SSD.

once a UFS fs that is used with logging becomes corrupted, it will often
stay corrupted until you manually run a full fsck on it ("fsck -fy").
the "fsck -p" that is run automatically only does log replay, and if
the metadata changes in the log do not fix the problem then the problem
will escape detection and remain unfixed even when the fs is mounted r/w.

we should add a way (such as setting a flag in the superblock) to mark
a file system as corrupted so that the automatic fsck will not mark the fs
as clean after only replaying the log, but instead require a full fsck
before the fs can be marked clean again.


> > [     3.402785] wd0 at atabus0 drive 0
> > [     3.402785] wd0: <KINGSTON SKC400S37512G>
> > [     3.402785] wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> > [     3.402785] wd0: 476 GB, 992277 cyl, 16 head, 63 sec, 512
> > bytes/sect x 1000215216 sectors
> > [     3.402785] wd0: GPT GUID: d220d34c-d8d6-44a9-8a2a-09818d238115
> > [     3.402785] dk8 at wd0: "8baeac2c-5905-4005-a876-f61c560390f8",
> > 262144 blocks at 2048, type: msdos
> > [     3.402785] dk9 at wd0: "nasroot", 724992000 blocks at 264192, type: ffs
> > [     3.402785] dk10 at wd0: "nashome", 242522112 blocks at 725258240, type: ffs
> >
> > Mounting partitions from another NetBSD system doesn't cause panic and
> > I can see files.
> >
> > Did I hit some hardware issue or is it software bug? What could be an
> > option to go forward to restore the system (reinstall, relocate files,
> > etc?)?

corruption could be either hardware or software, there's often no way
to tell from a single occurance.


> > Regards,
> > Andrius V
> 
> Seems like some data corruption occurred for some reason causing the
> issue (/etc/passwd file got filled with \xff values, master.passwd
> file had jiberrish values at the end of the file, password database
> corrupted as well). After some actions crash is not reproducible
> anymore though. I guess issue can be considered as some kind of fluke.

this can happen with our journal implementation because regular file data
is not journaled and there is no other mechanism to make sure that
uninitialized blocks are either initialized or removed from the file
after a crash.  no one has had the time to deal with this yet.

-Chuck


Home | Main Index | Thread Index | Old Index