kern/58553: ffs: garbage data appended after crash

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/58553: ffs: garbage data appended after crash
From: campbell+netbsd%mumble.net@localhost
Date: Sun, 4 Aug 2024 15:30:01 +0000 (UTC)

>Number:         58553
>Category:       kern
>Synopsis:       ffs: garbage data appended after crash
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Aug 04 15:30:01 +0000 2024
>Originator:     Taylor R Campbell
>Release:        current, 10, 9, 8, 7, 6, ...
>Organization:
The NetBS<fMfld[\"7"t@s):uQZc
>Environment:
>Description:
When ffs appends data to a file, it does three things:

1. allocate data blocks
2. write to data blocks
3. increase inode size

During normal operation, these steps are taken under the exclusive vnode lock, so other threads and processes can't see the intermediate states.

But if the system crashes in the middle of the steps, it may end up with data blocks that have been allocated, an inode that has been extended, and garbage in the data blocks because step (2) never finished.

This can happen because although steps (1) and (3) are metadata updates, which traditional ffs issues synchronously and wapbl issues in a transaction with a write-ahead log, step (2) is a data update which is largely unordered with respect to metadata updates.

The issue is exacerbated by wapbl, which accelerates metadata updates without changing the rate of data updates.
>How-To-Repeat:
1. start a write-heavy workload
2. crash the system in the middle
>Fix:
1. Create new type of logical log record truncate(n,k) for `truncate inode n to byte k'.  (This requires versioning -- older versions of NetBSD won't be able to replay these logs, so it'll require a newfs or tunefs option to opt in.)

2. Change ffs_write (WRITE in sys/ufs/ufs/ufs_readwrite.c) extending a file from length k0 to length k1, create a record truncate(n,k0) in the next transaction.

3. Change ffs_fsync and ffs_full_fsync so that if they are syncing any prefix of the interval [k0, k1], say to byte k, they change the record to truncate(n,k) in the next transaction.  If they are syncing the whole interval, they delete the record in the next transaction.

(We can also use truncate(n,k) records to make truncate itself atomic -- currently it is split over multiple transactions, in order to avoid overflowing the transaction when truncating a very large file requiring deallocating large numbers of data blocks, so if you crash in the middle of truncating a 100000-byte file to 100-bytes, you might find the file larger than 100 bytes but smaller than 100000 bytes.)

FreeBSD avoids the problem by enforcing a partial ordering with a more elaborate system of block dependencies called soft updates or softdep.  We used softdep but after years of struggling with it concluded it was unmaintainable and undebuggable and removed it 15 years ago: https://mail-index.netbsd.org/source-changes/2009/02/22/msg217531.html, https://mail-index.netbsd.org/netbsd-announce/2008/12/14/msg000051.html

Prev by Date: Re: lib/57628 (multithread programs may deadlock in ld.elf_so (sparc on sparc64))
Next by Date: Re: kern/58553: ffs: garbage data appended after crash
Previous by Thread: kern/58552: panic via genfs_getpages - ufs_bmaparray
Next by Thread: Re: kern/58553: ffs: garbage data appended after crash
Indexes:

Home | Main Index | Thread Index | Old Index