Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Filesystem corruption in current 9.99.92 (posix1eacl & log enabled FFSv2)



On Thu, Dec 23, 2021 at 12:30:14PM +0100, Matthias Petermann wrote:
> Hello,
> 
> for tracking down an FFS issue in current I would appreciate some advice.
> There is a NetBSD 9.99.92 Xen/PV VM (storage provided by file backed VND).
> The kernel is built from ~2012-11-27 CVS source. The root partition is a
> normal FFSv2 with WAPBL. In addition there is a data partition for which I
> have posix1eacls enabled (for samba network shares and sysvol).
> 
> The data partition causes problems. Without the host being crashed or rudely
> shut down in the past, the filesystem seems to have become inconsistent. I
> first noticed this because the "find" of the daily cron job was still
> running late in the morning with 100% CPU load but no disk I/O ongoing.
> 
> Then I took the filesystem offline for safety and forced a fsck. Errors were
> detected and solved:
> 
> ```
> $ doas fsck -f NAME=export
> ** /dev/rdk3
> ** File system is already clean
> ** Last Mounted on /export
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> CG 31: PASS5: BAD MAGIC NUMBER
> ALTERNATE SUPERBLK(S) ARE INCORRECT
> SALVAGE? [yn]
> 
> CG 31: PASS5: BAD MAGIC NUMBER
> ALTERNATE SUPERBLK(S) ARE INCORRECT
> SALVAGE? [yn] y
> 
> SUMMARY INFORMATION BAD
> SALVAGE? [yn] y
> 
> BLK(S) MISSING IN BIT MAPS
> SALVAGE? [yn] y
> 
> CG 799: PASS5: BAD MAGIC NUMBER
> CG 801: PASS5: BAD MAGIC NUMBER
> CG 806: PASS5: BAD MAGIC NUMBER
> CG 823: PASS5: BAD MAGIC NUMBER
> CG 962: PASS5: BAD MAGIC NUMBER
> CG 966: PASS5: BAD MAGIC NUMBER
> 482470 files, 113827090 used, 67860178 free (3818 frags, 8482045 blocks,
> 0.0% fragmentation)
> 
> ***** FILE SYSTEM WAS MODIFIED *****
> ```
> 
> I did not find too much information what this magic numbers of a cylinder
> group means and what could have caused them to be "bad" :-/

a "cylinder group" is a metadata structure in FFS that describes the
allocation state of a portion of the blocks and inodes of the file system
and contains the inode records themselves.  the header for this structure
also contains a "magic number" field that is supposed to contain a certain
constant value as a way to sanity-check that this metadata on disk was not
overwritten with some completely unrelated contents.

in your case, since the magic number field does not actually contain the value
that it's supposed to contain, we know that the storage underneath the
file system has gotten corrupted somehow.  you'll want to track down
how that happened, but that is separate from your immediate problem.


> Anyway, a repeated fsck does not show further errors so I thought it should
> be fine. However, after mounting the FS to /export with
> 
> ```
> $ find /export
> ```
> 
> i can still trigger the above mentioned 100% CPU problem in a reproduce-able
> manner. Thereby find always hangs at the same directory entry.
> 
> Does anyone have an idea how I can investigate this further? I have already
> done a ktrace on find, but in the state in question there seems to be no
> activity going on in find itself.
> 
> Kind regards
> Matthias

this sounds like a bug I have seen before, where the extended attribute block
for a file has been corrupted.  please try the attached patch and see if
this prevents the infinite loop.

if that does prevent the infinite loop, then the file will probably appear
not to have an ACL anymore, and I'm not sure what will happen if you try
to set a new ACL on the file when it is in this state.  for right now,
the safest thing you can do will be to make a copy of the file without
trying to preserve extended attributes (ie. do not use cp's "-p" option),
then delete the original file, then move the copy of the file to have
the original file's name, then you can change the new file's
owner/group/mode/ACL to be what the original file had.

-Chuck
Index: sys/ufs/ffs/ffs_extattr.c
===================================================================
RCS file: /home/chs/netbsd/cvs/src/sys/ufs/ffs/ffs_extattr.c,v
retrieving revision 1.8
diff -u -p -r1.8 ffs_extattr.c
--- sys/ufs/ffs/ffs_extattr.c	14 Dec 2021 11:06:50 -0000	1.8
+++ sys/ufs/ffs/ffs_extattr.c	23 Dec 2021 16:52:18 -0000
@@ -393,6 +393,9 @@ ffs_findextattr(u_char *ptr, u_int lengt
 		/* make sure this entry is complete */
 		if (EXTATTR_NEXT(eap) > eaend)
 			break;
+		/* handle corrupted ea_length */
+		if (EXTATTR_NEXT(eap) < eap + 1)
+			break;
 		if (eap->ea_namespace != nspace || eap->ea_namelength != nlen
 		    || memcmp(eap->ea_name, name, nlen) != 0)
 			continue;
@@ -857,6 +860,9 @@ ffs_listextattr(void *v)
 		/* make sure this entry is complete */
 		if (EXTATTR_NEXT(eap) > eaend)
 			break;
+		/* handle corrupted ea_length */
+		if (EXTATTR_NEXT(eap) < eap + 1)
+			break;
 		if (eap->ea_namespace != ap->a_attrnamespace)
 			continue;
 


Home | Main Index | Thread Index | Old Index