Subject: kern/8889: -current LFS corruption
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jbernard@mines.edu>
List: netbsd-bugs
Date: 11/26/1999 15:30:57
>Number:         8889
>Category:       kern
>Synopsis:       null files, dirty files in clean segments, non-removable directories
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Nov 26 15:30:01 1999
>Last-Modified:
>Originator:     Jim Bernard
>Organization:
	Speaking for myself
>Release:        Nov. 25, 1999
>Environment:
1.4P, Nov. 25, 1999, i386


>Description:
	LFS is exhibiting several forms of corruption.  I don't know exactly
	what minimum sequence of events leads to each, so I'll describe the
	sequence that has so far caused the problems.

	I rebuilt the LFS (3 GB in size) under a Nov. 14 kernel with 
	userland from the Nov. 13 snapshot, and unpacked a -current source
	tree onto the filesystem.  No corruption was evident (except for
	the appearance of UNREF FILE messages from fsck_lfs, which is
	apparently harmless) in a couple of days of relatively light
	operation.  At some point (not sure exactly when, but I think it was
	at this point), I noticed a couple of files with wrong timestamps
	(more recent than they should be), but dismissed that as not critical.
	I then built a new kernel (sources supped after Nov. 25 supscan; built
	in a scratch directory on the LFS, with the source tree, also on LFS,
	union mounted beneath the scratch directory), and booted it, with no
	immediately apparent problems.  I then did a full system build (again
	with sources on the LFS filesystem union mounted beneath a scratch
	directory on the LFS), with no immediately apparent problems, and
	rebooted.  Then "fsck_lfs -n -d" reported in phase 1 some 15,000
	messages like:

	  ! INO 1318: daddr 0x2779b3 is in clean segment 1263

	(The number of these has decreased a bit with time, over a period of
	a bit less than 24 hours---down to a minimum of about 7,000, then
	rising slightly.)  No other corruption, besides the UNREF FILE's
	was found by fsck_lfs.

	I then unpacked the latest xsrc tarball, and supped pkgsrc (which was
	about a week out of date) and xsrc (all onto the LFS), with no apparent
	problems.  BTW: at this point, df reported something like 600 MB of
	space in use.  I then successfully built and installed the tcl80
	package (the installed files go on a different filesystem, but the
	source tree and build space were on the LFS---no union mount was used
	here).  An immediately subsequent attempt to build tk80 failed
	miserably, because some directories and files in the work subdirectory
	(where the source tarball gets unpacked) were null (ls showed all mode
	bits off and 0 link count):

	  ----------  0 someuser  somegroup   (not sure of the rest)

	Furthermore, some directories could not be removed, reminiscent of
	the problem reported in kern/8815, which has, however been fixed
	(the current directory-removal problem evidently occurs far less
	frequently).

	Altogether, I tried building tk80 three times; finding different
	problem directories and files each time (only on the second try
	were there non-removable directories).  The third try crashed the
	machine (and I'm miles from the machine at the moment, so can't
	get a traceback right now---I'll submit an addendum when I get one).

>How-To-Repeat:
	I imagine any use of LFS for a while would eventually lead to these
	problems, but the sequence that did it for me is described above.

>Fix:
	Unknown.
>Audit-Trail:
>Unformatted: