Lost file-system story

To: tech-kern%netbsd.org@localhost
Subject: Lost file-system story
From: Donald Allen <donaldcallen%gmail.com@localhost>
Date: Tue, 6 Dec 2011 11:10:44 -0500

I recently installed NetBSD 5.1 on an old Thinkpad T41 that I use for
experimentation. I installed it with a single, monolithic filesystem,
which I mounted async,noatime. Yes, I'm fully aware that's dangerous
and was aware of it at the time. But .... I have a long history of
running Linux systems with ext2 filesystems and now, journal-less ext4
filesystems, and in all the years of running those systems, where no
particular care is taken to write file-system meta-data in ordered
fashion, I have never lost a file-system. Linux crashes are extremely
rare, my systems are either laptops or on UPSes, and I never do
something as stupid as just whacking the power-button to shut them
down. On the rare occasions when a file-system has suffered an
improper shutdown, fsck has always been able to recover with little or
no damage. (I should perhaps mention that I'm retired now, having had
a long career in software development, with a lot of OS development
experience -- IBM CP/67, Tenex, TOPS20, Unix (Mach), and a LOT of
Linux sys-admin experience; less with the BSDs, but not zero).

The T41 has built-in Aironet Wireless Communications MPI350 wireless
hardware. The GENERIC 5.1 kernel did not see this device at boot time,
so no wireless. To fix this, I stuck an Atheros-based PCMCIA card in
the machine, which did work. I was attempting to build Gnucash via
pkgsrc on the T41 and had left the machine grinding away overnight
(webkit is one of Gnucash's dependencies, and it's huge). It had
finished the build when I got up the following morning and I installed
gnucash and then did a
bunch of cleaning-up in /usr/pkgsrc. I then tried to use firefox and
found that my network connection was dead. So I did a

 /etc/rc.d/network restart

and the system froze, completely dead.

Upon restart, the automatic fsck gave up and requested a manual fsck.
I tried that, but there are just too many things broken, a
consequence, I'm sure, of running async and having this crash occur
just after having done a lot of filesystem writing. The situation was
so bad, I had to abandon this install.

There are two issues here:

1. It looks like there's a bug in the Atheros driver.
2. I'm a little bit surprised that the filesystem was as much of a
mess as it was.

I mentioned all this to old friend Christos Zoulas and he suggested
that I post this message. It is certainly true that I had done a lot
of writing to the filesystem (as a result of my pkgsrc cleanup) and
that had occurred within, say 10 minutes of the crash, maybe less. So
it wasn't hours. But it also wasn't seconds. My Linux experience, and
this is strictly gut feel -- I have no hard evidence to back this up
-- tells me that if this had happened on a Linux system with an async,
unjournaled filesystem, the filesystem would have survived. In
suggesting that I post this, Christos mentioned that he's seen
situations where a lot of writing happened in a session (e.g., a
kernel build) and then the sync at shutdown time took a long time,
which has made him somewhat suspicious that there might be a problem
with the trickle sync that the kernel is supposed to be doing.

So my purpose in posting this is to ask after doing 'make clean's of
perhaps 15 or 20 packages and their dependencies, what is your
estimate of the maximum time before everything gets safely written out
of the buffer cache (this machine has a 1.6 Ghz Pentium M, 2 GB of
memory, and a 7200 rpm 60 GB pata disk -- yes, not a normal
configuration for a T41; I stuck the memory and disk in this machine
taken from another, dead Thinkpad I have)? Is it seconds? Tens of
seconds? Minutes? If it's small, then I would suggest that a kernel
wizard have a look at the trickle sync stuff. I made the point to
Christos that I'm probably one of a very small number, maybe one, who
would mount the whole world async (and please, no lectures; I knew the
risk going in; this was an experiment and I knew it could end badly; I
did not have 10 years worth of un-backed-up financial data on this
machine :-), and it is almost certainly true that if the filesystem
had been mounted sync or softdep, it would have survived the crash. So
if there's a problem with trickle sync, it would only have
catastrophic consequences in the very rare case of someone doing what
I did (mounting async, doing a lot of writing followed by a system
crash). I'm trying to make the argument that there could be a problem
that is benign in 99.99% of the NetBSD setups, and so you haven't
heard about.

/Don Allen

Follow-Ups:
- Re: Lost file-system story
  - From: David Holland
- Re: Lost file-system story
  - From: Thor Lancelot Simon
- Re: Lost file-system story
  - From: Donald Allen
- Re: Lost file-system story
  - From: Greg Troxel

Prev by Date: Re: secmodel_register(9) API
Next by Date: Re: Lost file-system story
Previous by Thread: Attention: Web User
Next by Thread: Re: Lost file-system story
Indexes:

Home | Main Index | Thread Index | Old Index