Subject: Re: FFS reliability problems
To: NetBSD Kernel Technical Discussion List <tech-kern@netbsd.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-kern
Date: 06/10/2002 19:51:00
    Date:        Fri,  7 Jun 2002 13:19:22 -0400 (EDT)
    From:        woods@weird.com (Greg A. Woods)
    Message-ID:  <20020607171922.138D1AC@proven.weird.com>

  | The application is assuming the system will continue running smoothly
  | until it does what it does with the data and closes the file itself
  | (perhaps by exiting, cleanly or otherwise).

Rubbish.   The whole point of the create()/unlink() maneuver is to handle
the case where things don't continue running smoothly in a semi-reasonable
way.   If the application is to assume that it will exit properly, it
can easily unlink its temp files on the way out, that's so boringly
trivial to do I won't bother telling you how...

  | Applications do this in
  | order to implement a trivial garbage collection algorithm

Yes, to handle the abnormal termination case.

  | -- but that
  | doesn't mean the data they write is garbage right from the start.

No, and that's not what I said - what I said was "useless after the
application has vanished" (or some works like that).   If the application
is unlinking temp files that would still be useful after the application
has died in some abnormal way, including system crashes, then the
application is broken.

  | That data is recoverable.

Not always.   Consider a temp file that has pieces of some random
data, in binary, with the index that puts it all back together
left in memory (if you like, think of an ed temp file, rather than
a vi temp file, when the -x flag has been used).

If the data is recoverable, the application shouldn't be unlinking it,
because ...

  | Fsck has no business deleting it -- none whatsoever.

Of course it does.   Consider the other case, probably the more common
case where it is the application that aborts for some reason.   The same
unlinked temp file was open when the app died (SEGV'd or whatever), are
you now going to claim that the kernel has no business deleting it,
"none whatsoever" ???   The situation is just the same, the kernel knows
that the application aborted, as fsck can infer that the system crashed.
The kernel knows even better than fsck does that the file in question
was one which was open, but had been unlinked.

So, should the kernel be taking such files and linking them into
lost+found instead of deleting them?   By your argument, that's the only
conclusion I think, yet it would be absurd.

  | Even the most juniour sysadmin can trivially clean it up
  | after the crash, but only if given the chance.

Huh?   Aside from the "I can rm /filesys/lost+found/*" trivial
solution (which is no different than having fsck do it in the first
place, except it also destroys files fsck wouldn't have removed)
how is your junior sysadmin supposed to figure out what files are
worth preserving, and which aren't?

The file in question in this was was (apparently) the temp file of
some graphics editor - which is just going to look like binary rubbish
to the vast majority of junior sysadmins, and the ideal candidate for
immediate removal.

Working out what in lost+found should be kept, and where it should be
put, takes time and experience (or blind luck in some cases).

  | No, sorry, but that's flat out wrong.  That might be what you'd like
  | application developer to do, but that's not what happens in the real
  | world.

Then the application that does this (when the file would make sense
recovering) should be fixed, otherwise you certainly lose when the
application dies (kill -9 aimed at the wrong pid by accident, or
whatever...)

  | Many _many_ applications create and then _immediately_ unlink
  | temporary files that they will later use to shuffle data around.  They
  | do so to make cleanup easy, not to say "the data I write here is trash".

Of course, but almost none of those applications realistically expect
the data in those temp files to be of any use if the app dies or the
system crashes - it is only useful while the app continues running.

  | <sarcasm weight=super-heavy>
  | If the application were doing what you claim it to be doing then it
  | might as well just open /dev/null and write its temporary data there
  | instead.  Get real.
  | </sarcasm>

For sarcasm, that's so weak as to not be worthy of comment.   All it
indicates is that you have zero idea what is really being discussed.

  | Perhaps, but it is a very real-world argument too.  I've heard people
  | say they've done this (successfully, I might add) more times than I can
  | count.  I've even seen people do it right in front of me.  It's pretty
  | damn hard to argue with someone who's just made what would otherwise
  | have been the mistake of a lifetime -- if they can recover their work
  | then all the power to them!

The "crash the system" technique is just fine (if a bit of a fluke when
it actually works).  What is absurd is using fsck as the data recovery
tool.   That isn't what it is for, nor what it does.

kre

ps: to Greywolf and der Mouse - even if it were my place, and it isn't,
I wouldn't really object to adding a -z flag to fsck to make it act
dumbly - after all, it is just more code bloat, and I know the two of
you are never bothered by any of that...    Making it the default I
certainly wouldn't like, but if it isn't the default, it also isn't
really very useful, as no-one will think of turning it on until just
after the one crash where it might have actually saved something really
worth daving.