tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: FFS corruption



> Everything was fine until I scp it over the network on a new machine.

> After this migration, the filesystem started corrupting the same file
> /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be
> /etc/daily end of execution.

Which filesystem?  The original or the copy?

> httpd.conf metadata did not change, its content was just [filled]
> with some fixed length binary records (sample included below, in case
> it rings a bell to someone).  Setting immutable flags did not prevent
> the corruption; And using ktrace on /etc/daily showed it did not
> touch httpd.conf nor even its parent directory.

> And fsck did not [find] anything wrong.  Is there anything ringing a
> bell to someone here?  Any explanation?

Offhand, this sounds like one of two things:

(1) The same piece of disk is being used by two filesystems at once,
and that just happens to be the place where both filesystems actually
_use_ overlapping pieces of disk (if a filesystem is mostly empty, most
of the space it's nominally using can be scribbled on without
corrupting the filesystem; two mostly empty filesystems nominally using
overlapping areas of disk might end up almost never both actually
depending on the same sectors).

(2) Somewhere in the data path for disk writes, the high bits of the
disk block numbers are getting lost, thereby directing writes to two
nominally different pieces of disk to the same sectors.  This could be
a software bug or a hardware issue (which could be a hardware bug, a
software bug, or a case of incompatibility).  As a simple example that
probably is not what's going on in your case, a SCSI driver that
doesn't know how to use 10-byte CDBs can end up redirecting sectors
above the 1G point back onto the same sectors as others that are below
the 1G point.

You mentioned that at least one of these machines was a Xen instance.
I don't know enough about Xen to do more than guess here, but it does
mean that there's at least one more layer of mapping between OS sector
numbers and hardware sector numbers, and thus at least one more layer
where two supposedly different pieces of disk could get mapped to the
same real sectors.  Those additional layers are also additional places
where the sort of botch outlined in (2) could strike.

I realize this isn't very helpful, but it's about all that comes to
mind that explains your observations.  In particular, the metadata not
changing, the immutable flag making no difference, ktrace showing no
accesses - those all, to me, point to something corrupting the disk
behind the OS's back.  It could be either of the above, or perhaps even
broken disk firmware, though that strikes me as unlikely compared to
the above.  fsck noticing nothing wrong probably just means that the
only thing that got hit was data blocks.  Hit a metadata block (inode
table, superblock, etc) instead and fsck should get upset, but if all
you're damaging is data blocks, fsck shouldn't care.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index