NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Corrupting Files



lundman <lundman%lundman.net@localhost> wrote:

> I'm just trying to get a feel for where the problem may be lying. What
> is strange is that the "mtime" of the file is not updated. Is it
> possible to write to files from userland without updating "mtime"?

Technically yes, but only a program which can write to the raw
disk, and such precise corruption is unlikely in that case.

> If it is corrupting on disk (cache?) after it has been written, can I
> rule out userland problems? Should I look at memory stick, cgd
> implementation, perhaps even kernel vs userland mismatch?

I presume the file is copied from a memory stick?  If it's copied
to disk such that you can make a good copy of it from disk to disk,
then the memory stick isn't relevant.  (If your second copy is
also from the memory stick ... then no bets.)

> What is also strange is the differences are nearly always (at least on
> the files I have checked) at offset xxxx0F0-xxx0100. 8 bytes changed,
> 4 unchanged, 4 bytes changed.

Useful to note and remember, but means nothing to me right now.

> The server (userland) has performed without issues for 8 years, but
> hardware was replaced about 2 months ago.

Well ... that makes the hardware suspect moreso than the software
IMHO except of course that the new hardware could have exposed a latent
software problem. :-(

Still, I'd bet on hardware, particularly for corruption of only
four bytes. (DMA to the wrong address will hit more than four bytes,
in my experience, whether it's disk I/O or network I/O landing in
the wrong place.  Panics would also be likely.)

> I'm currently trying to determine if it happens only on certain
> disks/partitions, and/or attempt to ktrace it.

I'd definitely try different disk(s) and/or controllers; I would
not expect ktrace to show the corruption.  If I had to instrument
I'd be instrumenting the kernel, but whether it's a cgd, a ffs,
or a driver problem (if it *is* software ...) is unknown.

Can you try without cgd?  (Were you using cgd before?)
What file system options do you have in use?  (Softdeps?)
What did you change (hardware and software) two months ago?

Sorry not to be much direct help -- data corruption problems
are among the hardest to solve, due to the numerous possibilities
and difficulty of reproduction.

Chase the hardware first, I suggest.  If you can reproduce the
problem on multiple systems then it'll be a lot more likely that
it's software.  If the problem stays with one set of hardware,
then that hardware is probably guilty. :-)

On an enterprise class system I'd be looking at all the firmware
revisions (motherboard, I/O controllers, disks) but on PC hardware
I don't know how practical that is.

Giles


Home | Main Index | Thread Index | Old Index