NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Corrupting Files




I presume the file is copied from a memory stick?  If it's copied
to disk such that you can make a good copy of it from disk to disk,
then the memory stick isn't relevant.  (If your second copy is
also from the memory stick ... then no bets.)

I should have been more precise. These are files uploaded by users via FTP. I make a copy "right next to" the original file, same filesystem and everything. I was quite lucky that it did corrupt (at least one of) the file I was testing this on, and in fact so far, keeps happening when a particular file is deleted and re-uploaded. (Trying again for good measure).

It is rather annoying that it happens after CRC checks, so we generally do not discover it immediately.

It tests the uploaded file good, md5sums are equal. I even tried setting O_SYNC, and mode 0400 during, and after, upload, as a test. The file is correct on disk for several minutes. Even after "sync". Then suddenly it gets changed.



Still, I'd bet on hardware, particularly for corruption of only
four bytes. (DMA to the wrong address will hit more than four bytes,
in my experience, whether it's disk I/O or network I/O landing in
the wrong place.  Panics would also be likely.)

If it is a disk issue, 8-skip-4-then-4 byte change would mean 2 consecutive calls to write() and seek(). I thought then perhaps it is more likely to be in-memory cache corrupting, but why would they be flushed out so much later? Or can it be possible that the file "on disk" is good, but the read-cache, in-core, of the file is bad. (So, if I were to umount/mount, the file would be good again.. maybe I will try that too).



Can you try without cgd?  (Were you using cgd before?)
What file system options do you have in use?  (Softdeps?)
What did you change (hardware and software) two months ago?

It has always been cgd. Mounted as "noatime, soft, local". Worth trying without soft-dependencies?



Chase the hardware first, I suggest.  If you can reproduce the
problem on multiple systems then it'll be a lot more likely that
it's software.  If the problem stays with one set of hardware,
then that hardware is probably guilty. :-)

Dual AMD Opteron 246, 2u rack server. But it is on the other side of the globe to me, or I would have replaced the memory as the first test.

Appreciate the email, got a few more ideas to try.

Lund



--
Jorgen Lundman       | <lundman%lundman.net@localhost>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)


Home | Main Index | Thread Index | Old Index