NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Corrupting Files



Jorgen Lundman <lundman%lundman.net@localhost> wrote:

> It is rather annoying that it happens after CRC checks, so we
> generally do not discover it immediately.

:-(

> It tests the uploaded file good, md5sums are equal. I even tried
> setting O_SYNC, and mode 0400 during, and after, upload, as a
> test. The file is correct on disk for several minutes. Even after
> "sync". Then suddenly it gets changed.

Smells like hardware to me.  What hardware, I don't know.

> If it is a disk issue, 8-skip-4-then-4 byte change would mean 2
> consecutive calls to write() and seek().

If it's a disk issue, what you're writing to disk is not what you're
reading back.  Bad memory in the disk cache, bad controller, bad main
memory (although that I would expect to cause application core dumps and
kernel panics as well), or something I haven't thought of.

> (So, if I were to umount/mount, the file would be good again.. maybe I
> will try that too).

Yeah, worth a try.  Might even be worth reading the file of the disk
bypassing the buffer cache if you have a program that will do that.
(There used to be a program called "icat", a long time ago; I know too I
wrote something once upon a time.  I could perhaps look, but not
tonight.  The gist is to find the inode of the file you care about (ls
-i) and then open the raw device and read the file using that inode
number.

> It has always been cgd. Mounted as "noatime, soft, local". Worth
> trying without soft-dependencies?

I wouldn't rush to change the software now, if it's been stable for a
long time.  I'd try alternate hardware first, unless that is really hard
...

> Dual AMD Opteron 246, 2u rack server. But it is on the other side of
> the globe to me, or I would have replaced the memory as the first
> test.

... which it sounds like it may be, unless you've an excellent travel
budget, or have or can get some remote hands to do what you tell them!

> Appreciate the email, got a few more ideas to try.

I'd swap the disk(s) first, but that's just a guess.  Data corruption
problems like this one I've always solved by a process of elimination:
change stuff until the problem goes away.  Not an attractive solution
method, I grant, but the only other two I've found are recognising a
known problem (seems unlikely here) or inspiration. :-/

Good luck!

Giles


Home | Main Index | Thread Index | Old Index