[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Corrupting Files
Jorgen Lundman <lundman%lundman.net@localhost> wrote:
> It is rather annoying that it happens after CRC checks, so we
> generally do not discover it immediately.
> It tests the uploaded file good, md5sums are equal. I even tried
> setting O_SYNC, and mode 0400 during, and after, upload, as a
> test. The file is correct on disk for several minutes. Even after
> "sync". Then suddenly it gets changed.
Smells like hardware to me. What hardware, I don't know.
> If it is a disk issue, 8-skip-4-then-4 byte change would mean 2
> consecutive calls to write() and seek().
If it's a disk issue, what you're writing to disk is not what you're
reading back. Bad memory in the disk cache, bad controller, bad main
memory (although that I would expect to cause application core dumps and
kernel panics as well), or something I haven't thought of.
> (So, if I were to umount/mount, the file would be good again.. maybe I
> will try that too).
Yeah, worth a try. Might even be worth reading the file of the disk
bypassing the buffer cache if you have a program that will do that.
(There used to be a program called "icat", a long time ago; I know too I
wrote something once upon a time. I could perhaps look, but not
tonight. The gist is to find the inode of the file you care about (ls
-i) and then open the raw device and read the file using that inode
> It has always been cgd. Mounted as "noatime, soft, local". Worth
> trying without soft-dependencies?
I wouldn't rush to change the software now, if it's been stable for a
long time. I'd try alternate hardware first, unless that is really hard
> Dual AMD Opteron 246, 2u rack server. But it is on the other side of
> the globe to me, or I would have replaced the memory as the first
... which it sounds like it may be, unless you've an excellent travel
budget, or have or can get some remote hands to do what you tell them!
> Appreciate the email, got a few more ideas to try.
I'd swap the disk(s) first, but that's just a guess. Data corruption
problems like this one I've always solved by a process of elimination:
change stuff until the problem goes away. Not an attractive solution
method, I grant, but the only other two I've found are recognising a
known problem (seems unlikely here) or inspiration. :-/
Main Index |
Thread Index |