file corruption with 6.1.4

To: netbsd-users <netbsd-users%netbsd.org@localhost>
Subject: file corruption with 6.1.4
From: Dave Vitek <dvitek%grammatech.com@localhost>
Date: Tue, 16 Feb 2016 01:29:31 -0500

Hi all,

We have an amd64 NetBSD 6.1.4 (stable) machine that we use as a buildserver and also for testing. We're having an intermittent problem whereoccasionally, a 4096 byte long 4096 byte aligned chunk of an archive(.a) file gets overwritten with a bunch of human readable text that werecognize as stdout output from another process that is in no wayrelated to building archive files.

This other process runs (potentially concurrently) as a different userin a far away directory on the same file system. It can produce quite abit of output. The text is redirected to a file and/or sent over asocket. Either way, there's no way the middle of that log gets writtento the middle of this unrelated archive file.

On Feb 10 between 14:41 and 19:00, the .a file was created by ar andranlib. The linker used this file basically immediately with success atthe time. There's no way the linker would not choke in the presence ofthe file corruption. It made a copy of the archive for later on thesame file system. There's no way the copy process should have access tothe text later observed in the file.

Later the same day, between 19:00 and 23:12, the copy of the archivefile gets read. By this time, it contains the damaged page and thelinker complains.

I haven't yet determined when the process that logged the human readabletext ran. I may never know.

I have both the original undamaged .o file and the damaged .a file. Iused "ar x" to extract the bad object file and did a binary diff of theentire file to find the messed up 4096 byte chunk. The rest of the fileis unchanged. The entire archive is about 10mb.

/var/log/messages shows a handful of these messages overlapping the timeperiod in question:

/netbsd: file: table is full - increase kern.maxfiles or MAXFILES

At the risk of speculating: Are there any known issues with horriblethings happening in the kernel when there is file descriptor pressure?

We've also seen software-layer I/O checksum errors intermittently, withthe same sort of text overwriting chunks of files. Now that we've alsoseen it in these .a files I'm leaning towards blaming the OS.

We're pretty sure the hardware isn't to blame: This machine wasoriginally a virtual machine running the same version of NetBSD, and ithad the same problem. Other guests had no problems. It's now aphysical machine on completely different hardware and still has theproblem. I don't know of a lot of hardware problems that wouldconsistently manifest in this fashion anyway.


There's only one disk on the system:
/dev/sd0a at /

It likely always has 500gb+ free space. It has 16 logical cores and24GB of RAM. It's a busy system doing lot's of I/O all the time.

There are also a few nfs mounts, but they aren't used much and shouldn'tbe involved with the data in question.

We have roughly the same setup on many other platforms (linux, mac,solaris, freebsd, windows), none of which have this problem.

I am not yet able to artificially cause the problem to manifest. I amthankful that the file corruption occurs in large enough chunks that theconsequences are unlikely to be subtle.

I could start storing checksums along side the archive files. Let'sassume I do that and I discover that the file was OK when it waswritten, but the checksum no longer matches a couple hours later. What next?

I couldn't find any PRs that looked like this issue, but who knows if mysearch was any good. Does this sound familiar to anyone?


I could imagine trying several things at this point:
 - Turning on assertions in the kernel
 - Running in single CPU mode to see if it helps
 - Switching file systems
 - Trying different versions of netbsd (6.1.5?)

Suggestions? I suspect I need something that maintains binarycompatibility with the 6 series.


- Dave

Follow-Ups:
- Re: file corruption with 6.1.4
  - From: Manuel Bouyer
- Re: file corruption with 6.1.4
  - From: J. Hannken-Illjes

Prev by Date: CVS issues
Next by Date: Re: file corruption with 6.1.4
Previous by Thread: CVS issues
Next by Thread: Re: file corruption with 6.1.4
Indexes:

Home | Main Index | Thread Index | Old Index