Subject: corrupted files on NFS server
To: None <tech-kern@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: tech-kern
Date: 09/05/2003 13:22:05
Hi,
I've got 2 files corrupted in a strange way on my NFS server, running
NetBSD/alpha 1.6.1_RC1, sources from Aug 6.
These files are from a CVS repository, accessed only from NFS.
They have not been changed for months.
Sometime last month (I guess at about the same time), part of the file
has been overwritten by a PNG image generated by MRTG. These MRTG files are
from a different disk, not exported by NFS.
The image is small (less than 2k), and appears complete in the CVS file.
For one it was at the beggining of the CVS file, for the second at offset 8192.
8192 is the alpha page size, and also the rsize/wsize used by the linux NFS
clients.
In both case, the corrupted part is exactly of the image size (i.e it's not
a whole page which is corrupted).
Note that this is very hard to reproduce: this box is NFS server for more
than 100 clients, it has 3 partitions with about 150G of data, for a total
of more than 1500000 files. It can get more than 1000 NFS reqs/s on peak
(as shown by MRTG, so it's 1000 NFS reqs/s on a 5mn average). I got one
corrupted file in the same way (exept that the data inserted was from another
NFS file, so it could have been a but on a client) earlier this year, so
in total I only got 3 corrupted files in a year.

The fact that the corrupted area isn't a page size, and the exact size of the
image let me think we can exclude a bug on the NFS client (like a commit for an area he didn't write).
To me it looks more like a locking bug somewhere, which cause the same page
to be mapped for 2 different files occasionally.

Does it ring a bell for someone ?

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
     NetBSD: 24 ans d'experience feront toujours la difference
--