Subject: kern/31926: NFS client intermittent data corruption and EINVAL on read
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 10/27/2005 03:35:00
>Number:         31926
>Category:       kern
>Synopsis:       NFS client intermittent data corruption and EINVAL on read
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 27 03:35:00 +0000 2005
>Originator:     Jed Davis
>Release:        NetBSD 2.0.2
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD byzantium.nyc.access.net 2.0.2 NetBSD 2.0.2 (PANIX-STAFF) #0: Thu May 5 21:13:35 EDT 2005 root@juggler.panix.com:/devel/netbsd/2.0.2/src/sys/arch/i386/compile/PANIX-STAFF i386
Architecture: i386
Machine: i386
>Description:

Relatively rarely, under certain conditions (see below), read(2) calls
on on a file over NFS return EINVAL for no apparent reason, or return a
block of zero bytes after the file data (possibly "reading" past the end
of the file?).

This has been reproduced with 2.0, 3.0_BETA, and -CURRENT on the client
host.

>How-To-Repeat:

We were able to reliably reproduce this by running "tail -f" on a log
file in a filesystem mounted read-only from a NetBSD 2.0 NFS server;
the file was being written to on the server only (by syslog-ng).  The
tail will generally exit within a few minutes when it gets an EINVAL,
and a ktrace may -- but will not always -- show that blocks of NULs not
present in the actual file were read.

Previously, similar problems had been observed occasionally with a
NetBSD 2.0 client and a NetApp server, but could not be reproduced on
demand; thus the implication of the NFS client side.

The log in question takes up somewhere between 2 and 3 GB daily, to
give an idea of the average append rate, but the problem can still be
reproduced shortly after log rotation when the file is not large.

The client and server are using the default parameters -- save the
read-only flag, but the NetApp case was a read-write mount.  No
nullfs instances are involved.

We have a tcpdump trace of the NFS traffic, as well as several ktraces;
they contain mail log data and so are not public, but I can send them on
request.

>Fix:

None yet known.  
Suggestions as to where to start looking in the source code are welcome.