Subject: NFS/RPC and server clusters
To: None <tech-net@netbsd.org>
From: Matthias Drochner <M.Drochner@fz-juelich.de>
List: tech-net
Date: 10/15/2003 16:28:14
This is a multipart MIME message.

--==_Exmh_10174868769450
Content-Type: text/plain; charset=us-ascii


Hi -
As things look now, NetBSD NFS clients don't work well together
with certain HA server clusters, at least the DEC "ASE" thing
(2 Alpha servers with a shared SCSI bus).
I'll explain how it fails in my setup:

The cluster consists of 2 machines, each one has a node specific
IP address, "node1" and "node2". A service is tied to another
IP address, "service1". The service can be located on each of the
nodes; the corresponding machine has "service1" as an IP address
alias as long as it provides that service. Is a machine fails,
the other one takes over the service, the "service1" IP address
and the filesystems belonging to it on the shared SCSI bus.
The clients mount the filesystems from the "service1" address;
they don't need to know the "node?" addresses. NFS file handles
are preserved on a service relocation, so in theory everything
should be completely transparent.
However, due to whatever implementation peculiarities on DEC's
side, UDP RPC calls are replied from the "node?" address, even
though sent to the "service1" address by the client.
What happens now is that NetBSD's "mount_nfs" does a "portmap"
call to "service1", gets the reply from "node1", puts the "node1"
address into the mount(2) argument structure and passes it to
the kernel. This works of course as long as "service1" and
"node1" refer to the same machine.
If the service gets relocated to "node2", the kernel still sends
the NFS client calls to "node1". Even if "node1" gets up again,
all requests are answered with ESTALE because "service1"'s
filesystems are not mounted on "node1" anymore.

A Solution would require to put the "service1" IP address into
the mount(2) argument structure.
Now I'm not sure whether to treat this as NFS specific matter, to
be solved in higher level mount_nfs code, or a general RPC problem.
The appended patch tries the latter approach, and it makes NFS
mounts indeed survive a server switchover. (The "inlen" thing
is just cleanup.)
I'm not sure about possible effects on RPC multicasts (or anycasts
with IPv6?). Is it specified whether such service specifications
should collapse into a specific node address at portmap time?

Comments? Clues?

best regards
Matthias



--==_Exmh_10174868769450
Content-Type: text/plain ; name="udprpc.txt"; charset=us-ascii
Content-Description: udprpc.txt
Content-Disposition: attachment; filename="udprpc.txt"

--- clnt_dg.c.~1.12.~	Wed Sep 10 20:09:37 2003
+++ clnt_dg.c	Wed Oct 15 15:00:28 2003
@@ -325,7 +325,6 @@ clnt_dg_call(cl, proc, xargs, argsp, xre
 	sigset_t mask;
 #endif
 	sigset_t newmask;
-	socklen_t fromlen, inlen;
 	ssize_t recvlen = 0;
 
 	_DIAGASSERT(cl != NULL);
@@ -495,10 +494,8 @@ send_again:
 				 */
 				errno = 0;
 			}
-			fromlen = sizeof (struct sockaddr_storage);
-			recvlen = recvfrom(cu->cu_fd, cu->cu_inbuf,
-			    cu->cu_recvsz, 0, (struct sockaddr *)(void *)&cu->cu_raddr,
-			    &fromlen);
+			recvlen = recv(cu->cu_fd, cu->cu_inbuf,
+				       cu->cu_recvsz, 0);
 		} while (recvlen < 0 && errno == EINTR);
 		if (recvlen < 0) {
 			if (errno == EWOULDBLOCK)
@@ -516,13 +513,12 @@ send_again:
 		/* we now assume we have the proper reply */
 		break;
 	}
-	inlen = (socklen_t)recvlen;
 
 	/*
 	 * now decode and validate the response
 	 */
 
-	xdrmem_create(&reply_xdrs, cu->cu_inbuf, (u_int)inlen, XDR_DECODE);
+	xdrmem_create(&reply_xdrs, cu->cu_inbuf, (u_int)recvlen, XDR_DECODE);
 	ok = xdr_replymsg(&reply_xdrs, &reply_msg);
 	/* XDR_DESTROY(&reply_xdrs);	save a few cycles on noop destroy */
 	if (ok) {

--==_Exmh_10174868769450--