Subject: Locking trouble in net/if_tun.c
To: None <tech-net@netbsd.org>
From: Tom Ivar Helbekkmo <tih@eunetnorge.no>
List: tech-net
Date: 05/09/2004 21:56:46
I've been playing with openvpn, from pkgsrc, lately.  It works great,
except that it's been making my base station VPN gateway hang hard
whenever I pressed the tunnel a bit.  Every time I pushed more than a
little bit of data through the VPN tunnel, the VPN gateway would get
hung up at the start of tun_output, at the simple_lock() call:

int
tun_output(ifp, m0, dst, rt)
	struct ifnet   *ifp;
	struct mbuf    *m0;
	struct sockaddr *dst;
	struct rtentry *rt;
{
	struct tun_softc *tp = ifp->if_softc;
#ifdef INET
	int		s;
	int		error;
#endif
	int		mlen;
	ALTQ_DECL(struct altq_pktattr pktattr;)

	simple_lock(&tp->tun_lock);

The machine would be hung so hard, it didn't respond to keypresses on
the console, and DDB would show that it was at this location because
of a packet coming in over the ethernet, destined for the client side
of the VPN tunnel.  The DDB backtrace would always show either

    ip_input->ip_forward->ip_output->tun_output

or

    ip_input->tcp_input->tcp_output->ip_output->tun_output

Now, looking for a reson why the tunnel was locked, causing a deadlock
here, I first noticed that there's one single return from tun_output()
that doesn't unlock the tun_lock.  It looks like this:

	case AF_UNSPEC:
		s = splnet();
		IFQ_ENQUEUE(&ifp->if_snd, m0, &pktattr, error);
		if (error) {
			splx(s);
			ifp->if_collisions++;
			return (error);
		}

I added a call to simple_unlock() before the return, but this did not
help.  (I'm mentioning it anyway, because it looks to me like it's
wrong to leave the tun locked here.)  Then, I figured that a "try
again" type error return might do the trick, so I changed the start of
tun_output thus:

RCS file: /cvsroot/src/sys/net/if_tun.c,v
retrieving revision 1.68
diff -c -r1.68 if_tun.c
*** if_tun.c	1 Mar 2004 13:54:02 -0000	1.68
--- if_tun.c	9 May 2004 19:42:03 -0000
***************
*** 438,446 ****
  	int		mlen;
  	ALTQ_DECL(struct altq_pktattr pktattr;)
  
- 	simple_lock(&tp->tun_lock);
  	TUNDEBUG ("%s: tun_output\n", ifp->if_xname);
  
  	if ((tp->tun_flags & TUN_READY) != TUN_READY) {
  		TUNDEBUG ("%s: not ready 0%o\n", ifp->if_xname,
  			  tp->tun_flags);
--- 438,450 ----
  	int		mlen;
  	ALTQ_DECL(struct altq_pktattr pktattr;)
  
  	TUNDEBUG ("%s: tun_output\n", ifp->if_xname);
  
+ 	if (!simple_lock_try(&tp->tun_lock)) {
+ 		TUNDEBUG ("%s: already locked\n", ifp->if_xname);
+ 		return (ENOBUFS);
+ 	}
+ 
  	if ((tp->tun_flags & TUN_READY) != TUN_READY) {
  		TUNDEBUG ("%s: not ready 0%o\n", ifp->if_xname,
  			  tp->tun_flags);
***************
*** 496,501 ****
--- 500,506 ----
  		if (error) {
  			splx(s);
  			ifp->if_collisions++;
+ 			simple_unlock(&tp->tun_lock);
  			return (error);
  		}
  		mlen = m0->m_pkthdr.len;

The ENOBUFS is supposed to indicate a transient error.  Note that I've
kept the unlock in the collision case, because I still believe it's
supposed to be there -- but I may be wrong.  Anyway, after this
change, my VPN tunnel works great, there are no hangs, and I do not
see any degradation in performance.

Anyone care to look this over, and try to understand whether I've done
the right thing, or just masked the symptoms of another problem?  :-)

-tih
-- 
Tom Ivar Helbekkmo, Senior System Administrator, EUnet Norway
www.eunet.no  T: +47-22092958 M: +47-93013940 F: +47-22092901